This post follows up on the previous to report some timings. I've checked all the code into GitHub (
johnmay/efficient-bits/fp-idx) and it has some stand alone programs that can be run from the command line.
Currently there are a few limitations that we'll get out the way:
- Only generation of the CDK ECFP4 is supported and at a folded length of 1024, this should give a close approximation to what Matt used in MongoDB (RDKit Morgan FP). Other fingerprints and foldings could be used but the generation time of path based fingerprints in the CDK is currently (painfully) slow.
- Building the index is done in memory, since 1,000,000x1024 bit fingerprints is only 122 MiB you can easily build indexes with less than 10 million on modern hardware.
- During index searching the entire index is memory mapped, setting the chunks system property (see the GitHub readme) will avoid this at a slight performance cost.
- Results return the id in the index (indirection) and to get the original Id one would need to resolve it with another file (generated by mkidx).
- Index update operations are not supported without rebuilding it.
These are all pretty trivial to resolve and I've simply omitted them due to time. With that done, here's a quick synopsis of making the index, there is more in the GitHub readme.
$ ./smi2fps /data/chembl_19.smi chembl_19.fps # ~5 mins
$ ./mkidx chembl_19.fps chembl_19.idx # seconds
The fpsscan
does a linear search computing all Tanimoto's and outputting the lines that
are above a certain threshold. The simmer
and toper
utils use the index, they either filter for similarity or the top k results. They can take multiple SMILES via the command line or from a file.
$ ./fpsscan /data/chembl_19.fps 'c1cc(c(cc1CCN)O)O' 0.7 # ~ 1 second
$ ./simmer chembl_19.idx 0.7 'c1cc(c(cc1CCN)O)O' # < 1 second
$ ./toper chembl_19.idx 50 'c1cc(c(cc1CCN)O)O' # < 1 second (top 50)
Using the same queries from the MongoDB search I get the following distribution of search times for different thresholds.
Some median search times are as follows.
Threshold |
Median time (ms) |
0.90 |
14 |
0.80 |
31 |
0.70 |
46 |
0.60 |
53 |
In the box plot above the same (first) query is always the slowest, this is likely due to JIT.
It's interesting to see that the times seem to flatten out. By plotting how many fingerprints the search had to check we observe that below a certain threshold we are essentially checking the entire dataset.
The reason for this is potentially due to the sparse circular fingerprints. Examining the result file (see the github README) we can estimate that on average we're calculating 23,556,103 Tanimoto's a second. This also means that retrieving the top k queries isn't bad either. For example 10,000 gives a median time (Code 3) of 72 ms.
$ ./toper chembl_19.idx 10000 queries.smi
Next I'll look at some like-for-like comparisons.