The SMILES Arbitrary Target Specification (SMARTS) is a succinct way of expressing and matching structural attributes of molecules. SMARTS is ubiquitous within cheminformatics and involves first interpreting an expression into a query and matching that query against one or more target molecules. When performing a SMARTS match one may presume a bottleneck is the required subgraph isomorphism. In general this is true but whilst playing around with the Chemistry Development Kit (CDK) yesterday I discovered how some small changes could make a noticeable difference.
Benchmarking is difficult, benchmarking between toolkits, even more so. It's often hard to isolate the functionality and remove parts we don't really want to measure. Utilising a use case only provides an approximate measure for comparison but is easy to relate to.
Use case
A use case I've wanted to test for a while was the "time taken to find O=[C,N]aa[N,O;!H0] hits in 250,251 SMILES of the NCI August 2000 data". This is taken from a presentation by NextMove Software on Efficient matching of multiple chemical subgraphs (see slide 6). Edit: I now realise that is actually an example from the Daylight Chemistry Cartridge - Intra-molecular hydrogen bonds.
The CDK, did not finish and I was generally curious to see how long it would take. I think the DNF was actually due to some exceptions being thrown but I'll make those clear in the results. Running on different hardware means we can't draw concrete comparison between the times but they should give a ball park figure. One should remember: "Measurement is not prophesy" - The Computer Language Benchmarks Game.
The NCI August 2000 data (available: cactus.nci.nih.gov) has several desirable properties.
The results where measured using the following program. The times shown are from t0 and t1 includes IO and are from a cold start (no warmup through repeats). The VM startup time was not included but is generally minimal.
The CDK, did not finish and I was generally curious to see how long it would take. I think the DNF was actually due to some exceptions being thrown but I'll make those clear in the results. Running on different hardware means we can't draw concrete comparison between the times but they should give a ball park figure. One should remember: "Measurement is not prophesy" - The Computer Language Benchmarks Game.
The NCI August 2000 data (available: cactus.nci.nih.gov) has several desirable properties.
- The data is publicly available
- The SMILES are all valid (actually quite rare)
- The number of molecules isn't too large
- The SMILES are a kekulé representation and aromaticity must be perceived
The results where measured using the following program. The times shown are from t0 and t1 includes IO and are from a cold start (no warmup through repeats). The VM startup time was not included but is generally minimal.
Code 1 - SMARTS matching a file
static final IChemObjectBuilder bldr = SilentChemObjectBuilder.getInstance(); public static void main(String[] args) throws IOException { String path = "/databases/nci/nci_aug00.smi.gz"; IteratingSMILESReader smis = new IteratingSMILESReader(gunzip(path), bldr); SMARTSQueryTool query = new SMARTSQueryTool("O=[C,N]aa[N,O;!H0]", bldr); int nHits = 0, nFail = 0; long t0 = System.nanoTime(); while (smis.hasNext()) { try { if (query.matches(smis.next())) nHits++; } catch (Exception e) { nFail++; } } long t1 = System.nanoTime(); System.out.println(nHits + " hits (" + nFail + " in error) in " + toSeconds(t1 - t0)); smis.close(); } static String toSeconds(long t) { long s = TimeUnit.NANOSECONDS.toSeconds(t); long ms = TimeUnit.NANOSECONDS.toMillis(t - TimeUnit.SECONDS.toNanos(s)); return s + " s " + ms + " ms"; } static Reader gunzip(String path) throws IOException { // ensure GZIP buffer lines up with BufferedReader (SMILES Iterator) return new InputStreamReader(new GZIPInputStream(new FileInputStream(path), 8192)); }
Results
The above program was run on several versions of CDK. The 1.4.x branches are the stable releases whilst the 1.5.x are the unstable developer releases. The 1.5.4.git is the current (unreleased) master. The 1.5.4.git+ and 1.5.4.git++ include some additional tweaks.
Version | Matched | Errors | Time |
---|---|---|---|
1.4.15 | 11,145 | 64 | 819 s 860 ms |
1.5.3 | 11,145 | 37 | 279 s 725 ms |
1.5.4.git | 11,146 | 37 | 225 s 639 ms |
1.5.4.git+ | 10,136 | 37 | 160 s 713 ms |
1.5.4.git++ | 10,136 | 0 | 16 s 18 ms |
1.4.15 to 1.5.3
The difference seen between 1.4.15 and 1.5.3 is primarily due to the AllRingsFinder. This is also the cause of the errors. The 1.5.3 version includes the optimisations discussed previously (AllRingsFinder, Sport Edition) and accounts for the majority of the improvement. The AllRingsFinder was being used in SMARTS matches for setting the ring invariants, although this isn't needed at all it only takes ~5 seconds and isn't a huge factor when it's removed later.1.5.3 to 1.5.4.git
The 1.5.4.git version includes several changes such as a new SMILES parser and faster atom typing using the new RingSearch (Scaling Up - Faster Ring Detection in CDK). The new SMILES parser behaviour is a little different, kekulisation is now automatic and hydrogen counts are set on all atoms. Atom typing and aromaticity perception is not done automatically by the parser but is still done by the SMARTSQueryTool and so removal from the parser has little impact on performance. The number of matched molecules changed between 1.5.3 and 1.5.4.git. This was due to a bug in the aromaticity perception. Allowing removal of items from a collection whilst iterating can result in some being skipped over (fixed: f03170a).
Erroneous aromaticity fixed in 1.5.4.git means one structure is no longer matched and two previously unmatched are now found. Matched atoms are highlighted. |
The 1.5.4.git+ is faster than 1.5.4.git as no atom typing is required and faster ring perception is used with a new aromaticity utility. Providing implicit hydrogen counts are set atom typing is only needed if the CDK aromaticity model or orbital hybridisation queries (an extension to SMARTS) is used. The new Aromaticity implementation allows us to define different models of electron contribution and choose which ring sets we apply that model to. Using a new model which is more similar to Daylight's requires only bond orders and hydrogen counts and not the atom types. There are differences in perceived aromaticity and this results in a different number of matches. From the differences tested the 1.5.4.git+ matches agree with DEPICTMATCH. I've included all the matched SMILES at the end of the post.
Some of the different matches (highlighted) due to aromaticity models |
1.5.4.git+ to 1.5.4.git++
Perhaps most surprisingly was the small change between 1.5.4.git+ and 1.5.4.git++. It turns out that the majority of 160 seconds was spent in initializeMolecule() rather than the actual subgraph matching. The initialisation computes several invariants on the target molecule required for some query atoms to match. The invariants are the values such as; ring connectivity, ring number, degree, total hydrogen count and valence. These can be provided lazily or precomputed. Lazy properties would require each atom knowing the molecule to which it belongs. This is not how the CDK API is structured and so these are precomputed before a SMARTS target is matched.
As mentioned earlier AllRingsFinder is not needed and was removed. You may want it as an extension and although it's now fast enough it certainly shouldn't be a default option. The SSSRFinder was replaced with a faster implementation and is now only computed if there are ring queries. Finally the other invariants are incrementally built up using a data structures optimised for access. Egon and I were actually looking at using this technique to also speed up atom typing. There are no fancy tricks and for brevity I'll omit a full description but it should be included in the code base soon enough. The 16 seconds shown earlier is the current time taken using the new default options on my patch.
Could it go even faster?
There is currently some overhead going between different data structures. The SMILES are loaded with Beam and then converted to the CDK IAtomContainers. CDK takes 3,327 ms to load the NCI molecules whilst Beam takes only 1,500 ms. To read an apply the aromaticity model in CDK takes 9,297 ms whilst in Beam it takes 5,492 ms.It takes about a 1 second to convert the 250,000 CDK IAtomContainers to a graph representation optimised for quick traversal and this has to be done 3-4 times. Once for computing the invariants, aromaticity and isomorphism test. The forth time depends on whether ring information is required - the ring info isn't persisted on the IAtomContainer despite having already been done for the aromaticity. Improvements here would require a large restructuring of the API and domain objects.
Finally you may have noticed I didn't touch the UniversalIsomorphismTester. It's is actually pretty fast but the subgraph isomorphism is a by-product of an MCS computation. Using an optimised Ullmann/VF2 implementation could see further improvements. I might look in to patching the matcher from Ambit-SMARTS for further gains. I did try the subgraph implementation from the latest SMSD but for this particularly use case it was slower than the existing UniversalIsomorphismTester.
Matches
Different SMILES matched between the versions
To obtain the input, download the 2D SDF file and extract the SMILES field.
~: grep -A1 SMILES NCI_aug00_2D.sdz | grep -ve 'SMILES\|^--$' > nci_aug00.smi