Thursday, 13 February 2014

Animated Algorithm: Canonical Tautomer Assignment

Effectively understanding an algorithm with only a description is difficult. Reading source code, possibly more so. Although these approaches explain the finer details, invariants and proofs, a higher level view offers clarity. A great example of this is seen at McKay's and Piperno's canonical labelling site, The Search Tree.

Tautomers are constitutional isomers of organic compounds that rapidly interconvert (Wikipedia). The most common form, is the relocation of a proton. Many computer representations are tautomer specific and distinguish different tautomeric forms of a compound. They would for example not have the same unique SMILES.

There are several approaches and algorithms for handling tautomers (Warr W, 2010). At the Daylight EuroMUG99, Roger Sayle and Jack Delany presented an algorithm for enumerating and assigning a unique tautomer (Sayle R and Delany J, 1999). From slide 20 in that presentation, there is a very nice step wise example on guanine.

This afternoon I had fun creating a animation for algorithm using an implementation I wrote last year. I've used a more complicated example that emphasises the backtracking when an incorrect assignment is made.

Being a large compound, I didn't include the keto-enol types. Also I had to modify the order (normally the nitrogens would be assigned first) to allow it be watchable in reasonable time. Each proton is placed and the hetroatoms become either a donor (green) or acceptor (blue). At several points it attempts to place two protons in each of the five membered rings. After updating adjacent atoms the mistake is identified and the assignment backtracks setting one as an proton acceptor.

Be sure to set the HD option.


Warr W (2010) shows an example labelled as "hidden tautomers". This compound presents an interesting challenge and a nice demonstration. In total there are 68 tautomers generated.



References


Warr WA. Tautomerism in chemical information management systems. J Comput Aided Mol Des. 24(6-7):497-520. 2010
Also presented at the ChemAxon 2010 UGM - http://www.youtube.com/watch?v=1C-RTD4DAJ8

Sayle RA and Delany JJ. Canonicalization and enumeration of tauomers. Presented at EuroMUG99, Cambridge, UK, 28-29 Oct 1999


Saturday, 11 January 2014

New SMILES behaviour - generating (CDK 1.5.4)

In the previous post I outline the changes to SMILES Parsing in the Chemistry Development Kit (CDK). The original plan was to have several posts detailing the changes but in the end it was more practical to put this in a single release note document (available: here). Although the key changes to SMILES Generation are summarised in the release notes I thought it would be worth touching on some details.

Types of SMILES


The original API involved creating a mutable generator and then creating SMILES by invoking one of several methods.
Code 1 - Existing SMILES generation API
SmilesGenerator smigen = new SmilesGenerator();
String smi    = smigen.createSMILES(molecule);
String chismi = smigen.createChiralSMILES(molecule);

The new generator is immutable and a change in the configuration creates a new instance. There aren't many configuration options but key point is there isn't an internal state. This makes it a lot more robust.
Code 2 - New SMILES generation API
SmilesGenerator smigen = SmilesGenerator.unique()
                                        .aromatic()
                                        .withAtomClasses();
smigen.createSMILES(molecule); // now deprecated
smigen.create(molecule);
The generator can now create several different types of SMILES. The naming follows the Daylight specification.
non-canonicalcanonical
no isotopes / stereo
generic
unique
with isotopes / stereo
isomeric
absolute

CDK previously used the term chiral SMILES to refer to SMILES with stereochemistry. The term isomeric SMILES is now used and aligns with other toolkits (Daylight, RDKit, ChemAxon and OEChem).

Unique and Absolute SMILES

The equitable refinement published by Daylight (Weininger et al. 1988) has been optimised and is used to produce the unique SMILES. Generation of absolute SMILES is more tricky and currently uses the InChI algorithm. I believe this idea was first suggested by Andrew Dalke (inchi-dicuss archive). In 2012, Noel O'Boyle published and implemented (in Open Babel) procedures for generate Universal and Inchified SMILES (O'Boyle 2012).

The procedure used by the CDK absolute SMILES follows the Universal SMILES rules. There may still be some small differences in the implementation so I'm hesitant of declaring it as a Universal SMILES implementation before that validation can be done. There are also some problems, such as, delocalised charges that would be useful to handle.

Performance

Generally the handling of SMILES in the CDK is much more robust with hydrogen counts and stereochemistry correctly round tripped. In addition to this the performance is also much better. This is most noticeable when using the non-canonical outputs and only storage is needed but it is also true of the canonical output.

The following table summarises the time taken to read and generate canonical SMILES for three different size datasets. The times are one-off measurements on a cold JVM. The red number in brackets is the number of compounds which caused an error. Using the InChI the absolute SMILES can not encode unknown atoms (CC*) which are ubiquitous in ChEBI (ontology classes).

Data Compounds CDK 1.4.15 CDK 1.5.4 unique CDK 1.5.4 absolute
ChEBI 108 ~27,000 54 s 5 s 23 s (1251)
NCI Aug '00 ~250,000 2 m 8 s 14 s 2 m 12 s (184)
ChEMBL 17 ~1,300,000 23 m 48 s (11) 1 m 38 s 15 m 26 s (36)

References

Wednesday, 18 December 2013

New SMILES behaviour - parsing (CDK 1.5.4)

It's been a while since the last CDK development release (over 4 months). The CDK 1.5.4 will be released in the next few days and has a lot of improved functionality. This is the first of several posts expanding on the functionality and detailing some design decisions.

One major overhaul has been the handling of the SMILES format. Mainly, tetrahedral and double-bond stereo-chemistry is read and written but they'll be more on that in the release note. In general the SMILES parsing API is the same as older versions but there is some behaviour changes which are worth highlighting to avoid possible "gotchas" with SMILES.

The previous parser behaviour was to create a structure, perform atom typing and add hydrogens using the CDK atom types. Bonds in aromatic SMILES were left as single bonds with aromatic flags. Aromaticity was optionally re-perceived. Leaving aromatic bonds with correct bond orders caused problems elsewhere in the toolkit.

Atom-typing / Implicit hydrogen counts

Atom-typing is no longer performed, this mirrors the behaviour of the other CDK readers and avoids atom-typing being invoked twice. As a significant overhead the atom-typer should be optional (although it is often required). If you read SMILES and use a procedure that requires the CDK atom-types you must now explicitly invoke the atom-type configuration.

Although atom-types are not automatically set, the implicit hydrogen count is. The hydrogen count follows the Daylight SMILES specification allowing the formula to be precisely determined. If the atom-typing / hydrogen adding is redone with the CDK you may lose information. As is shown below, removing the automatic atom-typing allows better round-tripping.

IChemObjectBuilder     builder = SilentChemObjectBuilder.getInstance();
SmilesParser           sp      = new SmilesParser(builder);
SmilesGenerator        sg      = new SmilesGenerator();
InChIGeneratorFactory  igf     = InChIGeneratorFactory.getInstance();
        
IAtomContainer m = sp.parseSmiles("[O]");
System.out.println(sg.create(m));                         // [O]
System.out.println(igf.getInChIGenerator(m).getInchi());  // InChI=1S/O
        
// configure atom types
AtomContainerManipulator.percieveAtomTypesAndConfigureAtoms(m);
CDKHydrogenAdder.getInstance(builder).addImplicitHydrogens(m);

System.out.println(sg.create(m));                         // O ([OH2])
System.out.println(igf.getInChIGenerator(m).getInchi());  // InChI=1S/H2O/h1H2

Kekulé representation

Reading an aromatic SMILES will now provide a structure with defined bond orders. The kekulisation is efficient and correctly handles some tough cases. Example inputs and depictions are shown below. There will be a future post describing the algorithm in detail.


Input Depiction
c1(c)c(c)c(c)c(c)c(c)c1(c)
c1cccc1c1cccc1
oc1ccc(o)cc1
oc1ccocc1
oc(cc)co
If a structure could not be kekulised it is considered invalid and a checked exception is thrown. The most common cause of errors is omitting a hydrogen present on an aromatic nitrogen (or phosphorus).

Input Depiction
n1cccc1 Error - could not kekulise
[nH]1cccc1


Why are invalid SMILES not "fixed"?

The decision to consider such SMILES invalid may be controversial. The primary issue is there is no guaranteed way to unambiguously handle the input such the intended structure was obtained.

To kekulise the pyrrole example, n1cccc1, we would need to change the formula of the structure by adding a hydrogen ([nH]1cccc1) or assign a negative charge to the nitrogen ([n-]1cccc1). Both alter the SMILES string by the same number of edits, such decisions can lead to ambiguities.

In the case of pyrrole the decision seems obvious and we would arrive at the correct representation. Most would favour adding a hydrogen rather than modifying the charge (although positive charges are frequently assigned to 4 valent nitrogens in normalisation). In general the placement of the hydrogen is not always as trivial.

Fixarom is a program available in the Daylight Contrib that attempts to modify SMILES to allow parsing (kekulisation). In effect it tries every possible combination of hydrogen placements and is how most resolutions would work. Apart from the obvious inefficiency the structure we end up with will be different depending on the input order. This is because SMILES is tautomer specific, that is, different tautomers have different SMILES.

Input Depiction
CC(C)C1=CC=CC2=C1NC=N2
or
CC(C)c1cccc2c1[nH]cn2
CC(C)C1=CC=CC2=C1N=CN2
or
CC(C)c1cccc2c1nc[nH]2
 

I'm a huge fan and regularly use the Open Babel ready-to-use programs, recently I found that the obabel utility will automatically add hydrogens to aromatic nitrogens. As seen below, the input only differs in the atom order. We generate two different canonical SMILES because the output structures are different tautomers.

obabel -:'CC(C)c1cccc2ncnc12' -ocan
CC(c1cccc2c1nc[nH]2)C 
1 molecule converted
obabel -:'CC(c1cccc2c1ncn2)C' -ocan
CC(c1cccc2c1[nH]cn2)C
1 molecule converted

It would be possible to assign a single representation by choosing the canonical tautomer. There is however no guarantee that the representation is the same as was used to generate the original SMILES.

Rather than the choice of hydrogen placement it may be that there really should be no hydrogen placed. CHEBI:32730 which currently has the SMILES NC(Cc1c[n]c2ccccc12)C(O)=O serves as a good example. Following our decision on pyrrole we would incorrectly add a hydrogen to the aromatic nitrogen when it was actually a radical. We could use the brackets as a trigger to indicate that there really should be no hydrogen but then we still can't assign bond orders. I believe this error in ChEBI is actually due to the incorrect use of the 'loose' instead of 'general' aromaticity option in Marvin. Again, we can obtain two different structures depending on our interpretation. In this case it is the less likely one which is correct.


Input Depiction
NC(Cc1c[n]c2ccccc12)C(O)=O
(interpreted as NC(Cc1c[nH]c2ccccc12)C(O)=O)
NC(Cc1c[n]c2ccccc12)C(O)=O
(interpreted as NC(Cc1c[N]c2ccccc12)C(O)=O)


As you can see resolving missing hydrogens in aromatic SMILES is ambiguous. In practise providing the SMILES were generated correctly the number of invalid strucutres is small. There are fewer than 40 invalid inputs between ChEBI and ChEMBL and generally they involved organometallics incorrectly represented in SMILES with non-covalent bonds specified as covalent.

If you still really want to parse these structures then 'kekulise' can be set to false but this is not recommended.

IChemObjectBuilder builder   = SilentChemObjectBuilder.getInstance();
SmilesParser       sp        = new SmilesParser(builder);
IAtomContainer     container = sp.parseSmiles("n1cccc1");  // throws exception
sp.kekulise(false);                                        // turn off kekulisation
IAtomContainer     container = sp.parseSmiles("n1cccc1");  // creates an AtomContainer

Aromaticity re-perception

Previously the option setPreservingAromaticity could be set to return atoms and bonds with the aromaticity flags matching the input. By default the option was off and the CDKHueckelAromaticityDetector was automatically invoked. The new default behaviour is to preserve the aromatic specification in the input (even if the input is not aromatic). With the bond orders now set we can safely apply a new aromaticity model if desired or simply remove the aromatic flags. The functionality followed from the removal of automatic atom-typing. The option, setPreservingAromaticity, has been deprecated.

IChemObjectBuilder builder   = SilentChemObjectBuilder.getInstance();
SmilesParser       sp        = new SmilesParser(builder);

// cyclobuta-1,3-diene is not aromatic but has been specified as such
IAtomContainer container = sp.parseSmiles("c1ccc1");

// remove flags -> molecule is still valid thanks to kekulisation
for (IAtom a : container.atoms()) 
    a.setFlag(CDKConstants.ISAROMATIC, false);
for (IBond b : container.bonds()) 
    b.setFlag(CDKConstants.ISAROMATIC, false);

Ring closure bond type

A SMILES ring closure may optionally have an explicit bond type. As recommended by Open SMILES an explicit symbol should only be written on either the opening or the closure. Writing the symbol on both the ring opening and closure is ambiguous.

A violation of this is sometimes found with up/down directional labels in double-bond stereo-chemistry. The bond labels are relative and reverse sense on ring closures. Such specification obviously occurs in rings but we can mock an example to show that C/1.C/C=C\1 is correct C/1.C/C=C/1 is incorrect. The new SMILES parser will not accept the second input and highlights where the problem was.
InvalidSMILESException: could not parse 'C/1.C/C=C/1', Ring closure bonds
did not match. Ring was opened with '/' and closed with '\'. Note -
directional bonds ('/','\') are relative.
C/1.C/C=C/1
         ^
I decided to highlight this as during testing I encountered a couple of structures with this issue. CHEMBL1315255 currently has the SMILES shown below, the incorrect labels on the ring closures are highlighted.

Cl.Cl.Cc1cccc2\C(=C/3\\Nc4ccccc4/C/3=N\OCCN5CCNCC5)\C(=O)Nc12 1315255

Generally if a parser accepts this input it may be choosing either to use the ring opening or ring closure. A simple test to determine if the opening or closing is use is to parse C=1CCCC#1 and C#1CCCC=1. The choice of whether to prefer the open or close label differs between toolkits and I don't think there is a correct answer.

I would like to thank the valuable input and discussions from the CDK developers, EBI colleagues and CCN attendees. I have previously notified the ChEBI and ChEMBL curators of the problematic SMILES identified.

I'll leave you with a fun and seasonal example of valid SMILES input.

HoHoHo Merry Christmas (Daylight DEPICT)

Monday, 11 November 2013

Improved Substructure Matching


Since it was written the UniversalIsomorphismTester (UIT) has been the goto class in the Chemistry Development Kit (CDK) to check structure identity, substructure and overlap matching. The utility works really well but derives the substructure from the more expensive MCS computation. There has been several efforts to bring faster structure matching through Ullmann [1] and the Vento-Foggia (VF) [2] algorithms. Most notably VF was used for structure matching in OrChem [3] and SMSD [4]. The original SMSD code was added to CDK but since then SMSD has been completely rewritten and optimised. I recently tried reintegrating the latest version but have had several failed attempts. 

Substructure and identity checking is fundamental and I felt it needed a more centralised home in the code base. This weekend I wrote the first parts of the API and implementations which I will discuss here. 

Lazy Matching

For brevity I won't go over how substructure matching works, Rich Apodaca has two excellent introductions on One of These Things is Not Like The Others and Substructure Search From Scratch which gives the key ideas. In the first post it is mentioned that recursion is at the heart of any implementation. Generally recursive implementations are one-or-all, that is you can get the first match or all matches.

As recursion is simply storing frames on the call stack for matching you can easily replace the recursion with a non-recursive stack which only stores each candidate matchings. Providing front end access to this allows a lazy substructure matcher which improves memory and has another benefit we'll see later.

So far I've added three matchers VentoFoggia (substructure/identity) and Ullmann (substructure) with scope for others [5].

API

The new structure matching paradigm is similar to Java regular expressions - a Pattern is created for a query and then that pattern can be tested on multiple targets. Whether the pattern is matching SMARTS queries, identity or substructure doesn't matter. Personally I've found my self parsing the query and target molecules in the wrong order with the existing matchers and using the pattern makes it impossible to get the order wrong.

Code 1 - Matching API predicate
IAtomContainer query   = ...;
Pattern        pattern = Ullmann.findSubstructure(query);
for (IAtomContainer target : targets) 
    if (pattern.matches(target))
        hits++;
With language features due in Java 8 the matching hits can even be done in one line.
List<IAtomContainer> hits = targets.filter(Ullmann.findSubstructure(query)::matches);
The first atom-mapping can be obtained with match() and all atom mappings with matchAll() (which is lazy).
Code 1 - Matching API, lazy loop over all matches
for (IAtomContainer target : targets) 
    for (int[] mapping : pattern.matchAll(target))
        hits++;

Counting Self Matches 

To test the performance I first looked at counting self matches. You'd generally want to do this with a partition refinement procedure (see. Using the CDK's group module) but it serves as a good test for the matchers. To compare the new matchers I've used the CDK's UniversalIsomorphismTester (UIT) and the latest SMSD's Substructure (VF implementation). It should be noted that SMSD does go a lot faster with an API change (see. Benchmarking Substructure Search) but this has not yet been added.
ferrocene as 200 isomorphisms to it's self
The average (r=10) time taken to match and count isomorphisms (n) was tested on several symetric structures. All structures were normalised to only have sigma bonds. The SMSD implementation did not get the correct number of matches which I think is a bug propagated from the original ChemKit implementation. If any matcher took longer than 10 minutes I considered it as timed out (shown as n/a).

Structure Matched UIT SMSD Ullmann VentoFoggia
nt (s) t (s) t (s) t (s)
ferrocene 2000.281 0.166 (n=180?) 0.007 0.001
fullerene C60 120279 7.5 (n=1) 1.2 0.017
fullerene C70 60n/a 9.8 (n=1) 2 0.058
fullerene C320 120n/a n/an/a 0.527

VentoFoggia only took half a second to find all matches in fullerene C320. I was curious to see how far it could go. The new matchers allow disconnected structures and matching multiple cyclopropanes we quickly reach a huge number of matches.

C1CC1.C1CC1.C1CC1.C1CC1.C1CC1.C1CC1 has more than 33 million isomorphisms
Structure Matched Ullmann VentoFoggia
(SMILES) nt (s) t (s)
C1CC1 6 0 0
C1CC1.C1CC1 72 0.002 0.002
C1CC1.C1CC1.C1CC1 1,296 0.002 0.005
C1CC1.C1CC1.C1CC1.C1CC1 31,104 0.168 0.037
C1CC1.C1CC1.C1CC1.C1CC1.C1CC1 933,120 6.5 1.1
C1CC1.C1CC1.C1CC1.C1CC1.C1CC1.C1CC1 33,592,320 270.5 36.3

Dataset Searching

As a more general gauge of performance the NCI Aug 2000 (~250,000 structures) was searched for structures containing O=C1C=CCC=C1. It's a single isolated test so conclusions can't really be drawn but it provides a sense for the timings. 
Query structure matched against ~250,000 structures in the NCI Aug 2000 data set.
The SMSD uses heuristics internally whilst the others were run without heuristics on every structure (even if the query was larger than the target). How long the matching took for each implementation is shown below. When finding hits we can find the first match in each structure (3,623 total) or find all the hits (12,706 total). The SMSD matcher is fast when finding the first matching in each structure but much slower to find all matches.

Implementationt firstt all
UIT43.4 s43.6 s
SMSD28.6 s4 m 40 s
Ullmann6.3 s6.5 s
VentoFoggia5.1 s4.9 s

Stereochemistry Matching

The new matchers also check the tetrahedral and geometric (double-bond) stereo configurations are consistent. Technically it's more efficient to check stereo configurations as the vertices are mapped but I decided to go for a post-matching check. This is more modular and also means we can use the same stereo consistency checking for other matchers (the existing UIT). This is where the lazy matching is really useful as we can get the first match which is stereochemically valid without finding all matchings.The stereo checking also seems to be fast enough and I've left it on by default (including all times in this post).

Using the VentoFoggia implementation, it takes ~38 seconds to find all 4,306,056 matches for C(C)(C)(C)(C) in ChEMBL 17. Matching one with a stereo configuration ([C@](C)(C)(C)(C)) is actually just as quick, taking ~37 seconds to find 1,007,652 matches.

Finally I'll leave you with a sneak peak at the Smiles class I have planned which makes loading and matching molecules from SMILES concise and elegant.
Pattern pattern = VentoFoggia.findSubstructure(Smiles.fromString("[C@](C)(C)(C)(C)"));
for (IAtomContainer target : Smiles.fromPath("/db/chembl/chembl_17.smi.gz")) {
    hits += Iterables.size(pattern.matchAll(target)); // Guava utils
}

Code

The matcher code is currently awaiting review on the branch feature/isomorphisms. If the branch is missing then it will be in the package org.openscience.cdk.isomorphism.

References

1. Ullmann JR. An Algorithm for Subgraph Isomorphism. Journal of the Association for Computing Machinery. 2004 
2. Cordella LP et al. A (Sub)Graph Isomorphism Algorithm for Matching Large Graphs. IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE. 2004
3. Rijnbeek M and Steinbeck C. OrChem - An open source chemistry search engine for Oracle. Journal of Cheminformatics. 2009
4. Rahman SA et al. Small Molecule Subgraph Detector (SMSD) toolkit. Journal of Cheminformatics. 2009
5. Gouda K and Hassaan, M. A fast algorithm for subgraph search problem. Informatics and Systems (INFOS). 2012

Friday, 25 October 2013

All The Small Things

The SMILES Arbitrary Target Specification (SMARTS) is a succinct way of expressing and matching structural attributes of molecules. SMARTS is ubiquitous within cheminformatics and involves first interpreting an expression into a query and matching that query against one or more target molecules. When performing a SMARTS match one may presume a bottleneck is the required subgraph isomorphism. In general this is true but whilst playing around with the Chemistry Development Kit (CDK) yesterday I discovered how some small changes could make a noticeable difference.

Benchmarking is difficult, benchmarking between toolkits, even more so. It's often hard to isolate the functionality and remove parts we don't really want to measure. Utilising a use case only provides an approximate measure for comparison but is easy to relate to. 

Use case

A use case I've wanted to test for a while was the "time taken to find O=[C,N]aa[N,O;!H0] hits in 250,251 SMILES of the NCI August 2000 data". This is taken from a presentation by NextMove Software on Efficient matching of multiple chemical subgraphs (see slide 6). Edit: I now realise that is actually an example from the Daylight Chemistry Cartridge - Intra-molecular hydrogen bonds.

The CDK, did not finish and I was generally curious to see how long it would take. I think the DNF was actually due to some exceptions being thrown but I'll make those clear in the results. Running on different hardware means we can't draw concrete comparison between the times but they should give a ball park figure. One should remember: "Measurement is not prophesy" - The Computer Language Benchmarks Game.

The NCI August 2000 data (available: cactus.nci.nih.gov) has several desirable properties.

  • The data is publicly available
  • The SMILES are all valid (actually quite rare)
  • The number of molecules isn't too large
  • The SMILES are a kekulĂ© representation and aromaticity must be perceived

The results where measured using the following program. The times shown are from t0 and t1 includes IO and are from a cold start (no warmup through repeats). The VM startup time was not included but is generally minimal.
Code 1 - SMARTS matching a file
static final IChemObjectBuilder bldr = SilentChemObjectBuilder.getInstance();

public static void main(String[] args) throws IOException {

    String path = "/databases/nci/nci_aug00.smi.gz";

    IteratingSMILESReader smis  = new IteratingSMILESReader(gunzip(path), bldr);
    SMARTSQueryTool       query = new SMARTSQueryTool("O=[C,N]aa[N,O;!H0]", bldr);  
    int nHits = 0, nFail = 0;
        
    long t0 = System.nanoTime();
    while (smis.hasNext()) {
        try {
            if (query.matches(smis.next()))
                nHits++;
        } catch (Exception e) {
            nFail++;
        }
    }
    long t1 = System.nanoTime();

    System.out.println(nHits + " hits (" + nFail + " in error) in "
                             + toSeconds(t1 - t0));
        
    smis.close();
}
  
static String toSeconds(long t) {
    long s  = TimeUnit.NANOSECONDS.toSeconds(t);
    long ms = TimeUnit.NANOSECONDS.toMillis(t - TimeUnit.SECONDS.toNanos(s));
    return s + " s " + ms + " ms";
}

static Reader gunzip(String path) throws IOException {
    // ensure GZIP buffer lines up with BufferedReader (SMILES Iterator)
    return new InputStreamReader(new GZIPInputStream(new FileInputStream(path), 8192));
}

Results

The above program was run on several versions of CDK. The 1.4.x branches are the stable releases whilst the 1.5.x are the unstable developer releases. The 1.5.4.git is the current (unreleased) master. The 1.5.4.git+ and 1.5.4.git++ include some additional tweaks.

VersionMatchedErrorsTime
1.4.1511,14564819 s 860 ms
1.5.311,14537279 s 725 ms
1.5.4.git11,14637225 s 639 ms
1.5.4.git+10,13637160 s 713 ms
1.5.4.git++10,136016 s 18 ms

1.4.15 to 1.5.3

The difference seen between 1.4.15 and 1.5.3 is primarily due to the AllRingsFinder. This is also the cause of the errors. The 1.5.3 version includes the optimisations discussed previously (AllRingsFinder, Sport Edition) and accounts for the majority of the improvement. The AllRingsFinder was being used in SMARTS matches for setting the ring invariants, although this isn't needed at all it only takes ~5 seconds and isn't a huge factor when it's removed later.

1.5.3 to 1.5.4.git

The 1.5.4.git version includes several changes such as a new SMILES parser and faster atom typing using the new RingSearch (Scaling Up - Faster Ring Detection in CDK). The new SMILES parser behaviour is a little different, kekulisation is now automatic and hydrogen counts are set on all atoms. Atom typing and aromaticity perception is not done automatically by the parser but is still done by the SMARTSQueryTool and so removal from the parser has little impact on performance. The number of matched molecules changed between 1.5.3 and 1.5.4.git. This was due to a bug in the aromaticity perception. Allowing removal of items from a collection whilst iterating can result in some being skipped over (fixed: f03170a).

Erroneous aromaticity fixed in 1.5.4.git means one structure is no longer matched and two previously unmatched are now found. Matched atoms are highlighted.
1.5.4.git to 1.5.4.git+

The 1.5.4.git+ is faster than 1.5.4.git as no atom typing is required and faster ring perception is used with a new aromaticity utility. Providing implicit hydrogen counts are set atom typing is only needed if the CDK aromaticity model or orbital hybridisation queries (an extension to SMARTS) is used. The new Aromaticity implementation allows us to define different models of electron contribution and choose which ring sets we apply that model to. Using a new model which is more similar to Daylight's requires only bond orders and hydrogen counts and not the atom types. There are differences in perceived aromaticity and this results in a different number of matches. From the differences tested the 1.5.4.git+ matches agree with DEPICTMATCH. I've included all the matched SMILES at the end of the post.


Some of the different matches (highlighted) due to aromaticity models

1.5.4.git+ to 1.5.4.git++

Perhaps most surprisingly was the small change between 1.5.4.git+ and 1.5.4.git++It turns out that  the majority of 160 seconds was spent in initializeMolecule() rather than the actual subgraph matching. The initialisation computes several invariants on the target molecule required for some query atoms to match. The invariants are the values such as; ring connectivity, ring number, degree, total hydrogen count and valence. These can be provided lazily or precomputed. Lazy properties would require each atom knowing the molecule to which it belongs. This is not how the CDK API is structured and so these are precomputed before a SMARTS target is matched. 

As mentioned earlier AllRingsFinder is not needed and was removed. You may want it as an extension and although it's now fast enough it certainly shouldn't be a default option. The SSSRFinder was replaced with a faster implementation and is now only computed if there are ring queries. Finally the other invariants are incrementally built up using a data structures optimised for access. Egon and I were actually looking at using this technique to also speed up atom typing. There are no fancy tricks and for brevity I'll omit a full description but it should be included in the code base soon enough. The 16 seconds shown earlier is the current time taken using the new default options on my patch.

Could it go even faster?

There is currently some overhead going between different data structures. The SMILES are loaded with Beam and then converted to the CDK IAtomContainers. CDK takes 3,327 ms to load the NCI molecules whilst Beam takes only 1,500 ms. To read an apply the aromaticity model in CDK takes 9,297 ms whilst in Beam it takes 5,492 ms.

It takes about a 1 second to convert the 250,000 CDK IAtomContainers to a graph representation optimised for quick traversal and this has to be done 3-4 times. Once for computing the invariants, aromaticity and isomorphism test. The forth time depends on whether ring information is required - the ring info isn't persisted on the IAtomContainer despite having already been done for the aromaticity. Improvements here would require a large restructuring of the API and domain objects.

Finally you may have noticed I didn't touch the UniversalIsomorphismTester. It's is actually pretty fast but the subgraph isomorphism is a by-product of an MCS computation. Using an optimised Ullmann/VF2 implementation could see further improvements. I might look in to patching the matcher from Ambit-SMARTS for further gainsI did try the subgraph implementation from the latest SMSD but for this particularly use case it was slower than the existing UniversalIsomorphismTester.

Matches

Different SMILES matched between the versions
To obtain the input, download the 2D SDF file and extract the SMILES field.
~: grep -A1 SMILES NCI_aug00_2D.sdz | grep -ve 'SMILES\|^--$' > nci_aug00.smi