Saturday 11 January 2014

New SMILES behaviour - generating (CDK 1.5.4)

In the previous post I outline the changes to SMILES Parsing in the Chemistry Development Kit (CDK). The original plan was to have several posts detailing the changes but in the end it was more practical to put this in a single release note document (available: here). Although the key changes to SMILES Generation are summarised in the release notes I thought it would be worth touching on some details.

Types of SMILES


The original API involved creating a mutable generator and then creating SMILES by invoking one of several methods.
Code 1 - Existing SMILES generation API
SmilesGenerator smigen = new SmilesGenerator();
String smi    = smigen.createSMILES(molecule);
String chismi = smigen.createChiralSMILES(molecule);

The new generator is immutable and a change in the configuration creates a new instance. There aren't many configuration options but key point is there isn't an internal state. This makes it a lot more robust.
Code 2 - New SMILES generation API
SmilesGenerator smigen = SmilesGenerator.unique()
                                        .aromatic()
                                        .withAtomClasses();
smigen.createSMILES(molecule); // now deprecated
smigen.create(molecule);
The generator can now create several different types of SMILES. The naming follows the Daylight specification.
non-canonicalcanonical
no isotopes / stereo
generic
unique
with isotopes / stereo
isomeric
absolute

CDK previously used the term chiral SMILES to refer to SMILES with stereochemistry. The term isomeric SMILES is now used and aligns with other toolkits (Daylight, RDKit, ChemAxon and OEChem).

Unique and Absolute SMILES

The equitable refinement published by Daylight (Weininger et al. 1988) has been optimised and is used to produce the unique SMILES. Generation of absolute SMILES is more tricky and currently uses the InChI algorithm. I believe this idea was first suggested by Andrew Dalke (inchi-dicuss archive). In 2012, Noel O'Boyle published and implemented (in Open Babel) procedures for generate Universal and Inchified SMILES (O'Boyle 2012).

The procedure used by the CDK absolute SMILES follows the Universal SMILES rules. There may still be some small differences in the implementation so I'm hesitant of declaring it as a Universal SMILES implementation before that validation can be done. There are also some problems, such as, delocalised charges that would be useful to handle.

Performance

Generally the handling of SMILES in the CDK is much more robust with hydrogen counts and stereochemistry correctly round tripped. In addition to this the performance is also much better. This is most noticeable when using the non-canonical outputs and only storage is needed but it is also true of the canonical output.

The following table summarises the time taken to read and generate canonical SMILES for three different size datasets. The times are one-off measurements on a cold JVM. The red number in brackets is the number of compounds which caused an error. Using the InChI the absolute SMILES can not encode unknown atoms (CC*) which are ubiquitous in ChEBI (ontology classes).

Data Compounds CDK 1.4.15 CDK 1.5.4 unique CDK 1.5.4 absolute
ChEBI 108 ~27,000 54 s 5 s 23 s (1251)
NCI Aug '00 ~250,000 2 m 8 s 14 s 2 m 12 s (184)
ChEMBL 17 ~1,300,000 23 m 48 s (11) 1 m 38 s 15 m 26 s (36)

References