In the previous post I outline the changes to SMILES Parsing in the Chemistry Development Kit (CDK). The original plan was to have several posts detailing the changes but in the end it was more practical to put this in a single release note document (available: here). Although the key changes to SMILES Generation are summarised in the release notes I thought it would be worth touching on some details.

Types of SMILES

The original API involved creating a mutable generator and then creating SMILES by invoking one of several methods.

Code 1 - Existing SMILES generation API

SmilesGenerator smigen = new SmilesGenerator();
String smi    = smigen.createSMILES(molecule);
String chismi = smigen.createChiralSMILES(molecule);

The new generator is immutable and a change in the configuration creates a new instance. There aren't many configuration options but key point is there isn't an internal state. This makes it a lot more robust.

Code 2 - New SMILES generation API

SmilesGenerator smigen = SmilesGenerator.unique()
                                        .aromatic()
                                        .withAtomClasses();
smigen.createSMILES(molecule); // now deprecated
smigen.create(molecule);

The generator can now create several different types of SMILES. The naming follows the Daylight specification.

	non-canonical	canonical
no isotopes / stereo	generic	unique
with isotopes / stereo	isomeric	absolute

CDK previously used the term chiral SMILES to refer to SMILES with stereochemistry. The term isomeric SMILES is now used and aligns with other toolkits (Daylight, RDKit, ChemAxon and OEChem).

Unique and Absolute SMILES

The equitable refinement published by Daylight (Weininger et al. 1988) has been optimised and is used to produce the unique SMILES. Generation of absolute SMILES is more tricky and currently uses the InChI algorithm. I believe this idea was first suggested by Andrew Dalke (inchi-dicuss archive). In 2012, Noel O'Boyle published and implemented (in Open Babel) procedures for generate Universal and Inchified SMILES (O'Boyle 2012).

The procedure used by the CDK absolute SMILES follows the Universal SMILES rules. There may still be some small differences in the implementation so I'm hesitant of declaring it as a Universal SMILES implementation before that validation can be done. There are also some problems, such as, delocalised charges that would be useful to handle.

Performance

Generally the handling of SMILES in the CDK is much more robust with hydrogen counts and stereochemistry correctly round tripped. In addition to this the performance is also much better. This is most noticeable when using the non-canonical outputs and only storage is needed but it is also true of the canonical output.

The following table summarises the time taken to read and generate canonical SMILES for three different size datasets. The times are one-off measurements on a cold JVM. The red number in brackets is the number of compounds which caused an error. Using the InChI the absolute SMILES can not encode unknown atoms (CC*) which are ubiquitous in ChEBI (ontology classes).

Data	Compounds	CDK 1.4.15	CDK 1.5.4 unique	CDK 1.5.4 absolute
ChEBI 108	~27,000	54 s	5 s	23 s (1251)
NCI Aug '00	~250,000	2 m 8 s	14 s	2 m 12 s (184)
ChEMBL 17	~1,300,000	23 m 48 s (11)	1 m 38 s	15 m 26 s (36)

Efficient Bits

Saturday, 11 January 2014

New SMILES behaviour - generating (CDK 1.5.4)

Types of SMILES

Unique and Absolute SMILES

Performance

References

No comments:

Post a Comment