Types of SMILES
The original API involved creating a mutable generator and then creating SMILES by invoking one of several methods.
Code 1 - Existing SMILES generation API
SmilesGenerator smigen = new SmilesGenerator(); String smi = smigen.createSMILES(molecule); String chismi = smigen.createChiralSMILES(molecule);
The new generator is immutable and a change in the configuration creates a new instance. There aren't many configuration options but key point is there isn't an internal state. This makes it a lot more robust.
Code 2 - New SMILES generation API
SmilesGenerator smigen = SmilesGenerator.unique() .aromatic() .withAtomClasses(); smigen.createSMILES(molecule); // now deprecated smigen.create(molecule);
non-canonical | canonical | |
---|---|---|
no isotopes / stereo |
generic
|
unique
|
with isotopes / stereo |
isomeric
|
absolute
|
CDK previously used the term chiral SMILES to refer to SMILES with stereochemistry. The term isomeric SMILES is now used and aligns with other toolkits (Daylight, RDKit, ChemAxon and OEChem).
Unique and Absolute SMILES
The equitable refinement published by Daylight (Weininger et al. 1988) has been optimised and is used to produce the unique SMILES. Generation of absolute SMILES is more tricky and currently uses the InChI algorithm. I believe this idea was first suggested by Andrew Dalke (inchi-dicuss archive). In 2012, Noel O'Boyle published and implemented (in Open Babel) procedures for generate Universal and Inchified SMILES (O'Boyle 2012).The procedure used by the CDK absolute SMILES follows the Universal SMILES rules. There may still be some small differences in the implementation so I'm hesitant of declaring it as a Universal SMILES implementation before that validation can be done. There are also some problems, such as, delocalised charges that would be useful to handle.
Performance
Generally the handling of SMILES in the CDK is much more robust with hydrogen counts and stereochemistry correctly round tripped. In addition to this the performance is also much better. This is most noticeable when using the non-canonical outputs and only storage is needed but it is also true of the canonical output.The following table summarises the time taken to read and generate canonical SMILES for three different size datasets. The times are one-off measurements on a cold JVM. The red number in brackets is the number of compounds which caused an error. Using the InChI the absolute SMILES can not encode unknown atoms (
CC*
) which are ubiquitous in ChEBI (ontology classes).
Data | Compounds | CDK 1.4.15 | CDK 1.5.4 unique | CDK 1.5.4 absolute |
---|---|---|---|---|
ChEBI 108 | ~27,000 | 54 s | 5 s | 23 s (1251) |
NCI Aug '00 | ~250,000 | 2 m 8 s | 14 s | 2 m 12 s (184) |
ChEMBL 17 | ~1,300,000 | 23 m 48 s (11) | 1 m 38 s | 15 m 26 s (36) |
No comments:
Post a Comment
Note: only a member of this blog may post a comment.