One major overhaul has been the handling of the SMILES format. Mainly, tetrahedral and double-bond stereo-chemistry is read and written but they'll be more on that in the release note. In general the SMILES parsing API is the same as older versions but there is some behaviour changes which are worth highlighting to avoid possible "gotchas" with SMILES.
The previous parser behaviour was to create a structure, perform atom typing and add hydrogens using the CDK atom types. Bonds in aromatic SMILES were left as single bonds with aromatic flags. Aromaticity was optionally re-perceived. Leaving aromatic bonds with correct bond orders caused problems elsewhere in the toolkit.
Atom-typing / Implicit hydrogen counts
Atom-typing is no longer performed, this mirrors the behaviour of the other CDK readers and avoids atom-typing being invoked twice. As a significant overhead the atom-typer should be optional (although it is often required). If you read SMILES and use a procedure that requires the CDK atom-types you must now explicitly invoke the atom-type configuration.Although atom-types are not automatically set, the implicit hydrogen count is. The hydrogen count follows the Daylight SMILES specification allowing the formula to be precisely determined. If the atom-typing / hydrogen adding is redone with the CDK you may lose information. As is shown below, removing the automatic atom-typing allows better round-tripping.
IChemObjectBuilder builder = SilentChemObjectBuilder.getInstance(); SmilesParser sp = new SmilesParser(builder); SmilesGenerator sg = new SmilesGenerator(); InChIGeneratorFactory igf = InChIGeneratorFactory.getInstance(); IAtomContainer m = sp.parseSmiles("[O]"); System.out.println(sg.create(m)); // [O] System.out.println(igf.getInChIGenerator(m).getInchi()); // InChI=1S/O // configure atom types AtomContainerManipulator.percieveAtomTypesAndConfigureAtoms(m); CDKHydrogenAdder.getInstance(builder).addImplicitHydrogens(m); System.out.println(sg.create(m)); // O ([OH2]) System.out.println(igf.getInChIGenerator(m).getInchi()); // InChI=1S/H2O/h1H2
Kekulé representation
Reading an aromatic SMILES will now provide a structure with defined bond orders. The kekulisation is efficient and correctly handles some tough cases. Example inputs and depictions are shown below. There will be a future post describing the algorithm in detail.
Input | Depiction |
---|---|
c1(c)c(c)c(c)c(c)c(c)c1(c)
|
|
c1cccc1c1cccc1
| |
oc1ccc(o)cc1
| |
oc1ccocc1
| |
oc(cc)co
|
If a structure could not be kekulised it is considered invalid and a checked exception is thrown. The most common cause of errors is omitting a hydrogen present on an aromatic nitrogen (or phosphorus).
Input | Depiction |
---|---|
n1cccc1
|
Error - could not kekulise |
[nH]1cccc1
|
Why are invalid SMILES not "fixed"?
The decision to consider such SMILES invalid may be controversial. The primary issue is there is no guaranteed way to unambiguously handle the input such the intended structure was obtained.
To kekulise the pyrrole example, n1cccc1, we would need to change the formula of the structure by adding a hydrogen ([nH]1cccc1) or assign a negative charge to the nitrogen ([n-]1cccc1). Both alter the SMILES string by the same number of edits, such decisions can lead to ambiguities.
In the case of pyrrole the decision seems obvious and we would arrive at the correct representation. Most would favour adding a hydrogen rather than modifying the charge (although positive charges are frequently assigned to 4 valent nitrogens in normalisation). In general the placement of the hydrogen is not always as trivial.
Fixarom is a program available in the Daylight Contrib that attempts to modify SMILES to allow parsing (kekulisation). In effect it tries every possible combination of hydrogen placements and is how most resolutions would work. Apart from the obvious inefficiency the structure we end up with will be different depending on the input order. This is because SMILES is tautomer specific, that is, different tautomers have different SMILES.
I'm a huge fan and regularly use the Open Babel ready-to-use programs, recently I found that the
It would be possible to assign a single representation by choosing the canonical tautomer. There is however no guarantee that the representation is the same as was used to generate the original SMILES.
Rather than the choice of hydrogen placement it may be that there really should be no hydrogen placed.
To kekulise the pyrrole example, n1cccc1, we would need to change the formula of the structure by adding a hydrogen ([nH]1cccc1) or assign a negative charge to the nitrogen ([n-]1cccc1). Both alter the SMILES string by the same number of edits, such decisions can lead to ambiguities.
In the case of pyrrole the decision seems obvious and we would arrive at the correct representation. Most would favour adding a hydrogen rather than modifying the charge (although positive charges are frequently assigned to 4 valent nitrogens in normalisation). In general the placement of the hydrogen is not always as trivial.
Fixarom is a program available in the Daylight Contrib that attempts to modify SMILES to allow parsing (kekulisation). In effect it tries every possible combination of hydrogen placements and is how most resolutions would work. Apart from the obvious inefficiency the structure we end up with will be different depending on the input order. This is because SMILES is tautomer specific, that is, different tautomers have different SMILES.
Input | Depiction |
---|---|
CC(C)C1=CC=CC2=C1NC=N2 or CC(C)c1cccc2c1[nH]cn2
|
|
CC(C)C1=CC=CC2=C1N=CN2 or CC(C)c1cccc2c1nc[nH]2
|
I'm a huge fan and regularly use the Open Babel ready-to-use programs, recently I found that the
obabel
utility will automatically add hydrogens to aromatic nitrogens. As seen below, the input only differs in the atom order. We generate two different canonical SMILES because the output structures are different tautomers.obabel -:'CC(C)c1cccc2ncnc12' -ocan CC(c1cccc2c1nc[nH]2)C 1 molecule converted obabel -:'CC(c1cccc2c1ncn2)C' -ocan CC(c1cccc2c1[nH]cn2)C 1 molecule converted
It would be possible to assign a single representation by choosing the canonical tautomer. There is however no guarantee that the representation is the same as was used to generate the original SMILES.
Rather than the choice of hydrogen placement it may be that there really should be no hydrogen placed.
CHEBI:32730
which currently has the SMILES NC(Cc1c[n]c2ccccc12)C(O)=O
serves as a good example. Following our decision on pyrrole we would incorrectly add a hydrogen to the aromatic nitrogen when it was actually a radical. We could use the brackets as a trigger to indicate that there really should be no hydrogen but then we still can't assign bond orders. I believe this error in ChEBI is actually due to the incorrect use of the 'loose' instead of 'general' aromaticity option in Marvin. Again, we can obtain two different structures depending on our interpretation. In this case it is the less likely one which is correct.Input | Depiction |
---|---|
NC(Cc1c[n]c2ccccc12)C(O)=O (interpreted as NC(Cc1c[nH]c2ccccc12)C(O)=O )
|
|
NC(Cc1c[n]c2ccccc12)C(O)=O (interpreted as NC(Cc1c[N]c2ccccc12)C(O)=O )
|
As you can see resolving missing hydrogens in aromatic SMILES is ambiguous. In practise providing the SMILES were generated correctly the number of invalid strucutres is small. There are fewer than 40 invalid inputs between ChEBI and ChEMBL and generally they involved organometallics incorrectly represented in SMILES with non-covalent bonds specified as covalent.
If you still really want to parse these structures then 'kekulise' can be set to false but this is not recommended.
IChemObjectBuilder builder = SilentChemObjectBuilder.getInstance(); SmilesParser sp = new SmilesParser(builder); IAtomContainer container = sp.parseSmiles("n1cccc1"); // throws exception sp.kekulise(false); // turn off kekulisation IAtomContainer container = sp.parseSmiles("n1cccc1"); // creates an AtomContainer
Aromaticity re-perception
Previously the option
setPreservingAromaticity
could be set to return atoms and bonds with the aromaticity flags matching the input. By default the option was off and the CDKHueckelAromaticityDetector
was automatically invoked. The new default behaviour is to preserve the aromatic specification in the input (even if the input is not aromatic). With the bond orders now set we can safely apply a new aromaticity model if desired or simply remove the aromatic flags. The functionality followed from the removal of automatic atom-typing. The option, setPreservingAromaticity, has been deprecated.IChemObjectBuilder builder = SilentChemObjectBuilder.getInstance(); SmilesParser sp = new SmilesParser(builder); // cyclobuta-1,3-diene is not aromatic but has been specified as such IAtomContainer container = sp.parseSmiles("c1ccc1"); // remove flags -> molecule is still valid thanks to kekulisation for (IAtom a : container.atoms()) a.setFlag(CDKConstants.ISAROMATIC, false); for (IBond b : container.bonds()) b.setFlag(CDKConstants.ISAROMATIC, false);
Ring closure bond type
A SMILES ring closure may optionally have an explicit bond type. As recommended by Open SMILES an explicit symbol should only be written on either the opening or the closure. Writing the symbol on both the ring opening and closure is ambiguous.
A violation of this is sometimes found with up/down directional labels in double-bond stereo-chemistry. The bond labels are relative and reverse sense on ring closures. Such specification obviously occurs in rings but we can mock an example to show that
Generally if a parser accepts this input it may be choosing either to use the ring opening or ring closure. A simple test to determine if the opening or closing is use is to parse
A violation of this is sometimes found with up/down directional labels in double-bond stereo-chemistry. The bond labels are relative and reverse sense on ring closures. Such specification obviously occurs in rings but we can mock an example to show that
C/1.C/C=C\1
is correct
C/1.C/C=C/1
is incorrect. The new SMILES parser will not accept the second input and highlights where the problem was.
InvalidSMILESException: could not parse 'C/1.C/C=C/1', Ring closure bonds did not match. Ring was opened with '/' and closed with '\'. Note - directional bonds ('/','\') are relative. C/1.C/C=C/1 ^I decided to highlight this as during testing I encountered a couple of structures with this issue. CHEMBL1315255 currently has the SMILES shown below, the incorrect labels on the ring closures are highlighted.
Cl.Cl.Cc1cccc2\C(=C/3\\Nc4ccccc4/C/3=N\OCCN5CCNCC5)\C(=O)Nc12 1315255
Generally if a parser accepts this input it may be choosing either to use the ring opening or ring closure. A simple test to determine if the opening or closing is use is to parse
C=1CCCC#1
and C#1CCCC=1
. The choice of whether to prefer the open or close label differs between toolkits and I don't think there is a correct answer.
I would like to thank the valuable input and discussions from the CDK developers, EBI colleagues and CCN attendees. I have previously notified the ChEBI and ChEMBL curators of the problematic SMILES identified.
I'll leave you with a fun and seasonal example of valid SMILES input.
HoHoHo Merry Christmas (Daylight DEPICT) |