Wednesday, 18 December 2013

New SMILES behaviour - parsing (CDK 1.5.4)

It's been a while since the last CDK development release (over 4 months). The CDK 1.5.4 will be released in the next few days and has a lot of improved functionality. This is the first of several posts expanding on the functionality and detailing some design decisions.

One major overhaul has been the handling of the SMILES format. Mainly, tetrahedral and double-bond stereo-chemistry is read and written but they'll be more on that in the release note. In general the SMILES parsing API is the same as older versions but there is some behaviour changes which are worth highlighting to avoid possible "gotchas" with SMILES.

The previous parser behaviour was to create a structure, perform atom typing and add hydrogens using the CDK atom types. Bonds in aromatic SMILES were left as single bonds with aromatic flags. Aromaticity was optionally re-perceived. Leaving aromatic bonds with correct bond orders caused problems elsewhere in the toolkit.

Atom-typing / Implicit hydrogen counts

Atom-typing is no longer performed, this mirrors the behaviour of the other CDK readers and avoids atom-typing being invoked twice. As a significant overhead the atom-typer should be optional (although it is often required). If you read SMILES and use a procedure that requires the CDK atom-types you must now explicitly invoke the atom-type configuration.

Although atom-types are not automatically set, the implicit hydrogen count is. The hydrogen count follows the Daylight SMILES specification allowing the formula to be precisely determined. If the atom-typing / hydrogen adding is redone with the CDK you may lose information. As is shown below, removing the automatic atom-typing allows better round-tripping.

IChemObjectBuilder     builder = SilentChemObjectBuilder.getInstance();
SmilesParser           sp      = new SmilesParser(builder);
SmilesGenerator        sg      = new SmilesGenerator();
InChIGeneratorFactory  igf     = InChIGeneratorFactory.getInstance();
        
IAtomContainer m = sp.parseSmiles("[O]");
System.out.println(sg.create(m));                         // [O]
System.out.println(igf.getInChIGenerator(m).getInchi());  // InChI=1S/O
        
// configure atom types
AtomContainerManipulator.percieveAtomTypesAndConfigureAtoms(m);
CDKHydrogenAdder.getInstance(builder).addImplicitHydrogens(m);

System.out.println(sg.create(m));                         // O ([OH2])
System.out.println(igf.getInChIGenerator(m).getInchi());  // InChI=1S/H2O/h1H2

Kekulé representation

Reading an aromatic SMILES will now provide a structure with defined bond orders. The kekulisation is efficient and correctly handles some tough cases. Example inputs and depictions are shown below. There will be a future post describing the algorithm in detail.


Input Depiction
c1(c)c(c)c(c)c(c)c(c)c1(c)
c1cccc1c1cccc1
oc1ccc(o)cc1
oc1ccocc1
oc(cc)co
If a structure could not be kekulised it is considered invalid and a checked exception is thrown. The most common cause of errors is omitting a hydrogen present on an aromatic nitrogen (or phosphorus).

Input Depiction
n1cccc1 Error - could not kekulise
[nH]1cccc1


Why are invalid SMILES not "fixed"?

The decision to consider such SMILES invalid may be controversial. The primary issue is there is no guaranteed way to unambiguously handle the input such the intended structure was obtained.

To kekulise the pyrrole example, n1cccc1, we would need to change the formula of the structure by adding a hydrogen ([nH]1cccc1) or assign a negative charge to the nitrogen ([n-]1cccc1). Both alter the SMILES string by the same number of edits, such decisions can lead to ambiguities.

In the case of pyrrole the decision seems obvious and we would arrive at the correct representation. Most would favour adding a hydrogen rather than modifying the charge (although positive charges are frequently assigned to 4 valent nitrogens in normalisation). In general the placement of the hydrogen is not always as trivial.

Fixarom is a program available in the Daylight Contrib that attempts to modify SMILES to allow parsing (kekulisation). In effect it tries every possible combination of hydrogen placements and is how most resolutions would work. Apart from the obvious inefficiency the structure we end up with will be different depending on the input order. This is because SMILES is tautomer specific, that is, different tautomers have different SMILES.

Input Depiction
CC(C)C1=CC=CC2=C1NC=N2
or
CC(C)c1cccc2c1[nH]cn2
CC(C)C1=CC=CC2=C1N=CN2
or
CC(C)c1cccc2c1nc[nH]2
 

I'm a huge fan and regularly use the Open Babel ready-to-use programs, recently I found that the obabel utility will automatically add hydrogens to aromatic nitrogens. As seen below, the input only differs in the atom order. We generate two different canonical SMILES because the output structures are different tautomers.

obabel -:'CC(C)c1cccc2ncnc12' -ocan
CC(c1cccc2c1nc[nH]2)C 
1 molecule converted
obabel -:'CC(c1cccc2c1ncn2)C' -ocan
CC(c1cccc2c1[nH]cn2)C
1 molecule converted

It would be possible to assign a single representation by choosing the canonical tautomer. There is however no guarantee that the representation is the same as was used to generate the original SMILES.

Rather than the choice of hydrogen placement it may be that there really should be no hydrogen placed. CHEBI:32730 which currently has the SMILES NC(Cc1c[n]c2ccccc12)C(O)=O serves as a good example. Following our decision on pyrrole we would incorrectly add a hydrogen to the aromatic nitrogen when it was actually a radical. We could use the brackets as a trigger to indicate that there really should be no hydrogen but then we still can't assign bond orders. I believe this error in ChEBI is actually due to the incorrect use of the 'loose' instead of 'general' aromaticity option in Marvin. Again, we can obtain two different structures depending on our interpretation. In this case it is the less likely one which is correct.


Input Depiction
NC(Cc1c[n]c2ccccc12)C(O)=O
(interpreted as NC(Cc1c[nH]c2ccccc12)C(O)=O)
NC(Cc1c[n]c2ccccc12)C(O)=O
(interpreted as NC(Cc1c[N]c2ccccc12)C(O)=O)


As you can see resolving missing hydrogens in aromatic SMILES is ambiguous. In practise providing the SMILES were generated correctly the number of invalid strucutres is small. There are fewer than 40 invalid inputs between ChEBI and ChEMBL and generally they involved organometallics incorrectly represented in SMILES with non-covalent bonds specified as covalent.

If you still really want to parse these structures then 'kekulise' can be set to false but this is not recommended.

IChemObjectBuilder builder   = SilentChemObjectBuilder.getInstance();
SmilesParser       sp        = new SmilesParser(builder);
IAtomContainer     container = sp.parseSmiles("n1cccc1");  // throws exception
sp.kekulise(false);                                        // turn off kekulisation
IAtomContainer     container = sp.parseSmiles("n1cccc1");  // creates an AtomContainer

Aromaticity re-perception

Previously the option setPreservingAromaticity could be set to return atoms and bonds with the aromaticity flags matching the input. By default the option was off and the CDKHueckelAromaticityDetector was automatically invoked. The new default behaviour is to preserve the aromatic specification in the input (even if the input is not aromatic). With the bond orders now set we can safely apply a new aromaticity model if desired or simply remove the aromatic flags. The functionality followed from the removal of automatic atom-typing. The option, setPreservingAromaticity, has been deprecated.

IChemObjectBuilder builder   = SilentChemObjectBuilder.getInstance();
SmilesParser       sp        = new SmilesParser(builder);

// cyclobuta-1,3-diene is not aromatic but has been specified as such
IAtomContainer container = sp.parseSmiles("c1ccc1");

// remove flags -> molecule is still valid thanks to kekulisation
for (IAtom a : container.atoms()) 
    a.setFlag(CDKConstants.ISAROMATIC, false);
for (IBond b : container.bonds()) 
    b.setFlag(CDKConstants.ISAROMATIC, false);

Ring closure bond type

A SMILES ring closure may optionally have an explicit bond type. As recommended by Open SMILES an explicit symbol should only be written on either the opening or the closure. Writing the symbol on both the ring opening and closure is ambiguous.

A violation of this is sometimes found with up/down directional labels in double-bond stereo-chemistry. The bond labels are relative and reverse sense on ring closures. Such specification obviously occurs in rings but we can mock an example to show that C/1.C/C=C\1 is correct C/1.C/C=C/1 is incorrect. The new SMILES parser will not accept the second input and highlights where the problem was.
InvalidSMILESException: could not parse 'C/1.C/C=C/1', Ring closure bonds
did not match. Ring was opened with '/' and closed with '\'. Note -
directional bonds ('/','\') are relative.
C/1.C/C=C/1
         ^
I decided to highlight this as during testing I encountered a couple of structures with this issue. CHEMBL1315255 currently has the SMILES shown below, the incorrect labels on the ring closures are highlighted.

Cl.Cl.Cc1cccc2\C(=C/3\\Nc4ccccc4/C/3=N\OCCN5CCNCC5)\C(=O)Nc12 1315255

Generally if a parser accepts this input it may be choosing either to use the ring opening or ring closure. A simple test to determine if the opening or closing is use is to parse C=1CCCC#1 and C#1CCCC=1. The choice of whether to prefer the open or close label differs between toolkits and I don't think there is a correct answer.

I would like to thank the valuable input and discussions from the CDK developers, EBI colleagues and CCN attendees. I have previously notified the ChEBI and ChEMBL curators of the problematic SMILES identified.

I'll leave you with a fun and seasonal example of valid SMILES input.

HoHoHo Merry Christmas (Daylight DEPICT)

No comments:

Post a Comment

Note: only a member of this blog may post a comment.