Sunday 10 July 2016

Generic Structure Depiction

Last week I attended the Seventh Joint Sheffield Conference on Chemoinformatics. It was a great meeting with some cool science and attendees. I had the pleasure of chatting briefly with John Barnard who's contributed a lot to the representation, storage, and retrieval of generic (aka Markush) structures (see Torus, Digital Chemistry - now owned by Lhasa).

At NextMove we've been doing a bit on processing sketches from patents (see Sketchy Sketches). I learnt a few things about how generic structures are typically depicted I thought be interesting to share.

Substituent Variation (R groups)


The most common type of generic feature is substituent variation, colloquially known as R groups. The variation allows concise representation with an invariant/fixed part of a compound and variable/optional part.


wherein R denotes

That is: anisole, toluene, or ethylbenzene.

Substituent Labels


Multiple substituent labels may be distinguished by a number R1, R2, ... Rn. However in reality, any label can and will be used. This can be particularly confusing when they collide with elements, examples include: Ra (Radium), Rg (Roentgenium) B (Boron), D (Deuterium), Y (Yttrium), W (Tungsten). The distinction between the label Ra and Radium may be semantically captured by a format but lost in depiction.

To distinguish such labels we can style them differently. By using superscripting and italicizing the label the distinction becomes clear and also somewhat improves the aesthetics of numbered R groups. We avoid subscript due to ambiguities with stoichiometry, for example: –NR2.

Attachment Points


For substituents there are different notation options. In writing, radical nomenclature is used, for the above example we'd say: methyl-oxyl (-OMe), ethyl (-Et), or methyl (-Me). However this doesn't translate well to depictions: .

The CTfile actually does stores substituents this way and specifies the attachment point (APO) information separately.
$RGP
  1
$CTAB
  2  1  0  0  0  0            999 V2000
    1.9048   -0.0893    0.0000 O   0  0  0  0  0  0  0  0  0  0  0  0
    2.6192    0.3232    0.0000 C   0  0  0  0  0  0  0  0  0  0  0  0
  1  2  1  0  0  0  0
M  APO  1   1   1
M  END
$END CTAB
$CTAB
  1  0  0  0  0  0            999 V2000
    1.9940   -1.2869    0.0000 C   0  0  0  0  0  0  0  0  0  0  0  0
M  APO  1   1   1
M  END
$END CTAB
$CTAB
  2  1  0  0  0  0            999 V2000
    1.8750   -2.3286    0.0000 C   0  0  0  0  0  0  0  0  0  0  0  0
    2.5895   -1.9161    0.0000 C   0  0  0  0  0  0  0  0  0  0  0  0
  1  2  1  0  0  0  0
M  APO  1   1   1
M  END
$END CTAB
$END RGP
Alternatively we may use a virtual or 'null' atom. We can convert to/from CTfile format although it's slightly easier to delete the null atom that add it on, due to coordinate generation. A disadvantage of this is the atom count isn't accurate, however the labelled group is also a type of null atom and already distorts the atom count. There are unfortunately different ways of depicting this null atom.
Don't use a dative bond style! You have to fudge the valences and just doesn't work, how would I show a double bond attachment?

The first time I'd encountered attachment points was in ChEBI where and R group means 'something attaches here' (CHEBI:58314, CHEBI:52569), whilst a 'star' label means 'attaches to something' (CHEBI:37807, CHEBI:32861). This actually a nice way of thinking about it, like two jigsaw pieces the asymmetry allows the substituent to connect to the labelled atom.

The 'star' atom used by ChEBI is tempting to use as there is a star atom in SMILES.
*OC
*C
*CC

However a '*' in SMILES actually means 'unspecified atomic number', some toolkits impose additional semantics. ChemAxon reads a 'star' to mean 'any atom', whilst OEChem, Indigo, and OpenBabel actually read more like an R Group, with [*:1] and [*:2] being R1 and R2 etc. ChemAxon Extended SMILES allows us to explicitly encode attachment points.
*OC |$_AP1$|
*C |$_AP1$|
*CC |$_AP1$|
I opted to implement the wavy line notation in CDK which is preferred by IUPAC graphical representation guidelines.
A major disadvantage of this notation is mis-encoding by users mistaking it for a wavy up/down stereo bond. I talk more about this in the poster (Sketchy Sketches) but the number of times you see the following drawn:
The captured connection table for that sketch does not have null atoms but instead uses carbon: