Efficient Bits: Bringing Molfile Sgroups to the CDK

In the last but one post I gave a demonstration of S(ubstance)group rendering in the CDK. Now I want to give some implementation insights.

Abbreviations (Superatoms)

Abbreviations contract part of a structure to a line formula or common abbreviation.


Full-structure	Abbreviated-structure

Abbreviating too much or using unfamiliar terms (e.g. maybe using CAR for carbazole) can make a depiction worse. However some structures, like CHEMBL590010, can be dramatically improved.

CHEMBL590010

One way to implement abbreviations would be by modifying the molecule data structure with literal collapse/contract and expand operations. Whilst this approach is perfectly reasonable, deleting atoms/bonds is expensive (in most toolkits) and it somewhat subtracts the "display shortcut" nature of this Sgroup.
For efficiency abbreviations are implemented by hiding parts of the depictions and remapping symbols. Just before rendering we iterator over the Sgroups and set hint flags that these atoms/bonds should not be included in the final image. If there is one attachment (e.g. Phenyl) we remap the attach point symbol text to the abbreviation label ('C'->'Ph'). When there are no attachments (e.g. AlCl₃) we add a new symbol to the centroid of the hidden atoms.


Hide atoms and bonds	Symbol Remap	Abbreviated Result

For two or more attachments (e.g. SO₂) you also need coordinate remapping.

Multiple Group

Multiple groups allow, contraction of a discrete number of repeating units. They are handled similarly to the abbreviations except we don't need to remap parts.

CHEBI:1233

All atoms are present in the data structure but are laid out on top of each other (demonstrated below). We have a list of parent atoms that form the repeat unit. Therefore to display multiple groups we hide all atoms and bonds in the Sgroup except for parent atoms and the crossing bonds.

It's worth mentioning that hidden symbols are still generated but simply excluded from the final result. This allows bond back off for hetero atoms to be calculated correctly as is seen in this somewhat tangled example:

Brackets

Polymer and Multiple group Sgroups require rendering of brackets. Encoded in the molfile (and when laid out) brackets are described by two points, a line. It is therefore up to the renderer to decide which side of the line the tick marks should face.
I've seen some implementations use the order of the points to convey bracket direction. Another method would be to point the brackets at each other. As shown for CHEBI:59342 this is not always correct.


Poor bracket direction	Preferred bracket direction
CHEBI:59342

I originally thought the solution might involve a graph traversal/flood-fill but it turns out there is a very easy way to get the direction correct. First we consider that brackets may or may not be placed on bonds, if a bracket is on a bond this information is available (crossing bonds).

For a bracket on a crossing bond exactly one end atom will be contained in the Sgroup, the bracket should point towards this atom.
If a bracket doesn't cross a bond then the direction should point to the centroid of all atoms in the Sgroup.

Efficient Bits

Saturday, 21 November 2015

Bringing Molfile Sgroups to the CDK - Rendering Tips

Abbreviations (Superatoms)

Multiple Group

Brackets

No comments:

Post a Comment