Efficient Bits

Rules for Interpreting Up/Down Wedge Bonds

2019-09-02T15:45:00.000+01:00

Yesterday I was reminded of an old anecdote about a maths professor. A professor was lecturing an auditorium and writing down a proof. They proclaim at one part "and obviously x infers y". A student raises their hand and asks, "is it obvious?". The professor then studies the equation for 30 minutes until stating "Yes, it is obvious" and continued on with the proof.

chemists and SMILES experts, is the bridge carbon point towards us, or away from us? FC(F)(F)c1ccc(nc1)O[C@H]1C[C@H](C[C@H]12)C([2H])([2H])N2C(=O)c1cccc(F)c1-c1=Nccc=N1 https://t.co/6WwEvUSbkl (cc @jwmay @cdsouthan) pic.twitter.com/93q8TpZA58
— Egon Willighⓐgen (@egonwillighagen) September 1, 2019

Egon asked is the bridge in the following compound up or down? I replied obviously it's pointing down. Since it might not be obvious to everyone I thought I'd explain the three rules to easily assign up/down wedges to other bonds around a stereocentre in 2D.

The insight actually comes from the algorithm used to assign up/down wedges to a depiction. Since only a handful of people have ever had to write such an algorithm I'm not sure how common knowledge it is (i.e. is it actually obvious?).

Rule 1 (D₄)

A tetrahedral centre with four bonds must have alternating up/down bonds. Therefore no mater what the angle if one bond is labelled as up, we know the bond opposite it must also be up, and the two either side must be down (inverse of up).

Rule 2 (D₃)

For three bonds, when bonds are spaced evenly (i.e. all angles < 180 degrees) then all bonds are the same direction, all up, or all down.

Rule 3 (D₃)

For three bonds, when an angle > 180 exists then think about it like the D₄ case with one neighbour missing. The "outside" bonds are opposite direction to the "middle" one.

Exceptions

Like all rules there are exceptions, a well know ambiguity is when we have three bonds and the angle is exactly 180 degrees:

The problem here is you could move the central atom slightly to apply either Rule 2 or Rule 3. Some chemistry toolkits will refuse to read this others will side on the more likely interpretation (Rule 3).

A bigger and perhaps more common issue is mixing up/down wedges with perspective projection. More precisely MDL (and then SYMYX, Accelrys, now BIOVIA) had something known as the "triangle rule". The idea was if you were looking at a molecule in 3D the lengths of the bond would indicate which way round you were looking at it. They imposed this concept on 2D interpretation.

In practice what the this means is these two structures are read as different enantiomers (by BIOVIA) depending on whether the H is inside or outside the "triangle":

You're unlikely to encounter such cases except when a projection is involved. For example for the bridged system pictured below, perspective has been used and we may end up with the H within the "triangle". Note it's the point stored in the file for the atom not the actual "H" glyph that maters.

This isn't to say projections are bad, only that mixing perspective with up/down wedges can be problematic.

Creating Chemical Structure Animations

2019-06-19T21:57:00.000+01:00

I've just got back from the Eighth Joint Sheffield Conference on Chemoinfomatics where I presented about the technical details of subgraph isomorphism algorithms. It was a great conference (as usual) with good science, interesting posters and lots of fun. Noel was live tweeting the whole thing so check out the #Shef2019 hashtag if you want to catch up.

To help explain the algorithms in my talk I created some animations (as videos) that showed the backtracking search procedures. After the talk several delegates asked how I created these so I thought I do a quick blog post on it (and also to remind me in future how to do it again). I'd done something similar before to demonstrate the Sayle and Delany algorithm for tautomer enumeration. Here's the PDF (and PPTX with the videos) of my Sheffield talk.

CDK + ffmpeg

The general idea is to generate a bunch of PNGs and then stick them together with the ffmpeg command line tool. In older blog posts I used to generate a GIF but it turns out mp4 compresses better with higher quality.

Step 1: Generating the PNGs with CDK

The example code below loads the SMILES for indole and then loops around highlighting one atom. Other than some normal params we also set the symbol visibility. By default the depiction will add in symbols for highlighted carbons, we can turn this off by overriding the parameter.

String dest = "/tmp/example1";
IChemObjectBuilder bldr   = SilentChemObjectBuilder.getInstance();
SmilesParser       smipar = new SmilesParser(bldr);
IAtomContainer     mol    = smipar.parseSmiles("[nH]1ccc2c1cccc2");
DepictionGenerator dg = new DepictionGenerator().withZoom(5)
                                                .withSize(1280, 720)
                                                .withParam(StandardGenerator.Visibility.class,
                                                           SymbolVisibility.iupacRecommendationsWithoutTerminalCarbon())
                                                .withOuterGlowHighlight();
new File(dest).mkdirs();
for (int i=0; i<100; i++) {
    IAtom atm = mol.getAtom(i % mol.getAtomCount());
    dg.withHighlight(Collections.singleton(atm), Color.RED)
      .depict(mol)
      .writeTo(String.format("%s/frame.%03d.png", dest, i));
}

Step 2: Stick them all together

$ ffmpeg -r 12/1 -start_number 0 -i /tmp/example1/frame.%03d.png \
  -c:v libx264 -r 30 -pix_fmt yuv420p example1.mp4

The key parameters here are the output name (last arg) and input name template (-i /tmp/example1/frame.%03d.png). I zero padded the numbers so they will be ordered correctly when alphabetically sorted and match this in the argument. You don't need to zero-pad but it means they should then alphabetical in your OS file system which is handy. The first -r is used to set the input frame rate to 12/1 (12 per 1 second).

Okay so how did it turn out....

Your browser does not support the video tag.
Download: example1.mp4

Not bad but the inter-frame alignment is off due to the different size caused by the moving outer-glow. In future I might have an anchor attribute that would lock the depiction in place relative to some atom but that's quite specialised. We can fix the shifting by highlighting all other atoms to the same as the background. This is possible with the high-level APIs but it's easy enough just to set the field on the atom:

for (int i=0; i<100; i++) {
    IAtom target = mol.getAtom(i % mol.getAtomCount());
    // reset all
    for (IAtom a : mol.atoms())
        a.setProperty(StandardGenerator.HIGHLIGHT_COLOR,
                      Color.WHITE);
    target.setProperty(StandardGenerator.HIGHLIGHT_COLOR,
                       Color.RED);
    dg.depict(mol)
      .writeTo(String.format("%s/frame.%03d.png", dest, i));
}

Your browser does not support the video tag.
Download: example2.mp4

Add/Remove Atoms and Bonds?

Because of the alignment issue we need to use some tricks to add/remove atoms and bonds. Essentially you draw the whole thing and then hide the parts you don't want by setting them to the background color. For example:

int frameId = 0;
for (IAtom atom : mol.atoms())
    hide(atom);
for (IBond bond : mol.bonds())
    hide(bond);
// add bonds
for (int i = 0; i < mol.getBondCount(); i++) {
    IBond bnd = mol.getBond(i);
    show(bnd);
    show(bnd.getBegin());
    show(bnd.getEnd());
    dg.depict(mol)
      .writeTo(String.format("%s/frame.%03d.png", dest, ++frameId));
}
// hold for 12 frames
for (int i = 0; i < 12; i++)
    dg.depict(mol)
      .writeTo(String.format("%s/frame.%03d.png", dest, ++frameId));
// remove bonds
for (int i = 0; i < mol.getBondCount(); i++) {
    IBond bnd = mol.getBond(mol.getBondCount()-i-1);
    hide(bnd);
    if (!visible(bnd.getBegin()))
        hide(bnd.getBegin());
    if (!visible(bnd.getEnd()))
        hide(bnd.getEnd());
    dg.depict(mol)
      .writeTo(String.format("%s/frame.%03d.png", dest, ++frameId));
}

Where show/hide are:

private static void hide(IChemObject x) {
    x.setProperty(StandardGenerator.HIGHLIGHT_COLOR,
                  Color.WHITE);
}

private static void show(IChemObject x) {
    x.setProperty(StandardGenerator.HIGHLIGHT_COLOR,
                  Color.BLACK);
}

Producing:

Your browser does not support the video tag.
Download: example3.mp4

That's just about it, the CDK depiction is quite configurable but not really intended to be a general purpose drawing/animation tool. However using some tricks you can work around some quirks and get some nice results. If you make animations let me know as I'd love to see them!

CDK Depict on HTTPS

2018-10-25T12:47:00.001+01:00

Just a quick post to say CDK Depict is now using HTTPS https://simolecule.com/cdkdepict/depict.html. The main reason for this was Blogger stopped allowing image links to HTTP resources. In general browsers are being more fussy about non HTTPS content.

I used LetsEncrypt that turned out to be very easy to configure with TomCat.

Step 1

Install the certbot utility and use it generate a certificate.

$ sudo certbot certonly

Step 2

Configure TomCat 8+ connectors. This used to be more complex on older TomCat servers with the need to generate a separate keystore. Editing $CATALINA_HOME/confg/server.xml we configure the base connected, redirectPort is changed from 8443 to 443 (and 8080 to 80).

<Connector port="80" protocol="HTTP/1.1"
           connectionTimeout="20000"
           redirectPort="443" />

We also configure SSL connector, using port 443, change to NIO based protocol (the default requires extra native library) org.apache.coyote.http2.Http2Protocol, and set the file paths to the .pem files generated by certbot.

<Connector port="443" 
           protocol="org.apache.coyote.http11.Http11NioProtocol"
           maxThreads="150" SSLEnabled="true" >
  <UpgradeProtocol className="org.apache.coyote.http2.Http2Protocol" />
  <SSLHostConfig>
    <Certificate certificateKeyFile="/etc/letsencrypt/live/www.simolecule.com/privkey.pem"
                 certificateFile="/etc/letsencrypt/live/www.simolecule.com/cert.pem"
                 certificateChainFile="/etc/letsencrypt/live/www.simolecule.com/chain.pem"
                 type="RSA" />
  </SSLHostConfig>
</Connector>

Step 3 (optional)

If a client tries to visit the HTTP site we want to redirect them to HTTPS. To do this we edit $CATALINA_HOME/confg/web.xml adding this section to the end of the <web-app> block

<security-constraint>
  <web-resource-collection>
    <web-resource-name>Entire Application</web-resource-name>
    <url-pattern>/*</url-pattern>
  </web-resource-collection>
  <user-data-constraint>
    <transport-guarantee>CONFIDENTIAL</transport-guarantee>
  </user-data-constraint>
</security-constraint>

Bit packing for fast atom type assignment

2018-10-22T08:26:00.000+01:00

Many cheminformatics algorithms perform some form of atom typing as their first step. The atom types you need and the granularity depend on the algorithm. At the core of all atom typing lies some form of decision tree, usually manifesting as a complex if-else cascade.

Using an elegant technique Roger showed me a few years ago, you can replace this if-else cascade with a table/switch lookup. This turns out to be very clean, easy to extend, and efficient. The technique relies on first computing a value that captures the bond environment around an atom and packing this in to a single value.

Here's what it looks like:

Algorithm 1

int btypes = atom.getImplicitHydrogenCount();
for (IBond bond : atom.bonds()) {
  switch (bond.getOrder()) {
    case SINGLE: btypes += 0x0001; break;
    case DOUBLE: btypes += 0x0010; break;
    case TRIPLE: btypes += 0x0100; break;
    default:     btypes += 0x1000; break;
  }
}

After the value btypes has been calculated (once for each atom) it contains the count of single, double, triple, and other bonds. We can inspect each of these counts individually my masking of the relevant bits or the entire value, for example:

Algorithm 2

switch (btypes) {
 case 0x004: // 4 single bonds, e.g. Sp3 carbon
  break;
 case 0x012: // 1 double bond, 2 single bonds e.g. Sp2 carbon
  break;
 case 0x020: // 2 double bonds e.g. Sp cumulated carbon
  break;
 case 0x101: // 1 triple bond, 1 single bond e.g. Sp triple bonded carbon
  break;
}

Using a nibble allows us to store numbers up to 16 (2⁴) - more than enough for any sane chemistry. In the example above I shoved default bonds under an 'other' category but of course you could extend it to handle quadruple bonds and even additional properties of the bonds:

Algorithm 3

int btypes = atom.getImplicitHydrogenCount();
for (IBond bond : atom.bonds()) {
  switch (bond.getOrder()) {
    case SINGLE: 
      if (bond.isAromatic())
        btypes += 0x010001; 
      else
        btypes += 0x000001;
      break;
    case DOUBLE: 
      if (bond.isAromatic())
        btypes += 0x010010; 
      else if (isOxygen(bond.getOther(atom)))
        btypes += 0x100010; // dbs to oxygens
      else
        btypes += 0x000010;
      break;
    case TRIPLE: 
       btypes += 0x000100;
       break;
  }
}

RDKit Reaction SMARTS

2018-04-13T16:00:00.006+01:00

Update 19/09/22

I believe I was wrong and Daylight did allow SMIRKS to be run backwards and that the "direction" parameter controls it ("personal communication" from Roger).. The key insight is the documentation notes SMARTS patterns should only appear on atoms that don't have bond changes around them. From the man page:

"Atomic expressions may be any valid atomic SMARTS expression for nodes where the bonding (connectivity & bond order) doesn't change.1 Otherwise, the atomic expressions must be valid SMILES."

Whether this make sense or not for writing portable SMIRKS is questionable.

This is supplementary to my original grumble on ambiguous naming. I still believe RDKit should just call it SMIRKS or at least something which didn't already mean something else (e.g. "RdSmirks"). I do agree there is a lot of overlap between SMARTS and SMIRKS but conflating these terms is problematic. Shortly after I originally wrote this original post, SMIRKS Native Open Force Field (SMIRNOFF) was published. In that work SMARTS is used, but they called it SMIRKS. There is some wiggling attempted by the authors that they use "SMIRKS features" of atom maps. However it is incorrect that these were a SMIRKS feature as in what Daylight called Reaction SMARTS, see the table "Examples of Reaction SMARTS" (SMARTS Theory Page). Of course SMIRNOFF was just too good a pun to change, but it again creates more confusion.

I recently ran some comparisons/benchmarks on different SMIRKS implementations, everyone's semantics are consistently inconsistent and the community would benefit by an effort to standardise. Perhaps then I can convince RDKit that they really do have a (good) SMIRKS implementation and it would be less confusing to just call it that.

There's a been some papers using the RDKit for synthesis planning. If you're writing a paper and use the term "Reaction SMARTS" make sure you mean what everyone else thinks it means.

The SMILES, SMARTS, and SMIRKS line notations were created* by Daylight for storing, matching, and transforming connection tables.

SMILES describes a connection table to store molecule and reactions
SMARTS describes a pattern (or query) to match molecules and reactions
SMIRKS describes a transform (or "reaction") to modify molecules

RDKit uses the term "Reaction SMARTS" to mean "transform" (see RDKit Book). Unfortunately in Daylight's terminology Reaction SMARTS is a pattern not a transform.

Screenshot from the Daylight SMARTS theory manual.

Reactions SMARTS is primarily useful for searching reaction databases. For example this Reaction SMILES:

[cH:13]1[c:14]([cH:19][c:20]([c:10]([c:11]1[Cl:12])[n:9]2[cH:8][c:5]([c:4]([n:22]2)[NH2:3])[C:6]#[N:7])[Cl:21])[C:15]([F:16])([F:17])[F:18].[OH:1][OH:2]>C(Cl)Cl.C(=O)(C(F)(F)F)OC(=O)C(F)(F)F.O>[cH:13]1[c:14]([cH:19][c:20]([c:10]([c:11]1[Cl:12])[n:9]2[cH:8][c:5]([c:4]([n:22]2)[N+:3](=[O:1])[O-:2])[C:6]#[N:7])[Cl:21])[C:15]([F:16])([F:17])[F:18]

is matched by this Reaction SMARTS

[*:1][Nh2:2]>>[*:1][Nh0:2](~[OD1])~[OD1] amino to nitro

You can highlight the substructure:

Highlighting the SMARTS in the SMILES using CDK Depict

But that's a transform!

Yes but it's matching a transform (SMARTS) not applying one (SMIRKS), some may think you could read this unmodified as a SMIRKS but this is not the case. SMIRKS needs "real parts" after the second angled bracket as these are the parts created by the transform. Note that '*' is valid SMILES and in SMIRKS it kind of means "unmodified". This actually gives us the nice invariants:

All SMILES are valid SMARTS but not all SMARTS are valid SMILES
and
All SMIRKS are valid SMARTS but not all SMARTS are valid SMIRKS

Here is the SMIRKS transform for amino to nitro

[*:1][ND3:2]([H])([H])>>[*:1][N:2](=O)=O amino to nitro

In SMIRKS I can apply this SMIRKS to "molecules" and it will create "reactions". Note these molecules do not need to have atom-maps but they will come out with atom maps (see dt_transform)!

c1ccccc1N
[nH]1ccc2c1cc(N)cc2

The output is

c1cccc[c:1]1[NH2:2]>>c1cccc[c:1]1[N:2](=O)=O
[nH]1ccc2c1c[c:1]([NH2:2])cc2>>[nH]1ccc2c1c[c:1]([N:2](=O)=O)cc2

And another thing...

In general you can't run SMIRKS backwards. If I want to run a nitro to amino because the atoms/bonds we're adding need to be "real" we need to encode the reverse transform separately!

[*:1][ND3:2]([H])([H])>>[*:1][N:2](=O)=O amino to nitro
[*:1][ND3:2](~[OD1])(~[OD1])>>[*:1][N:2]([H])[H] nitro to amino

Although dt_transform specifies a direction this only controls whether the input molecules appear on the left or right of the output reaction.

*SMILES was created by Dave Weininger whilst at EPA

Sharp Tools for Java Refactoring: Byte Code Analysis

2017-05-06T15:06:00.001+01:00

I'm currently refactoring parts of the CDK core classes. As part of this I often need to find specific patterns/idioms that need to be changed. Whilst source code inspections and an IDE can make this task easy sometimes the tools aren't quite sharp enough.

I needed to find all occurrences of a reference (instead of value) comparison on a particular class. In Java there is no operator overload and so you can have situations like this:

Integer a = new Integer(25);
Integer b = new Integer(25);
if (a == b) {} // false
if (a.equals(b)) {} // true

I mentioned operating overloading but it's more subtle and is more about comparing reference vs. value comparison. In C/C++ we can have similar behaviour:

int aval = 25, bval = 25;
int *a = &aval;
int *b = &bval;
if (a == b) {} // false
if (*a == *b) {} // true

Most IDE's and code inspection programs will warn about common occurrences (for example Integer) but I wanted to find places where the CDK's classes were used like this. A simple text grep will find some but will have false positives and negatives requiring lots of manual checking. Daniel suggested the well known FindBugs might be able help.

Rather than analyze source code like PMD and Checkstyle, FindBugs analyses Java byte code with a set of defaults detectors to find often subtle but critical mistakes. FindBugs can be configured with custom detectors (see here), however the inspection I needed (RC: Suspicious reference comparison to constant) was almost there. After digging around in the source code I found you can provide a list of custom classes to detect. However it took a bit of trial and error to get what I needed.

First up we turn off all inspections except for the one we're looking for (we need to fix many others reported but I was looking for something specific). To do this we create an XML config that will only run the specific inspection (RC for Reference Comparison):

findbugs-include.xml

<?xml version="1.0" encoding="UTF-8"?>
<FindBugsFilter>
  <Match>
    <Bug code="RC"/>
  </Match>
</FindBugsFilter>

We then run findbugs with this configuration and provide the frc.suspicious property.

Running findbugs

$> findbugs -textui \
            -include findbugs-include.xml \
            -property "frc.suspicious=org.openscience.cdk.interfaces.IAtom" \
            base/*/target/cdk-*-2.0-SNAPSHOT.jar

This produces an accurate report of all the places the references are compared. Here's a sample:

H C RC: Suspicious comparison of org.openscience.cdk.interfaces.IAtom references in org.openscience.cdk.Bond.getOther(IAtom)  At Bond.java:[line 253]
H C RC: Suspicious comparison of org.openscience.cdk.interfaces.IAtom references in org.openscience.cdk.Bond.getConnectedAtom(IAtom)  At Bond.java:[line 265]
H C RC: Suspicious comparison of org.openscience.cdk.interfaces.IAtom references in org.openscience.cdk.Bond.getConnectedAtoms(IAtom)  At Bond.java:[line 281]
H C RC: Suspicious comparison of org.openscience.cdk.interfaces.IAtom references in org.openscience.cdk.Bond.contains(IAtom)  At Bond.java:[line 300]
...

CDK AtomContainers are Slow - Let's fix that

2017-04-03T22:51:00.000+01:00

The core class for molecule representation in CDK is the AtomContainer. The AtomContainer uses an edge-list data structure for storing the underlying connection table (see The Right Representation for the Job).

Essentially this edge-list representation is efficient in space. Atoms can be shared between and belong to multiple AtomContainers. Therefore querying connectivity (is this atom connected to this other atom) is linear time in the number of bonds.

The inefficiency of the AtomContainer can really sting. If someone was to describe Morgan's relaxation algorithm you may implement it like Code 1. The algorithm looks reasonable however it will run much slower than you expected. You may expect the runtime of this algorithm to be ~N² but it's actually ~N³. I've annotated with XXX where the extra effort creeps in.

Code 1 - Naive Morgan-like Relaxation (AtomContainer/AtomIter)

// Step 1. Algorithm body
int[] prev = new int[mol.getAtomCount()];
int[] next = new int[mol.getAtomCount()];
for (int i = 0; i < mol.getAtomCount(); i++) {
  next[i] = prev[i] = mol.getAtom(i).getAtomicNumber();
}
for (int rep = 0; rep < mol.getAtomCount(); rep++) { // 0..numAtoms
  for (int j = 0; j < mol.getAtomCount(); j++) {     // 0..numAtoms
    IAtom atom = mol.getAtom(j);
    // XXX: linear traversal! 0..numBonds
    for (IBond bond : mol.getConnectedBondsList(atom)) {
      IAtom nbr = bond.getConnectedAtom(atom); 
      // XXX: linear traversal! 0..numAtoms avg=numAtoms/2
      next[j] += prev[mol.getAtomNumber(nbr)]; 
    }
  }o
  System.arraycopy(next, 0, prev, 0, next.length);
}

A New Start: API Rewrite?

Ultimately to fix this problem correctly, would involve changing the core AtomContainer representation, unfortunately this would require an API change, optimally I think adding the constraint that atoms/bonds can not be in multiple molecules would be needed**. This would be a monumental change and not one I can stomach right now.

Existing Trade Off: The GraphUtil class

In 2013 I added the GraphUtil class for converting an AtomContainer to a more optimal adjacency list (int[][]) that was subsequently used to speed up many algorithms including: ring finding, canonicalisation, and substructure searching. Each time one of these algorithm is invoked with an IAtomContainer the first step is to build the adjacency list 2D array.

Code 2 - GraphUtil usage

IAtomContainer mol = ...;
int[][]        adj = GraphUtil.toAdjList(mol);

// optional with lookup map to bonds
EdgeToBondMap  e2b = EdgeToBondMap.withSpaceFor(mol);
int[][]        adj = GraphUtil.toAdjList(mol, e2b);

Although useful the usage of GraphUtil is somewhat clunky requiring passing around not just the adjacency list but the original molecule and the EdgeToBondMap if needed.

Code 3 - GraphUtil Depth First Traversal

void visit(IAtomContainer mol, int[][] adj, EdgeToBondMap bondmap, int beg, int prev) {
  mol.getAtom(beg).setFlag(CDKConstants.VISITED, true);
  for (int end : adjlist[beg]) {
    if (end == prev)
      continue;
    if (!mol.getAtom(end).getFlag(CDKConstants.VISITED))
      visit(mol, adj, bondmap, end, beg);
    else
      bondmap.get(beg, end).setIsInRing(true); // back edge
  }
}

Using the GraphUtil approach has been successful but due to the clunky-ness I've not felt comfortable exposing the option of passing these through to public APIs. It was only ever meant as an internal optimisation to be hidden from the caller. Beyond causing unintentional poor performance (Code 1) what often happens in a workflow is GraphUtil is invoked multiple times. A typical use case would be matching multiple SMARTS against one AtomContainer.

A New Public API: Atom and Bond References

I wanted something nicer to work with and came up with the idea of using object composition to extend the existing Atom and Bond APIs with methods to improve performance and connectivity checks.

Essentially the idea is to provide two classes, and AtomRef and BondRef that reference a given atom or bond in a particular AtomContainer. An AtomRef knows about the original atom it's connected bonds and the index, the BondRef knows about the original bond, it's index and the AtomRef for the connected atoms. The majority of methods (e.g. setSymbol, setImplicitHydrogenCount, setBondOrder) are passed straight through to the original atom. Some methods (such as setAtom on IBond) are blocked as being unmodifiable.

Code 4 - AtomRef and BondRef structure

class AtomRef implements IAtom {
  IAtom         atm;
  int           idx;
  List<BondRef> bnds;
}

class BondRef implements IBond {
  IBond         bnd;
  int           idx;
  AtomRef       beg, end;
}

We can now re-write the Morgan-like relaxation (Code 1) using AtomRef and BondRef. The scaling of this algorithm is now ~N² as you would expect.

Code 5 - Morgan-like Relaxation (AtomRef/AtomIter)

// Step 1. Initial up front conversion cost
AtomRef[] arefs = AtomRef.getAtomRefs(mol);

// Step 2. Algorithm body
int[]   prev  = new int[mol.getAtomCount()];
int[]   next  = new int[mol.getAtomCount()];
for (int i = 0; i < mol.getAtomCount(); i++) {
  next[i] = prev[i] = mol.getAtom(i).getAtomicNumber();
}
for (int rep = 0; rep < mol.getAtomCount(); rep++) {
  for (AtomRef aref : arefs) {
    int idx = aref.getIndex();
    for (BondRef bond : aref.getBonds()) {
      next[idx] += prev[bond.getConnectedAtom(aref).getIndex()];
    }
  }
  System.arraycopy(next, 0, prev, 0, next.length);
}

The depth first implementation also improves in readability and only requires two arguments.

Code 6 - AromRef Depth First (AtomRef/AtomFlags)

// Step 1. Initial up front conversion cost
void visit(AtomRef beg, BondRef prev) {
  beg.setFlag(CDKConstants.VISITED, true);
  for (BondRef bond : beg.getBonds()) {
    if (bond == prev)
      continue;
    AtomRef nbr = bond.getConnectedAtom(beg);
    if (!nbr.getFlag(CDKConstants.VISITED))
      visit(nbr, bond);
    else
      bond.setIsInRing(true); // back edge
  }
}

Benchmark

I like the idea of exposing the AtomRef and BondRef to public APIs. I wanted to check that the trade-off in calculating and using the AtomRef/BondRef vs the current internal GraphUtil. To test this I wrote a benchmark that implements some variants of a Depth First Search and Morgan-like algorithms. I varied the algorithm implementations and whether I used, IAtomContainer, GraphUtil, or AtomRef.

The performance was measured over ChEMBL 22 and averaged the run time performance over 1/10th (167,839 records). You can find the code on GitHub (Benchmark.java). Each algorithm computes a checksum to verify the same work is being done. Here are the raw results: depthfirst.tsv, and relaxation.tsv.

Depth First Traversal

A Depth first traversal is a linear time algorithm. I tested eight implementations that varied the graph data structure and whether I used an external visit array or atom flags to mark visited atoms. When looking just at initialisation time the AtomRef creation is about the same as GraphUtil. There was some variability between the different variants but I couldn't isolate where the different came from (maybe GC/JIT related). The runtime of the AtomRef was marginally slower than GraphUtil. Both were significantly faster (18-20x) than the AtomContainer to do the traversal. When we look at the total run-time (initialisation+traversal) we see that even for a linear algorithm, the AtomRef (and GraphUtil) were ~3x faster. Including the EdgeToBondMap adds a significant penalty.

Graph Relaxation

A more interesting test is a Morgan-like relaxation, as a more expensive algorithm (N²) it should emphasise any difference between the AtomRef and GraphUtil. The variability in this algorithm is whether we relax over atoms (AtomIter - see Code 1/5) or bonds (BondIter). We see a huge variability in AtomContainer/AtomIter implementation. This is because the algorithm is more susceptible to difference in input (molecule) size.

Clearly the AtomContainer/AtomIter is really bad (~80x slower). Excluding this results shows that as expected the AtomRef/AtomIter is slower than the GraphUtil/AtomIter equivalent (~2x slower). However because the AtomRef has a richer syntax, we can do a trick with XOR number storage to improve performance or iterate over bonds (BondIter) giving like-for-like speeds.

Conclusions

The proposed AtomRef and BondRef provide a convenience API to use the CDK in a natural way with efficient connectivity access. The conversion to an AtomRef is efficient and provides a speedup even for linear algorithms. The encapsulation facilities the passing as a public API parameter, users will be able to compute it ahead of time and pass it along to multiple algorithms.

I'm somewhat tempted to provide an equivalent AtomContainerRef allowing a drop-in replacement for methods that take the IAtomContainer interface. It is technically possible to implement writes (e.g. delete bond) efficiently in which case it would no longer be a 'Ref'. Maybe I'll omit that functionality or use a better name?

Footnotes

^** My colleague Daniel Lowe notes that OPSIN allows atoms to be in multiple molecules and know about their neighbours but it's a bit of a fudge. It's certainly possible with some extra book keeping but prevents some other optimisations from being applied.

Generic Structure Depiction

2016-07-10T11:26:00.001+01:00

Last week I attended the Seventh Joint Sheffield Conference on Chemoinformatics. It was a great meeting with some cool science and attendees. I had the pleasure of chatting briefly with John Barnard who's contributed a lot to the representation, storage, and retrieval of generic (aka Markush) structures (see Torus, Digital Chemistry - now owned by Lhasa).

At NextMove we've been doing a bit on processing sketches from patents (see Sketchy Sketches). I learnt a few things about how generic structures are typically depicted I thought be interesting to share.

Substituent Variation (R groups)

The most common type of generic feature is substituent variation, colloquially known as R groups. The variation allows concise representation with an invariant/fixed part of a compound and variable/optional part.

wherein R denotes

That is: anisole, toluene, or ethylbenzene.

Substituent Labels

Multiple substituent labels may be distinguished by a number R¹, R², ... Rⁿ. However in reality, any label can and will be used. This can be particularly confusing when they collide with elements, examples include: Ra (Radium), Rg (Roentgenium) B (Boron), D (Deuterium), Y (Yttrium), W (Tungsten). The distinction between the label Ra and Radium may be semantically captured by a format but lost in depiction.

To distinguish such labels we can style them differently. By using superscripting and italicizing the label the distinction becomes clear and also somewhat improves the aesthetics of numbered R groups. We avoid subscript due to ambiguities with stoichiometry, for example: –NR₂.

Attachment Points

For substituents there are different notation options. In writing, radical nomenclature is used, for the above example we'd say: methyl-oxyl (-OMe), ethyl (-Et), or methyl (-Me). However this doesn't translate well to depictions: .

The CTfile actually does stores substituents this way and specifies the attachment point (APO) information separately.

$RGP
  1
$CTAB
  2  1  0  0  0  0            999 V2000
    1.9048   -0.0893    0.0000 O   0  0  0  0  0  0  0  0  0  0  0  0
    2.6192    0.3232    0.0000 C   0  0  0  0  0  0  0  0  0  0  0  0
  1  2  1  0  0  0  0
M  APO  1   1   1
M  END
$END CTAB
$CTAB
  1  0  0  0  0  0            999 V2000
    1.9940   -1.2869    0.0000 C   0  0  0  0  0  0  0  0  0  0  0  0
M  APO  1   1   1
M  END
$END CTAB
$CTAB
  2  1  0  0  0  0            999 V2000
    1.8750   -2.3286    0.0000 C   0  0  0  0  0  0  0  0  0  0  0  0
    2.5895   -1.9161    0.0000 C   0  0  0  0  0  0  0  0  0  0  0  0
  1  2  1  0  0  0  0
M  APO  1   1   1
M  END
$END CTAB
$END RGP

Alternatively we may use a virtual or 'null' atom. We can convert to/from CTfile format although it's slightly easier to delete the null atom that add it on, due to coordinate generation. A disadvantage of this is the atom count isn't accurate, however the labelled group is also a type of null atom and already distorts the atom count. There are unfortunately different ways of depicting this null atom.

Don't use a dative bond style! You have to fudge the valences and just doesn't work, how would I show a double bond attachment?

The first time I'd encountered attachment points was in ChEBI where and R group means 'something attaches here' (CHEBI:58314, CHEBI:52569), whilst a 'star' label means 'attaches to something' (CHEBI:37807, CHEBI:32861). This actually a nice way of thinking about it, like two jigsaw pieces the asymmetry allows the substituent to connect to the labelled atom.

The 'star' atom used by ChEBI is tempting to use as there is a star atom in SMILES.

*OC
*C
*CC

However a '*' in SMILES actually means 'unspecified atomic number', some toolkits impose additional semantics. ChemAxon reads a 'star' to mean 'any atom', whilst OEChem, Indigo, and OpenBabel actually read more like an R Group, with [*:1] and [*:2] being R1 and R2 etc. ChemAxon Extended SMILES allows us to explicitly encode attachment points.

*OC |$_AP1$|
*C |$_AP1$|
*CC |$_AP1$|

I opted to implement the wavy line notation in CDK which is preferred by IUPAC graphical representation guidelines.

A major disadvantage of this notation is mis-encoding by users mistaking it for a wavy up/down stereo bond. I talk more about this in the poster (Sketchy Sketches) but the number of times you see the following drawn:

The captured connection table for that sketch does not have null atoms but instead uses carbon:

Bringing Molfile Sgroups to the CDK - Rendering Tips

2015-11-21T17:55:00.000+00:00

In the last but one post I gave a demonstration of S(ubstance)group rendering in the CDK. Now I want to give some implementation insights.

Abbreviations (Superatoms)

Abbreviations contract part of a structure to a line formula or common abbreviation.


Full-structure	Abbreviated-structure

Abbreviating too much or using unfamiliar terms (e.g. maybe using CAR for carbazole) can make a depiction worse. However some structures, like CHEMBL590010, can be dramatically improved.

CHEMBL590010

One way to implement abbreviations would be by modifying the molecule data structure with literal collapse/contract and expand operations. Whilst this approach is perfectly reasonable, deleting atoms/bonds is expensive (in most toolkits) and it somewhat subtracts the "display shortcut" nature of this Sgroup.
For efficiency abbreviations are implemented by hiding parts of the depictions and remapping symbols. Just before rendering we iterator over the Sgroups and set hint flags that these atoms/bonds should not be included in the final image. If there is one attachment (e.g. Phenyl) we remap the attach point symbol text to the abbreviation label ('C'->'Ph'). When there are no attachments (e.g. AlCl₃) we add a new symbol to the centroid of the hidden atoms.


Hide atoms and bonds	Symbol Remap	Abbreviated Result

For two or more attachments (e.g. SO₂) you also need coordinate remapping.

Multiple Group

Multiple groups allow, contraction of a discrete number of repeating units. They are handled similarly to the abbreviations except we don't need to remap parts.

CHEBI:1233

All atoms are present in the data structure but are laid out on top of each other (demonstrated below). We have a list of parent atoms that form the repeat unit. Therefore to display multiple groups we hide all atoms and bonds in the Sgroup except for parent atoms and the crossing bonds.

It's worth mentioning that hidden symbols are still generated but simply excluded from the final result. This allows bond back off for hetero atoms to be calculated correctly as is seen in this somewhat tangled example:

Brackets

Polymer and Multiple group Sgroups require rendering of brackets. Encoded in the molfile (and when laid out) brackets are described by two points, a line. It is therefore up to the renderer to decide which side of the line the tick marks should face.
I've seen some implementations use the order of the points to convey bracket direction. Another method would be to point the brackets at each other. As shown for CHEBI:59342 this is not always correct.


Poor bracket direction	Preferred bracket direction
CHEBI:59342

I originally thought the solution might involve a graph traversal/flood-fill but it turns out there is a very easy way to get the direction correct. First we consider that brackets may or may not be placed on bonds, if a bracket is on a bond this information is available (crossing bonds).

For a bracket on a crossing bond exactly one end atom will be contained in the Sgroup, the bracket should point towards this atom.
If a bracket doesn't cross a bond then the direction should point to the centroid of all atoms in the Sgroup.

Java Serialization: Great power but at what cost?

2015-10-17T19:38:00.003+01:00

The default Java serialization framework provides a convenient mechanism for streaming in-memory Objects to another computer or storing them on disk. Beyond the obvious badness of being tied to the internal object layout (i.e. not stable through changes), serialization can be very inefficient. Externalization and libraries like Kyro are popular for improving performance.

SMILES: CO[C@@H]([C@H](OC(C)=O)[C@@H](OC(C)=O)[C@H](OC(C)=O)[C@H](OC(C)=O)COC(C)=O)SC

In the domain of Chemistry we have a rich variety of formats (e.g. SMILES) with which we can store molecules and reactions (in memory these are labelled graphs). Although these formats do not completely fulfil the utility of Object serialization they can be used as building block upon which we build. Not only are these defacto standards but they can be much faster and compact than default serialization of the in-memory connection table (graph) representation.

Recent History

Crafting efficient (de)serialization is beneficial and you can get great speed with simple setup. A few years ago I ran some experiments on writing an externalization stream for the Chemistry Development Kit (CDK) molecules (thread - High Performance Structure IO). Since the objects are huge any improvement over the default would be useful. This partly fed into the needs of CDK-Knime (a workflow tool) where I think CML was being used originally. From testing on ChEBI (~20,000 molecules) we see actually the ObjectInputStream was about as fast as an SDfile and much faster than CML. SDfiles are now much faster but that would be another post.

Read Performance

Method	Time	Size	Throughput
AtomContainerStream	346 ms	11.1 MiB	63739 s^-1
SDfile	4159 ms	51.7 MiB	5302 s^-1
CML	18605 ms	91.5 MiB	1185 s^-1
ObjectInputStream	5552 ms	93.9 MiB	3972 s^-1

It was around that time that Andrew Dalke payed a visit to EMBL-EBI. In discussing what I was currently working on he promptly showed me how fast OEChem could read/write SMILES. Needless to say – pretty quick and as fast if not faster than my attempt at 'High Throughput' streaming.

The CDK now also has fast SMILES processing and I wanted to compare this to the serialization to see how much of a performance penalty there is.

Benchmark

For a benchmark I used 100,000 structures for ChEMBL 20.

$ shuf chembl_20.smi | head -n 100,000 > chembl_20_subset.smi

Writing it to a ObjectOutputStream takes 28.78 seconds. The SMILES subset file takes up 6.8 MiB on disk whilst the serialized objects take up 295 MiB. Ouch, that's 42x larger.

Code 1 - Writing to an ObjectOutputStream

IChemObjectBuilder bldr = SilentChemObjectBuilder.getInstance();
SmilesParser smipar = new SmilesParser(bldr);

String srcname = "/data/chembl_20_subset.smi";
String destname = "/data/chembl_20_subset.obj";

try (InputStream in = new FileInputStream(srcname);
     Reader rdr = new InputStreamReader(in, StandardCharsets.UTF_8);
     BufferedReader brdr = new BufferedReader(rdr);
     ObjectOutputStream oos = new ObjectOutputStream(new FileOutputStream(destname))) {
    String line;
    long t0 = System.nanoTime();
    while ((line = brdr.readLine()) != null) {
        try {
            IAtomContainer mol = smipar.parseSmiles(line);

            // stereochemistry does not implement serializable...
            // so need to remove it
            mol.setStereoElements(new ArrayList(0));

            oos.writeObject(mol);
        } catch (CDKException e) {
            System.err.println(e.getMessage());
        }
    }
    long t1 = System.nanoTime();
    System.err.printf("write time: %.2f s\n", (t1 - t0) / 1e9);
}

In CDK we first read SMILES with Beam and then convert to the CDK objects so we'll also look at that small overhead. Here I compare the time to read the 100,000 SMILES using Beam, CDK, and the objects using an ObjectInputStream. Both CDK and Beam take less than 1 second whilst the ObjectInputStream takes more than 50.

In terms of throughput (mol per sec) here is the kind of speed we hit. I also show the total elapsed time for all 15 repeats.

Method	Min	Max	Elapsed Time	Size
Deserialization	1961 s^-1	2089 s^-1	12 m 16 s	295 MiB
Kryo (Auto)	42401 s^-1	44557 s^-1	33.9 s	186 MiB
Kryo Unsafe (Auto)	44854 s^-1	47331 s^-1	31.9 s	231 MiB
CDK	135286 s^-1	142126 s^-1	10.7 s	6.8 MiB
Beam	347534 s^-1	489545 s^-1	3.2 s	6.8 MiB

Auxiliary Data

With a performance difference that huge why would anyone want to use Serialization? Some use-cases might be that a format doesn't store the parts we need. A common argument against SMILES is the lack of coordinates but we can simply store this supplementary to the SMILES if we no what the input order will be (Code 2).

Code 2 - Writing Coordinates with SMILES

IAtomContainer  mol = ...;
// 'Generic' - avoid canon SMILES we are not doing identity check
SmilesGenerator sg  = SmilesGenerator.generic(); 

int   n     = mol.getAtomCount();
int[] order = new int[n];

// the order array is filled up as the SMILES is generated
String smi = sg.create(mol, order);

// load the coordinates array such that they are in the order the atoms
// are read when parsing the SMILES
Point2d[] coords = new Point2d[mol.getAtomCount()];
for (int i = 0; i < coords.length; i++)
  coords[order[i]] = container.getAtom(i).getPoint2d();

// SMILES string suffixed by the coordinates
String smi2d = smi + " " + Arrays.toString(coords);

Using that same technique it's relatively simply to extend this to handle arbitrary data fields and it even forms the basis of ChemAxon's extended SMILES. A more advanced method would be combining the SMILES with a DataOutputStream since we know how many coordinates there are expected to be.

Summary

I'm certainly not against a performant AtomContainerInputStream but the default Java serialization should never be the first choice. Hopefully this post has put some numbers on why and will discourage knee-jerk usage.

Update

Added Kryo performance

Bringing Molfile Sgroups to the CDK - Demo

2015-09-08T20:43:00.001+01:00

Despite the flaws, the molfile has been a defacto standard for chemical representation for several decades. The core format (atom and bond block) is well supported in many toolkits but more advanced features (dark corners) of the property block may be skipped.

At this year's Fall ACS (Boston '15) I bumped into an old colleague from ChEBI who told me they (ChEBI) couldn't use CDK because they wanted to display repeating brackets on records and CDK didn't do that.

Polymer representation (more precisely Structural Repeat Unit) used by ChEBI falls under the category of a Ctab Sgroup. I'd wanted to add support for Sgroups for some time and now had motivation to do so.

Substructure (or Substance) Groups

Over the years there seems to have been a shift in definition. The original literature[1] uses the term "substructure groups" but more recent materials use "substance groups"[2,3]. Personally I prefer "substructure" since it concisely summarises what they really are about.

Essentially an Sgroup annotates some part of the connection table (a substructure) with meta-information (data). There are several types of Sgroup that formalise the types of annotation present:

Display Shortcuts
- Abbreviations
- Multiple Groups
Polymers
- Structural Repeat Unit (SRU)
- Monomer
- Copolymer (alternating, block, or random)
- Mer
- Crosslink
- Graft
- Modified
- Any
Mixtures
- Unordered Mixture
- Ordered Mixture (formulation)
- Component
Generic
Data

Example ChEBI Depictions

Egon reviewed the first patch (pull/149) last week that focussed on representation and molfile round tripping. The second patch enhances the rendering code to handle more than basic SRUs (e.g. >2 brackets) and display shortcuts.

As of ChEBI 131 there are 809 entries with at least one Sgroup. Generating the depictions of these from an SDfile took < 3 seconds, then a further 11 to actually write the files to disk. The rest of this post demonstrates some example of those depictions.

Display Shortcut, Abbreviations

Previously referred to as "superatoms", parts of a structure can be abbreviated to a more concise name (e.g. Ph for a phenyl substituent). The full structure is present but is only displayed when the expansion flag is set.


CHEBI:29441	CHEBI:7725

Display Shortcut, Multiple Group

Multiple groups allow structures with fixed repeating parts to be drawn more concisely. Similar to abbreviations, all the atoms and bonds are present but are hidden from display. They're actually all overlaid on one another with duplicated coordinates but for rendering you still want omit them from display.


CHEBI:1233	CHEBI:79399

Polymer, SRUs

The most common Sgroup used in ChEBI is the Structure Repeat Unit (SRU), an SRU defines a repeat unit of variable length. The brackets do not necessarily come in pairs, are parallel, or point towards each other.


CHEBI:16838	CHEBI:4294

CHEBI:53422	CHEBI:59342

Polymer, Others

A few entries encode copolymers and source-based representations (monomer).


CHEBI:59599	CHEBI:3814 (overlap in original)

Combinations

A structure can have more than one Sgroup and they can be nested. Here we see a multiple group within an SRU. There is also a data Sgroup attached to the Zn-N bond marking it as a coordination bond for Marvin. I've not decided whether to render those yet, but we have the information there.

CHEBI:81539

Additional Reading

MMFF Partial Charges Improvements in CDK

2015-08-09T15:33:00.002+01:00

Some time last year Mark Williamson brought to my attention discrepancies in CDK's MMFF partial charge calculation. Investigating further it seemed to mainly be a problem with atom typing. There were two existing classes that could assigned MMFF atom types using a combination of a decision tree and string matching hose codes. The 761 molecules from the MMFF94 Validation Suite provided by Paul Kersey were used to give a more comprehensive overview then our current tests.

The results showed reasonable precision per-atom in the validation suite but were less favourable per-molecule, the best implementation assigned types to <90% of the molecules with <16% assigned correctly.

	Assigned Types (Atoms)		Correct Types (Atoms)		Assigned Types (Molecules)		Correct Types (Molecules)
ForceFieldConfigurator	15576	90.1%	12932	74.8%	678	89.1%	118	15.5%
MMFF94AtomTypeMatcher	17120	99.1%	12309	71.2%	659	86.6%	75	9.9%
MmffAtomTypeMatcher	17279	100.0%	17279	100.0%	761	100.0%	761	100.0%

I wasn't keen to hard code the atom typing procedure but was delighted to find Robert Hanson of JMol had some SMARTS patterns that could be used as a starting point. After about a month of tweaking I managed to simplify the SMARTS patterns and achieve 100% precision on the validation suite. You can find the SMARTS patterns here: /org/openscience/cdk/forcefield/mmff/MMFFSYMB.sma.

Apart from improving atom type assignments the charge assignment also needed updating to include charge sharing and bond class differences. This wasn't quite as simple as I first thought as the parameter set parsing also needed reworking. After many months of analysis paralysis I decided last week to just rewrite what was needed and delegate calls from the existing implementation.

Now the patch is finished, charge assignments are much better. Notice that in the previous version (labelled CDK 1.5.10) equivalent terminal oxygens and the nitrogens in imidazole anion have different values. The overall charge was also inconsistent with the formal charges.

Improved charge assignment

Roger Sayle noted to me this week that MMFF charges should not be affected by representation, for example, charge separated pi bonds in nitro groups or phosphates.

Charges are independant of representation

Many thanks to Mark and Alison Choy for reporting the problem and adding patches for debugging and testing.

PhD Thesis Now Available

2015-01-29T17:38:00.002+00:00

I'm please to announce that my PhD thesis is now available from the Cambridge DSpace repository: https://www.repository.cam.ac.uk/handle/1810/246652. One thing potentially of note is the description of fast Kekulisation that I originally intended to write as a blog post. Also following up from NextMove Software's recent post by Daniel on Cahn-Ingold-Prelog (CIP), the results of Chapter 6 contains some more CIP madness.

CDK Release 1.5.10

2014-12-30T21:02:00.000+00:00

CDK 1.5.10 has been released and is available from sourceforge (download here) and the Maven central repository (XML 1).

This release follows very shortly after 1.5.9 and is the first release available from the central maven repository. This means there is now no need to include a custom repo when using the library in downstream projects (XML 1)

The short release notes (1.5.10-Release-Notes) summarise and detail the changes. Other than the availability in the central repository the release includes a new MolecularFormulaGenerator contributed by Tomáš Pluskal that provide mass to formula generation in a fraction of the time of the old MassToFormulaTool.

XML 1 - Maven POM configuration

<dependency>
  <groupId>org.openscience.cdk</groupId>
  <artifactId>cdk-bundle</artifactId>
  <version>1.5.10</version>
</dependency>

CDK Release 1.5.9

2014-12-24T15:56:00.002+00:00

CDK 1.5.9 has been released and is available from sourceforge (download here) and the EBI maven repo (XML 1).

This is the first release to be built using Java 7 and will require the Java SE Runtime 7 to execute. The previous release (1.5.8) will be the last to work with Java SE 6.

The full release notes (1.5.9-Release-Notes) summarise and detail the changes. One of the new features is the recognition of perspective projection stereochemistry.

XML 1 - Maven POM configuration

<repository>
  <url>http://www.ebi.ac.uk/intact/maven/nexus/content/repositories/ebi-repo/</url>
</repository>
...
<dependency>
  <groupId>org.openscience.cdk</groupId>
  <artifactId>cdk-bundle</artifactId>
  <version>1.5.9</version>
</dependency>

Memory Mapped Fingerprint Index - Part II

2014-12-01T23:31:00.001+00:00

This post follows up on the previous to report some timings. I've checked all the code into GitHub (johnmay/efficient-bits/fp-idx) and it has some stand alone programs that can be run from the command line.

Currently there are a few limitations that we'll get out the way:

Only generation of the CDK ECFP4 is supported and at a folded length of 1024, this should give a close approximation to what Matt used in MongoDB (RDKit Morgan FP). Other fingerprints and foldings could be used but the generation time of path based fingerprints in the CDK is currently (painfully) slow.
Building the index is done in memory, since 1,000,000x1024 bit fingerprints is only 122 MiB you can easily build indexes with less than 10 million on modern hardware.
During index searching the entire index is memory mapped, setting the chunks system property (see the GitHub readme) will avoid this at a slight performance cost.
Results return the id in the index (indirection) and to get the original Id one would need to resolve it with another file (generated by mkidx).
Index update operations are not supported without rebuilding it.

These are all pretty trivial to resolve and I've simply omitted them due to time. With that done, here's a quick synopsis of making the index, there is more in the GitHub readme.

Code 1 - Synopsis

$ ./smi2fps /data/chembl_19.smi chembl_19.fps # ~5 mins
$ ./mkidx chembl_19.fps chembl_19.idx # seconds

The fpsscan does a linear search computing all Tanimoto's and outputting the lines that are above a certain threshold. The simmer and toper utils use the index, they either filter for similarity or the top k results. They can take multiple SMILES via the command line or from a file.

Code 2 - Running queries

$ ./fpsscan /data/chembl_19.fps 'c1cc(c(cc1CCN)O)O' 0.7 # ~ 1 second
$ ./simmer chembl_19.idx 0.7 'c1cc(c(cc1CCN)O)O' # < 1 second
$ ./toper chembl_19.idx 50 'c1cc(c(cc1CCN)O)O' # < 1 second (top 50)

Using the same queries from the MongoDB search I get the following distribution of search times for different thresholds.

Some median search times are as follows.

Threshold	Median time (ms)
0.90	14
0.80	31
0.70	46
0.60	53

In the box plot above the same (first) query is always the slowest, this is likely due to JIT.
It's interesting to see that the times seem to flatten out. By plotting how many fingerprints the search had to check we observe that below a certain threshold we are essentially checking the entire dataset.

The reason for this is potentially due to the sparse circular fingerprints. Examining the result file (see the github README) we can estimate that on average we're calculating 23,556,103 Tanimoto's a second. This also means that retrieving the top k queries isn't bad either. For example 10,000 gives a median time (Code 3) of 72 ms.

Code 3 - Top 10,000 hits for queries (same as before)

$ ./toper chembl_19.idx 10000 queries.smi

Next I'll look at some like-for-like comparisons.

Memory Mapped Fingerprint Index - Part I

2014-11-26T22:50:00.001+00:00

I attended an interesting talk this afternoon (CCNM) by Matt Swain on using MongoDB for chemical similarity searching (code: github/mcs07/mongodb-chemistry).

The similarity searching partially uses the "Baldi" algorithm with some extra tweaks based on checking rare bits. The Baldi method is nicely summarised along with others by Tim Vandermeersch in his post on Fingerprint Searching Using Various Indexing Methods. As is noted by Tim, it can be improved upon.

Anyways, I had an implementation of a memory mapped Baldi index lying around, there is also one in the OrChem database cartridge. I prototyped the implementation back in April and was/is part of a "nfp" (new fingerprint) module for CDK. I've now put the code on a GitHub project (github/johnmay/efficient-bits/fp-idx) and will do some benchmarking to see how it does.

My feeling is that the very simple (it's about 100 lines) memory mapped index can give competitive performance on small datasets (<5 million entries).

Fun (and abuse) of implicit methods

2014-11-16T17:10:00.000+00:00

Earlier this year I wrote up some Chemical Toolkit Rosetta examples of using the CDK in Scala (github/cdk/cdk-scala-examples). When I was writing this it sprung to mind that it would be cool to (ab)use one feature for interoperability between cheminformatics toolkits.

Scala is a statically typed functional language that runs on the Java Virtual Machine. It has some nice features and syntax that can produce some very concise code. One thing particular neat is the ability to define implicit methods. Essentially these are methods that define how to convert between types, they are implicit because the compiler can insert them automatically.

Implicit methods are very similar to auto(un)boxing that was introduced in Java 5 to simplify conversion of primitives and their instance wrappers (Code 1).

Code 1 - Autoboxing and autounboxing in Java

Integer x = 5; // ~ Integer x = Integer.valueOf(5);
int y = x;     // ~ int y = x.intValue();
x = y;         // ~ x = Integer.valueOf(y);

if (x == y) {  // ~ x.intValue() == y 
  
}

Much like it is possible in some programming languages to define custom operators, Scala makes it possible to define custom conversions that are inserted at compile time. The main advantage is it allows APIs to be extended to accept different types without introducing extra methods.

Conversion from line notations

Line notations are a concise means of encoding a chemical structure as sequence of characters (String). Common examples include SMILES, InChI*, WLN, SLN, and systematic nomenclature. Conversion to and from these formats isn't too computationally expensive but probably not something you want to do on-the-fly. However, just for fun, let's see what an implicit method for converting from strings can do. First we need the specified methods for loading from a known string type. We'll use the CDK for SMILES and InChI with Opsin for nomenclature.

Code 2a - Parsing of linear notations

val bldr = SilentChemObjectBuilder.getInstance
val sp   = new SmilesParser(bldr)
val igf  = InChIGeneratorFactory.getInstance

def inchipar(inchi: String) = 
  igf.getInChIToStructure(inchi, bldr).getAtomContainer

def cdksmipar(smi: String) = 
  sp.parseSmiles(smi)

def nompar(nom: String) = 
  cdksmipar(NameToStructure.getInstance.parseToSmiles(nom))

def cansmi(ac: IAtomContainer) =
  SmilesGenerator.unique().create(ac)
  
// Universal SMILES (see. O'Boyle N, 2012**)
def unismi(ac: IAtomContainer) = 
  SmilesGenerator.absolute().create(ac)

Code 2b - Implicit conversion from a String to an IAtomContainer

implicit def autoParseCDK(str: String): IAtomContainer = {
    if (str.startsWith("InChI=")) { 
      inchipar(str)
    } else if (str.startsWith("1S/")) {
      inchipar("InChI=" + str)
    } else {
      try {
        cdksmipar(str)
      } catch {
        case _: InvalidSmilesException => nompar(str)
      }
    }
}

Now the implicit method has been defined, any method in the CDK API that accepts an IAtomContainer can now behave as though it accepts a linear notation. Code 3 shows how we can get the same Universal SMILES for different representations of caffeine and compute the ECFP4 fingerprint for porphyrin

Code 3 - Using implicit methods

println(unismi("caffeine"))
println(unismi("CN1C=NC2=C1C(=O)N(C)C(=O)N2C"))
println(unismi("InChI=1S/C8H10N4O2/c1-10-4-9-6-5(10)7(13)12(3)8(14)11(6)2/h4H,1-3H3"))
println(unismi("1S/C8H10N4O2/c1-10-4-9-6-5(10)7(13)12(3)8(14)11(6)2/h4H,1-3H3"))
  
val fp = new CircularFingerprinter(CLASS_ECFP4).getCountFingerprint("porphyrin")

Conversion between toolkits

It is also possible to add in implicit methods to auto-convert between toolkit types. To convert between the CDK and RDKit (Java bindings) I'll go via SMILES. This conversion is lossy without an auxiliary data vector but serves as a proof of concept. I've lifted the Java bindings from the RDKit lucene project (github/rdkit/org.rdkit.lucene/) as the shared library works out the box for me. We can also add in the from string implicit conversions.

Code 4 shows the implicit method definitions. The additional autoParseRDKit allows us to bootstrap the RDKit API to also accept linear notations on all methods that expect an RWMol (or ROMol).

Code 4 - Implicit methods for conversion between CDK and RDKit

implicit def cdk2rdkit(ac : IAtomContainer) : RWMol = 
  RWMol.MolFromSmiles(SmilesGenerator.isomeric.create(ac))

// XXX: better to use non-canon SMILES
implicit def rdkit2cdk(rwmol : RWMol) : IAtomContainer = 
  cdksmipar(RDKFuncs.getCanonSmiles(rwmol))

implicit def autoParseRDKit(str: String): RWMol = 
  cdk2rdkit(autoParseCDK(str))

Now we can obtain the Avalon fingerprint of caffeine from it's name and pass an RWMol to the CDK's PubchemFingerprinter (Code 5).

Code 5 - Using the RDKit API

val fp = new ExplicitBitVect(512)
RDKFuncs.getAvalonFP("caffeine", fp2)

val caffeine : RWMol = "caffeine"
new PubchemFingerprinter(bldr).getBitFingerprint(caffeine)

Given that auto(un)boxing primitives in Java can sting you in performance critical code, the above examples should be used sparingly. They do serve as a fun example of what is possible and I've put together the working code example in a Scala project for others to try github/johnmay/efficient-bits/impl-conversion.

Footnotes

* - Technically "InChI is not a replacement for any existing internal structure representations", Heller S. ICCS. 2014 but we'll allow it.
** - http://www.jcheminf.com/content/4/1/22

Not to scale

2014-09-12T14:59:00.001+01:00

The latest release of the CDK (1.5.8) includes a new generator for rendering structure diagrams. A detailed introduction to configuring the new generator is available on the CDK wiki^[1].

The new generator can be used as a drop in replacement in existing code. However, one aspect of rendering that I've struggled with previously was getting good sized depictions with the CDK - most notably with vector graphic output. This post will look at how we can size depictions and will provide code in an example project as a reference.

ChEBI's current entity of the month - maytansine [CHEBI:6701] will be used to demonstrate the sizing.

Parameters

Three parameters that are important in the overall sizing of depictions. These are the BondLength, Scale, and Zoom which are all registered as BasicSceneGenerator parameters. The Zoom is not needed if we allow our diagram to be fitted automatically.

The BondLength can be set by the user and has a default value of '40' whilst the Scale is set during diagram generation. BondLength units are arbitrary - for now we'll consider this as '40 px'.

Scaling

The Scale parameter is used to render molecules with different coordinate systems consistently^[2,3]. The value is determined using the BondLength parameter and the bond length in the molecule. For maytansine [CHEBI:6701] the median bond length is ~0.82. Again, the units are arbitrary - this could be Angstroms (it isn't).

The Scale is therefore the ratio of the measured bond length (0.82) to the desired bond length (40 px). For this example, the scale is 48.48. The coordinates must be scaled by ~4800% such that each bond is drawn as 40 px long.

Bounds

Now we know our scale (~48.48), how big is our depiction going to be? It depends how we measure it. One way would be to check the bounding box that contains all atom coordinates (using GeometryUtil.getMinMax()). However, this does not consider the positions of adjuncts and would lead to parts of the diagram being cut off^[4].

Fortunately the new generator provides a Bounds rendering element allowing us to determine the true diagrams bounds of 8.46x8.03. Since the scale is ~48.48 the final size of the depiction would be ~410x390 px. A margin is also added.

Current Rendering API

Now we have the size of our diagram we can render raster images. Unfortunately the current rendering API makes this a little tricky as the diagram is generated after the desired image size is provided by the user. To get the correct size we need to generate the diagram twice (to get the bounds) or use an intermediate representation (we'll see this later).

Code 1 - Current rendering API

// structure with coordinates
IAtomContainer container = ...; 

// create the renderer - we don't use a font manager
List<IGenerator<IAtomContainer>> generators = 
        Arrays.asList(new BasicSceneGenerator(),
                      new StandardGenerator(new Font("Verdana", Plain, 18));
AtomContainerRenderer renderer = new AtomContainerRenderer(generators, null); 

Graphics2D g2 = ...; // Graphics2D to draw raster / vector graphics
Rectangle2D bounds = ...; // need the bounds here!
renderer.paint(new AWTDrawVisitor(g2),
               bounds);

Vector graphics

To render scalable graphics we can use the VectorGraphics2D^[5] implementations of the Java Graphics2D class. Vector graphics output can use varied units (e.g. pt, mm, px) - the VectorGraphics2D uses mm.

Without adjusting our scaling the render of maytansine [CHEBI:6701] would be displayed with bond lengths of 40 mm and a total size of ~410x390 mm. The output can be rescaled after rendering but the default width of 41 cm is a bit large. We therefore need to change our desired bond length.

The bond length of published structure diagrams varies between journals. A common and recommended style for wikipedia ^[6] is 'ACS 1996' - the style has a bond length of '5.08 mm'. Although setting the BondLength parameter to '5.08' would work, other parameters would need adjusting such as Font size (which is provided in pt!).

To render the diagram with the same proportions as the raster image we can instead resize the bounds and fit the diagram to this. Since the desired bond length is '5.08 mm' instead of '40 mm' we need rescale the diagram by 12.7 %. Our final diagram size is then ~52x50 mm. The border for ACS 1996 is '0.56 mm' which can be added to the diagram size.

Example code

To help demonstrate the above rendering I've put together a quick GitHub project johnmay/efficient-bits/scaled-renders. The code provides a convenient API and a command line utility for generating images.

Code 2 - Intermediate object

// structure with coordinates
IAtomContainer container = ...; 

// create the depiction generator
Font font = new Font("Verdana", Plain, 18);
DepictionGenerator generator = new DepictionGenerator(new BasicSceneGenerator(),
                                                      new StandardGenerator(font));

// generate the intermediate 'depiction'
Depiction depiction = generator.generate(container);

// holds on to the rendering primitives as well as the size
double w = depiction.width();
double h = depiction.height(); 

// draw at 'default' size
depiction.draw(g2, w, h);

// generate a PDF (or SVG)
String pdfContent = depiction.toPdf(); // default size
String pdfContent = depiction.toPdf(1.5); // 1.5 * default size
String pdfContent = depiction.toPdf(0.508, 0.056); // bond length, margin

The command line utility provides several options to play with and can load from molfile, SMILES, InChI, or name (using OPSIN^[7]).

Code 3 - Command line

# In the project root set the following alias
$: alias render='mvn exec:java -Dexec.mainClass=Main'

# Using OPSIN to load porphyrin and generate a PDF
$: render -Dexec.args="-name porphyrin -pdf"

# Highlight one of the pyrrole in porphyrin
$: render -Dexec.args="-name porphyrin -pdf -sma n1cccc1"

# Show atom numbers
$: render -Dexec.args="-name porphyrin -pdf -atom-numbers"

# Show CIP labels
$: render -Dexec.args="-name '(2R)-butan-2-ol' -pdf -cip-labels"

# Generate a PDF / SVG for ethanol SMILES
$: render -Dexec.args="-smi CCO -pdf ethanol.pdf -svg ethanol.svg"

# Load a molfile
$: render -Dexec.args="-mol ChEBI_6701.mol -pdf chebi-6701.pdf"

You can even play with the font

$: render -Dexec.args="-name 'caffeine' -svg cc-caffeine.svg -font-family 'Cinnamon Cake' -stroke-scale 0.6 -kekule"

Links/References

CDK Release 1.5.8

2014-09-11T11:40:00.001+01:00

CDK 1.5.8 has been released and is available from sourceforge (download here) and the EBI maven repo (XML 1).

The release notes (1.5.8-Release-Notes) summarise and detail the changes.

XML 1 - Maven POM configuration

<repository>
  <url>http://www.ebi.ac.uk/intact/maven/nexus/content/repositories/ebi-repo/</url>
</repository>
...
<dependency>
  <groupId>org.openscience.cdk</groupId>
  <artifactId>cdk-bundle</artifactId>
  <version>1.5.8</version>
</dependency>

Polish-ed SMARTS parsing

2014-07-22T11:58:00.002+01:00

As introduced in previous posts, SMARTS is a concise notation for describing chemical substructure queries. There are several aspects to a SMARTS implementation: subgraph graph matching, parsing, generating, and even optimisation^[1,2].

In this post I'll show a way of parsing the binary atom expressions that I found quite neat.

Preliminaries

Conceptually, a SMARTS atom expression is composed of primitives and operators (binary and unary). The primitives test whether some property of a atom (e.g. element, charge, valence, etc) has a certain value^[3]. The operators invert and combine these primitives through conjunction (and), disjunction (or), and negation (not).

Some examples of atom expressions are:

[O&X1]
[!C&!N]
[C,c;X3&v4]
[N&!H0&X3]
[!#6&X4]
[O,S,#7,#15]
[C&X3;$([H2]),$([H1][#6]),$(C([#6])[#6])]

The operators in these expressions ordered by their precedence are:

! unary not
& binary and (high)
, binary or
; binary and (low)

The default operator is '&' and can often be omitted such that the first pattern would read [OX1]. There are two 'and' operators with difference precedence allowing logical expressions like:

[C,N&X1]  C or (N and X1)
[C,N;X1]  (C or N) and X1

More complex expression trees can be accomplished with recursive SMARTS.

A formal grammar for SMARTS that respects precedence looks something like this (lifted from the CDK javacc implementation):

SMARTS EBNF grammar

AtomExpression    ::= "[" <LowAndExpression> "]"
LowAndExpression  ::= <OrExpression> [ ";" <LowAndExpression> ]
OrExpression      ::= <HighAndExpression> [ "," <OrExpression> ]
HighAndExpression ::= <NotExpression> [ '&' <HighAndExpression> ]
NotExpression     ::= [ "!" ] <AtomPrimitive>

Notice this is a recursive procedure where I ascend up the precedence hierarchy while descending into the grammar. The small number of operators in SMARTS means this is generally good enough. However there is a non-recursive alternative.

Reverse Polish notation

Reverse Polish notation (RPN) is a notation where the operator follows the operands of an expression^[4]. Some simple mathematical expressions are written as follows:

5 + 1              5 1 +
3 + 4 * 2          3 4 2 * +
(3 + 4) * 2        3 4 + 2 *

RPN is extremely useful and simple for implementing and performing operations on stack-based machines^[5]. An excellent property is that the operators are applied as soon as they are encountered. Notice that I don't need parentheses to change the multiply and addition order. Also notice that a lookahead check for operator validity isn't needed, when an operator is applied the primitives have already been parsed.

SMARTS operators are infix but let us see what RPN SMARTS might look like:

[O&X1]             O X1 &
[!C&!N]            C ! N ! &
[C,c;X3&v4]        C c , X3 v4 & ;
[N&!H0&X3]         N H0 ! X3 & &
[!#6&X4]           #6 ! X4 &
[O,S,#7,#15]       O S #7 #15 , , ,

RPN SMARTS is much simpler to write a parser for that respect precedence. All that is needed is a way to convert from infix to postfix. The Shunting-yard algorithm^[6] does just that.

Implementation

The Shunting-yard algorithm is explained well in many other webpages so I'll neglect that here. I will be converting from infix to postfix and build the expression at the same time. To do this, two stacks are needed, one for atom primitives and one for operators. The atom primitive stack is essentially the output of the Shunting-yard but I apply the operators instead of appending them to the postfix string.

Code 1 consumes characters from the input and either shunts an operator or parses a primitive. Once all the input is consumed the remaining operators are applied. The created query atom is on top of the stack and is returned. A low precedence no-operator is pushed on the stack to make thinks simpler and buffer the shunting.

To handle the implicit '&' between primitives a little more work is needed. Essentially one would optionally invoke shunt(atoms, operators, '&'); as needed at each iteration.

Code 1 - Primary loop

IQueryAtom parse(CharBuffer buffer) throws IOException {

    // stacks of atom primitives and operators
    Deque<IQueryAtom> atoms     = new ArrayDeque<>();
    Deque<Character>  operators = new ArrayDeque<>();
    operators.push(Character.MAX_VALUE); // a pseudo low precedence op

    while (buffer.hasRemaining()) {
        char c = buffer.get();
        if (isOperator(c)) // c == '!' or '&' or ',' or ';'? 
           shunt(atoms, operators, c);
        else                
           atoms.push(parsePrimitive(buffer));
    }

    // apply remaining operators
    while (!operators.isEmpty())
        apply(operators.pop(), atoms);

    return atoms.pop();
}

Code 2 shows the creation of query atom primitives, here they are delegated to several self explanatory utility methods. For compactness only a subset of primitives are read.

Code 2 - Parsing selected primitives

IQueryAtom parsePrimitive(CharBuffer buffer) throws IOException {
    switch (buffer.get(buffer.position() - 1)) {
        case 'A': return newAliphaticQryAtm();
        case 'C': return newAliphaticQryAtm(6);
        case 'N': return newAliphaticQryAtm(7);
        case 'O': return newAliphaticQryAtm(8);
        case 'P': return newAliphaticQryAtm(15);
        case 'S': return newAliphaticQryAtm(16);

        case 'a': return newAromaticQryAtm();
        case 'c': return newAromaticQryAtm(6);
        case 'n': return newAromaticQryAtm(7);
        case 'o': return newAromaticQryAtm(8);
        case 'p': return newAromaticQryAtm(15);
        case 's': return newAromaticQryAtm(16);

        case '#': return newNumberQryAtm(parseNum(buffer));
        case 'X': return newConnectivityQryAtm(parseNum(buffer));
        case 'H': return newHydrogenCountQryAtm(parseNum(buffer));
        case 'R': return newRingMembershipQryAtom(parseNum(buffer));
        case 'v': return newValenceQryAtom(parseNum(buffer));
    }
    throw new IOException("Primitive not handled");
}

To apply an operator, take the operands (primitives) off the top of the atom stack, create a new query atom, and push it back on to the stack (Code 3). If there aren't enough operands, the expression is invalid (not shown).

Code 3 - Applying an operator

void apply(char op, Deque<IQueryAtom> atoms) {
    if (op == '&' || op == ';')
        atoms.push(and(atoms.pop(), atoms.pop()));
    else if (op == ',')
        atoms.push(or(atoms.pop(), atoms.pop()));
    else if (op == '!')
        atoms.push(not(atoms.pop()));
}

Finally, to handle the operator (Code 4), check if the operator currently on top of the stack has precedence over the new operator. If so, pop it from the stack and apply it. The new operator is then added to the stack. Conveniently the code point of the operator character can be used as the precedence.

Code 4 - Handling operator precedence

void shunt(Deque<IQueryAtom> atoms, Deque<Characters> operators, char op) {
    while (precedence(operators.peek()) < precedence(op))
        apply(operators.pop(), atoms);
    operators.push(op);
}

static int precedence(char c) {
    return c; // in ASCII, '!' < '&' < ',' < ';' 
}

With the exception of a few utility methods these four snippets are essentially the whole implementation. You can find the fully functional code on the GitHub project^[7].

Not only is the code is incredibly compact and elegant but it can easily be expanded. Several convenience extensions to SMARTS have been made in the past – for example, #X for !#1!#6. A common requirement in general expressions and the Shunting-yard is to handle parenthesis. These need special treatment but it is only a simple modification to the shunting and the precedence value (Code 5).

Code 5 - Handling parenthesis

void shunt(Deque atoms, Deque operators, char op) {
    if (op == ')') {
        while ((op = operators.pop()) != '(')
            apply(op, atoms);
    } else {
        if (op != '(') {
            while (precedence(operators.peek()) < precedence(op))
                apply(operators.pop(), atoms);
        }
        operators.push(op);               
    }
}

int precedence(char c) {
    switch (c) {
        case '!': return 1;
        case '&': return 2;
        case ',': return 3;
        case ';': return 4;
        case '(':
        case ')': return 5;
        default:  return 6;
    }
}

The parser will now correctly handle the following expressions without recursive SMARTS:

[!(C,N,O,P,S)]              C N O P , , , !
[!(C,N,O&X1)]               C N O X1 & , , !
[((C,N)&X3),((O,S)&X2)]     C N , X3 & O S , X2 & ,

All source code is available at github/johnmay/efficient-bits/polished-smarts.

References

CDK Release 1.5.7

2014-07-18T11:30:00.001+01:00

CDK 1.5.7 has been released and is available from sourceforge (download here) and the EBI maven repo (XML 1).

The release notes (1.5.7-Release-Notes) summarise and detail the changes. Among the new bug fixes and features, several plugins have been added to the build. The release notes describe how these plugins can be run and what they do so be sure check the notes out if you're a contributor.

XML 1 - Maven POM configuration

<repository>
  <url>http://www.ebi.ac.uk/intact/maven/nexus/content/repositories/ebi-repo/</url>
</repository>
...
<dependency>
  <groupId>org.openscience.cdk</groupId>
  <artifactId>cdk-bundle</artifactId>
  <version>1.5.7</version>
</dependency>

Mischievous SMARTS Queries

2014-03-26T19:06:00.000+00:00

Last year I extended the CDK SMARTS implementation to match component groupings and stereochemistry. Specifying stereochemistry presents some interesting logical predicate that might be tricky to handle.

Here are some examples that I came up with for testing the correctness of query handling. They start simple before getting a little mischievous. First, recursion and component grouping.

query	targets	n_match	Comment
Component grouping (fragment)
`(O).(O)`	`O=O`	0	Example from Daylight
	`OCCO`	0
	`O.CCO`	2
Component grouping (connected)
`(O.O)`	`O=O`	2	Example from Daylight
	`OCCO`	2
	`O.CCO`	0
Recursion, ad infinitum
`[$(CC[$(CCO),$(CCN)])]`	`CCCCO`	1
	`CCCCN`	1
	`CCCCC`	0
Recursive component grouping
`[O;D1;$(([a,A]).([A,a]))][CH]=O`	`OC=O.c1ccccc1`	1	Feature/Bug #1312
	`OC=O`	0

These next ones are concerned with logic and stereochemistry.

query	targets	n_match	Comment
Ensure local stereo matching
`[@]()()()`	`O[C@](N)(C)CC`	12	tetrahedrons have 12 rotation symmetries
	`O[C@@](N)(C)CC`	12
	`O[C](N)(C)CC`	0
Implicit (hydrogen or lone-pair) neighbour
`CC[S@](C)=O`	`CC[S@](C)=O`	1
	`CC[S@@](C)=O`	0
	`CC[S](C)=O`	0
Either (tetrahedral)
`CC[@,@@](C)O`	`CC[C@H](C)O`	1
	`CC[C@@H](C)O`	1
	`CCC(C)O`	0
Both (tetrahedral)
`CC[@&@@](C)O`	`CC[C@H](C)O`	0
	`CC[C@@H](C)O`	0
	`CCC(C)O`	0
Respect logical precedence 1
`CC[@,Si@@](C)O`	`CC[C@H](C)O`	1
	`CC[C@@H](C)O`	0
	`CCC(C)=O`	0
Respect logical precedence 2
`CC[C@,Si@@](C)O`	`CC[C@H](C)O`	1
	`CC[C@@H](C)O`	0
	`CCC(C)O`	0
	`CC[Si@H](C)O`	0
	`CC[Si@@H](C)O`	1
	`CC[Si](C)O`	0
Unspecified
`CC[@@?](C)O`	`CC[C@H](C)O`	0
	`CC[C@@H](C)O`	1
	`CCC(C)O`	1
Negation
`CC[!@](C)O`	`CC[C@H](C)O`	0	`!@@` is also equivalent to `@?`
	`CC[C@@H](C)O`	1
	`CCC(C)O`	1
Neither (tetrahedral) using 'or unspecified'
`CC[@?@@?](C)O`	`CC[C@H](C)O`	0
	`CC[C@@H](C)O`	0
	`CCC(C)O`	1
Neither (tetrahedral) using negation
`CC[!@!@@](C)O`	`CC[C@H](C)O`	0
	`CC[C@@H](C)O`	0
	`CCC(C)O`	1
Either (geomeric)
`C/C=C/,\C`	`C/C=C/C`	1
	`C/C=C\C`	1
	`CC=CC`	0
Neither (geomeric)
`C/C=C!/!\C`	`C/C=C/C`	0
	`C/C=C\C`	0
	`CC=CC`	1

The last two are quite tricky (and not currently implemented) but once the atom-centric handling is correct it's a simple reduction. It's quite fun to work out so i'll leaf that up to the reader.

CDK now built using Maven

2014-02-19T15:59:00.000+00:00

At 13 years and 4 months the Chemistry Development Kit (CDK) is reasonably mature for a software project. Over the years there have been many changes in development practices as the code base evolved. This post is a departure for the usual algorithms and performance tests and looks at a recent and major change in the CDK development process.

On Monday, Egon, Nina and I made the final alterations that changed the build system from Ant to Maven. This change has been in the works for a long time and has been suggested multiple times. The actually migration has taken about a years worth of planning.

If you want to have a play with the new build system yourself I've put together a brief guide that also describes how to import the project into several popular IDEs - Building CDK. The project README also summarises the command line usage.

I download CDK releases and use it my project, what has changed?

If you are using the CDK as a dependancy you should not notice any difference. The library and bundled dependencies will still be distributed at each release. If you are also using maven then CDK module artefacts have been deployed for last few releases. These are by far the easiest way to use the library as dependency versioning is managed maven and newer releases can be automatically downloaded. Please see the project README for repository details.

I build the CDK source and use it my project, what has changed?

The source code is now built with maven - the README summarises the steps. As with releases, SNAPSHOT artefacts will be deployed to a remote repository (currently EBI).

I have modifications to the CDK that I apply, what has changed?

If your patches are Git commits then these can still be applied. Git will sort out and use the correct file locations to any modified files. If the patch creates new files these will need to be moved manually to the correct location.

CDK Modules in Dec 2013 - Egon W, Bits of Blah

Existing project structure

Prior to Monday the project code was organised under a single root folder. The Ant build would then read instructions in the source code and assemble the modules during compilation. This approach allowed progressive partition the code into modules over an extended period. Without this system we would not have be able to convert to maven at all.

This system was customised and specific to the CDK which, in my opinion, made it a significant barrier to contributions. I know that personally I struggled to understand what was going on at compile time. A highly customised build process makes it not only difficult for a human to comprehend but also any automated tooling (Integrated Development Environments, IDEs). Superficial support has been provided for Eclipse and Netbeans editors but neither correctly interrupted the modules and relationships between them.

Separate source trees

The most noticeable difference in the project is each module now has a separate source tree. This allows easier reasoning about the contents of module and provides a visual cue about the modular structure. Below we can see the 'cip' (Cahn-Ingold-Prelog) module source tree only contains the classes relevant to the module. Separate source trees are not specific to maven and we could still use Ant. The main benefit is that Maven supports and encourages this kind of structuring by default.

Source code in the CIP module

There is still more work to do on the module organisation, for example, CMLWritier is the the 'libiocml' module whilst the CMLReader is the 'io' module. The modules are mainly organised by their dependancies but in future it may be beneficial to organise by function. Normally classes with similar dependencies have similar function but this isn't always the case. An example of this is seen with the LINGOFingerprinter and SignatureFingerprinter in the 'smiles' and 'signatures' modules rather than the 'fingerprint' module.

Super modules

The Maven build also allows us to define groupings of the existing modules. These intermediate modules group the code base in a few digestible sections. You can see these groupings at the root level in the repository - https://github.com/cdk/cdk/. There are five groups and an additional misc/ module for the left overs. I'm planning to write a more in depth guide for the wiki but here is a quick overview of what is present in each.

base/ - API and implementations of domain objects and central algorithms to handling chemical information
descriptor/ - fingerprinters, qsars and signatures for describing and characterising attributes of a compound.
storage/ - reading and writing of chemical compounds from multiple file formats
tool/ - structure diagram generation, smarts, smsd, hashcodes, tautomer and tools that either answer a question directly, manipulate input or compute intrinsic properties
display/ - rendering of 2D depictions

Animated Algorithm: Canonical Tautomer Assignment

2014-02-13T18:58:00.000+00:00

Effectively understanding an algorithm with only a description is difficult. Reading source code, possibly more so. Although these approaches explain the finer details, invariants and proofs, a higher level view offers clarity. A great example of this is seen at McKay's and Piperno's canonical labelling site, The Search Tree.

Tautomers are constitutional isomers of organic compounds that rapidly interconvert (Wikipedia). The most common form, is the relocation of a proton. Many computer representations are tautomer specific and distinguish different tautomeric forms of a compound. They would for example not have the same unique SMILES.

There are several approaches and algorithms for handling tautomers (Warr W, 2010). At the Daylight EuroMUG99, Roger Sayle and Jack Delany presented an algorithm for enumerating and assigning a unique tautomer (Sayle R and Delany J, 1999). From slide 20 in that presentation, there is a very nice step wise example on guanine.

This afternoon I had fun creating a animation for algorithm using an implementation I wrote last year. I've used a more complicated example that emphasises the backtracking when an incorrect assignment is made.

Being a large compound, I didn't include the keto-enol types. Also I had to modify the order (normally the nitrogens would be assigned first) to allow it be watchable in reasonable time. Each proton is placed and the hetroatoms become either a donor (green) or acceptor (blue). At several points it attempts to place two protons in each of the five membered rings. After updating adjacent atoms the mistake is identified and the assignment backtracks setting one as an proton acceptor.

Be sure to set the HD option.

Warr W (2010) shows an example labelled as "hidden tautomers". This compound presents an interesting challenge and a nice demonstration. In total there are 68 tautomers generated.

References

Warr WA. Tautomerism in chemical information management systems. J Comput Aided Mol Des. 24(6-7):497-520. 2010
Also presented at the ChemAxon 2010 UGM - http://www.youtube.com/watch?v=1C-RTD4DAJ8

Sayle RA and Delany JJ. Canonicalization and enumeration of tauomers. Presented at EuroMUG99, Cambridge, UK, 28-29 Oct 1999

Efficient Bits

Rules for Interpreting Up/Down Wedge Bonds

Rule 1 (D4)

Rule 2 (D3)

Rule 3 (D3)

Exceptions

Creating Chemical Structure Animations

CDK + ffmpeg

Step 1: Generating the PNGs with CDK

Step 2: Stick them all together

Add/Remove Atoms and Bonds?

CDK Depict on HTTPS

Step 1

Step 2

Step 3 (optional)

Bit packing for fast atom type assignment

RDKit Reaction SMARTS

Update 19/09/22

But that's a transform!

And another thing...

Sharp Tools for Java Refactoring: Byte Code Analysis

CDK AtomContainers are Slow - Let's fix that

A New Start: API Rewrite?

Existing Trade Off: The GraphUtil class

A New Public API: Atom and Bond References

Benchmark

Depth First Traversal

Graph Relaxation

Conclusions

Footnotes

Generic Structure Depiction

Substituent Variation (R groups)

Substituent Labels

Attachment Points

Bringing Molfile Sgroups to the CDK - Rendering Tips

Abbreviations (Superatoms)

Multiple Group

Brackets

Java Serialization: Great power but at what cost?

Recent History

Benchmark

Auxiliary Data

Summary

Update

Bringing Molfile Sgroups to the CDK - Demo

Substructure (or Substance) Groups

Example ChEBI Depictions

Display Shortcut, Abbreviations

Display Shortcut, Multiple Group

Polymer, SRUs

Polymer, Others

Combinations

Additional Reading

MMFF Partial Charges Improvements in CDK

PhD Thesis Now Available

CDK Release 1.5.10

CDK Release 1.5.9

Memory Mapped Fingerprint Index - Part II

Memory Mapped Fingerprint Index - Part I

Fun (and abuse) of implicit methods

Conversion from line notations

Conversion between toolkits

Footnotes

Not to scale

Parameters

Scaling

Bounds

Current Rendering API

Vector graphics

Example code

Links/References

CDK Release 1.5.8

Polish-ed SMARTS parsing

Preliminaries

Reverse Polish notation

Implementation

References

CDK Release 1.5.7

Mischievous SMARTS Queries

CDK now built using Maven

Rule 1 (D₄)

Rule 2 (D₃)

Rule 3 (D₃)