Tuesday, 30 December 2014

CDK Release 1.5.10


CDK 1.5.10 has been released and is available from sourceforge (download here) and the Maven central repository (XML 1).

This release follows very shortly after 1.5.9 and is the first release available from the central maven repository. This means there is now no need to include a custom repo when using the library in downstream projects (XML 1)

The short release notes (1.5.10-Release-Notes) summarise and detail the changes. Other than the availability in the central repository the release includes a new MolecularFormulaGenerator contributed by Tomáš Pluskal that provide mass to formula generation in a fraction of the time of the old MassToFormulaTool.

XML 1 - Maven POM configuration

Wednesday, 24 December 2014

CDK Release 1.5.9


CDK 1.5.9 has been released and is available from sourceforge (download here) and the EBI maven repo (XML 1).

This is the first release to be built using Java 7 and will require the Java SE Runtime 7 to execute. The previous release (1.5.8) will be the last to work with Java SE 6.

The full release notes (1.5.9-Release-Notes) summarise and detail the changes. One of the new features is the recognition of perspective projection stereochemistry.

Stereochemistry recognition
XML 1 - Maven POM configuration

Monday, 1 December 2014

Memory Mapped Fingerprint Index - Part II

This post follows up on the previous to report some timings. I've checked all the code into GitHub (johnmay/efficient-bits/fp-idx) and it has some stand alone programs that can be run from the command line.

Currently there are a few limitations that we'll get out the way:

  • Only generation of the CDK ECFP4 is supported and at a folded length of 1024, this should give a close approximation to what Matt used in MongoDB (RDKit Morgan FP). Other fingerprints and foldings could be used but the generation time of path based fingerprints in the CDK is currently (painfully) slow.
  • Building the index is done in memory, since 1,000,000x1024 bit fingerprints is only 122 MiB you can easily build indexes with less than 10 million on modern hardware.
  • During index searching the entire index is memory mapped, setting the chunks system property (see the GitHub readme) will avoid this at a slight performance cost.
  • Results return the id in the index (indirection) and to get the original Id one would need to resolve it with another file (generated by mkidx). 
  • Index update operations are not supported without rebuilding it.
These are all pretty trivial to resolve and I've simply omitted them due to time. With that done, here's a quick synopsis of making the index, there is more in the GitHub readme.
Code 1 - Synopsis
$ ./smi2fps /data/chembl_19.smi chembl_19.fps # ~5 mins
$ ./mkidx chembl_19.fps chembl_19.idx # seconds

The fpsscan does a linear search computing all Tanimoto's and outputting the lines that are above a certain threshold. The simmer and toper utils use the index, they either filter for similarity or the top k results. They can take multiple SMILES via the command line or from a file.

Code 2 - Running queries
$ ./fpsscan /data/chembl_19.fps 'c1cc(c(cc1CCN)O)O' 0.7 # ~ 1 second
$ ./simmer chembl_19.idx 0.7 'c1cc(c(cc1CCN)O)O' # < 1 second
$ ./toper chembl_19.idx 50 'c1cc(c(cc1CCN)O)O' # < 1 second (top 50)

Using the same queries from the MongoDB search I get the following distribution of search times for different thresholds.

Some median search times are as follows.
Threshold Median time (ms)
0.90 14
0.80 31
0.70 46
0.60 53
In the box plot above the same (first) query is always the slowest, this is likely due to JIT.
It's interesting to see that the times seem to flatten out. By plotting how many fingerprints the search had to check we observe that below a certain threshold we are essentially checking the entire dataset.

The reason for this is potentially due to the sparse circular fingerprints. Examining the result file (see the github README) we can estimate that on average we're calculating 23,556,103 Tanimoto's a second. This also means that retrieving the top k queries isn't bad either. For example 10,000 gives a median time (Code 3) of 72 ms.

Code 3 - Top 10,000 hits for queries (same as before)
$ ./toper chembl_19.idx 10000 queries.smi

Next I'll look at some like-for-like comparisons.

Wednesday, 26 November 2014

Memory Mapped Fingerprint Index - Part I

I attended an interesting talk this afternoon (CCNM) by Matt Swain on using MongoDB for chemical similarity searching (code: github/mcs07/mongodb-chemistry).

The similarity searching partially uses the "Baldi" algorithm with some extra tweaks based on checking rare bits. The Baldi method is nicely summarised along with others by Tim Vandermeersch in his post on Fingerprint Searching Using Various Indexing Methods. As is noted by Tim, it can be improved upon.

Anyways, I had an implementation of a memory mapped Baldi index lying around, there is also one in the OrChem database cartridge.  I prototyped the implementation back in April and was/is part of a "nfp" (new fingerprint) module for CDK. I've now put the code on a GitHub project (github/johnmay/efficient-bits/fp-idx) and will do some benchmarking to see how it does.

My feeling is that the very simple (it's about 100 lines) memory mapped index can give competitive performance on small datasets (<5 million entries).

Sunday, 16 November 2014

Fun (and abuse) of implicit methods

Earlier this year I wrote up some Chemical Toolkit Rosetta examples of using the CDK in Scala (github/cdk/cdk-scala-examples). When I was writing this it sprung to mind that it would be cool to (ab)use one feature for interoperability between cheminformatics toolkits.

Scala is a statically typed functional language that runs on the Java Virtual Machine. It has some nice features and syntax that can produce some very concise code. One thing particular neat is the ability to define implicit methods. Essentially these are methods that define how to convert between types, they are implicit because the compiler can insert them automatically.

Implicit methods are very similar to auto(un)boxing that was introduced in Java 5 to simplify conversion of primitives and their instance wrappers (Code 1).

Code 1 - Autoboxing and autounboxing in Java
Integer x = 5; // ~ Integer x = Integer.valueOf(5);
int y = x;     // ~ int y = x.intValue();
x = y;         // ~ x = Integer.valueOf(y);

if (x == y) {  // ~ x.intValue() == y 

Much like it is possible in some programming languages to define custom operators, Scala makes it possible to define custom conversions that are inserted at compile time. The main advantage is it allows APIs to be extended to accept different types without introducing extra methods.

Conversion from line notations

Line notations are a concise means of encoding a chemical structure as sequence of characters (String). Common examples include SMILES, InChI*, WLN, SLN, and systematic nomenclature. Conversion to and from these formats isn't too computationally expensive but probably not something you want to do on-the-fly. However, just for fun, let's see what an implicit method for converting from strings can do. First we need the specified methods for loading from a known string type. We'll use the CDK for SMILES and InChI with Opsin for nomenclature.

Code 2a - Parsing of linear notations
val bldr = SilentChemObjectBuilder.getInstance
val sp   = new SmilesParser(bldr)
val igf  = InChIGeneratorFactory.getInstance

def inchipar(inchi: String) = 
  igf.getInChIToStructure(inchi, bldr).getAtomContainer

def cdksmipar(smi: String) = 

def nompar(nom: String) = 

def cansmi(ac: IAtomContainer) =
// Universal SMILES (see. O'Boyle N, 2012**)
def unismi(ac: IAtomContainer) = 
Code 2b - Implicit conversion from a String to an IAtomContainer
implicit def autoParseCDK(str: String): IAtomContainer = {
    if (str.startsWith("InChI=")) { 
    } else if (str.startsWith("1S/")) {
      inchipar("InChI=" + str)
    } else {
      try {
      } catch {
        case _: InvalidSmilesException => nompar(str)

Now the implicit method has been defined, any method in the CDK API that accepts an IAtomContainer can now behave as though it accepts a linear notation. Code 3 shows how we can get the same Universal SMILES for different representations of caffeine and compute the ECFP4 fingerprint for porphyrin

Code 3 - Using implicit methods
val fp = new CircularFingerprinter(CLASS_ECFP4).getCountFingerprint("porphyrin")

Conversion between toolkits

It is also possible to add in implicit methods to auto-convert between toolkit types. To convert between the CDK and RDKit (Java bindings) I'll go via SMILES. This conversion is lossy without an auxiliary data vector but serves as a proof of concept. I've lifted the Java bindings from the RDKit lucene project (github/rdkit/org.rdkit.lucene/) as the shared library works out the box for me. We can also add in the from string implicit conversions.

Code 4 shows the implicit method definitions. The additional autoParseRDKit allows us to bootstrap the RDKit API to also accept linear notations on all methods that expect an RWMol (or ROMol).

Code 4 - Implicit methods for conversion between CDK and RDKit
implicit def cdk2rdkit(ac : IAtomContainer) : RWMol = 

// XXX: better to use non-canon SMILES
implicit def rdkit2cdk(rwmol : RWMol) : IAtomContainer = 

implicit def autoParseRDKit(str: String): RWMol = 

Now we can obtain the Avalon fingerprint of caffeine from it's name and pass an RWMol to the CDK's PubchemFingerprinter (Code 5).

Code 5 - Using the RDKit API
val fp = new ExplicitBitVect(512)
RDKFuncs.getAvalonFP("caffeine", fp2)

val caffeine : RWMol = "caffeine"
new PubchemFingerprinter(bldr).getBitFingerprint(caffeine)

Given that auto(un)boxing primitives in Java can sting you in performance critical code, the above examples should be used sparingly. They do serve as a fun example of what is possible and I've put together the working code example in a Scala project for others to try github/johnmay/efficient-bits/impl-conversion.


Friday, 12 September 2014

Not to scale

The latest release of the CDK (1.5.8) includes a new generator for rendering structure diagrams. A detailed introduction to configuring the new generator is available on the CDK wiki[1].

The new generator can be used as a drop in replacement in existing code. However, one aspect of rendering that I've struggled with previously was getting good sized depictions with the CDK - most notably with vector graphic output. This post will look at how we can size depictions and will provide code in an example project as a reference.

ChEBI's current entity of the month - maytansine [CHEBI:6701] will be used to demonstrate the sizing.


Three parameters that are important in the overall sizing of depictions. These are the BondLength, Scale, and Zoom which are all registered as BasicSceneGenerator parameters. The Zoom is not needed if we allow our diagram to be fitted automatically.

The BondLength can be set by the user and has a default value of '40' whilst the Scale is set during diagram generation. BondLength units are arbitrary - for now we'll consider this as '40 px'.


The Scale parameter is used to render molecules with different coordinate systems consistently[2,3]. The value is determined using the BondLength parameter and the bond length in the molecule. For maytansine [CHEBI:6701] the median bond length is ~0.82. Again, the units are arbitrary - this could be Angstroms (it isn't).

The Scale is therefore the ratio of the measured bond length (0.82) to the desired bond length (40 px). For this example, the scale is 48.48. The coordinates must be scaled by ~4800% such that each bond is drawn as 40 px long.


Now we know our scale (~48.48), how big is our depiction going to be? It depends how we measure it. One way would be to check the bounding box that contains all atom coordinates (using GeometryUtil.getMinMax()). However, this does not consider the positions of adjuncts and would lead to parts of the diagram being cut off[4].

Fortunately the new generator provides a Bounds rendering element allowing us to determine the true diagrams bounds of 8.46x8.03. Since the scale is ~48.48 the final size of the depiction would be ~410x390 px. A margin is also added.

Current Rendering API

Now we have the size of our diagram we can render raster images. Unfortunately the current rendering API makes this a little tricky as the diagram is generated after the desired image size is provided by the user. To get the correct size we need to generate the diagram twice (to get the bounds) or use an intermediate representation (we'll see this later).

Code 1 - Current rendering API
// structure with coordinates
IAtomContainer container = ...; 

// create the renderer - we don't use a font manager
List<IGenerator<IAtomContainer>> generators = 
        Arrays.asList(new BasicSceneGenerator(),
                      new StandardGenerator(new Font("Verdana", Plain, 18));
AtomContainerRenderer renderer = new AtomContainerRenderer(generators, null); 

Graphics2D g2 = ...; // Graphics2D to draw raster / vector graphics
Rectangle2D bounds = ...; // need the bounds here!
renderer.paint(new AWTDrawVisitor(g2),

Vector graphics

To render scalable graphics we can use the VectorGraphics2D[5] implementations of the Java Graphics2D class. Vector graphics output can use varied units (e.g. pt, mm, px) - the VectorGraphics2D uses mm.

Without adjusting our scaling the render of maytansine [CHEBI:6701] would be displayed with bond lengths of 40 mm and a total size of ~410x390 mm. The output can be rescaled after rendering but the default width of 41 cm is a bit large. We therefore need to change our desired bond length.

The bond length of published structure diagrams varies between journals. A common and recommended style for wikipedia [6] is 'ACS 1996' - the style has a bond length of '5.08 mm'. Although setting the BondLength parameter to '5.08' would work, other parameters would need adjusting such as Font size (which is provided in pt!).

To render the diagram with the same proportions as the raster image we can instead resize the bounds and fit the diagram to this. Since the desired bond length is '5.08 mm' instead of '40 mm' we need rescale the diagram by 12.7 %. Our final diagram size is then ~52x50 mm. The border for ACS 1996 is '0.56 mm' which can be added to the diagram size.

Example code

To help demonstrate the above rendering I've put together a quick GitHub project johnmay/efficient-bits/scaled-renders. The code provides a convenient API and a command line utility for generating images.

Code 2 - Intermediate object
// structure with coordinates
IAtomContainer container = ...; 

// create the depiction generator
Font font = new Font("Verdana", Plain, 18);
DepictionGenerator generator = new DepictionGenerator(new BasicSceneGenerator(),
                                                      new StandardGenerator(font));

// generate the intermediate 'depiction'
Depiction depiction = generator.generate(container);

// holds on to the rendering primitives as well as the size
double w = depiction.width();
double h = depiction.height(); 

// draw at 'default' size
depiction.draw(g2, w, h);

// generate a PDF (or SVG)
String pdfContent = depiction.toPdf(); // default size
String pdfContent = depiction.toPdf(1.5); // 1.5 * default size
String pdfContent = depiction.toPdf(0.508, 0.056); // bond length, margin

The command line utility provides several options to play with and can load from molfile, SMILES, InChI, or name (using OPSIN[7]).

Code 3 - Command line
# In the project root set the following alias
$: alias render='mvn exec:java -Dexec.mainClass=Main'

# Using OPSIN to load porphyrin and generate a PDF
$: render -Dexec.args="-name porphyrin -pdf"

# Highlight one of the pyrrole in porphyrin
$: render -Dexec.args="-name porphyrin -pdf -sma n1cccc1"

# Show atom numbers
$: render -Dexec.args="-name porphyrin -pdf -atom-numbers"

# Show CIP labels
$: render -Dexec.args="-name '(2R)-butan-2-ol' -pdf -cip-labels"

# Generate a PDF / SVG for ethanol SMILES
$: render -Dexec.args="-smi CCO -pdf ethanol.pdf -svg ethanol.svg"

# Load a molfile
$: render -Dexec.args="-mol ChEBI_6701.mol -pdf chebi-6701.pdf"

You can even play with the font

$: render -Dexec.args="-name 'caffeine' -svg cc-caffeine.svg -font-family 'Cinnamon Cake' -stroke-scale 0.6 -kekule"


  1. https://github.com/cdk/cdk/wiki/Standard-Generator
  2. http://gilleain.blogspot.co.uk/2010/09/scaling-and-text.html
  3. http://gilleain.blogspot.co.uk/2010/09/consistent-zoom-with-models-of.html
  4. http://onlinelibrary.wiley.com/doi/10.1002/minf.201200171/abstract
  5. http://trac.erichseifert.de/vectorgraphics2d/
  6. http://en.wikipedia.org/wiki/Wikipedia:Manual_of_Style/Chemistry/Structure_drawing
  7. http://opsin.ch.cam.ac.uk/

Thursday, 11 September 2014

CDK Release 1.5.8


CDK 1.5.8 has been released and is available from sourceforge (download here) and the EBI maven repo (XML 1).

The release notes (1.5.8-Release-Notes) summarise and detail the changes.

XML 1 - Maven POM configuration

Tuesday, 22 July 2014

Polish-ed SMARTS parsing

As introduced in previous posts, SMARTS is a concise notation for describing chemical substructure queries. There are several aspects to a SMARTS implementation: subgraph graph matching, parsing, generating, and even optimisation[1,2].

In this post I'll show a way of parsing the binary atom expressions that I found quite neat.


Conceptually, a SMARTS atom expression is composed of primitives and operators (binary and unary). The primitives test whether some property of a atom (e.g. element, charge, valence, etc) has a certain value[3]. The operators invert and combine these primitives through conjunction (and), disjunction (or), and negation (not).

Some examples of atom expressions are:


The operators in these expressions ordered by their precedence are:

! unary not
& binary and (high)
, binary or
; binary and (low)

The default operator is '&' and can often be omitted such that the first pattern would read [OX1]. There are two 'and' operators with difference precedence allowing logical expressions like:

[C,N&X1]  C or (N and X1)
[C,N;X1]  (C or N) and X1

More complex expression trees can be accomplished with recursive SMARTS.

A formal grammar for SMARTS that respects precedence looks something like this (lifted from the CDK javacc implementation):

AtomExpression    ::= "[" <LowAndExpression> "]"
LowAndExpression  ::= <OrExpression> [ ";" <LowAndExpression> ]
OrExpression      ::= <HighAndExpression> [ "," <OrExpression> ]
HighAndExpression ::= <NotExpression> [ '&' <HighAndExpression> ]
NotExpression     ::= [ "!" ] <AtomPrimitive>

Notice this is a recursive procedure where I ascend up the precedence hierarchy while descending into the grammar. The small number of operators in SMARTS means this is generally good enough. However there is a non-recursive alternative.

Reverse Polish notation

Reverse Polish notation (RPN) is a notation where the operator follows the operands of an expression[4]. Some simple mathematical expressions are written as follows:

5 + 1              5 1 +
3 + 4 * 2          3 4 2 * +
(3 + 4) * 2        3 4 + 2 *

RPN is extremely useful and simple for implementing and performing operations on stack-based machines[5]. An excellent property is that the operators are applied as soon as they are encountered. Notice that I don't need parentheses to change the multiply and addition order. Also notice that a lookahead check for operator validity isn't needed, when an operator is applied the primitives have already been parsed.

SMARTS operators are infix but let us see what RPN SMARTS might look like:

[O&X1]             O X1 &
[!C&!N]            C ! N ! &
[C,c;X3&v4]        C c , X3 v4 & ;
[N&!H0&X3]         N H0 ! X3 & &
[!#6&X4]           #6 ! X4 &
[O,S,#7,#15]       O S #7 #15 , , ,

RPN SMARTS is much simpler to write a parser for that respect precedence. All that is needed is a way to convert from infix to postfix. The Shunting-yard algorithm[6] does just that.


The Shunting-yard algorithm is explained well in many other webpages so I'll neglect that here. I will be converting from infix to postfix and build the expression at the same time. To do this, two stacks are needed, one for atom primitives and one for operators. The atom primitive stack is essentially the output of the Shunting-yard but I apply the operators instead of appending them to the postfix string.

Code 1 consumes characters from the input and either shunts an operator or parses a primitive. Once all the input is consumed the remaining operators are applied. The created query atom is on top of the stack and is returned. A low precedence no-operator is pushed on the stack to make thinks simpler and buffer the shunting.

To handle the implicit '&' between primitives a little more work is needed. Essentially one would optionally invoke shunt(atoms, operators, '&'); as needed at each iteration.

Code 1 - Primary loop
IQueryAtom parse(CharBuffer buffer) throws IOException {

    // stacks of atom primitives and operators
    Deque<IQueryAtom> atoms     = new ArrayDeque<>();
    Deque<Character>  operators = new ArrayDeque<>();
    operators.push(Character.MAX_VALUE); // a pseudo low precedence op

    while (buffer.hasRemaining()) {
        char c = buffer.get();
        if (isOperator(c)) // c == '!' or '&' or ',' or ';'? 
           shunt(atoms, operators, c);

    // apply remaining operators
    while (!operators.isEmpty())
        apply(operators.pop(), atoms);

    return atoms.pop();

Code 2 shows the creation of query atom primitives, here they are delegated to several self explanatory utility methods. For compactness only a subset of primitives are read.

Code 2 - Parsing selected primitives
IQueryAtom parsePrimitive(CharBuffer buffer) throws IOException {
    switch (buffer.get(buffer.position() - 1)) {
        case 'A': return newAliphaticQryAtm();
        case 'C': return newAliphaticQryAtm(6);
        case 'N': return newAliphaticQryAtm(7);
        case 'O': return newAliphaticQryAtm(8);
        case 'P': return newAliphaticQryAtm(15);
        case 'S': return newAliphaticQryAtm(16);

        case 'a': return newAromaticQryAtm();
        case 'c': return newAromaticQryAtm(6);
        case 'n': return newAromaticQryAtm(7);
        case 'o': return newAromaticQryAtm(8);
        case 'p': return newAromaticQryAtm(15);
        case 's': return newAromaticQryAtm(16);

        case '#': return newNumberQryAtm(parseNum(buffer));
        case 'X': return newConnectivityQryAtm(parseNum(buffer));
        case 'H': return newHydrogenCountQryAtm(parseNum(buffer));
        case 'R': return newRingMembershipQryAtom(parseNum(buffer));
        case 'v': return newValenceQryAtom(parseNum(buffer));
    throw new IOException("Primitive not handled");

To apply an operator, take the operands (primitives) off the top of the atom stack, create a new query atom, and push it back on to the stack (Code 3). If there aren't enough operands, the expression is invalid (not shown).

Code 3 - Applying an operator
void apply(char op, Deque<IQueryAtom> atoms) {
    if (op == '&' || op == ';')
        atoms.push(and(atoms.pop(), atoms.pop()));
    else if (op == ',')
        atoms.push(or(atoms.pop(), atoms.pop()));
    else if (op == '!')

Finally, to handle the operator (Code 4), check if the operator currently on top of the stack has precedence over the new operator. If so, pop it from the stack and apply it. The new operator is then added to the stack. Conveniently the code point of the operator character can be used as the precedence.

Code 4 - Handling operator precedence
void shunt(Deque<IQueryAtom> atoms, Deque<Characters> operators, char op) {
    while (precedence(operators.peek()) < precedence(op))
        apply(operators.pop(), atoms);

static int precedence(char c) {
    return c; // in ASCII, '!' < '&' < ',' < ';' 

With the exception of a few utility methods these four snippets are essentially the whole implementation. You can find the fully functional code on the GitHub project[7].

Not only is the code is incredibly compact and elegant but it can easily be expanded. Several convenience extensions to SMARTS have been made in the past – for example, #X for !#1!#6. A common requirement in general expressions and the Shunting-yard is to handle parenthesis. These need special treatment but it is only a simple modification to the shunting and the precedence value (Code 5).

Code 5 - Handling parenthesis
void shunt(Deque atoms, Deque operators, char op) {
    if (op == ')') {
        while ((op = operators.pop()) != '(')
            apply(op, atoms);
    } else {
        if (op != '(') {
            while (precedence(operators.peek()) < precedence(op))
                apply(operators.pop(), atoms);

int precedence(char c) {
    switch (c) {
        case '!': return 1;
        case '&': return 2;
        case ',': return 3;
        case ';': return 4;
        case '(':
        case ')': return 5;
        default:  return 6;

The parser will now correctly handle the following expressions without recursive SMARTS:

[!(C,N,O,P,S)]              C N O P , , , !
[!(C,N,O&X1)]               C N O X1 & , , !
[((C,N)&X3),((O,S)&X2)]     C N , X3 & O S , X2 & ,

All source code is available at github/johnmay/efficient-bits/polished-smarts.


  1. PATSY, NextMove Software
  2. SMARTS Optimisation & Compilation: Introduction & Optimisation, Tim Vandermeersch
  3. Daylight theory manual, Daylight CIS
  4. Reverse Polish notation, Wikipedia
  5. Reverse Polish notation and the stack, Computerphile
  6. Shunting-yard algorithm, Wikipedia
  7. github/johnmay/efficient-bits/polished-smarts

Friday, 18 July 2014

CDK Release 1.5.7

CDK 1.5.7 has been released and is available from sourceforge (download here) and the EBI maven repo (XML 1).

The release notes (1.5.7-Release-Notes) summarise and detail the changes. Among the new bug fixes and features, several plugins have been added to the build. The release notes describe how these plugins can be run and what they do so be sure check the notes out if you're a contributor.

XML 1 - Maven POM configuration

Wednesday, 26 March 2014

Mischievous SMARTS Queries

Last year I extended the CDK SMARTS implementation to match component groupings and stereochemistry. Specifying stereochemistry presents some interesting logical predicate that might be tricky to handle.

Here are some examples that I came up with for testing the correctness of query handling. They start simple before getting a little mischievous. First, recursion and component grouping.

Component grouping (fragment)
(O).(O)O=O0Example from Daylight
Component grouping (connected)
(O.O)O=O2Example from Daylight
Recursion, ad infinitum
Recursive component grouping
[O;D1;$(([a,A]).([A,a]))][CH]=OOC=O.c1ccccc11Feature/Bug #1312

These next ones are concerned with logic and stereochemistry.
Ensure local stereo matching
*[@](*)(*)(*)O[C@](N)(C)CC12tetrahedrons have 12 rotation symmetries
Implicit (hydrogen or lone-pair) neighbour
Either (tetrahedral)
Both (tetrahedral)
Respect logical precedence 1
Respect logical precedence 2
CC[!@](C)OCC[C@H](C)O0!@@ is also equivalent to @?
Neither (tetrahedral) using 'or unspecified'
Neither (tetrahedral) using negation
Either (geomeric)
Neither (geomeric)
The last two are quite tricky (and not currently implemented) but once the atom-centric handling is correct it's a simple reduction. It's quite fun to work out so i'll leaf that up to the reader.

Wednesday, 19 February 2014

CDK now built using Maven

At 13 years and 4 months the Chemistry Development Kit (CDK) is reasonably mature for a software project. Over the years there have been many changes in development practices as the code base evolved. This post is a departure for the usual algorithms and performance tests and looks at a recent and major change in the CDK development process.

On Monday, Egon, Nina and I made the final alterations that changed the build system from Ant to Maven. This change has been in the works for a long time and has been suggested multiple times. The actually migration has taken about a years worth of planning.

If you want to have a play with the new build system yourself I've put together a brief guide that also describes how to import the project into several popular IDEs - Building CDK. The project README also summarises the command line usage.

I download CDK releases and use it my project, what has changed?

If you are using the CDK as a dependancy you should not notice any difference. The library and bundled dependencies will still be distributed at each release. If you are also using maven then CDK module artefacts have been deployed for last few releases. These are by far the easiest way to use the library as dependency versioning is managed maven and newer releases can be automatically downloaded. Please see the project README for repository details.

I build the CDK source and use it my project, what has changed?

The source code is now built with maven - the README summarises the steps. As with releases, SNAPSHOT artefacts will be deployed to a remote repository (currently EBI).

I have modifications to the CDK that I apply, what has changed?

If your patches are Git commits then these can still be applied. Git will sort out and use the correct file locations to any modified files. If the patch creates new files these will need to be moved manually to the correct location.

CDK Modules in Dec 2013 - Egon W, Bits of Blah

Existing project structure

Prior to Monday the project code was organised under a single root folder. The Ant build would then read instructions in the source code and assemble the modules during compilation. This approach allowed progressive partition the code into modules over an extended period. Without this system we would not have be able to convert to maven at all.

This system was customised and specific to the CDK which, in my opinion, made it a significant barrier to contributions. I know that personally I struggled to understand what was going on at compile time. A highly customised build process makes it not only difficult for a human to comprehend but also any automated tooling (Integrated Development Environments, IDEs). Superficial support has been provided for Eclipse and Netbeans editors but neither correctly interrupted the modules and relationships between them.

Separate source trees

The most noticeable difference in the project is each module now has a separate source tree. This allows easier reasoning about the contents of module and provides a visual cue about the modular structure. Below we can see the 'cip' (Cahn-Ingold-Prelog) module source tree only contains the classes relevant to the module. Separate source trees are not specific to maven and we could still use Ant. The main benefit is that Maven supports and encourages this kind of structuring by default.

Source code in the CIP module
There is still more work to do on the module organisation, for example, CMLWritier is the the 'libiocml' module whilst the CMLReader is the 'io' module. The modules are mainly organised by their dependancies but in future it may be beneficial to organise by function. Normally classes with similar dependencies have similar function but this isn't always the case. An example of this is seen with the LINGOFingerprinter and SignatureFingerprinter in the 'smiles' and 'signatures' modules rather than the 'fingerprint' module. 

Super modules

The Maven build also allows us to define groupings of the existing modules. These intermediate modules group the code base in a few digestible sections. You can see these groupings at the root level in the repository - https://github.com/cdk/cdk/. There are five groups and an additional misc/ module for the left overs. I'm planning to write a more in depth guide for the wiki but here is a quick overview of what is present in each.
  • base/ - API and implementations of domain objects and central algorithms to handling chemical information
  • descriptor/ - fingerprinters, qsars and signatures for describing and characterising attributes of a compound.
  • storage/ - reading and writing of chemical compounds from multiple file formats
  • tool/ - structure diagram generation, smarts, smsd, hashcodes, tautomer and tools that either answer a question directly, manipulate input or compute intrinsic properties
  • display/ - rendering of 2D depictions

Thursday, 13 February 2014

Animated Algorithm: Canonical Tautomer Assignment

Effectively understanding an algorithm with only a description is difficult. Reading source code, possibly more so. Although these approaches explain the finer details, invariants and proofs, a higher level view offers clarity. A great example of this is seen at McKay's and Piperno's canonical labelling site, The Search Tree.

Tautomers are constitutional isomers of organic compounds that rapidly interconvert (Wikipedia). The most common form, is the relocation of a proton. Many computer representations are tautomer specific and distinguish different tautomeric forms of a compound. They would for example not have the same unique SMILES.

There are several approaches and algorithms for handling tautomers (Warr W, 2010). At the Daylight EuroMUG99, Roger Sayle and Jack Delany presented an algorithm for enumerating and assigning a unique tautomer (Sayle R and Delany J, 1999). From slide 20 in that presentation, there is a very nice step wise example on guanine.

This afternoon I had fun creating a animation for algorithm using an implementation I wrote last year. I've used a more complicated example that emphasises the backtracking when an incorrect assignment is made.

Being a large compound, I didn't include the keto-enol types. Also I had to modify the order (normally the nitrogens would be assigned first) to allow it be watchable in reasonable time. Each proton is placed and the hetroatoms become either a donor (green) or acceptor (blue). At several points it attempts to place two protons in each of the five membered rings. After updating adjacent atoms the mistake is identified and the assignment backtracks setting one as an proton acceptor.

Be sure to set the HD option.

Warr W (2010) shows an example labelled as "hidden tautomers". This compound presents an interesting challenge and a nice demonstration. In total there are 68 tautomers generated.


Warr WA. Tautomerism in chemical information management systems. J Comput Aided Mol Des. 24(6-7):497-520. 2010
Also presented at the ChemAxon 2010 UGM - http://www.youtube.com/watch?v=1C-RTD4DAJ8

Sayle RA and Delany JJ. Canonicalization and enumeration of tauomers. Presented at EuroMUG99, Cambridge, UK, 28-29 Oct 1999

Saturday, 11 January 2014

New SMILES behaviour - generating (CDK 1.5.4)

In the previous post I outline the changes to SMILES Parsing in the Chemistry Development Kit (CDK). The original plan was to have several posts detailing the changes but in the end it was more practical to put this in a single release note document (available: here). Although the key changes to SMILES Generation are summarised in the release notes I thought it would be worth touching on some details.

Types of SMILES

The original API involved creating a mutable generator and then creating SMILES by invoking one of several methods.
Code 1 - Existing SMILES generation API
SmilesGenerator smigen = new SmilesGenerator();
String smi    = smigen.createSMILES(molecule);
String chismi = smigen.createChiralSMILES(molecule);

The new generator is immutable and a change in the configuration creates a new instance. There aren't many configuration options but key point is there isn't an internal state. This makes it a lot more robust.
Code 2 - New SMILES generation API
SmilesGenerator smigen = SmilesGenerator.unique()
smigen.createSMILES(molecule); // now deprecated
The generator can now create several different types of SMILES. The naming follows the Daylight specification.
no isotopes / stereo
with isotopes / stereo

CDK previously used the term chiral SMILES to refer to SMILES with stereochemistry. The term isomeric SMILES is now used and aligns with other toolkits (Daylight, RDKit, ChemAxon and OEChem).

Unique and Absolute SMILES

The equitable refinement published by Daylight (Weininger et al. 1988) has been optimised and is used to produce the unique SMILES. Generation of absolute SMILES is more tricky and currently uses the InChI algorithm. I believe this idea was first suggested by Andrew Dalke (inchi-dicuss archive). In 2012, Noel O'Boyle published and implemented (in Open Babel) procedures for generate Universal and Inchified SMILES (O'Boyle 2012).

The procedure used by the CDK absolute SMILES follows the Universal SMILES rules. There may still be some small differences in the implementation so I'm hesitant of declaring it as a Universal SMILES implementation before that validation can be done. There are also some problems, such as, delocalised charges that would be useful to handle.


Generally the handling of SMILES in the CDK is much more robust with hydrogen counts and stereochemistry correctly round tripped. In addition to this the performance is also much better. This is most noticeable when using the non-canonical outputs and only storage is needed but it is also true of the canonical output.

The following table summarises the time taken to read and generate canonical SMILES for three different size datasets. The times are one-off measurements on a cold JVM. The red number in brackets is the number of compounds which caused an error. Using the InChI the absolute SMILES can not encode unknown atoms (CC*) which are ubiquitous in ChEBI (ontology classes).

Data Compounds CDK 1.4.15 CDK 1.5.4 unique CDK 1.5.4 absolute
ChEBI 108 ~27,000 54 s 5 s 23 s (1251)
NCI Aug '00 ~250,000 2 m 8 s 14 s 2 m 12 s (184)
ChEMBL 17 ~1,300,000 23 m 48 s (11) 1 m 38 s 15 m 26 s (36)