Thursday, 25 October 2018

CDK Depict on HTTPS

Just a quick post to say CDK Depict is now using HTTPS https://simolecule.com/cdkdepict/depict.html. The main reason for this was Blogger stopped allowing image links to HTTP resources. In general browsers are being more fussy about non HTTPS content.

I used LetsEncrypt that turned out to be very easy to configure with TomCat.

Step 1


Install the certbot utility and use it generate a certificate.

$ sudo certbot certonly

Step 2


Configure TomCat 8+ connectors. This used to be more complex on older TomCat servers with the need to generate a separate keystore. Editing $CATALINA_HOME/confg/server.xml we configure the base connected, redirectPort is changed from 8443 to 443 (and 8080 to 80).

<Connector port="80" protocol="HTTP/1.1"
           connectionTimeout="20000"
           redirectPort="443" />

We also configure SSL connector, using port 443, change to NIO based protocol (the default requires extra native library) org.apache.coyote.http2.Http2Protocol, and set the file paths to the .pem files generated by certbot.

<Connector port="443" 
           protocol="org.apache.coyote.http11.Http11NioProtocol"
           maxThreads="150" SSLEnabled="true" >
  <UpgradeProtocol className="org.apache.coyote.http2.Http2Protocol" />
  <SSLHostConfig>
    <Certificate certificateKeyFile="/etc/letsencrypt/live/www.simolecule.com/privkey.pem"
                 certificateFile="/etc/letsencrypt/live/www.simolecule.com/cert.pem"
                 certificateChainFile="/etc/letsencrypt/live/www.simolecule.com/chain.pem"
                 type="RSA" />
  </SSLHostConfig>
</Connector>       

Step 3 (optional)


If a client tries to visit the HTTP site we want to redirect them to HTTPS. To do this we edit $CATALINA_HOME/confg/web.xml adding this section to the end of the <web-app> block

<security-constraint>
  <web-resource-collection>
    <web-resource-name>Entire Application</web-resource-name>
    <url-pattern>/*</url-pattern>
  </web-resource-collection>
  <user-data-constraint>
    <transport-guarantee>CONFIDENTIAL</transport-guarantee>
  </user-data-constraint>
</security-constraint>

Monday, 22 October 2018

Bit packing for fast atom type assignment

Many cheminformatics algorithms perform some form of atom typing as their first step. The atom types you need and the granularity depend on the algorithm. At the core of all atom typing lies some form of decision tree, usually manifesting as a complex if-else cascade.

Using an elegant technique Roger showed me a few years ago, you can replace this if-else cascade with a table/switch lookup. This turns out to be very clean, easy to extend, and efficient. The technique relies on first computing a value that captures the bond environment around an atom and packing this in to a single value.

Here's what it looks like:
Algorithm 1
int btypes = atom.getImplicitHydrogenCount();
for (IBond bond : atom.bonds()) {
  switch (bond.getOrder()) {
    case SINGLE: btypes += 0x0001; break;
    case DOUBLE: btypes += 0x0010; break;
    case TRIPLE: btypes += 0x0100; break;
    default:     btypes += 0x1000; break;
  }
}

After the value btypes has been calculated (once for each atom) it contains the count of single, double, triple, and other bonds. We can inspect each of these counts individually my masking of the relevant bits or the entire value, for example:
Algorithm 2
switch (btypes) {
 case 0x004: // 4 single bonds, e.g. Sp3 carbon
  break;
 case 0x012: // 1 double bond, 2 single bonds e.g. Sp2 carbon
  break;
 case 0x020: // 2 double bonds e.g. Sp cumulated carbon
  break;
 case 0x101: // 1 triple bond, 1 single bond e.g. Sp triple bonded carbon
  break;
}

Using a nibble allows us to store numbers up to 16 (24) - more than enough for any sane chemistry. In the example above I shoved default bonds under an 'other' category but of course you could extend it to handle quadruple bonds and even additional properties of the bonds:
Algorithm 3
int btypes = atom.getImplicitHydrogenCount();
for (IBond bond : atom.bonds()) {
  switch (bond.getOrder()) {
    case SINGLE: 
      if (bond.isAromatic())
        btypes += 0x010001; 
      else
        btypes += 0x000001;
      break;
    case DOUBLE: 
      if (bond.isAromatic())
        btypes += 0x010010; 
      else if (isOxygen(bond.getOther(atom)))
        btypes += 0x100010; // dbs to oxygens
      else
        btypes += 0x000010;
      break;
    case TRIPLE: 
       btypes += 0x000100;
       break;
  }
}

Friday, 13 April 2018

RDKit Reaction SMARTS

There's a been some papers using the RDKit for synthesis planning. If you're writing a paper and use the term "Reaction SMARTS" make sure you mean what everyone else thinks it means.

The SMILES, SMARTS, and SMIRKS line notations were created* by Daylight for storing, matching, and transforming connection tables.

  • SMILES describes a connection table to store molecule and reactions
  • SMARTS describes a pattern (or query) to match molecules and reactions
  • SMIRKS describes a transform (or "reaction") to modify molecules

RDKit uses the term "Reaction SMARTS" to mean "transform" (see RDKit Book). Unfortunately in Daylight's terminology Reaction SMARTS is a pattern not a transform.

Screenshot from the Daylight SMARTS theory manual.
Reactions SMARTS is primarily useful for searching reaction databases. For example this Reaction SMILES:

[cH:13]1[c:14]([cH:19][c:20]([c:10]([c:11]1[Cl:12])[n:9]2[cH:8][c:5]([c:4]([n:22]2)[NH2:3])[C:6]#[N:7])[Cl:21])[C:15]([F:16])([F:17])[F:18].[OH:1][OH:2]>C(Cl)Cl.C(=O)(C(F)(F)F)OC(=O)C(F)(F)F.O>[cH:13]1[c:14]([cH:19][c:20]([c:10]([c:11]1[Cl:12])[n:9]2[cH:8][c:5]([c:4]([n:22]2)[N+:3](=[O:1])[O-:2])[C:6]#[N:7])[Cl:21])[C:15]([F:16])([F:17])[F:18]

is matched by this Reaction SMARTS

[*:1][Nh2:2]>>[*:1][Nh0:2](~[OD1])~[OD1] amino to nitro

You can highlight the substructure:

Highlighting the SMARTS in the SMILES using CDK Depict

But that's a transform!


Yes but it's matching a transform (SMARTS) not applying one (SMIRKS), some may think you could read this unmodified as a SMIRKS but this is not the case. SMIRKS needs "real parts" after the second angled bracket as these are the parts created by the transform. Note that '*' is valid SMILES and in SMIRKS it kind of means "unmodified". This actually gives us the nice invariants:

All SMILES are valid SMARTS but not all SMARTS are valid SMILES
and
All SMIRKS are valid SMARTS but not all SMARTS are valid SMIRKS

Here is the SMIRKS transform for amino to nitro

[*:1][ND3:2]([H])([H])>>[*:1][N:2](=O)=O amino to nitro

In SMIRKS I can apply this SMIRKS to "molecules" and it will create "reactions". Note these molecules do not need to have atom-maps but they will come out with atom maps (see dt_transform)!

c1ccccc1N
[nH]1ccc2c1cc(N)cc2

The output is

c1cccc[c:1]1[NH2:2]>>c1cccc[c:1]1[N:2](=O)=O
[nH]1ccc2c1c[c:1]([NH2:2])cc2>>[nH]1ccc2c1c[c:1]([N:2](=O)=O)cc2

And another thing...


In general you can't run SMIRKS backwards. If I want to run a nitro to amino because the atoms/bonds we're adding need to be "real" we need to encode the reverse transform separately!

[*:1][ND3:2]([H])([H])>>[*:1][N:2](=O)=O amino to nitro
[*:1][ND3:2](~[OD1])(~[OD1])>>[*:1][N:2]([H])[H] nitro to amino

Although dt_transform specifies a direction this only controls whether the input molecules appear on the left or right of the output reaction.


*SMILES was created by Dave Weininger whilst at EPA

Saturday, 6 May 2017

Sharp Tools for Java Refactoring: Byte Code Analysis

I'm currently refactoring parts of the CDK core classes. As part of this I often need to find specific patterns/idioms that need to be changed. Whilst source code inspections and an IDE can make this task easy sometimes the tools aren't quite sharp enough.

I needed to find all occurrences of a reference (instead of value) comparison on a particular class. In Java there is no operator overload and so you can have situations like this:
Integer a = new Integer(25);
Integer b = new Integer(25);
if (a == b) {} // false
if (a.equals(b)) {} // true
I mentioned operating overloading but it's more subtle and is more about comparing reference vs. value comparison. In C/C++ we can have similar behaviour:
int aval = 25, bval = 25;
int *a = &aval;
int *b = &bval;
if (a == b) {} // false
if (*a == *b) {} // true
Most IDE's and code inspection programs will warn about common occurrences (for example Integer) but I wanted to find places where the CDK's classes were used like this. A simple text grep will find some but will have false positives and negatives requiring lots of manual checking. Daniel suggested the well known FindBugs might be able help.

Rather than analyze source code like PMD and Checkstyle, FindBugs analyses Java byte code with a set of defaults detectors to find often subtle but critical mistakes. FindBugs can be configured with custom detectors (see here), however the inspection I needed (RC: Suspicious reference comparison to constant) was almost there. After digging around in the source code I found you can provide a list of custom classes to detect. However it took a bit of trial and error to get what I needed.

First up we turn off all inspections except for the one we're looking for (we need to fix many others reported but I was looking for something specific). To do this we create an XML config that will only run the specific inspection (RC for Reference Comparison):
findbugs-include.xml
<?xml version="1.0" encoding="UTF-8"?>
<FindBugsFilter>
  <Match>
    <Bug code="RC"/>
  </Match>
</FindBugsFilter>
We then run findbugs with this configuration and provide the frc.suspicious property.
Running findbugs
$> findbugs -textui \
            -include findbugs-include.xml \
            -property "frc.suspicious=org.openscience.cdk.interfaces.IAtom" \
            base/*/target/cdk-*-2.0-SNAPSHOT.jar 
This produces an accurate report of all the places the references are compared. Here's a sample:
H C RC: Suspicious comparison of org.openscience.cdk.interfaces.IAtom references in org.openscience.cdk.Bond.getOther(IAtom)  At Bond.java:[line 253]
H C RC: Suspicious comparison of org.openscience.cdk.interfaces.IAtom references in org.openscience.cdk.Bond.getConnectedAtom(IAtom)  At Bond.java:[line 265]
H C RC: Suspicious comparison of org.openscience.cdk.interfaces.IAtom references in org.openscience.cdk.Bond.getConnectedAtoms(IAtom)  At Bond.java:[line 281]
H C RC: Suspicious comparison of org.openscience.cdk.interfaces.IAtom references in org.openscience.cdk.Bond.contains(IAtom)  At Bond.java:[line 300]
...

Monday, 3 April 2017

CDK AtomContainers are Slow - Let's fix that

The core class for molecule representation in CDK is the AtomContainer. The AtomContainer uses an edge-list data structure for storing the underlying connection table (see The Right Representation for the Job).

Essentially this edge-list representation is efficient in space. Atoms can be shared between and belong to multiple AtomContainers. Therefore querying connectivity (is this atom connected to this other atom) is linear time in the number of bonds.

The inefficiency of the AtomContainer can really sting. If someone was to describe Morgan's relaxation algorithm you may implement it like Code 1. The algorithm looks reasonable however it will run much slower than you expected. You may expect the runtime of this algorithm to be ~N2 but it's actually ~N3. I've annotated with XXX where the extra effort creeps in.
Code 1 - Naive Morgan-like Relaxation (AtomContainer/AtomIter)
// Step 1. Algorithm body
int[] prev = new int[mol.getAtomCount()];
int[] next = new int[mol.getAtomCount()];
for (int i = 0; i < mol.getAtomCount(); i++) {
  next[i] = prev[i] = mol.getAtom(i).getAtomicNumber();
}
for (int rep = 0; rep < mol.getAtomCount(); rep++) { // 0..numAtoms
  for (int j = 0; j < mol.getAtomCount(); j++) {     // 0..numAtoms
    IAtom atom = mol.getAtom(j);
    // XXX: linear traversal! 0..numBonds
    for (IBond bond : mol.getConnectedBondsList(atom)) {
      IAtom nbr = bond.getConnectedAtom(atom); 
      // XXX: linear traversal! 0..numAtoms avg=numAtoms/2
      next[j] += prev[mol.getAtomNumber(nbr)]; 
    }
  }o
  System.arraycopy(next, 0, prev, 0, next.length);
}

A New Start: API Rewrite?


Ultimately to fix this problem correctly, would involve changing the core AtomContainer representation, unfortunately this would require an API change, optimally I think adding the constraint that atoms/bonds can not be in multiple molecules would be needed**. This would be a monumental change and not one I can stomach right now.

Existing Trade Off: The GraphUtil class


In 2013 I added the GraphUtil class for converting an AtomContainer to a more optimal adjacency list (int[][]) that was subsequently used to speed up many algorithms including: ring finding, canonicalisation, and substructure searching. Each time one of these algorithm is invoked with an IAtomContainer the first step is to build the adjacency list 2D array.

Code 2 - GraphUtil usage
IAtomContainer mol = ...;
int[][]        adj = GraphUtil.toAdjList(mol);

// optional with lookup map to bonds
EdgeToBondMap  e2b = EdgeToBondMap.withSpaceFor(mol);
int[][]        adj = GraphUtil.toAdjList(mol, e2b);

Although useful the usage of GraphUtil is somewhat clunky requiring passing around not just the adjacency list but the original molecule and the EdgeToBondMap if needed.
Code 3 - GraphUtil Depth First Traversal
void visit(IAtomContainer mol, int[][] adj, EdgeToBondMap bondmap, int beg, int prev) {
  mol.getAtom(beg).setFlag(CDKConstants.VISITED, true);
  for (int end : adjlist[beg]) {
    if (end == prev)
      continue;
    if (!mol.getAtom(end).getFlag(CDKConstants.VISITED))
      visit(mol, adj, bondmap, end, beg);
    else
      bondmap.get(beg, end).setIsInRing(true); // back edge
  }
}

Using the GraphUtil approach has been successful but due to the clunky-ness I've not felt comfortable exposing the option of passing these through to public APIs. It was only ever meant as an internal optimisation to be hidden from the caller. Beyond causing unintentional poor performance (Code 1) what often happens in a workflow is GraphUtil is invoked multiple times. A typical use case would be matching multiple SMARTS against one AtomContainer.

A New Public API: Atom and Bond References


I wanted something nicer to work with and came up with the idea of using object composition to extend the existing Atom and Bond APIs with methods to improve performance and connectivity checks.

Essentially the idea is to provide two classes, and AtomRef and BondRef that reference a given atom or bond in a particular AtomContainer. An AtomRef knows about the original atom it's connected bonds and the index, the BondRef knows about the original bond, it's index and the AtomRef for the connected atoms. The majority of methods (e.g. setSymbol, setImplicitHydrogenCount, setBondOrder) are passed straight through to the original atom. Some methods (such as setAtom on IBond) are blocked as being unmodifiable.

Code 4 - AtomRef and BondRef structure
class AtomRef implements IAtom {
  IAtom         atm;
  int           idx;
  List<BondRef> bnds;
}

class BondRef implements IBond {
  IBond         bnd;
  int           idx;
  AtomRef       beg, end;
}

We can now re-write the Morgan-like relaxation (Code 1) using AtomRef and BondRef. The scaling of this algorithm is now ~N2 as you would expect.
Code 5 - Morgan-like Relaxation (AtomRef/AtomIter)
// Step 1. Initial up front conversion cost
AtomRef[] arefs = AtomRef.getAtomRefs(mol);

// Step 2. Algorithm body
int[]   prev  = new int[mol.getAtomCount()];
int[]   next  = new int[mol.getAtomCount()];
for (int i = 0; i < mol.getAtomCount(); i++) {
  next[i] = prev[i] = mol.getAtom(i).getAtomicNumber();
}
for (int rep = 0; rep < mol.getAtomCount(); rep++) {
  for (AtomRef aref : arefs) {
    int idx = aref.getIndex();
    for (BondRef bond : aref.getBonds()) {
      next[idx] += prev[bond.getConnectedAtom(aref).getIndex()];
    }
  }
  System.arraycopy(next, 0, prev, 0, next.length);
}   

The depth first implementation also improves in readability and only requires two arguments.
Code 6 - AromRef Depth First (AtomRef/AtomFlags)
// Step 1. Initial up front conversion cost
void visit(AtomRef beg, BondRef prev) {
  beg.setFlag(CDKConstants.VISITED, true);
  for (BondRef bond : beg.getBonds()) {
    if (bond == prev)
      continue;
    AtomRef nbr = bond.getConnectedAtom(beg);
    if (!nbr.getFlag(CDKConstants.VISITED))
      visit(nbr, bond);
    else
      bond.setIsInRing(true); // back edge
  }
} 


Benchmark


I like the idea of exposing the AtomRef and BondRef to public APIs. I wanted to check that the trade-off in calculating and using the AtomRef/BondRef vs the current internal GraphUtil. To test this I wrote a benchmark that implements some variants of a Depth First Search and Morgan-like algorithms. I varied the algorithm implementations and whether I used, IAtomContainer, GraphUtil, or AtomRef.

The performance was measured over ChEMBL 22 and averaged the run time performance over 1/10th (167,839 records). You can find the code on GitHub (Benchmark.java). Each algorithm computes a checksum to verify the same work is being done. Here are the raw results: depthfirst.tsv, and relaxation.tsv.


Depth First Traversal


A Depth first traversal is a linear time algorithm. I tested eight implementations that varied the graph data structure and whether I used an external visit array or atom flags to mark visited atoms. When looking just at initialisation time the AtomRef creation is about the same as GraphUtil. There was some variability between the different variants but I couldn't isolate where the different came from (maybe GC/JIT related). The runtime of the AtomRef was marginally slower than GraphUtil. Both were significantly faster (18-20x) than the AtomContainer to do the traversal. When we look at the total run-time (initialisation+traversal) we see that even for a linear algorithm, the AtomRef (and GraphUtil) were ~3x faster. Including the EdgeToBondMap adds a significant penalty.




Graph Relaxation


A more interesting test is a Morgan-like relaxation, as a more expensive algorithm (N2) it should emphasise any difference between the AtomRef and GraphUtil. The variability in this algorithm is whether we relax over atoms (AtomIter - see Code 1/5) or bonds (BondIter). We see a huge variability in AtomContainer/AtomIter implementation. This is because the algorithm is more susceptible to difference in input (molecule) size.



Clearly the AtomContainer/AtomIter is really bad (~80x slower). Excluding this results shows that as expected the AtomRef/AtomIter is slower than the GraphUtil/AtomIter equivalent (~2x slower). However because the AtomRef has a richer syntax, we can do a trick with XOR number storage to improve performance or iterate over bonds (BondIter) giving like-for-like speeds.



Conclusions


The proposed AtomRef and BondRef provide a convenience API to use the CDK in a natural way with efficient connectivity access. The conversion to an AtomRef is efficient and provides a speedup even for linear algorithms. The encapsulation facilities the passing as a public API parameter, users will be able to compute it ahead of time and pass it along to multiple algorithms.

I'm somewhat tempted to provide an equivalent AtomContainerRef allowing a drop-in replacement for methods that take the IAtomContainer interface. It is technically possible to implement writes (e.g. delete bond) efficiently in which case it would no longer be a 'Ref'. Maybe I'll omit that functionality or use a better name?

Footnotes


  • ** My colleague Daniel Lowe notes that OPSIN allows atoms to be in multiple molecules and know about their neighbours but it's a bit of a fudge. It's certainly possible with some extra book keeping but prevents some other optimisations from being applied.