Monday, 2 September 2019

Rules for Interpreting Up/Down Wedge Bonds

Yesterday I was reminded of an old anecdote about a maths professor. A professor was lecturing an auditorium and writing down a proof. They proclaim at one part "and obviously x infers y". A student raises their hand and asks, "is it obvious?". The professor then studies the equation for 30 minutes until stating "Yes, it is obvious" and continued on with the proof.


Egon asked is the bridge in the following compound up or down? I replied obviously it's pointing down. Since it might not be obvious to everyone I thought I'd explain the three rules to easily assign up/down wedges to other bonds around a stereocentre in 2D.

The insight actually comes from the algorithm used to assign up/down wedges to a depiction. Since only a handful of people have ever had to write such an algorithm I'm not sure how common knowledge it is (i.e. is it actually obvious?).

Rule 1 (D4)


A tetrahedral centre with four bonds must have alternating up/down bonds. Therefore no mater what the angle if one bond is labelled as up, we know the bond opposite it must also be up, and the two either side must be down (inverse of up).


Rule 2 (D3)


For three bonds, when bonds are spaced evenly (i.e. all angles < 180 degrees) then all bonds are the same direction, all up, or all down.



Rule 3 (D3)


For three bonds, when an angle > 180 exists then think about it like the D4 case with one neighbour missing. The "outside" bonds are opposite direction to the "middle" one.



Exceptions


Like all rules there are exceptions, a well know ambiguity is when we have three bonds and the angle is exactly 180 degrees:


The problem here is you could move the central atom slightly to apply either Rule 2 or Rule 3. Some chemistry toolkits will refuse to read this others will side on the more likely interpretation (Rule 3).

A bigger and perhaps more common issue is mixing up/down wedges with perspective projection. More precisely MDL (and then SYMYX, Accelrys, now BIOVIA) had something known as the "triangle rule". The idea was if you were looking at a molecule in 3D the lengths of the bond would indicate which way round you were looking at it. They imposed this concept on 2D interpretation.

In practice what the this means is these two structures are read as different enantiomers (by BIOVIA) depending on whether the H is inside or outside the "triangle":


You're unlikely to encounter such cases except when a projection is involved. For example for the bridged system pictured below, perspective has been used and we may end up with the H within the "triangle". Note it's the point stored in the file for the atom not the actual "H" glyph that maters.


This isn't to say projections are bad, only that mixing perspective with up/down wedges can be problematic.

Wednesday, 19 June 2019

Creating Chemical Structure Animations

I've just got back from the Eighth Joint Sheffield Conference on Chemoinfomatics where I presented about the technical details of subgraph isomorphism algorithms. It was a great conference (as usual) with good science, interesting posters and lots of fun. Noel was live tweeting the whole thing so check out the #Shef2019 hashtag if you want to catch up.

To help explain the algorithms in my talk I created some animations (as videos) that showed the backtracking search procedures. After the talk several delegates asked how I created these so I thought I do a quick blog post on it (and also to remind me in future how to do it again). I'd done something similar before to demonstrate the Sayle and Delany algorithm for tautomer enumeration. Here's the PDF (and PPTX with the videos) of my Sheffield talk.

CDK + ffmpeg


The general idea is to generate a bunch of PNGs and then stick them together with the ffmpeg command line tool. In older blog posts I used to generate a GIF but it turns out mp4 compresses better with higher quality.

Step 1: Generating the PNGs with CDK


The example code below loads the SMILES for indole and then loops around highlighting one atom. Other than some normal params we also set the symbol visibility. By default the depiction will add in symbols for highlighted carbons, we can turn this off by overriding the parameter.

String dest = "/tmp/example1";
IChemObjectBuilder bldr   = SilentChemObjectBuilder.getInstance();
SmilesParser       smipar = new SmilesParser(bldr);
IAtomContainer     mol    = smipar.parseSmiles("[nH]1ccc2c1cccc2");
DepictionGenerator dg = new DepictionGenerator().withZoom(5)
                                                .withSize(1280, 720)
                                                .withParam(StandardGenerator.Visibility.class,
                                                           SymbolVisibility.iupacRecommendationsWithoutTerminalCarbon())
                                                .withOuterGlowHighlight();
new File(dest).mkdirs();
for (int i=0; i<100; i++) {
    IAtom atm = mol.getAtom(i % mol.getAtomCount());
    dg.withHighlight(Collections.singleton(atm), Color.RED)
      .depict(mol)
      .writeTo(String.format("%s/frame.%03d.png", dest, i));
}

Step 2: Stick them all together


$ ffmpeg -r 12/1 -start_number 0 -i /tmp/example1/frame.%03d.png \
  -c:v libx264 -r 30 -pix_fmt yuv420p example1.mp4

The key parameters here are the output name (last arg) and input name template (-i /tmp/example1/frame.%03d.png). I zero padded the numbers so they will be ordered correctly when alphabetically sorted and match this in the argument. You don't need to zero-pad but it means they should then alphabetical in your OS file system which is handy. The first -r is used to set the input frame rate to 12/1 (12 per 1 second).

Okay so how did it turn out....

Download: example1.mp4

Not bad but the inter-frame alignment is off due to the different size caused by the moving outer-glow. In future I might have an anchor attribute that would lock the depiction in place relative to some atom but that's quite specialised. We can fix the shifting by highlighting all other atoms to the same as the background. This is possible with the high-level APIs but it's easy enough just to set the field on the atom:

for (int i=0; i<100; i++) {
    IAtom target = mol.getAtom(i % mol.getAtomCount());
    // reset all
    for (IAtom a : mol.atoms())
        a.setProperty(StandardGenerator.HIGHLIGHT_COLOR,
                      Color.WHITE);
    target.setProperty(StandardGenerator.HIGHLIGHT_COLOR,
                       Color.RED);
    dg.depict(mol)
      .writeTo(String.format("%s/frame.%03d.png", dest, i));
}


Download: example2.mp4

Add/Remove Atoms and Bonds?


Because of the alignment issue we need to use some tricks to add/remove atoms and bonds. Essentially you draw the whole thing and then hide the parts you don't want by setting them to the background color. For example:

int frameId = 0;
for (IAtom atom : mol.atoms())
    hide(atom);
for (IBond bond : mol.bonds())
    hide(bond);
// add bonds
for (int i = 0; i < mol.getBondCount(); i++) {
    IBond bnd = mol.getBond(i);
    show(bnd);
    show(bnd.getBegin());
    show(bnd.getEnd());
    dg.depict(mol)
      .writeTo(String.format("%s/frame.%03d.png", dest, ++frameId));
}
// hold for 12 frames
for (int i = 0; i < 12; i++)
    dg.depict(mol)
      .writeTo(String.format("%s/frame.%03d.png", dest, ++frameId));
// remove bonds
for (int i = 0; i < mol.getBondCount(); i++) {
    IBond bnd = mol.getBond(mol.getBondCount()-i-1);
    hide(bnd);
    if (!visible(bnd.getBegin()))
        hide(bnd.getBegin());
    if (!visible(bnd.getEnd()))
        hide(bnd.getEnd());
    dg.depict(mol)
      .writeTo(String.format("%s/frame.%03d.png", dest, ++frameId));
}

Where show/hide are:

private static void hide(IChemObject x) {
    x.setProperty(StandardGenerator.HIGHLIGHT_COLOR,
                  Color.WHITE);
}

private static void show(IChemObject x) {
    x.setProperty(StandardGenerator.HIGHLIGHT_COLOR,
                  Color.BLACK);
}

Producing:

Download: example3.mp4

That's just about it, the CDK depiction is quite configurable but not really intended to be a general purpose drawing/animation tool. However using some tricks you can work around some quirks and get some nice results. If you make animations let me know as I'd love to see them!

Thursday, 25 October 2018

CDK Depict on HTTPS

Just a quick post to say CDK Depict is now using HTTPS https://simolecule.com/cdkdepict/depict.html. The main reason for this was Blogger stopped allowing image links to HTTP resources. In general browsers are being more fussy about non HTTPS content.

I used LetsEncrypt that turned out to be very easy to configure with TomCat.

Step 1


Install the certbot utility and use it generate a certificate.

$ sudo certbot certonly

Step 2


Configure TomCat 8+ connectors. This used to be more complex on older TomCat servers with the need to generate a separate keystore. Editing $CATALINA_HOME/confg/server.xml we configure the base connected, redirectPort is changed from 8443 to 443 (and 8080 to 80).

<Connector port="80" protocol="HTTP/1.1"
           connectionTimeout="20000"
           redirectPort="443" />

We also configure SSL connector, using port 443, change to NIO based protocol (the default requires extra native library) org.apache.coyote.http2.Http2Protocol, and set the file paths to the .pem files generated by certbot.

<Connector port="443" 
           protocol="org.apache.coyote.http11.Http11NioProtocol"
           maxThreads="150" SSLEnabled="true" >
  <UpgradeProtocol className="org.apache.coyote.http2.Http2Protocol" />
  <SSLHostConfig>
    <Certificate certificateKeyFile="/etc/letsencrypt/live/www.simolecule.com/privkey.pem"
                 certificateFile="/etc/letsencrypt/live/www.simolecule.com/cert.pem"
                 certificateChainFile="/etc/letsencrypt/live/www.simolecule.com/chain.pem"
                 type="RSA" />
  </SSLHostConfig>
</Connector>       

Step 3 (optional)


If a client tries to visit the HTTP site we want to redirect them to HTTPS. To do this we edit $CATALINA_HOME/confg/web.xml adding this section to the end of the <web-app> block

<security-constraint>
  <web-resource-collection>
    <web-resource-name>Entire Application</web-resource-name>
    <url-pattern>/*</url-pattern>
  </web-resource-collection>
  <user-data-constraint>
    <transport-guarantee>CONFIDENTIAL</transport-guarantee>
  </user-data-constraint>
</security-constraint>

Monday, 22 October 2018

Bit packing for fast atom type assignment

Many cheminformatics algorithms perform some form of atom typing as their first step. The atom types you need and the granularity depend on the algorithm. At the core of all atom typing lies some form of decision tree, usually manifesting as a complex if-else cascade.

Using an elegant technique Roger showed me a few years ago, you can replace this if-else cascade with a table/switch lookup. This turns out to be very clean, easy to extend, and efficient. The technique relies on first computing a value that captures the bond environment around an atom and packing this in to a single value.

Here's what it looks like:
Algorithm 1
int btypes = atom.getImplicitHydrogenCount();
for (IBond bond : atom.bonds()) {
  switch (bond.getOrder()) {
    case SINGLE: btypes += 0x0001; break;
    case DOUBLE: btypes += 0x0010; break;
    case TRIPLE: btypes += 0x0100; break;
    default:     btypes += 0x1000; break;
  }
}

After the value btypes has been calculated (once for each atom) it contains the count of single, double, triple, and other bonds. We can inspect each of these counts individually my masking of the relevant bits or the entire value, for example:
Algorithm 2
switch (btypes) {
 case 0x004: // 4 single bonds, e.g. Sp3 carbon
  break;
 case 0x012: // 1 double bond, 2 single bonds e.g. Sp2 carbon
  break;
 case 0x020: // 2 double bonds e.g. Sp cumulated carbon
  break;
 case 0x101: // 1 triple bond, 1 single bond e.g. Sp triple bonded carbon
  break;
}

Using a nibble allows us to store numbers up to 16 (24) - more than enough for any sane chemistry. In the example above I shoved default bonds under an 'other' category but of course you could extend it to handle quadruple bonds and even additional properties of the bonds:
Algorithm 3
int btypes = atom.getImplicitHydrogenCount();
for (IBond bond : atom.bonds()) {
  switch (bond.getOrder()) {
    case SINGLE: 
      if (bond.isAromatic())
        btypes += 0x010001; 
      else
        btypes += 0x000001;
      break;
    case DOUBLE: 
      if (bond.isAromatic())
        btypes += 0x010010; 
      else if (isOxygen(bond.getOther(atom)))
        btypes += 0x100010; // dbs to oxygens
      else
        btypes += 0x000010;
      break;
    case TRIPLE: 
       btypes += 0x000100;
       break;
  }
}

Friday, 13 April 2018

RDKit Reaction SMARTS

There's a been some papers using the RDKit for synthesis planning. If you're writing a paper and use the term "Reaction SMARTS" make sure you mean what everyone else thinks it means.

The SMILES, SMARTS, and SMIRKS line notations were created* by Daylight for storing, matching, and transforming connection tables.

  • SMILES describes a connection table to store molecule and reactions
  • SMARTS describes a pattern (or query) to match molecules and reactions
  • SMIRKS describes a transform (or "reaction") to modify molecules

RDKit uses the term "Reaction SMARTS" to mean "transform" (see RDKit Book). Unfortunately in Daylight's terminology Reaction SMARTS is a pattern not a transform.

Screenshot from the Daylight SMARTS theory manual.
Reactions SMARTS is primarily useful for searching reaction databases. For example this Reaction SMILES:

[cH:13]1[c:14]([cH:19][c:20]([c:10]([c:11]1[Cl:12])[n:9]2[cH:8][c:5]([c:4]([n:22]2)[NH2:3])[C:6]#[N:7])[Cl:21])[C:15]([F:16])([F:17])[F:18].[OH:1][OH:2]>C(Cl)Cl.C(=O)(C(F)(F)F)OC(=O)C(F)(F)F.O>[cH:13]1[c:14]([cH:19][c:20]([c:10]([c:11]1[Cl:12])[n:9]2[cH:8][c:5]([c:4]([n:22]2)[N+:3](=[O:1])[O-:2])[C:6]#[N:7])[Cl:21])[C:15]([F:16])([F:17])[F:18]

is matched by this Reaction SMARTS

[*:1][Nh2:2]>>[*:1][Nh0:2](~[OD1])~[OD1] amino to nitro

You can highlight the substructure:

Highlighting the SMARTS in the SMILES using CDK Depict

But that's a transform!


Yes but it's matching a transform (SMARTS) not applying one (SMIRKS), some may think you could read this unmodified as a SMIRKS but this is not the case. SMIRKS needs "real parts" after the second angled bracket as these are the parts created by the transform. Note that '*' is valid SMILES and in SMIRKS it kind of means "unmodified". This actually gives us the nice invariants:

All SMILES are valid SMARTS but not all SMARTS are valid SMILES
and
All SMIRKS are valid SMARTS but not all SMARTS are valid SMIRKS

Here is the SMIRKS transform for amino to nitro

[*:1][ND3:2]([H])([H])>>[*:1][N:2](=O)=O amino to nitro

In SMIRKS I can apply this SMIRKS to "molecules" and it will create "reactions". Note these molecules do not need to have atom-maps but they will come out with atom maps (see dt_transform)!

c1ccccc1N
[nH]1ccc2c1cc(N)cc2

The output is

c1cccc[c:1]1[NH2:2]>>c1cccc[c:1]1[N:2](=O)=O
[nH]1ccc2c1c[c:1]([NH2:2])cc2>>[nH]1ccc2c1c[c:1]([N:2](=O)=O)cc2

And another thing...


In general you can't run SMIRKS backwards. If I want to run a nitro to amino because the atoms/bonds we're adding need to be "real" we need to encode the reverse transform separately!

[*:1][ND3:2]([H])([H])>>[*:1][N:2](=O)=O amino to nitro
[*:1][ND3:2](~[OD1])(~[OD1])>>[*:1][N:2]([H])[H] nitro to amino

Although dt_transform specifies a direction this only controls whether the input molecules appear on the left or right of the output reaction.


*SMILES was created by Dave Weininger whilst at EPA