Efficient Bits: 2018

Thursday, 25 October 2018

CDK Depict on HTTPS

Just a quick post to say CDK Depict is now using HTTPS https://simolecule.com/cdkdepict/depict.html. The main reason for this was Blogger stopped allowing image links to HTTP resources. In general browsers are being more fussy about non HTTPS content.

I used LetsEncrypt that turned out to be very easy to configure with TomCat.

Step 1

Install the certbot utility and use it generate a certificate.

$ sudo certbot certonly

Step 2

Configure TomCat 8+ connectors. This used to be more complex on older TomCat servers with the need to generate a separate keystore. Editing $CATALINA_HOME/confg/server.xml we configure the base connected, redirectPort is changed from 8443 to 443 (and 8080 to 80).

<Connector port="80" protocol="HTTP/1.1"
           connectionTimeout="20000"
           redirectPort="443" />

We also configure SSL connector, using port 443, change to NIO based protocol (the default requires extra native library) org.apache.coyote.http2.Http2Protocol, and set the file paths to the .pem files generated by certbot.

<Connector port="443" 
           protocol="org.apache.coyote.http11.Http11NioProtocol"
           maxThreads="150" SSLEnabled="true" >
  <UpgradeProtocol className="org.apache.coyote.http2.Http2Protocol" />
  <SSLHostConfig>
    <Certificate certificateKeyFile="/etc/letsencrypt/live/www.simolecule.com/privkey.pem"
                 certificateFile="/etc/letsencrypt/live/www.simolecule.com/cert.pem"
                 certificateChainFile="/etc/letsencrypt/live/www.simolecule.com/chain.pem"
                 type="RSA" />
  </SSLHostConfig>
</Connector>

Step 3 (optional)

If a client tries to visit the HTTP site we want to redirect them to HTTPS. To do this we edit $CATALINA_HOME/confg/web.xml adding this section to the end of the <web-app> block

<security-constraint>
  <web-resource-collection>
    <web-resource-name>Entire Application</web-resource-name>
    <url-pattern>/*</url-pattern>
  </web-resource-collection>
  <user-data-constraint>
    <transport-guarantee>CONFIDENTIAL</transport-guarantee>
  </user-data-constraint>
</security-constraint>

Monday, 22 October 2018

Bit packing for fast atom type assignment

Many cheminformatics algorithms perform some form of atom typing as their first step. The atom types you need and the granularity depend on the algorithm. At the core of all atom typing lies some form of decision tree, usually manifesting as a complex if-else cascade.

Using an elegant technique Roger showed me a few years ago, you can replace this if-else cascade with a table/switch lookup. This turns out to be very clean, easy to extend, and efficient. The technique relies on first computing a value that captures the bond environment around an atom and packing this in to a single value.

Here's what it looks like:

Algorithm 1

int btypes = atom.getImplicitHydrogenCount();
for (IBond bond : atom.bonds()) {
  switch (bond.getOrder()) {
    case SINGLE: btypes += 0x0001; break;
    case DOUBLE: btypes += 0x0010; break;
    case TRIPLE: btypes += 0x0100; break;
    default:     btypes += 0x1000; break;
  }
}

After the value btypes has been calculated (once for each atom) it contains the count of single, double, triple, and other bonds. We can inspect each of these counts individually my masking of the relevant bits or the entire value, for example:

Algorithm 2

switch (btypes) {
 case 0x004: // 4 single bonds, e.g. Sp3 carbon
  break;
 case 0x012: // 1 double bond, 2 single bonds e.g. Sp2 carbon
  break;
 case 0x020: // 2 double bonds e.g. Sp cumulated carbon
  break;
 case 0x101: // 1 triple bond, 1 single bond e.g. Sp triple bonded carbon
  break;
}

Using a nibble allows us to store numbers up to 16 (2⁴) - more than enough for any sane chemistry. In the example above I shoved default bonds under an 'other' category but of course you could extend it to handle quadruple bonds and even additional properties of the bonds:

Algorithm 3

int btypes = atom.getImplicitHydrogenCount();
for (IBond bond : atom.bonds()) {
  switch (bond.getOrder()) {
    case SINGLE: 
      if (bond.isAromatic())
        btypes += 0x010001; 
      else
        btypes += 0x000001;
      break;
    case DOUBLE: 
      if (bond.isAromatic())
        btypes += 0x010010; 
      else if (isOxygen(bond.getOther(atom)))
        btypes += 0x100010; // dbs to oxygens
      else
        btypes += 0x000010;
      break;
    case TRIPLE: 
       btypes += 0x000100;
       break;
  }
}

Friday, 13 April 2018

RDKit Reaction SMARTS

Update 19/09/22

I believe I was wrong and Daylight did allow SMIRKS to be run backwards and that the "direction" parameter controls it ("personal communication" from Roger).. The key insight is the documentation notes SMARTS patterns should only appear on atoms that don't have bond changes around them. From the man page:

"Atomic expressions may be any valid atomic SMARTS expression for nodes where the bonding (connectivity & bond order) doesn't change.1 Otherwise, the atomic expressions must be valid SMILES."

Whether this make sense or not for writing portable SMIRKS is questionable.

This is supplementary to my original grumble on ambiguous naming. I still believe RDKit should just call it SMIRKS or at least something which didn't already mean something else (e.g. "RdSmirks"). I do agree there is a lot of overlap between SMARTS and SMIRKS but conflating these terms is problematic. Shortly after I originally wrote this original post, SMIRKS Native Open Force Field (SMIRNOFF) was published. In that work SMARTS is used, but they called it SMIRKS. There is some wiggling attempted by the authors that they use "SMIRKS features" of atom maps. However it is incorrect that these were a SMIRKS feature as in what Daylight called Reaction SMARTS, see the table "Examples of Reaction SMARTS" (SMARTS Theory Page). Of course SMIRNOFF was just too good a pun to change, but it again creates more confusion.

I recently ran some comparisons/benchmarks on different SMIRKS implementations, everyone's semantics are consistently inconsistent and the community would benefit by an effort to standardise. Perhaps then I can convince RDKit that they really do have a (good) SMIRKS implementation and it would be less confusing to just call it that.