tag:blogger.com,1999:blog-61316748094990987892024-03-14T02:44:24.495+00:00Efficient BitsAlgorithms, data structures, and utilities for processing chemical informationJohn Mayfieldhttp://www.blogger.com/profile/03062858450255240984noreply@blogger.comBlogger39125tag:blogger.com,1999:blog-6131674809499098789.post-19165448947107540892019-09-02T15:45:00.000+01:002019-09-02T15:45:21.436+01:00Rules for Interpreting Up/Down Wedge BondsYesterday I was reminded of an old anecdote about a maths professor. A professor was lecturing an auditorium and writing down a proof. They proclaim at one part "and obviously x infers y". A student raises their hand and asks, "is it obvious?". The professor then studies the equation for 30 minutes until stating "Yes, it is obvious" and continued on with the proof.<br />
<br />
<blockquote class="twitter-tweet"><p lang="en" dir="ltr">chemists and SMILES experts, is the bridge carbon point towards us, or away from us? FC(F)(F)c1ccc(nc1)O[C@H]1C[C@H](C[C@H]12)C([2H])([2H])N2C(=O)c1cccc(F)c1-c1=Nccc=N1 <a href="https://t.co/6WwEvUSbkl">https://t.co/6WwEvUSbkl</a> (cc <a href="https://twitter.com/jwmay?ref_src=twsrc%5Etfw">@jwmay</a> <a href="https://twitter.com/cdsouthan?ref_src=twsrc%5Etfw">@cdsouthan</a>) <a href="https://t.co/93q8TpZA58">pic.twitter.com/93q8TpZA58</a></p>— Egon Willighⓐgen (@egonwillighagen) <a href="https://twitter.com/egonwillighagen/status/1168141020598587395?ref_src=twsrc%5Etfw">September 1, 2019</a></blockquote> <script async src="https://platform.twitter.com/widgets.js" charset="utf-8"></script>
<br />
Egon asked is the bridge in the following compound up or down? I replied obviously it's pointing down. Since it might not be obvious to everyone I thought I'd explain the three rules to easily assign up/down wedges to other bonds around a stereocentre in 2D.<br />
<br />
The insight actually comes from the algorithm used to assign up/down wedges to a depiction. Since only a handful of people have ever had to write such an algorithm I'm not sure how common knowledge it is (i.e. is it actually obvious?).<br />
<br />
<h2>
Rule 1 <b>(D<sub>4</sub>)</b></h2>
<br />
A tetrahedral centre with four bonds must have alternating up/down bonds. Therefore no mater what the angle if one bond is labelled as up, we know the bond opposite it must also be up, and the two either side must be down (inverse of up).<br />
<br />
<div class="separator" style="clear: both; text-align: center;">
<a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEgu1tHqBKQYNIojOytL29gvTV1EzM0QWgS5jLMp1w8NznklKI4Q92WBWso-I-AVFf3km-Xjhfv8mHBtdd0yryzfQKCKK68GFbGjtsTy1AeyI-aOIEIVMFO39yHQ_jtrtOHmCQZtQKtIfhc/s1600/D4.png" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img border="0" data-original-height="166" data-original-width="600" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEgu1tHqBKQYNIojOytL29gvTV1EzM0QWgS5jLMp1w8NznklKI4Q92WBWso-I-AVFf3km-Xjhfv8mHBtdd0yryzfQKCKK68GFbGjtsTy1AeyI-aOIEIVMFO39yHQ_jtrtOHmCQZtQKtIfhc/s1600/D4.png" /></a></div>
<div class="separator" style="clear: both; text-align: center;">
<br /></div>
<h2>
Rule 2 <b>(D<sub>3</sub>)</b></h2>
<br />
For three bonds, when bonds are spaced evenly (i.e. all angles < 180 degrees) then all bonds are the same direction, all up, or all down.<br />
<br />
<div class="separator" style="clear: both; text-align: center;">
<a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEiYjQ6kOYSVq1CtGbV_mGc5sy-H8Yz_y84iWmPHvXE4M9UgS9fekSGugSu4eTQxoFBGTwp1oFaUUwheosVcAxU2boCpnBrQ3oa-yYYFvXRqrTM4dc741Zbfa3Re9iDuo2vSh3kaeW_l61Q/s1600/D3-1.png" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img border="0" data-original-height="137" data-original-width="325" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEiYjQ6kOYSVq1CtGbV_mGc5sy-H8Yz_y84iWmPHvXE4M9UgS9fekSGugSu4eTQxoFBGTwp1oFaUUwheosVcAxU2boCpnBrQ3oa-yYYFvXRqrTM4dc741Zbfa3Re9iDuo2vSh3kaeW_l61Q/s1600/D3-1.png" /></a></div>
<br />
<br />
<h2>
Rule 3 <b>(D<sub>3</sub>)</b></h2>
<br />
For three bonds, when an angle > 180 exists then think about it like the <b>D<sub>4</sub></b> case with one neighbour missing. The "outside" bonds are opposite direction to the "middle" one.<br />
<br />
<div class="separator" style="clear: both; text-align: center;">
<a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEgioOQ5fz3EBsdtCzVIaxpkyCxTznM4AEf1zZywTtUe-6s8JhLIeh04y3jp_R8qJlwBWM4v3th9t4CGz37gb-I1-utTqpBaqD02-yhXdDkeDDY7E4citJVLzs7w78rg7hfuegeDitL6pXQ/s1600/D3-2.png" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img border="0" data-original-height="225" data-original-width="341" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEgioOQ5fz3EBsdtCzVIaxpkyCxTznM4AEf1zZywTtUe-6s8JhLIeh04y3jp_R8qJlwBWM4v3th9t4CGz37gb-I1-utTqpBaqD02-yhXdDkeDDY7E4citJVLzs7w78rg7hfuegeDitL6pXQ/s1600/D3-2.png" /></a></div>
<br />
<br />
<h2>
<b>Exceptions</b></h2>
<br>
Like all rules there are exceptions, a well know ambiguity is when we have three bonds and the angle is exactly 180 degrees:<br />
<br />
<div class="separator" style="clear: both; text-align: center;">
<a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEjgGB9xydotQyv9TEVe09OOMISMEWDr5gEpgch-g99o8PuDSuh6y1S8DJexiJiYr8qIam06-1HfyzHdoH775TeTGeW453N-W90hf2owgZfkW4GlXdyA1MJ_6eFLoWurgKzzL43ZoCLS6UQ/s1600/D3-3.png" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img border="0" data-original-height="162" data-original-width="100" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEjgGB9xydotQyv9TEVe09OOMISMEWDr5gEpgch-g99o8PuDSuh6y1S8DJexiJiYr8qIam06-1HfyzHdoH775TeTGeW453N-W90hf2owgZfkW4GlXdyA1MJ_6eFLoWurgKzzL43ZoCLS6UQ/s1600/D3-3.png" /></a></div>
<br />
The problem here is you could move the central atom slightly to apply either Rule 2 or Rule 3. Some chemistry toolkits will refuse to read this others will side on the more likely interpretation (Rule 3).<br />
<br />
A bigger and perhaps more common issue is mixing up/down wedges with perspective projection. More precisely MDL (and then SYMYX, Accelrys, now BIOVIA) had something known as the "triangle rule". The idea was if you were looking at a molecule in 3D the lengths of the bond would indicate which way round you were looking at it. They imposed this concept on 2D interpretation.<br />
<br />
In practice what the this means is these two structures are read as different enantiomers (by BIOVIA) depending on whether the H is inside or outside the "triangle":<br />
<br />
<div class="separator" style="clear: both; text-align: center;">
<a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEhTHSQUAQwXL7N_qrmLZxGwIxd85aHPkKtd6oYSHHxuc7TSCCuS3zwTkZ01oe335bVTtZoFvWQXBXwAQAoUnLGxTpA6EhDqLuWdga_9tb-ekx0ZEHewkLFdPJC_Z9Zc5O8VwSZHYImxNtU/s1600/D4-2.png" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img border="0" data-original-height="154" data-original-width="541" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEhTHSQUAQwXL7N_qrmLZxGwIxd85aHPkKtd6oYSHHxuc7TSCCuS3zwTkZ01oe335bVTtZoFvWQXBXwAQAoUnLGxTpA6EhDqLuWdga_9tb-ekx0ZEHewkLFdPJC_Z9Zc5O8VwSZHYImxNtU/s1600/D4-2.png" /></a></div>
<br />
You're unlikely to encounter such cases except when a projection is involved. For example for the bridged system pictured below, perspective has been used and we may end up with the H within the "triangle". Note it's the point stored in the file for the atom not the actual "H" glyph that maters.<br />
<br />
<div class="separator" style="clear: both; text-align: center;">
<a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEguS0x33Z3pP7c85uH8_8tsYZfrFeD3X6NHWCx6eiwu5iWmLARa5NgsswUuEq0P0eKITTbO8UaXfkHEbkzGFBJln8n6VCvP2B0-9NnZPfC_zqyMYzb9er9g_l8ap_YT0gybrLU6ekjv9Ls/s1600/D4-3.png" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img border="0" data-original-height="212" data-original-width="379" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEguS0x33Z3pP7c85uH8_8tsYZfrFeD3X6NHWCx6eiwu5iWmLARa5NgsswUuEq0P0eKITTbO8UaXfkHEbkzGFBJln8n6VCvP2B0-9NnZPfC_zqyMYzb9er9g_l8ap_YT0gybrLU6ekjv9Ls/s1600/D4-3.png" /></a></div>
<div class="separator" style="clear: both; text-align: center;">
<br /></div>
<div class="separator" style="clear: both; text-align: left;">
This isn't to say projections are bad, only that mixing perspective with up/down wedges can be problematic.</div>
<br />John Mayfieldhttp://www.blogger.com/profile/03062858450255240984noreply@blogger.com0tag:blogger.com,1999:blog-6131674809499098789.post-16080592622013439952019-06-19T21:57:00.000+01:002019-06-19T21:57:03.346+01:00Creating Chemical Structure AnimationsI've just got back from the <a href="https://cisrg.shef.ac.uk/shef2019/"><span id="goog_1463174872"></span>Eighth Joint Sheffield Conference on Chemoinfomatics</a> where I presented about the technical details of subgraph isomorphism algorithms. It was a great conference (as usual) with good science, interesting posters and lots of fun. <a href="https://baoilleach.blogspot.com/">Noel</a> was live tweeting the whole thing so check out the <a href="https://twitter.com/hashtag/Shef2019?src=hash">#Shef2019</a> hashtag if you want to catch up.<br />
<br />
To help explain the algorithms in my talk I created some animations (as videos) that showed the backtracking search procedures. After the talk several delegates asked how I created these so I thought I do a quick blog post on it (and also to remind me in future how to do it again). I'd done something similar before to demonstrate the <a href="https://efficientbits.blogspot.com/2014/02/animated-algorithm-canonical-tautomer.html">Sayle and Delany algorithm</a> for tautomer enumeration. Here's the <a href="https://www.nextmovesoftware.com/talks/Mayfield_SecretsOfFastSmartsMatching_Sheffield_201906.pdf">PDF</a> (and <a href="https://www.nextmovesoftware.com/talks/Mayfield_SecretsOfFastSmartsMatching_Sheffield_201906.pptx">PPTX</a> with the videos) of my Sheffield talk.<br />
<br />
<h2>
CDK + ffmpeg</h2>
<br />
<div>
The general idea is to generate a bunch of PNGs and then stick them together with the <a href="https://ffmpeg.org/">ffmpeg</a> command line tool. In older blog posts I used to generate a GIF but it turns out mp4 compresses better with higher quality.</div>
<div>
<br /></div>
<h4>
Step 1: Generating the PNGs with CDK</h4><br/>
The example code below loads the SMILES for indole and then loops around highlighting one atom. Other than some normal params we also set the symbol visibility. By default the depiction will add in symbols for highlighted carbons, we can turn this off by overriding the parameter.
<br /><br/>
<pre class="prettyprint lang-java">
String dest = "/tmp/example1";
IChemObjectBuilder bldr = SilentChemObjectBuilder.getInstance();
SmilesParser smipar = new SmilesParser(bldr);
IAtomContainer mol = smipar.parseSmiles("[nH]1ccc2c1cccc2");
DepictionGenerator dg = new DepictionGenerator().withZoom(5)
.withSize(1280, 720)
.withParam(StandardGenerator.Visibility.class,
SymbolVisibility.iupacRecommendationsWithoutTerminalCarbon())
.withOuterGlowHighlight();
new File(dest).mkdirs();
for (int i=0; i<100; i++) {
IAtom atm = mol.getAtom(i % mol.getAtomCount());
dg.withHighlight(Collections.singleton(atm), Color.RED)
.depict(mol)
.writeTo(String.format("%s/frame.%03d.png", dest, i));
}
</pre>
<br/>
<h4>
Step 2: Stick them all together</h4>
<br/>
<pre class="prettyprint lang-bash">$ ffmpeg -r 12/1 -start_number 0 -i /tmp/example1/frame.%03d.png \
-c:v libx264 -r 30 -pix_fmt yuv420p example1.mp4
</pre>
<br/>
The key parameters here are the output name (last arg) and input name template (<code>-i /tmp/example1/frame.%03d.png</code>). I zero padded the numbers so they will be ordered correctly when alphabetically sorted and match this in the argument. You don't need to zero-pad but it means they should then alphabetical in your OS file system which is handy. The first <code>-r</code> is used to set the input frame rate to <code>12/1</code> (12 per 1 second).
<br/><br/>
Okay so how did it turn out....
<br/>
<div>
<center>
<video width="320" height="240" autoplay loop>
<source src="https://www.simolecule.com/blog/cdkanimate/example1.mp4" type="video/mp4">
Your browser does not support the video tag.
</video><br/>
Download: <a href="https://www.simolecule.com/blog/cdkanimate/example1.mp4">example1.mp4</a>
</center>
</div>
<br/>
Not bad but the inter-frame alignment is off due to the different size caused by the moving outer-glow. In future I might have an anchor attribute that would lock the depiction in place relative to some atom but that's quite specialised. We can fix the shifting by highlighting all other atoms to the same as the background. This is possible with the high-level APIs but it's easy enough just to set the field on the atom:
<br/><br/>
<pre class="prettyprint lang-java">
for (int i=0; i<100; i++) {
IAtom target = mol.getAtom(i % mol.getAtomCount());
// reset all
for (IAtom a : mol.atoms())
a.setProperty(StandardGenerator.HIGHLIGHT_COLOR,
Color.WHITE);
target.setProperty(StandardGenerator.HIGHLIGHT_COLOR,
Color.RED);
dg.depict(mol)
.writeTo(String.format("%s/frame.%03d.png", dest, i));
}
</pre>
<br/>
<div>
<center>
<video width="320" height="240" autoplay loop>
<source src="https://www.simolecule.com/blog/cdkanimate/example2.mp4" type="video/mp4">
Your browser does not support the video tag.
</video><br/>
Download: <a href="https://www.simolecule.com/blog/cdkanimate/example2.mp4">example2.mp4</a>
</center>
</div>
<br/>
<h2>Add/Remove Atoms and Bonds?</h2>
<br/>
Because of the alignment issue we need to use some tricks to add/remove atoms and bonds. Essentially you draw the whole thing and then hide the parts you don't want by setting them to the background color. For example:
<br/><br/>
<pre class="prettyprint">
int frameId = 0;
for (IAtom atom : mol.atoms())
hide(atom);
for (IBond bond : mol.bonds())
hide(bond);
// add bonds
for (int i = 0; i < mol.getBondCount(); i++) {
IBond bnd = mol.getBond(i);
show(bnd);
show(bnd.getBegin());
show(bnd.getEnd());
dg.depict(mol)
.writeTo(String.format("%s/frame.%03d.png", dest, ++frameId));
}
// hold for 12 frames
for (int i = 0; i < 12; i++)
dg.depict(mol)
.writeTo(String.format("%s/frame.%03d.png", dest, ++frameId));
// remove bonds
for (int i = 0; i < mol.getBondCount(); i++) {
IBond bnd = mol.getBond(mol.getBondCount()-i-1);
hide(bnd);
if (!visible(bnd.getBegin()))
hide(bnd.getBegin());
if (!visible(bnd.getEnd()))
hide(bnd.getEnd());
dg.depict(mol)
.writeTo(String.format("%s/frame.%03d.png", dest, ++frameId));
}
</pre>
<br/>
Where show/hide are:
<br/><br/>
<pre class="prettyprint">
private static void hide(IChemObject x) {
x.setProperty(StandardGenerator.HIGHLIGHT_COLOR,
Color.WHITE);
}
private static void show(IChemObject x) {
x.setProperty(StandardGenerator.HIGHLIGHT_COLOR,
Color.BLACK);
}
</pre><br/>
Producing:
<div>
<center>
<video width="320" height="240" autoplay loop>
<source src="https://www.simolecule.com/blog/cdkanimate/example3.mp4" type="video/mp4">
Your browser does not support the video tag.
</video><br/>
Download: <a href="https://www.simolecule.com/blog/cdkanimate/example3.mp4">example3.mp4</a>
</center>
</div>
<br/>
That's just about it, the CDK depiction is quite configurable but not really intended to be a general purpose drawing/animation tool. However using some tricks you can work around some quirks and get some nice results. If you make animations let me know as I'd love to see them!John Mayfieldhttp://www.blogger.com/profile/03062858450255240984noreply@blogger.com0tag:blogger.com,1999:blog-6131674809499098789.post-3006505611410737002018-10-25T12:47:00.001+01:002018-10-25T19:33:46.018+01:00CDK Depict on HTTPSJust a quick post to say CDK Depict is now using HTTPS <a href="https://simolecule.com/cdkdepict/depict.html">https://simolecule.com/cdkdepict/depict.html</a>. The main reason for this was Blogger stopped allowing image links to HTTP resources. In general browsers are being more fussy about non HTTPS content.
<br><br>
I used <a href="https://letsencrypt.org/">LetsEncrypt</a> that turned out to be very easy to configure with TomCat.
<br><br>
<h3>Step 1</h3>
<br>
<div style="margin-left: 20px">
Install the <code>certbot</code> utility and use it generate a certificate. <br><br>
<pre>
$ sudo certbot certonly
</pre>
</div>
<br>
<h3>Step 2</h3>
<br>
<div style="margin-left: 20px">
Configure TomCat 8+ connectors. This used to be more complex on older TomCat servers with the need to generate a separate keystore. Editing <code>$CATALINA_HOME/confg/server.xml</code> we configure the base connected, redirectPort is changed from 8443 to 443
(and 8080 to 80). <br><br>
<pre class="prettyprint xml">
<Connector port="80" protocol="HTTP/1.1"
connectionTimeout="20000"
redirectPort="443" />
</pre>
<br>
We also configure SSL connector, using port 443, change to NIO based protocol (the default requires extra native library) <code>org.apache.coyote.http2.Http2Protocol</code>, and set the file paths to the <code>.pem</code> files generated by certbot. <br><br>
<pre class="prettyprint xml">
<Connector port="443"
protocol="org.apache.coyote.http11.Http11NioProtocol"
maxThreads="150" SSLEnabled="true" >
<UpgradeProtocol className="org.apache.coyote.http2.Http2Protocol" />
<SSLHostConfig>
<Certificate certificateKeyFile="/etc/letsencrypt/live/www.simolecule.com/privkey.pem"
certificateFile="/etc/letsencrypt/live/www.simolecule.com/cert.pem"
certificateChainFile="/etc/letsencrypt/live/www.simolecule.com/chain.pem"
type="RSA" />
</SSLHostConfig>
</Connector>
</pre>
</div>
<br>
<h3>Step 3 (optional)</h3>
<br>
<div style="margin-left: 20px">
If a client tries to visit the HTTP site we want to redirect them to HTTPS. To do this we edit <code>$CATALINA_HOME/confg/web.xml</code> adding this section to the end of the <code><web-app></code> block<br><br>
<pre class="prettyprint xml">
<security-constraint>
<web-resource-collection>
<web-resource-name>Entire Application</web-resource-name>
<url-pattern>/*</url-pattern>
</web-resource-collection>
<user-data-constraint>
<transport-guarantee>CONFIDENTIAL</transport-guarantee>
</user-data-constraint>
</security-constraint>
</pre>
</div>John Mayfieldhttp://www.blogger.com/profile/03062858450255240984noreply@blogger.com0tag:blogger.com,1999:blog-6131674809499098789.post-83471914364819066012018-10-22T08:26:00.000+01:002018-10-22T09:58:54.900+01:00Bit packing for fast atom type assignmentMany cheminformatics algorithms perform some form of atom typing as their first step. The atom types you need and the granularity depend on the algorithm. At the core of all atom typing lies some form of decision tree, usually manifesting as a complex if-else cascade.<br />
<br />
Using an elegant technique Roger showed me a few years ago, you can replace this if-else cascade with a table/switch lookup. This turns out to be very clean, easy to extend, and efficient. The technique relies on first computing a value that captures the bond environment around an atom and packing this in to a single value.
<br/><br/>
Here's what it looks like:<br />
<div class="code-header">Algorithm 1</div>
<pre class="java prettyprint">
int btypes = atom.getImplicitHydrogenCount();
for (IBond bond : atom.bonds()) {
switch (bond.getOrder()) {
case SINGLE: btypes += 0x0001; break;
case DOUBLE: btypes += 0x0010; break;
case TRIPLE: btypes += 0x0100; break;
default: btypes += 0x1000; break;
}
}
</pre>
<div class="code-footer"></div>
<br/>
After the value <code>btypes</code> has been calculated (once for each atom) it contains the count of single, double, triple, and other bonds. We can inspect each of these counts individually my masking of the relevant bits or the entire value, for example:
<br />
<div class="code-header">Algorithm 2</div>
<pre class="java prettyprint">
switch (btypes) {
case 0x004: // 4 single bonds, e.g. Sp3 carbon
break;
case 0x012: // 1 double bond, 2 single bonds e.g. Sp2 carbon
break;
case 0x020: // 2 double bonds e.g. Sp cumulated carbon
break;
case 0x101: // 1 triple bond, 1 single bond e.g. Sp triple bonded carbon
break;
}
</pre>
<div class="code-footer"></div>
<br/>
Using a nibble allows us to store numbers up to 16 (2<sup>4</sup>) - more than enough for any sane chemistry. In the example above I shoved default bonds under an 'other' category but of course you could extend it to handle quadruple bonds and even additional properties of the bonds:
<div class="code-header">Algorithm 3</div>
<pre class="java prettyprint">
int btypes = atom.getImplicitHydrogenCount();
for (IBond bond : atom.bonds()) {
switch (bond.getOrder()) {
case SINGLE:
if (bond.isAromatic())
btypes += 0x010001;
else
btypes += 0x000001;
break;
case DOUBLE:
if (bond.isAromatic())
btypes += 0x010010;
else if (isOxygen(bond.getOther(atom)))
btypes += 0x100010; // dbs to oxygens
else
btypes += 0x000010;
break;
case TRIPLE:
btypes += 0x000100;
break;
}
}
</pre>
<div class="code-footer"></div>
John Mayfieldhttp://www.blogger.com/profile/03062858450255240984noreply@blogger.com1tag:blogger.com,1999:blog-6131674809499098789.post-73610095887807432962018-04-13T16:00:00.006+01:002022-09-19T09:05:55.000+01:00RDKit Reaction SMARTS<div><h3><b>Update 19/09/22</b></h3><div><br /></div><div><i>I believe I was wrong and Daylight did allow SMIRKS to be run backwards and that the "direction" parameter controls it (</i><i>"personal communication" from Roger).</i><i>. The key insight is the documentation notes SMARTS patterns should only appear on atoms that don't have bond changes around them. From the man page:</i></div><div><i><br /></i></div></div><blockquote style="border: none; margin: 0px 0px 0px 40px; padding: 0px;"><div><div style="text-align: left;"><i>"Atomic expressions may be any valid atomic SMARTS expression for nodes where the bonding (connectivity & bond order) doesn't change.1 Otherwise, the atomic expressions must be valid SMILES."</i></div></div></blockquote><div><br /></div><div><i>Whether this make sense or not for writing portable SMIRKS is questionable.</i></div><div><i><br /></i></div><div><i>This is supplementary to my original grumble on ambiguous naming. I still believe RDKit should just call it SMIRKS or at least something which didn't already mean something else (e.g. "RdSmirks"). I do agree there is a lot of overlap between SMARTS and SMIRKS but conflating these terms is problematic. </i><i>Shortly after I originally wrote this original post, </i><a href="https://docs.openforcefield.org/projects/toolkit/en/0.5.0/smirnoff.html" style="font-style: italic;">SMIRKS Native Open Force Field (SMIRNOFF)</a><i> was published. In that work SMARTS is used, but they called it SMIRKS. There is some wiggling attempted by the authors that they use "SMIRKS features" of atom maps. However it is incorrect that these were a SMIRKS feature as in what Daylight called <b>Reaction SMARTS</b>, </i><span>see the table "Examples of Reaction SMARTS" (</span><a href="https://www.daylight.com/dayhtml/doc/theory/theory.smarts.html" style="font-family: inherit;">SMARTS Theory Page</a><span>). Of course SMIRNOFF was just too good a pun to change, but it again creates more </span>confusion<span>.</span></div><div><i><br /></i></div><div><i>I recently ran some comparisons/benchmarks on different SMIRKS implementations, everyone's semantics are consistently inconsistent and the community would benefit by an effort to standardise. Perhaps then I can convince RDKit that they really do have a (good) SMIRKS implementation and it would be less confusing to just call it that.</i></div><div><i><br /></i></div><div><div><br /></div><div><br /></div></div><div><br /></div><hr /><div><span></span><span><a name='more'></a></span><p style="text-align: left;">There's a been some papers using the RDKit for synthesis planning. If you're writing a paper and use the term "Reaction SMARTS" make sure you mean what everyone else thinks it means.</p></div>
The <a href="http://www.daylight.com/dayhtml/doc/theory/theory.smiles.html">SMILES</a>, <a href="http://www.daylight.com/dayhtml/doc/theory/theory.smarts.html">SMARTS</a>, and <a href="http://www.daylight.com/dayhtml/doc/theory/theory.smirks.html">SMIRKS</a> line notations were created* by Daylight for storing, matching, and transforming connection tables.<br />
<br />
<ul>
<li>SMILES <i>describes a connection table to store molecule and reactions</i></li>
<li>SMARTS <i>describes a pattern (or query) to match molecules and reactions</i></li>
<li>SMIRKS <i>describes a transform (or "reaction") to modify molecules</i></li>
</ul>
<br />
RDKit uses the term "Reaction SMARTS" to mean "transform" (see <a href="http://www.rdkit.org/docs/RDKit_Book.html#reaction-smarts">RDKit Book</a>). Unfortunately in Daylight's terminology Reaction SMARTS is a pattern not a transform.<br />
<br />
<table align="center" cellpadding="0" cellspacing="0" class="tr-caption-container" style="margin-left: auto; margin-right: auto; text-align: center;"><tbody>
<tr><td style="text-align: center;"><a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEjdjnmp1Yrzkmkyahyp_zUBtMuahyphenhypheniFbiPQEurVF1A4YA1-BDmKK_Zbk8vCtQP8mOlze2-InotikRSN9Z06wvEl5bAhYTM-0PW_SpyOqRdaYg01zdVH6cJfdQKGrrRYvz2KZuuxiRo5M3k/s1600/Screen+Shot+2018-04-13+at+14.55.51.png" style="margin-left: auto; margin-right: auto;"><img border="0" data-original-height="124" data-original-width="797" height="97" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEjdjnmp1Yrzkmkyahyp_zUBtMuahyphenhypheniFbiPQEurVF1A4YA1-BDmKK_Zbk8vCtQP8mOlze2-InotikRSN9Z06wvEl5bAhYTM-0PW_SpyOqRdaYg01zdVH6cJfdQKGrrRYvz2KZuuxiRo5M3k/s640/Screen+Shot+2018-04-13+at+14.55.51.png" width="640" /></a></td></tr>
<tr><td class="tr-caption" style="text-align: center;">Screenshot from the Daylight SMARTS theory manual.</td></tr>
</tbody></table>
Reactions SMARTS is primarily useful for searching reaction databases. For example this Reaction SMILES:<br />
<br />
<pre style="white-space: pre-wrap;">[cH:13]1[c:14]([cH:19][c:20]([c:10]([c:11]1[Cl:12])[n:9]2[cH:8][c:5]([c:4]([n:22]2)[NH2:3])[C:6]#[N:7])[Cl:21])[C:15]([F:16])([F:17])[F:18].[OH:1][OH:2]>C(Cl)Cl.C(=O)(C(F)(F)F)OC(=O)C(F)(F)F.O>[cH:13]1[c:14]([cH:19][c:20]([c:10]([c:11]1[Cl:12])[n:9]2[cH:8][c:5]([c:4]([n:22]2)[N+:3](=[O:1])[O-:2])[C:6]#[N:7])[Cl:21])[C:15]([F:16])([F:17])[F:18]
</pre>
<br />
is matched by this Reaction SMARTS<br />
<br />
<pre>[*:1][Nh2:2]>>[*:1][Nh0:2](~[OD1])~[OD1] amino to nitro
</pre>
<br />
You can highlight the substructure:<br />
<br />
<table align="center" cellpadding="0" cellspacing="0" class="tr-caption-container" style="margin-left: auto; margin-right: auto; text-align: center;"><tbody>
<tr><td style="text-align: center;"><a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEiQpdVZFcv1eVETdswtW5uXFKTMzgMlkIc1vA0-43Mjd8j4zfqX5Aap_o71lXvup6ZisErVF0xGSB-X0qB7YToLtu_j0xADTfIEew8fKXD1UPJG3aMZ05jOcTRUCn28iZj9kTmmifSk3iQ/s1600/amino-to-nitro.png" style="margin-left: auto; margin-right: auto;"><img border="0" data-original-height="230" data-original-width="1019" height="144" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEiQpdVZFcv1eVETdswtW5uXFKTMzgMlkIc1vA0-43Mjd8j4zfqX5Aap_o71lXvup6ZisErVF0xGSB-X0qB7YToLtu_j0xADTfIEew8fKXD1UPJG3aMZ05jOcTRUCn28iZj9kTmmifSk3iQ/s640/amino-to-nitro.png" width="640" /></a></td></tr>
<tr><td class="tr-caption" style="text-align: center;">Highlighting the SMARTS in the SMILES using <a href="http://simolecule.com/cdkdepict/depict.html">CDK Depict</a></td></tr>
</tbody></table>
<br />
<h3>
But that's a transform!</h3>
<br />
Yes but it's <i>matching</i> a transform (SMARTS) not applying one (SMIRKS), some may think you could read this unmodified as a SMIRKS but this is not the case. SMIRKS needs "real parts" after the second angled bracket as these are the parts created by the transform. Note that '*' is valid SMILES and in SMIRKS it kind of means "unmodified". This actually gives us the nice invariants:<br />
<br />
<div style="display: block; font-family: "sans-serif"; font-size: 14pt;">
<i>All SMILES are valid SMARTS but not all SMARTS are valid SMILES</i><br />
and<br />
<i>All SMIRKS are valid SMARTS but not all SMARTS are valid SMIRKS</i>
</div>
<br />
Here is the SMIRKS transform for amino to nitro<br />
<br />
<pre>[*:1][ND3:2]([H])([H])>>[*:1][N:2](=O)=O amino to nitro</pre>
<br />
<div>
</div>
<div>
In SMIRKS I can <i>apply</i> this SMIRKS to <b>"molecules"</b> and it will create "<b>reactions</b>". Note these molecules do not need to have atom-maps but they will come out with atom maps (see <a href="http://www.daylight.com/dayhtml/doc/man/man3/dt_transform.html">dt_transform</a>)!</div>
<div>
<br /></div>
<div>
<pre>c1ccccc1N
[nH]1ccc2c1cc(N)cc2</pre>
</div>
<br />
The output is<br />
<br />
<pre>c1cccc[c:1]1[NH2:2]>>c1cccc[c:1]1[N:2](=O)=O
[nH]1ccc2c1c[c:1]([NH2:2])cc2>>[nH]1ccc2c1c[c:1]([N:2](=O)=O)cc2</pre>
<br />
<h3>
And another thing...</h3>
<br />
In general you can't run SMIRKS backwards. If I want to run a nitro to amino because the atoms/bonds we're adding need to be "real" we need to encode the reverse transform separately!
<br />
<br />
<pre>[*:1][ND3:2]([H])([H])>>[*:1][N:2](=O)=O amino to nitro
[*:1][ND3:2](~[OD1])(~[OD1])>>[*:1][N:2]([H])[H] nitro to amino</pre>
<br />
Although <a href="http://www.daylight.com/dayhtml/doc/man/man3/dt_transform.html">dt_transform</a> specifies a <b> direction</b> this only controls whether the input molecules appear on the left or right of the output reaction.<div><br /></div><div><div>
*SMILES was created by Dave Weininger whilst at EPA</div></div>John Mayfieldhttp://www.blogger.com/profile/03062858450255240984noreply@blogger.com0tag:blogger.com,1999:blog-6131674809499098789.post-82330044310564163642017-05-06T15:06:00.001+01:002017-05-06T15:06:27.027+01:00Sharp Tools for Java Refactoring: Byte Code AnalysisI'm currently refactoring parts of the <a href="http://cdk.github.io">CDK</a> core classes. As part of this I often need to find specific patterns/idioms that need to be changed. Whilst source code inspections and an <a href="https://en.wikipedia.org/wiki/Integrated_development_environment">IDE</a> can make this task easy sometimes the tools aren't quite sharp enough.
<br><br>
I needed to find all occurrences of a reference (instead of value) comparison on a particular class. In Java there is no operator overload and so you can have situations like this:
<br>
<pre class="prettyprint lang-java">
Integer a = new Integer(25);
Integer b = new Integer(25);
if (a == b) {} // false
if (a.equals(b)) {} // true
</pre>
I mentioned operating overloading but it's more subtle and is more about comparing reference vs. value comparison. In <b>C/C++</b> we can have similar behaviour:
<br>
<pre class="prettyprint lang-c">
int aval = 25, bval = 25;
int *a = &aval;
int *b = &bval;
if (a == b) {} // false
if (*a == *b) {} // true
</pre>
Most IDE's and code inspection programs will warn about common occurrences (for example Integer) but I wanted to find places where the CDK's classes were used like this. A simple text grep will find some but will have false positives and negatives requiring lots of manual checking.
Daniel suggested the well known <a href="http://findbugs.sourceforge.net/">FindBugs</a> might be able help.
<br><br>
Rather than analyze source code like <a href="https://pmd.github.io/">PMD</a> and <a href="http://checkstyle.sourceforge.net/">Checkstyle</a>, FindBugs analyses Java byte code with a set of defaults <a href="http://findbugs.sourceforge.net/bugDescriptions.html">detectors</a> to find often subtle but critical mistakes. FindBugs can be configured with custom detectors (see <a href="https://www.ibm.com/developerworks/library/j-findbug2/">here</a>), however the inspection I needed (<a href="http://findbugs.sourceforge.net/bugDescriptions.html">RC: Suspicious reference comparison to constant</a>) was almost there. After digging around in the source code I found you can provide a list of custom classes to detect. However it took a bit of trial and error to get what I needed.
<br><br>
First up we turn off all inspections except for the one we're looking for (we need to fix many others reported but I was looking for something specific). To do this we create an XML config that will only run the specific inspection (RC for Reference Comparison):
<div class="code-header">
findbugs-include.xml
</div>
<pre class="prettyprint lang-xml">
<?xml version="1.0" encoding="UTF-8"?>
<FindBugsFilter>
<Match>
<Bug code="RC"/>
</Match>
</FindBugsFilter>
</pre>
<div class="code-footer"></div>
We then run findbugs with this configuration and provide the <code>frc.suspicious</code> property.
<div class="code-header">Running findbugs</div>
<pre>
$> findbugs -textui \
-include findbugs-include.xml \
-property "frc.suspicious=org.openscience.cdk.interfaces.IAtom" \
base/*/target/cdk-*-2.0-SNAPSHOT.jar
</pre>
<div class="code-footer"></div>
This produces an accurate report of all the places the references are compared. Here's a sample:
<pre style="overflow: scroll; scroll-x: auto;">
H C RC: Suspicious comparison of org.openscience.cdk.interfaces.IAtom references in org.openscience.cdk.Bond.getOther(IAtom) At Bond.java:[line 253]
H C RC: Suspicious comparison of org.openscience.cdk.interfaces.IAtom references in org.openscience.cdk.Bond.getConnectedAtom(IAtom) At Bond.java:[line 265]
H C RC: Suspicious comparison of org.openscience.cdk.interfaces.IAtom references in org.openscience.cdk.Bond.getConnectedAtoms(IAtom) At Bond.java:[line 281]
H C RC: Suspicious comparison of org.openscience.cdk.interfaces.IAtom references in org.openscience.cdk.Bond.contains(IAtom) At Bond.java:[line 300]
...
</pre>
John Mayfieldhttp://www.blogger.com/profile/03062858450255240984noreply@blogger.com0tag:blogger.com,1999:blog-6131674809499098789.post-84219121678283624302017-04-03T22:51:00.000+01:002017-12-15T20:53:31.703+00:00CDK AtomContainers are Slow - Let's fix thatThe core class for molecule representation in CDK is the <code>AtomContainer</code>. The <code>AtomContainer</code> uses an edge-list data structure for storing the underlying connection table (see <a href="http://efficientbits.blogspot.co.uk/2012/12/the-right-representation-for-job.html">The Right Representation for the Job</a>).
<br />
<br />
Essentially this edge-list representation is efficient in space. Atoms can be shared between and belong to multiple <code>AtomContainer</code>s. Therefore querying connectivity (is this atom connected to this other atom) is linear time in the number of bonds.<br />
<br />
The inefficiency of the <code>AtomContainer</code> can really sting. If someone was to describe <a href="https://graphiteworks.wordpress.com/2011/08/31/chemoinformatics-curiosities-i-the-morgan-algorithm/">Morgan's relaxation algorithm</a> you may implement it like Code 1. The algorithm looks reasonable however it will run much slower than you expected. You may expect the runtime of this algorithm to be ~<i>N<sup>2</sup></i> but it's actually ~<i>N<sup>3</sup></i>. I've annotated with <code>XXX</code> where the extra effort creeps in.
<br />
<div class="code-header">
<b>Code 1</b> - Naive Morgan-like Relaxation (AtomContainer/AtomIter)</div>
<pre class="prettyprint lang-java">// Step 1. Algorithm body
int[] prev = new int[mol.getAtomCount()];
int[] next = new int[mol.getAtomCount()];
for (int i = 0; i < mol.getAtomCount(); i++) {
next[i] = prev[i] = mol.getAtom(i).getAtomicNumber();
}
for (int rep = 0; rep < mol.getAtomCount(); rep++) { // 0..numAtoms
for (int j = 0; j < mol.getAtomCount(); j++) { // 0..numAtoms
IAtom atom = mol.getAtom(j);
// XXX: linear traversal! 0..numBonds
for (IBond bond : mol.getConnectedBondsList(atom)) {
IAtom nbr = bond.getConnectedAtom(atom);
// XXX: linear traversal! 0..numAtoms avg=numAtoms/2
next[j] += prev[mol.getAtomNumber(nbr)];
}
}o
System.arraycopy(next, 0, prev, 0, next.length);
}
</pre>
<br />
<h3>
A New Start: API Rewrite?</h3>
<br />
Ultimately to fix this problem correctly, would involve changing the core <code>AtomContainer</code> representation, unfortunately this would require an API change, optimally I think adding the constraint that atoms/bonds can not be in multiple molecules would be needed**. This would be a monumental change and not one I can stomach right now.
<br />
<br />
<h3>
Existing Trade Off: The GraphUtil class</h3>
<br />
In 2013 I added the <a href="https://github.com/cdk/cdk/commit/4597f69234d63a10cff171e7439bd54407679d25"><code>GraphUtil</code></a> class for converting an <code>AtomContainer</code> to a more optimal adjacency list (<span style="font-family: "courier new" , "courier" , monospace; font-weight: bold;">int[][]</span>) that was subsequently used to speed up many algorithms including: ring finding, canonicalisation, and substructure searching. Each time one of these algorithm is invoked with an <code>IAtomContainer</code> the first step is to build the adjacency list 2D array.<br />
<br />
<div class="code-header">
<b>Code 2</b> - GraphUtil usage</div>
<pre class="prettyprint lang-java">IAtomContainer mol = ...;
int[][] adj = GraphUtil.toAdjList(mol);
// optional with lookup map to bonds
EdgeToBondMap e2b = EdgeToBondMap.withSpaceFor(mol);
int[][] adj = GraphUtil.toAdjList(mol, e2b);
</pre>
<div class="code-footer">
</div>
<br />
Although useful the usage of <code>GraphUtil</code> is somewhat clunky requiring passing around not just the adjacency list but the original molecule and the <code>EdgeToBondMap</code> if needed.
<br />
<div class="code-header">
<b>Code 3</b> - GraphUtil Depth First Traversal</div>
<pre class="prettyprint lang-java">void visit(IAtomContainer mol, int[][] adj, EdgeToBondMap bondmap, int beg, int prev) {
mol.getAtom(beg).setFlag(CDKConstants.VISITED, true);
for (int end : adjlist[beg]) {
if (end == prev)
continue;
if (!mol.getAtom(end).getFlag(CDKConstants.VISITED))
visit(mol, adj, bondmap, end, beg);
else
bondmap.get(beg, end).setIsInRing(true); // back edge
}
}
</pre>
<div class="code-footer">
</div>
<br />
Using the <code>GraphUtil</code> approach has been successful but due to the clunky-ness I've not felt comfortable exposing the option of passing these through to public APIs. It was only ever meant as an internal optimisation to be hidden from the caller. Beyond causing unintentional poor performance (Code 1) what often happens in a workflow is <code>GraphUtil</code> is invoked multiple times. A typical use case would be matching multiple SMARTS against one <code>AtomContainer</code>.
<br />
<br />
<h3>
A New Public API: Atom and Bond References</h3>
<br />
I wanted something nicer to work with and came up with the idea of using <a href="https://en.wikipedia.org/wiki/Object_composition">object composition</a> to extend the existing Atom and Bond APIs with methods to improve performance and connectivity checks.
<br />
<br />
Essentially the idea is to provide two classes, and <code>AtomRef</code> and <code>BondRef</code> that <i>reference</i> a given atom or bond in a particular AtomContainer. An <code>AtomRef</code> knows about the original <code>atom</code> it's <code>connected bonds</code> and the <code>index</code>, the <code>BondRef</code> knows about the original <code>bond</code>, it's <code>index</code> and the <code>AtomRef</code> for the connected atoms. The majority of methods (e.g. setSymbol, setImplicitHydrogenCount, setBondOrder) are passed straight through to the original atom. Some methods (such as setAtom on <code>IBond</code>) are blocked as being unmodifiable.
<br />
<br />
<div class="code-header">
<b>Code 4</b> - AtomRef and BondRef structure</div>
<pre class="prettyprint lang-java">class AtomRef implements IAtom {
IAtom atm;
int idx;
List<BondRef> bnds;
}
class BondRef implements IBond {
IBond bnd;
int idx;
AtomRef beg, end;
}
</pre>
<div class="code-footer">
</div>
<br />
We can now re-write the Morgan-like relaxation (Code 1) using AtomRef and BondRef. The scaling of this algorithm is now ~<i>N<sup>2</sup></i> as you would expect.
<br />
<div class="code-header">
<b>Code 5</b> - Morgan-like Relaxation (AtomRef/AtomIter)</div>
<pre class="prettyprint lang-java">// Step 1. Initial up front conversion cost
AtomRef[] arefs = AtomRef.getAtomRefs(mol);
// Step 2. Algorithm body
int[] prev = new int[mol.getAtomCount()];
int[] next = new int[mol.getAtomCount()];
for (int i = 0; i < mol.getAtomCount(); i++) {
next[i] = prev[i] = mol.getAtom(i).getAtomicNumber();
}
for (int rep = 0; rep < mol.getAtomCount(); rep++) {
for (AtomRef aref : arefs) {
int idx = aref.getIndex();
for (BondRef bond : aref.getBonds()) {
next[idx] += prev[bond.getConnectedAtom(aref).getIndex()];
}
}
System.arraycopy(next, 0, prev, 0, next.length);
}
</pre>
<div class="code-footer">
</div>
<br />
The depth first implementation also improves in readability and only requires two arguments.
<br />
<div class="code-header">
<b>Code 6</b> - AromRef Depth First (AtomRef/AtomFlags)</div>
<pre class="prettyprint lang-java">// Step 1. Initial up front conversion cost
void visit(AtomRef beg, BondRef prev) {
beg.setFlag(CDKConstants.VISITED, true);
for (BondRef bond : beg.getBonds()) {
if (bond == prev)
continue;
AtomRef nbr = bond.getConnectedAtom(beg);
if (!nbr.getFlag(CDKConstants.VISITED))
visit(nbr, bond);
else
bond.setIsInRing(true); // back edge
}
}
</pre>
<div class="code-footer">
</div>
<br />
<br />
<h3>
Benchmark</h3>
<br />
I like the idea of exposing the <code>AtomRef</code> and <code>BondRef</code> to public APIs. I wanted to check that the trade-off in <b>calculating</b> and <b>using</b> the <ci ode="">AtomRef/BondRef</ci> vs the current internal <code>GraphUtil</code>. To test this I wrote a benchmark that implements some variants of a Depth First Search and Morgan-like algorithms. I varied the algorithm implementations and whether I used, <code>IAtomContainer</code>, <code>GraphUtil</code>, or <code>AtomRef</code>.
<br />
<br />
The performance was measured over ChEMBL 22 and averaged the run time performance over 1/10th (167,839 records). You can find the code on <a href="https://github.com/johnmay/efficient-bits/blob/master/readopt-ds">GitHub</a> (<a href="https://github.com/johnmay/efficient-bits/blob/master/readopt-ds/src/main/java/com/blogspot/efficientbits/Benchmark.java">Benchmark.java</a>). Each algorithm computes a checksum to verify the same work is being done. Here are the raw results: <a href="https://github.com/johnmay/efficient-bits/blob/master/readopt-ds/results/depthfirst.tsv">depthfirst.tsv</a>, and <a href="https://github.com/johnmay/efficient-bits/blob/master/readopt-ds/results/relaxation.tsv">relaxation.tsv</a>.
<br />
<br />
<br />
<h4>
Depth First Traversal</h4>
<br />
A Depth first traversal is a linear time algorithm. I tested eight implementations that varied the graph data structure and whether I used an external visit array or atom flags to mark visited atoms. When looking just at initialisation time the <code>AtomRef</code> creation is about the same as <code>GraphUtil</code>. There was some variability between the different variants but I couldn't isolate where the different came from (maybe GC/JIT related). The runtime of the <code>AtomRef</code> was marginally slower than <code>GraphUtil</code>. Both were significantly faster (18-20x) than the <code>AtomContainer</code> to do the traversal. When we look at the total run-time (initialisation+traversal) we see that even for a linear algorithm, the <code>AtomRef</code> (and <code>GraphUtil</code>) were ~3x faster. Including the <code>EdgeToBondMap</code> adds a significant penalty.
<br />
<br />
<img src="https://github.com/johnmay/efficient-bits/raw/master/readopt-ds/results/chart_dfs.png" width="100%" />
<br />
<br />
<br />
<h4>
Graph Relaxation</h4>
<br />
A more interesting test is a Morgan-like relaxation, as a more expensive algorithm (N<sup>2</sup>) it should emphasise any difference between the <code>AtomRef</code> and <code>GraphUtil</code>. The variability in this algorithm is whether we relax over atoms (AtomIter - see Code 1/5) or bonds (BondIter). We see a huge variability in AtomContainer/AtomIter implementation. This is because the algorithm is more susceptible to difference in input (molecule) size.
<br />
<br />
<img src="https://github.com/johnmay/efficient-bits/raw/master/readopt-ds/results/chart_relax.png" width="100%" />
<br />
<br />
Clearly the AtomContainer/AtomIter is really bad (~80x slower). Excluding this results shows that as expected the <code>AtomRef/AtomIter</code> is slower than the <code>GraphUtil/AtomIter</code> equivalent (~2x slower). However because the <code>AtomRef</code> has a richer syntax, we can do a trick with XOR number storage to improve performance or iterate over bonds (BondIter) giving like-for-like speeds.
<br />
<br />
<img src="https://github.com/johnmay/efficient-bits/raw/master/readopt-ds/results/chart_relax_noacai.png" width="100%" />
<br />
<br />
<h3>
Conclusions</h3>
<br />
The proposed <code>AtomRef</code> and <code>BondRef</code> provide a convenience API to use the CDK in a natural way with efficient connectivity access. The conversion to an <code>AtomRef</code> is efficient and provides a speedup even for linear algorithms. The encapsulation facilities the passing as a public API parameter, users will be able to compute it ahead of time and pass it along to multiple algorithms.
<br />
<br />
I'm somewhat tempted to provide an equivalent <code>AtomContainerRef</code> allowing a drop-in replacement for methods that take the <code>IAtomContainer</code> interface. It is technically possible to implement writes (e.g. delete bond) efficiently in which case it would no longer be a 'Ref'. Maybe I'll omit that functionality or use a better name?
<br />
<br />
<h3>
Footnotes</h3>
<br />
<ul>
<li><sup>**</sup> My colleague Daniel Lowe notes that <a href="https://www.google.co.uk/url?sa=t&rct=j&q=&esrc=s&source=web&cd=14&cad=rja&uact=8&ved=0ahUKEwiCluuF0IXTAhWKJcAKHb8AA14QFghbMA0&url=http%3A%2F%2Fopsin.ch.cam.ac.uk%2F&usg=AFQjCNFZrMt03WhzWiy_UnwwbDy8q09gBA&sig2=R-uUx187tVagLxyGEbT-Bg">OPSIN</a> allows atoms to be in multiple molecules and know about their neighbours but it's a bit of a fudge. It's certainly possible with some extra book keeping but prevents some other optimisations from being applied.</li>
</ul>
John Mayfieldhttp://www.blogger.com/profile/03062858450255240984noreply@blogger.com0tag:blogger.com,1999:blog-6131674809499098789.post-20506816143691098032016-07-10T11:26:00.001+01:002016-07-16T15:36:56.496+01:00Generic Structure DepictionLast week I attended the <a href="http://cisrg.shef.ac.uk/shef2016/">Seventh Joint Sheffield Conference on Chemoinformatics</a>. It was a great meeting with some cool science and attendees. I had the pleasure of chatting briefly with <a href="http://www.digitalchemistry.co.uk/">John Barnard</a> who's contributed a lot to the representation, storage, and retrieval of generic (aka Markush) structures (see <a href="http://www.digitalchemistry.co.uk/prod_torus_patent.html">Torus, Digital Chemistry</a> - now owned by <a href="http://www.lhasalimited.org/">Lhasa</a>).<br />
<br />
At <a href="http://nextmovesoftware.com/">NextMove</a> we've been doing a bit on processing sketches from patents (see <a href="http://www.slideshare.net/NextMoveSoftware/sketchy-sketches-hiding-chemistry-in-plain-sight">Sketchy Sketches</a>). I learnt a few things about how generic structures are typically depicted I thought be interesting to share.<br />
<br />
<h3>
Substituent Variation (R groups)</h3>
<div>
<br /></div>
<div>
The most common type of generic feature is <b>substituent variation</b>, colloquially known as <b>R groups</b>. The variation allows concise representation with an invariant/fixed part of a compound and variable/optional part.<br />
<br />
<div style="margin-left: auto; margin-right: auto; text-align: center;">
<img src="https://cdkdepict-openchem.rhcloud.com/depict/bow/svg?smi=*c1ccccc1+%7C$R;;;;;;;$%7C&abbr=on&suppressh=true&showtitle=true&zoom=1.3&annotate=none" />
<br />
wherein <b>R</b> denotes<br />
<img src="https://cdkdepict-openchem.rhcloud.com/depict/bow/svg?smi=*OC.*C.*CC%20%7C%24_AP1%3B%3B%3B_AP1%3B%3B_AP1%24%7C%20R1&abbr=on&suppressh=true&showtitle=false&zoom=1.3&annotate=none" />
<br />
That is: <b>anisole</b>, <b>toluene</b>, or <b>ethylbenzene</b>.
</div>
<h3>Substituent Labels</h3>
<br />
Multiple substituent labels may be distinguished by a number <b>R<sup>1</sup></b>, <b>R<sup>2</sup></b>, ... <b>R<sup>n</sup></b>. However in reality, any label can and will be used. This can be particularly confusing when they collide with elements, examples include: <b>Ra</b> (Radium), <b>Rg</b> (Roentgenium) <b>B</b> (Boron), <b>D</b> (Deuterium), <b>Y</b> (Yttrium), <b>W</b> (Tungsten). The distinction between the label <b>Ra</b> and Radium may be semantically captured by a format but lost in depiction.<br />
<br />
<div style="margin-left: auto; margin-right: auto; text-align: center;">
<img border="0" height="128" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEhebgr4X0QmCL7IRGIqr-o7dpNfIw2KmhkK2WZMt1-onacCg8zlNTqwvf8hZN6FYozBZlH1KrOMv0Wc-AZtem7_4XiJzkxKt3q5h9sclMmheTE1bV9EjmknGwgJBUAT3tk_AtYeR3nJPoI/s320/RgrpNums.png" width="128" />
<img border="0" height="128" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEiwgi9bFt3ueGzuDbskumn_R411AKjDH3nu4Rk660IUFN6TRa1rrY3iAIqIW7VB7ogiXEe6iz_1oOEG2osPQh8kvSMhcfQbKNU6e8zLPw99Ti9qg86zjq707GayTtSEMynP0vSgB04nt-4/s320/YgrpNums.png" width="128" />
</div>
To distinguish such labels we can style them differently. By using superscripting and italicizing the label the distinction becomes clear and also somewhat improves the aesthetics of numbered R groups. We avoid subscript due to ambiguities with stoichiometry, for example: <b>–NR<sub>2</sub></b>.
<br />
<div style="margin-left: auto; margin-right: auto; text-align: center;">
<img border="0" height="128" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEjbFTmKPen-aCUjElfk9L0mmfEmmoGGxL9IEiAyxWe_X4R2mWT5eNsyg9vOtR8F7cBG4WjexIkkGjKfTCoQhxS0jNgPPD4w6tOMc3O5FS5wQ94yk_dNottLKBAmCkMxK1pjtDHHPhxHBDM/s320/RgrpNums.png" width="128" /><img border="0" height="128" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEiwGY-r-Kwlvw6-bypNoVhDW_VfkI8oQXAaFAiiapeXE3ZIUi8YtXeFtlYnYYJZMQu_7Oa5KY54iLJ5sLxjKxhqIPQ2acOHodj9OwZWuqogKkMu1tg5qRnY3kyvwjwq8uWwGtAHfSvPVZw/s320/YgrpNums.png" width="128" />
</div>
<h3>Attachment Points</h3>
<br/>
For substituents there are different notation options. In writing, radical nomenclature is used, for the above example we'd say: <b>methyl-oxyl</b> (-OMe), <b>ethyl</b> (-Et), or <b>methyl</b> (-Me). However this doesn't translate well to depictions: <img style="vertical-align: top;" src="https://cdkdepict-openchem.rhcloud.com/depict/bow/svg?smi=[O]C.[CH3].[CH2]C&abbr=on&suppressh=true&showtitle=false&zoom=1.3&annotate=none" />. <br />
<br/>
The CTfile actually does stores substituents this way and specifies the attachment point (<code>APO</code>) information separately.
<pre style="font-size: 7pt; background: #444; color: #fff; padding: 5px;">
$RGP
1
$CTAB
2 1 0 0 0 0 999 V2000
1.9048 -0.0893 0.0000 O 0 0 0 0 0 0 0 0 0 0 0 0
2.6192 0.3232 0.0000 C 0 0 0 0 0 0 0 0 0 0 0 0
1 2 1 0 0 0 0
M APO 1 1 1
M END
$END CTAB
$CTAB
1 0 0 0 0 0 999 V2000
1.9940 -1.2869 0.0000 C 0 0 0 0 0 0 0 0 0 0 0 0
M APO 1 1 1
M END
$END CTAB
$CTAB
2 1 0 0 0 0 999 V2000
1.8750 -2.3286 0.0000 C 0 0 0 0 0 0 0 0 0 0 0 0
2.5895 -1.9161 0.0000 C 0 0 0 0 0 0 0 0 0 0 0 0
1 2 1 0 0 0 0
M APO 1 1 1
M END
$END CTAB
$END RGP
</pre>
Alternatively we may use a virtual or 'null' atom. We can convert to/from CTfile format although it's slightly easier to delete the null atom that add it on, due to coordinate generation. A disadvantage of this is the atom count isn't accurate, however the labelled group is also a type of null atom and already distorts the atom count. There are unfortunately different ways of depicting this null atom.
<div style="margin-left: auto; margin-right: auto; text-align: center;">
<img border="0" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEhWSDXyMjglu89FFRjOqDYvON9fx09gkXVJU9YWKylZrJhVvxtzdEppJMnPTeGeVpj8iIEZ3Mq7qEUjsHpz1VAb1k5yDHgJdVXNNX8Cwxh9PdfvwB9zHbmhCmQs4c6iYfJ1CPWyrfbCZB8/s320/Untitled+Document-1.png" width="320" height="97" />
</div>
<b style="color: red;">Don't use a dative bond style! You have to fudge the valences and just doesn't work, how would I show a double bond attachment?</b><br/>
<br/>
The first time I'd encountered attachment points was in ChEBI where and R group means 'something attaches here' (<a href="http://www.ebi.ac.uk/chebi/searchId.do?chebiId=58314">CHEBI:58314</a>, <a href="http://www.ebi.ac.uk/chebi/searchId.do?chebiId=52569">CHEBI:52569</a>), whilst a 'star' label means 'attaches to something' (<a href="http://www.ebi.ac.uk/chebi/searchId.do?chebiId=37807">CHEBI:37807</a>, <a href="http://www.ebi.ac.uk/chebi/searchId.do?chebiId=32861">CHEBI:32861</a>). This actually a nice way of thinking about it, like two jigsaw pieces the asymmetry allows the substituent to connect to the labelled atom.
<br/>
<br/>
The 'star' atom used by ChEBI is tempting to use as there is a star atom in SMILES.
<pre style="font-size: 7pt; background: #444; color: #fff; padding: 5px;">
*OC
*C
*CC
</pre>
<div style="margin-left: auto; margin-right: auto; text-align: center;">
<img src="https://cdkdepict-openchem.rhcloud.com/depict/bow/svg?smi=*OC.*C.*CC&abbr=on&suppressh=true&showtitle=false&zoom=1.3&annotate=none" /><br />
</div>
However a '*' in SMILES actually means 'unspecified atomic number', some toolkits impose additional semantics. ChemAxon reads a 'star' to mean 'any atom', whilst OEChem, Indigo, and OpenBabel actually read more like an R Group, with [*:1] and [*:2] being R1 and R2 etc. ChemAxon Extended SMILES allows us to explicitly encode attachment points.
<pre style="font-size: 7pt; background: #444; color: #fff; padding: 5px;">
*OC |$_AP1$|
*C |$_AP1$|
*CC |$_AP1$|
</pre>
I opted to implement the wavy line notation in CDK which is preferred by <a href="http://iupac.org/publications/pac/pdf/2008/pdf/8002x0277.pdf">IUPAC graphical representation guidelines</a>.
<div style="margin-left: auto; margin-right: auto; text-align: center;">
<img src="https://cdkdepict-openchem.rhcloud.com/depict/bow/svg?smi=*OC.*C.*CC%20%7C%24_AP1%3B%3B%3B_AP1%3B%3B_AP1%24%7C%20R1&abbr=on&suppressh=true&showtitle=false&zoom=1.3&annotate=none" />
</div>
A major disadvantage of this notation is mis-encoding by users mistaking it for a wavy up/down stereo bond. I talk more about this in the poster (<a href="http://www.slideshare.net/NextMoveSoftware/sketchy-sketches-hiding-chemistry-in-plain-sight">Sketchy Sketches</a>) but the number of times you see the following drawn:
<div style="margin-left: auto; margin-right: auto; text-align: center;"><img border="0" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEiDsjJv_qgRMAdJYNp41MWsvTcSQ33iiB1f5V-ywWmzhNsn1Z-Dl2GFutJWmFNxR1D2Qiug9qmFGMltspeoaKCKRR_KDrAbYumBnQMPSJZWl8OE72qkUcQhfinMheszE9yZmHzp9gQZ7e8/s320/bad-wavy.png" width="320" height="77" /></div>
The captured connection table for that sketch does not have null atoms but instead uses carbon:
<div style="margin-left: auto; margin-right: auto; text-align: center;"><img border="0" src="https://cdkdepict-openchem.rhcloud.com/depict/bow/svg?smi=COC(C)(C)C.COC(C)C.COC.CC&abbr=off&suppressh=true&showtitle=false&zoom=1.3&annotate=none" width="256" /></div>
</div>
John Mayfieldhttp://www.blogger.com/profile/03062858450255240984noreply@blogger.com0tag:blogger.com,1999:blog-6131674809499098789.post-78346470341431861182015-11-21T17:55:00.000+00:002017-11-18T08:02:18.674+00:00Bringing Molfile Sgroups to the CDK - Rendering TipsIn the <a href="http://efficientbits.blogspot.co.uk/2015_09_01_archive.html">last but one post</a> I gave a demonstration of S(ubstance)group rendering in the CDK. Now I want to give some implementation insights.<br />
<h3>
Abbreviations (Superatoms)</h3>
Abbreviations contract part of a structure to a line formula or common abbreviation.<br />
<br />
<table align="center" cellpadding="0" cellspacing="0" class="tr-caption-container" style="margin-left: auto; margin-right: auto; text-align: center;"><tbody>
<tr>
<td style="text-align: center;"><img border="0" height="200" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEiiVSfCs3QWmadAEc8WRpVT_SeE7qpKbAaen0JswDPOx8ud8lwMKT9xgG9EGRgrFeyqXf7_6l_0OoVptnH9U4vk_irbvoGyU68btySaqFJ81J5nSQumJMvX2SmuAs4_ZjkNxtwg2sfhzA0/s200/expanded.png" style="cursor: move;" width="200" /></td>
<td style="text-align: center;"><a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEj0u6_G5cL5IH2VpQqlVaxQWDYYA6EgOoFaG5x1e8JOjsKwnvrfsWdo2Ae1Hkaf9gclfgNk2zC-L4WNEp4OSbmxpy1i4rqRRsSGCR33FRDRIBwPcLbW6pVjxvXRpDsuhvDVRb1quMIcwYE/s1600/collapsed.png" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img border="0" height="200" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEj0u6_G5cL5IH2VpQqlVaxQWDYYA6EgOoFaG5x1e8JOjsKwnvrfsWdo2Ae1Hkaf9gclfgNk2zC-L4WNEp4OSbmxpy1i4rqRRsSGCR33FRDRIBwPcLbW6pVjxvXRpDsuhvDVRb1quMIcwYE/s200/collapsed.png" width="200" /></a></td></tr>
<tr><td class="tr-caption" style="text-align: center;">Full-structure</td><td class="tr-caption" style="text-align: center;">Abbreviated-structure</td></tr>
</tbody></table>
Abbreviating too much or using unfamiliar terms (e.g. maybe using CAR for carbazole) can make a depiction worse. However some structures, like <span style="font-family: "courier new" , "courier" , monospace;"><b><a href="https://www.ebi.ac.uk/chembl/compound/inspect/CHEMBL590010">CHEMBL590010</a></b></span>, can be dramatically improved.<br />
<table align="center" cellpadding="0" cellspacing="0" class="tr-caption-container" style="margin-left: auto; margin-right: auto; text-align: center;"><tbody>
<tr><td style="text-align: center;"><a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEhEyMbZuiHH5Y2Yu3x_eEWPYXGHjt5l8YXZkWjLSFXF3_-tql8v2amWJJEbxrOQsZmqDboQ0Ks3RkN1HD5_Yj0OCo2hdPpuQFBUfDvZ-JLLUGc327EYxboSTGJP_C1cNaHlOf1EpNS2_j8/s1600/CHEMBL590010.png" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img border="0" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEhEyMbZuiHH5Y2Yu3x_eEWPYXGHjt5l8YXZkWjLSFXF3_-tql8v2amWJJEbxrOQsZmqDboQ0Ks3RkN1HD5_Yj0OCo2hdPpuQFBUfDvZ-JLLUGc327EYxboSTGJP_C1cNaHlOf1EpNS2_j8/s640/CHEMBL590010.png" width="512" /></a></td></tr>
<tr><td class="tr-caption" style="text-align: center;"><span style="font-family: "courier new" , "courier" , monospace;"><b><a href="https://www.ebi.ac.uk/chembl/compound/inspect/CHEMBL590010">CHEMBL590010</a></b></span></td></tr>
</tbody></table>
<div class="separator" style="clear: both; text-align: center;">
<a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEiiVSfCs3QWmadAEc8WRpVT_SeE7qpKbAaen0JswDPOx8ud8lwMKT9xgG9EGRgrFeyqXf7_6l_0OoVptnH9U4vk_irbvoGyU68btySaqFJ81J5nSQumJMvX2SmuAs4_ZjkNxtwg2sfhzA0/s1600/expanded.png" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"></a></div>
One way to implement abbreviations would be by modifying the molecule data structure with literal <b><i>collapse/contract</i></b> and <b><i>expand</i></b> operations. Whilst this approach is perfectly reasonable, deleting atoms/bonds is expensive (in most toolkits) and it somewhat subtracts the "<i>display shortcut</i>" nature of this Sgroup.<br />
For efficiency abbreviations are implemented by hiding parts of the depictions and remapping symbols. Just before rendering we iterator over the Sgroups and set hint flags that these atoms/bonds should not be included in the final image. If there is one attachment (e.g. Phenyl) we remap the attach point symbol text to the abbreviation label ('C'->'Ph'). When there are no attachments (e.g. AlCl<sub>3</sub>) we add a new symbol to the centroid of the hidden atoms.<br />
<table align="center" cellpadding="0" cellspacing="0" class="tr-caption-container" style="margin-left: auto; margin-right: auto; text-align: center;">
<tbody>
<tr>
<td style="text-align: center;"><a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEgnD8k8FuXN4nTSbZVf8x1E6VsaWKNFzIYzuvgUWhfyC8LhXhTNdU_6y9fjdO8N6mTQkKOk9yUObYLXh656bp7K26LubtULioo-1TzlHs_MtyRvWAhIr0WevQ5X3pKJA5C9nOMnocKw_E4/s1600/sgroup-a1.png" imageanchor="1" style="margin-left: auto; margin-right: auto;">
<img border="0" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEgnD8k8FuXN4nTSbZVf8x1E6VsaWKNFzIYzuvgUWhfyC8LhXhTNdU_6y9fjdO8N6mTQkKOk9yUObYLXh656bp7K26LubtULioo-1TzlHs_MtyRvWAhIr0WevQ5X3pKJA5C9nOMnocKw_E4/s200/sgroup-a1.png" width="150" />
</a>
</td>
<td style="text-align: center;"><a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEg4ehyyFLxZzbdRdKh_HhnzVP9YLAwvRMfdOBdRCt3TOOhD6BDb3x0abvYfyD-sYf58Ig_6MIbFXgowF13nlvvwHD3ZdJnbiiBf05VKRe5UH4FMX9n66YUC9f_sZlGmQrPI8d9kDdDnffI/s1600/sgroup-a2.png" imageanchor="1" style="margin-left: auto; margin-right: auto;">
<img border="0" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEg4ehyyFLxZzbdRdKh_HhnzVP9YLAwvRMfdOBdRCt3TOOhD6BDb3x0abvYfyD-sYf58Ig_6MIbFXgowF13nlvvwHD3ZdJnbiiBf05VKRe5UH4FMX9n66YUC9f_sZlGmQrPI8d9kDdDnffI/s1600/sgroup-a2.png" width="150" />
</a>
</td>
<td style="text-align: center;"><a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEiVK0Zo78sCwGQQ61EDV2MuedLuw3cji3DJYfp2QMw_Z_kph_Mxh06ibA1MqPrqxCOHdDkvLowxCut-kwkL12z7qt8h3Oruse_kv9iYfuk5rm7BB3QjptqQNekjkDOmwqMsbEbRmBHNdSk/s1600/sgroup-a3.png" imageanchor="1" style="margin-left: auto; margin-right: auto;">
<img border="0" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEiVK0Zo78sCwGQQ61EDV2MuedLuw3cji3DJYfp2QMw_Z_kph_Mxh06ibA1MqPrqxCOHdDkvLowxCut-kwkL12z7qt8h3Oruse_kv9iYfuk5rm7BB3QjptqQNekjkDOmwqMsbEbRmBHNdSk/s1600/sgroup-a3.png" width="150" />
</a>
</td>
</tr>
<tr>
<td class="tr-caption" style="text-align: center;">Hide atoms and bonds</td>
<td class="tr-caption" style="text-align: center;">Symbol Remap</td>
<td class="tr-caption" style="text-align: center;">Abbreviated Result</td>
</tr>
</tbody>
</table>
For two or more attachments (e.g. SO<sub>2</sub>) you also need coordinate remapping.<br />
<h3>
Multiple Group</h3>
Multiple groups allow, contraction of a discrete number of repeating units. They are handled similarly to the abbreviations except we don't need to remap parts.<br />
<table align="center" cellpadding="0" cellspacing="0" class="tr-caption-container" style="margin-left: auto; margin-right: auto; text-align: center;"><tbody>
<tr><td style="text-align: center;"><a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEhH4uNmds2lYjvV46-O4Z0WsZKjlM9Va5C7u_6q_lZ7_i6Lceb1-mVL6WVKlXt9PeVeduwk4AnISJxSu7fC0xho3DkkSDmzOmLIpteemlJMnlkV8HnbVO7WqqCZBMJdpbDIIV6G8ut16Zw/s1600/sgroup-mg-1.png" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img border="0" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEhH4uNmds2lYjvV46-O4Z0WsZKjlM9Va5C7u_6q_lZ7_i6Lceb1-mVL6WVKlXt9PeVeduwk4AnISJxSu7fC0xho3DkkSDmzOmLIpteemlJMnlkV8HnbVO7WqqCZBMJdpbDIIV6G8ut16Zw/s640/sgroup-mg-1.png" width="512" /></a></td></tr>
<tr><td class="tr-caption" style="text-align: center;"><span style="font-family: "courier new" , "courier" , monospace;"><b><a href="http://www.ebi.ac.uk/chebi/searchId.do?chebiId=CHEBI:1233">CHEBI:1233</a></b></span></td></tr>
</tbody></table>
All atoms are present in the data structure but are laid out on top of each other (demonstrated below). We have a list of parent atoms that form the repeat unit. Therefore to display multiple groups we hide all atoms and bonds in the Sgroup except for parent atoms and the crossing bonds.<br />
<table align="center" cellpadding="0" cellspacing="0" class="tr-caption-container" style="margin-left: auto; margin-right: auto; text-align: center;">
<tbody>
<tr>
<td style="text-align: center;"><a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEhHsPPwRbC9p8pocLKc8z9B9gk5LbjYw4yBYRTQzGbXamV1AQ49ZRu8DXkAVrf6oucIc4-u6YC4lWqMNOYF88OksaUgtRAQ29vKE4sY3fyA5aZxNsww1p7JuelgC1IqGHpjQBSOa8RGuRs/s1600/tmp.png" imageanchor="1" style="margin-left: 1em; margin-right: 1em;">
<img border="0" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEhHsPPwRbC9p8pocLKc8z9B9gk5LbjYw4yBYRTQzGbXamV1AQ49ZRu8DXkAVrf6oucIc4-u6YC4lWqMNOYF88OksaUgtRAQ29vKE4sY3fyA5aZxNsww1p7JuelgC1IqGHpjQBSOa8RGuRs/s320/tmp.png" width="200" />
</a>
</td>
<td style="text-align: center;"><a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEgzjyg3-a-3a_kdwE5zU4lIJ22k-MJYApvDh6FDCKCEEkkOqUX5FpouOPCV75JTuJlk-rkReV84bYfPttPz3ex1h2E8fwR-oYRDu6-LCxr7FK7WpKsN7nEG1pC6klDoKY8DI_A9jqutuyU/s1600/output_C2cFon.gif" imageanchor="1" style="margin-left: 1em; margin-right: 1em;">
<img border="0" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEgzjyg3-a-3a_kdwE5zU4lIJ22k-MJYApvDh6FDCKCEEkkOqUX5FpouOPCV75JTuJlk-rkReV84bYfPttPz3ex1h2E8fwR-oYRDu6-LCxr7FK7WpKsN7nEG1pC6klDoKY8DI_A9jqutuyU/s320/output_C2cFon.gif" width="200" />
</a>
</td>
<td style="text-align: center;"><a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEiGJb0uBbyxFlJ9wfQHChh6sPhdmI3t8VpZyp8YwFpi-ZrNJYjFNg-mw_xFlM3RpfV9fPQfxe839i57QGKvhHZX209s3jQL5kxon2apRIixHxI7ZghlCtIv6h1zy5FauyX0EBxQIgpXqfE/s1600/tmp.png" imageanchor="1" style="margin-left: 1em; margin-right: 1em;">
<img border="0" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEiGJb0uBbyxFlJ9wfQHChh6sPhdmI3t8VpZyp8YwFpi-ZrNJYjFNg-mw_xFlM3RpfV9fPQfxe839i57QGKvhHZX209s3jQL5kxon2apRIixHxI7ZghlCtIv6h1zy5FauyX0EBxQIgpXqfE/s320/tmp.png" width="200" />
</a>
</td>
</tr>
</tbody>
</table>
It's worth mentioning that hidden symbols are still generated but simply excluded from the final result. This allows bond back off for hetero atoms to be calculated correctly as is seen in this somewhat tangled example:<br />
<div class="separator" style="clear: both; text-align: center;">
<a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEh7i2UQspqJ_DD6eXmW2xiC-1Al50Tu1lQPYBIwSlwDPPdv4Do3zenMpUByn2t-mgG9wsDHjST1FhFg-Qca7HOicckk1tTb59Am5fqRtp0DP-T9He8OoTPHr-WjdQL2-JntOdUGA3qdi6M/s1600/tmp.png" imageanchor="1" style="margin-left: 1em; margin-right: 1em;">
<img border="0" height="320" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEh7i2UQspqJ_DD6eXmW2xiC-1Al50Tu1lQPYBIwSlwDPPdv4Do3zenMpUByn2t-mgG9wsDHjST1FhFg-Qca7HOicckk1tTb59Am5fqRtp0DP-T9He8OoTPHr-WjdQL2-JntOdUGA3qdi6M/s320/tmp.png" width="320" />
</a>
</div>
<h3>
Brackets</h3>
Polymer and Multiple group Sgroups require rendering of brackets. Encoded in the molfile (and when laid out) brackets are described by two points, a line. It is therefore up to the renderer to decide which side of the line the tick marks should face.<br />
I've seen some implementations use the order of the points to convey bracket direction. Another method would be to point the brackets at each other. As shown for <span style="font-family: "courier new" , "courier" , monospace; font-size: xx-small;"><b><a href="http://www.ebi.ac.uk/chebi/searchId.do?chebiId=CHEBI:59342">CHEBI:<span style="background-color: white; color: #222222; line-height: 19.994px;">59342</span></a></b></span> this is not always correct.<br />
<br />
<table align="center" cellpadding="0" cellspacing="0" class="tr-caption-container" style="margin-left: auto; margin-right: auto; text-align: center;">
<tbody>
<tr>
<td style="text-align: center;"><a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEiPvIHBCRA20QUuQARAmqIdDG6ejiOX1YGnBpP3SQRiqmbWUFY_zNUUxAGys3KJ8dvuN4TqPGKd5KiRt_7NgHDBfrDpHW3PUqXEkT-msdT0VrFMnUoeuM3JROZ4AZeEknvoPMviRNcXmfI/s1600/tmp.png" imageanchor="1" style="margin-left: 1em; margin-right: 1em;">
<img border="0" height="256" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEiPvIHBCRA20QUuQARAmqIdDG6ejiOX1YGnBpP3SQRiqmbWUFY_zNUUxAGys3KJ8dvuN4TqPGKd5KiRt_7NgHDBfrDpHW3PUqXEkT-msdT0VrFMnUoeuM3JROZ4AZeEknvoPMviRNcXmfI/s320/tmp.png" />
</a>
</td><td style="text-align: center;"><a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEgsJtjx7GiCqVc3eSAebvBqVKypV8MD1wY_zJgaJOgqNHpppAZwiOKPOaZ94dx1qSDX0oDDsUnaLDDGCkZfJ32Z4rvU3QGh_RH-Gxw9BsDI4nI0YrMbZ1fHVFztByH5OjkGVXxWSQbPnAY/s1600/tmp.png" imageanchor="1" style="margin-left: auto; margin-right: auto;">
<img border="0" height="256" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEgsJtjx7GiCqVc3eSAebvBqVKypV8MD1wY_zJgaJOgqNHpppAZwiOKPOaZ94dx1qSDX0oDDsUnaLDDGCkZfJ32Z4rvU3QGh_RH-Gxw9BsDI4nI0YrMbZ1fHVFztByH5OjkGVXxWSQbPnAY/s320/tmp.png" />
</a>
</td>
</tr>
<tr>
<td class="tr-caption" style="text-align: center;">Poor bracket direction</td>
<td class="tr-caption" style="text-align: center;">Preferred bracket direction</td>
</tr>
<tr><td class="tr-caption" colspan="2" style="text-align: center;">CHEBI:59342</td></tr>
</tbody>
</table>
I originally thought the solution might involve a graph traversal/flood-fill but it turns out there is a very easy way to get the direction correct. First we consider that brackets may or may not be placed on bonds, if a bracket is on a bond this information is available (crossing bonds).<br />
<ul>
<li>For a bracket on a crossing bond exactly one end atom will be contained in the Sgroup, the bracket should point towards this atom.</li>
<li>If a bracket doesn't cross a bond then the direction should point to the centroid of all atoms in the Sgroup.</li>
</ul>
John Mayfieldhttp://www.blogger.com/profile/03062858450255240984noreply@blogger.com0tag:blogger.com,1999:blog-6131674809499098789.post-35343683373432238892015-10-17T19:38:00.003+01:002015-10-18T20:55:48.197+01:00Java Serialization: Great power but at what cost?<p>The default Java serialization framework provides a convenient mechanism for streaming in-memory Objects to another computer or storing them on disk. Beyond the obvious badness of being tied to the internal object layout (i.e. not stable through changes), serialization can be very inefficient. Externalization and libraries like <a href="https://github.com/EsotericSoftware/kryo">Kyro</a> are popular for improving performance.</p>
<table align="center" cellpadding="0" cellspacing="0" class="tr-caption-container" style="margin-left: auto; margin-right: auto; text-align: center;"><tbody>
<tr><td style="text-align: center;"><a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEjol8U0uAIk0BXICK3QDVezO9JrGwQnzIvk0jvg-zogNoamYocLgwwXpC7YCUhtR-S3lmcEMs_UkKEc0wohI8uBizhZH5ZBLr8OAh5bJKE8WshqYnKdVEsrl4twDeZx0qXpPnsMAzM45Y4/s1600/Screen+Shot+2015-10-17+at+17.06.36.png" imageanchor="1" style="margin-left: auto; margin-right: auto;"><img border="0" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEjol8U0uAIk0BXICK3QDVezO9JrGwQnzIvk0jvg-zogNoamYocLgwwXpC7YCUhtR-S3lmcEMs_UkKEc0wohI8uBizhZH5ZBLr8OAh5bJKE8WshqYnKdVEsrl4twDeZx0qXpPnsMAzM45Y4/s1600/Screen+Shot+2015-10-17+at+17.06.36.png" /></a></td></tr>
<tr><td class="tr-caption" style="text-align: center;">SMILES: <span style="font-family: Courier New, Courier, monospace; font-size: xx-small;"><b>CO[C@@H]([C@H](OC(C)=O)[C@@H](OC(C)=O)[C@H](OC(C)=O)[C@H](OC(C)=O)COC(C)=O)SC</b></span></td></tr>
</tbody></table>
<p>In the domain of Chemistry we have a rich variety of formats (e.g. <a href="https://en.wikipedia.org/wiki/Simplified_molecular-input_line-entry_system">SMILES</a>) with which we can store molecules and reactions (in memory these are labelled graphs). Although these formats do not completely fulfil the utility of Object serialization they can be used as building block upon which we build. Not only are these <i>defacto</i> standards but they can be much faster and compact than default serialization of the in-memory connection table (graph) representation.</p>
<h4>
Recent History</h4>
<p>Crafting efficient (de)serialization is beneficial and you can get great speed with simple setup. A few years ago I ran some experiments on writing an externalization stream for the Chemistry Development Kit (CDK) molecules (thread - <a href="http://sourceforge.net/p/cdk/mailman/message/29411539/">High Performance Structure IO</a>). Since the objects are huge any improvement over the default would be useful. This partly fed into the needs of <a href="https://tech.knime.org/community/cdk">CDK-Knime</a> (a workflow tool) where I think CML was being used originally. From testing on ChEBI (~20,000 molecules) we see actually the ObjectInputStream was about as fast as an SDfile and much faster than CML. SDfiles are now much faster but that would be another post.</p>
<center>
<b>Read Performance</b>
<table style="text-align: center;">
<tbody>
<tr>
<th>Method</th>
<th>Time</th>
<th>Size</th>
<th>Throughput</th>
</tr>
<tr>
<th>AtomContainerStream</th>
<td>346 ms</td>
<td>11.1 MiB</td>
<td>63739 s<sup>-1</sup></td>
</tr>
<tr>
<th>SDfile</th>
<td>4159 ms</td>
<td>51.7 MiB</td>
<td>5302 s<sup>-1</sup></td>
</tr>
<tr>
<th>CML</th>
<td>18605 ms</td>
<td>91.5 MiB</td>
<td>1185 s<sup>-1</sup></td>
</tr>
<tr>
<th>ObjectInputStream</th>
<td>5552 ms</td>
<td>93.9 MiB</td>
<td>3972 s<sup>-1</sup></td>
</tr>
</tbody></table>
</center>
<p>It was around that time that <a href="http://www.dalkescientific.com/">Andrew Dalke</a> payed a visit to EMBL-EBI. In discussing what I was currently working on he promptly showed me how fast <a href="http://www.eyesopen.com/oechem-tk">OEChem</a> could read/write SMILES. Needless to say – pretty quick and as fast if not faster than my attempt at <i>'High Throughput'</i> streaming.</p>
<p>The CDK now also has fast SMILES processing and I wanted to compare this to the serialization to see how much of a performance penalty there is.</p>
<h4>
Benchmark</h4>
<p>For a benchmark I used 100,000 structures for ChEMBL 20.</p>
<pre>$ shuf chembl_20.smi | head -n 100,000 > chembl_20_subset.smi
</pre>
<p>Writing it to a <code>ObjectOutputStream</code> takes <b>28.78</b> seconds. The SMILES subset file takes up <b>6.8 MiB</b> on disk whilst the serialized objects take up <b>295 MiB</b>. Ouch, that's <b>42x</b> larger.</p>
<div class="code-header">
Code 1 - Writing to an ObjectOutputStream</div>
<pre class="prettyprint">IChemObjectBuilder bldr = SilentChemObjectBuilder.getInstance();
SmilesParser smipar = new SmilesParser(bldr);
String srcname = "/data/chembl_20_subset.smi";
String destname = "/data/chembl_20_subset.obj";
try (InputStream in = new FileInputStream(srcname);
Reader rdr = new InputStreamReader(in, StandardCharsets.UTF_8);
BufferedReader brdr = new BufferedReader(rdr);
ObjectOutputStream oos = new ObjectOutputStream(new FileOutputStream(destname))) {
String line;
long t0 = System.nanoTime();
while ((line = brdr.readLine()) != null) {
try {
IAtomContainer mol = smipar.parseSmiles(line);
// stereochemistry does not implement serializable...
// so need to remove it
mol.setStereoElements(new ArrayList<istereoelement>(0));
oos.writeObject(mol);
} catch (CDKException e) {
System.err.println(e.getMessage());
}
}
long t1 = System.nanoTime();
System.err.printf("write time: %.2f s\n", (t1 - t0) / 1e9);
}
</istereoelement></pre>
<div class="code-footer">
</div>
<p>In CDK we first read SMILES with Beam and then convert to the CDK objects so we'll also look at that small overhead. Here I compare the time to read the 100,000 SMILES using Beam, CDK, and the objects using an <code>ObjectInputStream</code>. Both CDK and Beam take less than <b>1</b> second whilst the <code>ObjectInputStream</code> takes more than <b>50</b>.</p>
<div class="separator" style="clear: both; text-align: center;"><a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEi5XUoCY0WLPMyE0mwqL97aREthxTBW-3ngm6iUjkXdotOJ2zhQ9_YN1_ErAvgAcw-KIsaC_lzG57Njfo7lDn4QapQ1CBDuDSWyq3n4q3dQ-I9cZmrv0tTXuJvrmoOePreMP_DfPU6l7zs/s1600/Rplot.png" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img border="0" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEi5XUoCY0WLPMyE0mwqL97aREthxTBW-3ngm6iUjkXdotOJ2zhQ9_YN1_ErAvgAcw-KIsaC_lzG57Njfo7lDn4QapQ1CBDuDSWyq3n4q3dQ-I9cZmrv0tTXuJvrmoOePreMP_DfPU6l7zs/s320/Rplot.png" /></a></div>
<p>In terms of throughput (mol per sec) here is the kind of speed we hit. I also show the total elapsed time for all 15 repeats.</p>
<center>
<table style="text-align: center;">
<tbody>
<tr><th>Method</th><th>Min</th><th>Max</th><th>Elapsed Time</th><th>Size</th></tr>
<tr><td>Deserialization</td><td>1961 s<sup>-1</sup></td><td>2089 s<sup>-1</sup></td><td>12 m 16 s</td><td>295 MiB</td></tr>
<tr><td>Kryo (Auto)</td><td>42401 s<sup>-1</sup></td><td>44557 s<sup>-1</sup></td><td>33.9 s</td><td>186 MiB</td></tr>
<tr><td>Kryo Unsafe (Auto)</td><td>44854 s<sup>-1</sup></td><td>47331 s<sup>-1</sup></td><td>31.9 s</td><td>231 MiB</td></tr>
<tr><td>CDK</td><td>135286 s<sup>-1</sup></td><td>142126 s<sup>-1</sup></td><td>10.7 s</td><td>6.8 MiB</td></tr>
<tr><td>Beam</td><td>347534 s<sup>-1</sup></td><td>489545 s<sup>-1</sup></td><td>3.2 s</td><td>6.8 MiB</td></tr>
</tbody></table>
</center>
<h4>
Auxiliary Data</h4>
<p>With a performance difference that huge why would anyone want to use Serialization? Some use-cases might be that a format doesn't store the parts we need. A common argument against SMILES is the lack of coordinates but we can simply store this supplementary to the SMILES if we no what the input order will be (Code 2).</p>
<div class="code-header">
Code 2 - Writing Coordinates with SMILES
</div>
<pre class="prettyprint">IAtomContainer mol = ...;
// 'Generic' - avoid canon SMILES we are not doing identity check
SmilesGenerator sg = SmilesGenerator.generic();
int n = mol.getAtomCount();
int[] order = new int[n];
// the order array is filled up as the SMILES is generated
String smi = sg.create(mol, order);
// load the coordinates array such that they are in the order the atoms
// are read when parsing the SMILES
Point2d[] coords = new Point2d[mol.getAtomCount()];
for (int i = 0; i < coords.length; i++)
coords[order[i]] = container.getAtom(i).getPoint2d();
// SMILES string suffixed by the coordinates
String smi2d = smi + " " + Arrays.toString(coords);
</pre>
<div class="code-footer">
</div>
<p>Using that same technique it's relatively simply to extend this to handle arbitrary data fields and it even forms the basis of <a href="https://www.chemaxon.com/marvin-archive/15.3.9.0/help/formats/cxsmiles-doc.html">ChemAxon's extended SMILES</a>. A more advanced method would be combining the SMILES with a <code>DataOutputStream</code> since we know how many coordinates there are expected to be.</p>
<h4>
Summary</h4>
<p>
I'm certainly not against a performant <code>AtomContainerInputStream</code> but the default Java serialization should never be the first choice. Hopefully this post has put some numbers on why and will discourage knee-jerk usage.</p>
<h4>Update</h4>
<p>Added Kryo performance</p>
John Mayfieldhttp://www.blogger.com/profile/03062858450255240984noreply@blogger.com0tag:blogger.com,1999:blog-6131674809499098789.post-26390519253312056322015-09-08T20:43:00.001+01:002015-09-08T20:51:27.752+01:00Bringing Molfile Sgroups to the CDK - Demo<p>Despite the <a href="http://molmatinf.com/whynotmolsdf.html">flaws</a>, the molfile has been a defacto standard for chemical representation for several decades. The core format (atom and bond block) is well supported in many toolkits but more advanced features (<a href="http://depth-first.com/articles/2009/06/10/dark-corners-of-the-molfile-specification-sgroups-and-substructure-abbreviations/">dark corners</a>) of the property block may be skipped.</p>
<p>At this year's Fall ACS (Boston '15) I bumped into an old colleague from <a href="https://www.ebi.ac.uk/chebi/">ChEBI</a> who told me they (ChEBI) couldn't use CDK because they wanted to display repeating brackets on records and CDK didn't do that.</p>
<p>Polymer representation (more precisely Structural Repeat Unit) used by ChEBI falls under the category of a Ctab Sgroup. I'd wanted to add support for Sgroups for some time and now had motivation to do so.</p>
<br />
<h3>Substructure (or Substance) Groups</h3>
<p>Over the years there seems to have been a shift in definition. The original literature[1] uses the term "substructure groups" but more recent materials use "substance groups"[2,3]. Personally I prefer "substructure" since it concisely summarises what they really are about.</p>
<p>Essentially an Sgroup annotates some part of the connection table (a substructure) with meta-information (data). There are several types of Sgroup that formalise the types of annotation present:
<ul>
<li>Display Shortcuts
<ul>
<li>Abbreviations</li>
<li>Multiple Groups</li>
</ul>
</li>
<li>Polymers
<ul>
<li>Structural Repeat Unit (SRU)</li>
<li>Monomer</li>
<li>Copolymer (alternating, block, or random)</li>
<li>Mer</li>
<li>Crosslink</li>
<li>Graft</li>
<li>Modified</li>
<li>Any</li>
</ul>
<li>Mixtures
<ul>
<li>Unordered Mixture</li>
<li>Ordered Mixture (formulation)</li>
<li>Component</li>
</ul>
<li>Generic</li>
<li>Data</li>
</ul>
</p>
<h3>Example ChEBI Depictions</h3>
<p><a href="chem-bla-ics.blogspot.nl">Egon</a> reviewed the first patch (<a href="https://github.com/cdk/cdk/pull/149">pull/149</a>) last week that focussed on representation and molfile round tripping. The second patch enhances the rendering code to handle more than basic SRUs (e.g. >2 brackets) and display shortcuts.</p>
<p>As of ChEBI 131 there are 809 entries with at least one Sgroup. Generating the depictions of these from an SDfile took < 3 seconds, then a further 11 to actually write the files to disk. The rest of this post demonstrates some example of those depictions.</p>
<h4>
Display Shortcut, Abbreviations</h4>
<div>
<p>Previously referred to as "superatoms", parts of a structure can be abbreviated to a more concise name (e.g. Ph for a phenyl substituent). The full structure is present but is only displayed when the expansion flag is set.
</p></div>
<table>
<tbody>
<tr>
<td><a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEhZGr__ALSiAusvj3g0Gjw6hKxXhU8Wn5QBNE6MkotRQo8AYPuaG95NlBrKKnHyUTXWFaXfxDWrJi2iFEVhCixqtk0QLTbDlsVfkUYo-sIArMthMC2uOZkwda4JqHv2mklCAc6Gugw7PVo/s1600/CHEBI_29441.png" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img border="0" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEhZGr__ALSiAusvj3g0Gjw6hKxXhU8Wn5QBNE6MkotRQo8AYPuaG95NlBrKKnHyUTXWFaXfxDWrJi2iFEVhCixqtk0QLTbDlsVfkUYo-sIArMthMC2uOZkwda4JqHv2mklCAc6Gugw7PVo/s320/CHEBI_29441.png" /></a></td>
<td><a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEipvYZXmWdhA6vykxU8e_FrSLAsiTWMF_SFYq8I2uCbBgYeJhA7TeOMuzNSoWYZYNeh8SIuWaJFZMr5t8owI5RdUb-LT9BW0v1vENLBR8v1Ai4f1y8d9YNMufeghNWF1F64euWsQ0P6A0g/s1600/CHEBI_7725.png" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img border="0" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEipvYZXmWdhA6vykxU8e_FrSLAsiTWMF_SFYq8I2uCbBgYeJhA7TeOMuzNSoWYZYNeh8SIuWaJFZMr5t8owI5RdUb-LT9BW0v1vENLBR8v1Ai4f1y8d9YNMufeghNWF1F64euWsQ0P6A0g/s320/CHEBI_7725.png" /></a></td>
</tr>
<tr>
<td style="text-align: center;"><a href="http://www.ebi.ac.uk/chebi/searchId.do?chebiId=CHEBI:29441">CHEBI:29441</a></td>
<td style="text-align: center;"><a href="http://www.ebi.ac.uk/chebi/searchId.do?chebiId=CHEBI:7725">CHEBI:7725</a></td>
</tr>
</tbody></table>
<br/>
<h4>
Display Shortcut, Multiple Group</h4>
<p>
Multiple groups allow structures with fixed repeating parts to be drawn more concisely. Similar to abbreviations, all the atoms and bonds are present but are hidden from display. They're actually all overlaid on one another with duplicated coordinates but for rendering you still want omit them from display.
</p>
<br />
<table>
<tbody>
<tr>
<td><a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEjpBnp3-qWBotD47tCw7U3B52hg0q9zNy2D3cIGCt6W1sRPfaHfWNq8lsUI27N51WHk1fAi3lWn_FP0swzqgMVvy23kD_8usFuBui-hoEnKPGhZTVyEmHN662RfVQ8dSpIG-EUNyA9NEsA/s1600/CHEBI_1233.png" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img border="0" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEjpBnp3-qWBotD47tCw7U3B52hg0q9zNy2D3cIGCt6W1sRPfaHfWNq8lsUI27N51WHk1fAi3lWn_FP0swzqgMVvy23kD_8usFuBui-hoEnKPGhZTVyEmHN662RfVQ8dSpIG-EUNyA9NEsA/s320/CHEBI_1233.png" /></a></td>
<td><a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEgah_8Ds-usDfMhYNINd2s1QwwhUs-uz1hLiItGwWsf0x2RytYPv_RIxx0yQRya3blP6NcW3ukKssn2Clpl77nJ4ZLvJ7YZQCwL2blacLlgmteb9PGaXH7b5H_ng2QlmbqWKUGcP4wCEXI/s1600/CHEBI_79399.png" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img border="0" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEgah_8Ds-usDfMhYNINd2s1QwwhUs-uz1hLiItGwWsf0x2RytYPv_RIxx0yQRya3blP6NcW3ukKssn2Clpl77nJ4ZLvJ7YZQCwL2blacLlgmteb9PGaXH7b5H_ng2QlmbqWKUGcP4wCEXI/s320/CHEBI_79399.png" /></a></td>
</tr>
<tr>
<td style="text-align: center;"><a href="http://www.ebi.ac.uk/chebi/searchId.do?chebiId=CHEBI:1233">CHEBI:1233</a></td>
<td style="text-align: center;"><a href="http://www.ebi.ac.uk/chebi/searchId.do?chebiId=CHEBI:79399">CHEBI:79399</a></td>
</tr>
</tbody></table>
<br/>
<h4>
Polymer, SRUs</h4>
<p>
The most common Sgroup used in ChEBI is the Structure Repeat Unit (SRU), an SRU defines a repeat unit of variable length. The brackets do not necessarily come in pairs, are parallel, or point towards each other.
</p>
<br />
<table>
<tbody>
<tr>
<td><a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEjn229Setd90r4jkNsfzo7JjGhbqfYICzQeMIfIcUV3HhDHm_QRvXeahyffpby-uFv5WX9uymq01Im_hQTArhTs_8_IFsBzwvt3dD4OBQE_ScxljrXS0COVeRfahcLNxepmjQOh77vPakk/s1600/CHEBI_16838.png" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img border="0" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEjn229Setd90r4jkNsfzo7JjGhbqfYICzQeMIfIcUV3HhDHm_QRvXeahyffpby-uFv5WX9uymq01Im_hQTArhTs_8_IFsBzwvt3dD4OBQE_ScxljrXS0COVeRfahcLNxepmjQOh77vPakk/s320/CHEBI_16838.png" /></a></td>
<td><a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEhBXza25fLWKQFZhO2tELaVihbPLN8vYfjKZUIMmSDD4ULXiLV-cVxiKSMs-LZ07HMX-Gvy9qmEFCk83LYRCipd7vVcKOl1zLIBCe2FOVFlkz9F0hIsVSyWYQCCX_1MagSvo81Bi-KUJq8/s1600/CHEBI_4294.png" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img border="0" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEhBXza25fLWKQFZhO2tELaVihbPLN8vYfjKZUIMmSDD4ULXiLV-cVxiKSMs-LZ07HMX-Gvy9qmEFCk83LYRCipd7vVcKOl1zLIBCe2FOVFlkz9F0hIsVSyWYQCCX_1MagSvo81Bi-KUJq8/s320/CHEBI_4294.png" /></a></td>
</tr>
<tr>
<td style="text-align: center;"><a href="http://www.ebi.ac.uk/chebi/searchId.do?chebiId=CHEBI:16838">CHEBI:16838</a></td>
<td style="text-align: center;"><a href="http://www.ebi.ac.uk/chebi/searchId.do?chebiId=CHEBI:4294">CHEBI:4294</a></td>
</tr>
<tr>
<td><a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEgAql3RQEo7tCxlyzghOUzL9FJPj1KyQgNhTfzHzn_54eJ-iNLTTsSZe3VOYU7uyIENlOG_-0rEcvPA6tunu7YUluIOOUu82uF3YJOUY9uGuTHFdKwW0-x0z2RwtPPnxEBzyEJ4V0vcxiw/s1600/CHEBI_53422.png" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img border="0" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEgAql3RQEo7tCxlyzghOUzL9FJPj1KyQgNhTfzHzn_54eJ-iNLTTsSZe3VOYU7uyIENlOG_-0rEcvPA6tunu7YUluIOOUu82uF3YJOUY9uGuTHFdKwW0-x0z2RwtPPnxEBzyEJ4V0vcxiw/s320/CHEBI_53422.png" /></a></td>
<td><a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEiB1NM5MwFUSPzPZVLzp2tivy6B0cAP54VmlaNlGCnqqetf645zLW2aIryuYqwx-8PB0-Ey7dfwoOIAZFmphoe0WsUdx1EUB30jsKOz50tANY4lX5QUasfUiBKw2JUFJqvBHeLDJ1HgtU8/s1600/CHEBI_59342.png" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img border="0" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEiB1NM5MwFUSPzPZVLzp2tivy6B0cAP54VmlaNlGCnqqetf645zLW2aIryuYqwx-8PB0-Ey7dfwoOIAZFmphoe0WsUdx1EUB30jsKOz50tANY4lX5QUasfUiBKw2JUFJqvBHeLDJ1HgtU8/s320/CHEBI_59342.png" /></a></td>
</tr>
<tr>
<td style="text-align: center;"><a href="http://www.ebi.ac.uk/chebi/searchId.do?chebiId=CHEBI:53422">CHEBI:53422</a></td>
<td style="text-align: center;"><a href="http://www.ebi.ac.uk/chebi/searchId.do?chebiId=CHEBI:59342">CHEBI:59342</a></td>
</tr>
</tbody></table>
<br/>
<h4>
Polymer, Others</h4>
<p>
A few entries encode copolymers and source-based representations (monomer).
</p>
<br />
<table>
<tbody>
<tr>
<td><a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEjLQZTQBz8Zt4btFZtYIbHOAVcXGXU0GLPaF1tS1W-x2wRNPh1yAQEfglaEhIQEM5g8sHwlAKsPwmKMteMY0zWDmK01JMptQasKzbwfGdgqggk-LMOkIdbqg1D8uUT1Fp-53wiPlASIAm8/s1600/CHEBI_59599.png" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img border="0" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEjLQZTQBz8Zt4btFZtYIbHOAVcXGXU0GLPaF1tS1W-x2wRNPh1yAQEfglaEhIQEM5g8sHwlAKsPwmKMteMY0zWDmK01JMptQasKzbwfGdgqggk-LMOkIdbqg1D8uUT1Fp-53wiPlASIAm8/s320/CHEBI_59599.png" /></a></td>
<td><a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEhxfR7BG-kI-LCxcV1cTps11P0LtwpRd-Sd7UDhj5YR7TJbLThIadCtWamzXb-LD4NvuPYeHUPLJWYpKfn1CrOqmsjxr5XN7EY9knrod6c7Mg3AU9WZ3IGFS-nV3GbYwlJwrS-F-GklZuc/s1600/CHEBI_3814.png" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img border="0" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEhxfR7BG-kI-LCxcV1cTps11P0LtwpRd-Sd7UDhj5YR7TJbLThIadCtWamzXb-LD4NvuPYeHUPLJWYpKfn1CrOqmsjxr5XN7EY9knrod6c7Mg3AU9WZ3IGFS-nV3GbYwlJwrS-F-GklZuc/s320/CHEBI_3814.png" /></a></td>
</tr>
<tr>
<td style="text-align: center;"><a href="http://www.ebi.ac.uk/chebi/searchId.do?chebiId=CHEBI:59599">CHEBI:59599</a></td>
<td style="text-align: center;"><a href="http://www.ebi.ac.uk/chebi/searchId.do?chebiId=CHEBI:3814">CHEBI:3814</a> (overlap in original)</td>
</tr>
</tbody></table>
<br/>
<h4>
Combinations</h4>
<p>A structure can have more than one Sgroup and they can be nested. Here we see a multiple group within an SRU. There is also a data Sgroup attached to the Zn-N bond marking it as a coordination bond for Marvin. I've not decided whether to render those yet, but we have the information there.</p>
<br />
<table align="center" cellpadding="0" cellspacing="0" class="tr-caption-container" style="margin-left: auto; margin-right: auto; text-align: center;"><tbody>
<tr><td style="text-align: center;"><a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEgd9TIyV7on3klxz-5C9UkOyy4SY_H6xBBpVs7gNFbRds9lqqnZx53j5pds6DkgCMsVn-g8zrXgSAl20qf3gytKANrHIjp7RreHEHZiPuVbsaDG_FaFf1A_RgVkJ5CA1pEtU4QbpkuAGxU/s1600/CHEBI_81539.png" imageanchor="1" style="margin-left: auto; margin-right: auto;"><img border="0" height="247" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEgd9TIyV7on3klxz-5C9UkOyy4SY_H6xBBpVs7gNFbRds9lqqnZx53j5pds6DkgCMsVn-g8zrXgSAl20qf3gytKANrHIjp7RreHEHZiPuVbsaDG_FaFf1A_RgVkJ5CA1pEtU4QbpkuAGxU/s400/CHEBI_81539.png" width="400" /></a></td></tr>
<tr><td class="tr-caption" style="text-align: center;"><a href="http://www.ebi.ac.uk/chebi/searchId.do?chebiId=CHEBI:81539">CHEBI:81539</a></td></tr>
</tbody></table>
<br/>
<h3>Additional Reading</h3>
<ol>
<li><a href="http://pubs.acs.org/doi/abs/10.1021/ci00004a003">Gushurst <i>et al.</i> The substance module: the representation, storage, and searching of complex structures. <i><b>J. Chem. Inf. Comput. Sci.</b></i> (1991)</a></li>
<li><a href="http://www.structurependium.com/index.php/chemical-representation-of-structures-and-reactions/sgroups-abbreviations-mixtures-formulations-polymers-structures-with-statistical-distribution-and-other-special-cases">Blanke G. Sgroups – Abbreviations, Mixtures, Formulations, Polymers, Structures with Statistical Distribution and Other Special Cases. <b><i>Online - StructurePendium Technologies GmbH</i></b></a></li>
<li><a href="help.accelrysonline.com/insight/2.2/Content/PDF_files/AccelrysChemicalRepresentation.pdf">Accelrys Chemical Representation</a></li>
<li><a href="http://accelrys.com/products/collaborative-science/biovia-draw/ctfile-no-fee.html">CTfile Formats Specification</a></li>
</ol>John Mayfieldhttp://www.blogger.com/profile/03062858450255240984noreply@blogger.com0tag:blogger.com,1999:blog-6131674809499098789.post-63007280758210522712015-08-09T15:33:00.002+01:002015-08-09T15:33:39.204+01:00MMFF Partial Charges Improvements in CDK<style type="text/css">
.tg {border-collapse:collapse;border-spacing:0;}
.tg td{font-family:Arial, sans-serif;font-size:14px;padding:10px 5px;border-style:solid;border-width:1px;overflow:hidden;word-break:normal;}
.tg th{font-family:Arial, sans-serif;font-size:14px;font-weight:normal;padding:10px 5px;border-style:solid;border-width:1px;overflow:hidden;word-break:normal;}
</style>
<p>
Some time last year <a href="http://www-ucc.ch.cam.ac.uk/members/mw529">Mark Williamson</a> brought to my attention discrepancies in CDK's MMFF partial charge calculation. Investigating further it seemed to mainly be a problem with atom typing. There were two existing classes that could assigned MMFF atom types using a combination of a decision tree and string matching hose codes. The 761 molecules from the <a href="http://www.ccl.net/cca/data/MMFF94/">MMFF94 Validation Suite</a> provided by Paul Kersey were used to give a more comprehensive overview then our current tests.
</p>
<p>
The results showed reasonable precision per-atom in the validation suite but were less favourable per-molecule, the best implementation assigned types to <90% of the molecules with <16% assigned correctly.
</p>
<table class="tg">
<tbody>
<tr>
<th class="tg-031e"></th>
<th class="tg-031e" colspan="2">Assigned Types <br />
(Atoms)</th>
<th class="tg-031e" colspan="2">Correct Types <br />
(Atoms)</th>
<th class="tg-031e" colspan="2">Assigned Types <br />
(Molecules)</th>
<th class="tg-031e" colspan="2">Correct Types <br />
(Molecules)</th>
</tr>
<tr>
<td class="tg-031e">ForceFieldConfigurator</td>
<td class="tg-031e">15576</td>
<td class="tg-031e">90.1%</td>
<td class="tg-031e">12932</td>
<td class="tg-031e">74.8%</td>
<td class="tg-031e">678</td>
<td class="tg-031e">89.1%</td>
<td class="tg-031e">118</td>
<td class="tg-031e">15.5%</td>
</tr>
<tr>
<td class="tg-031e">MMFF94AtomTypeMatcher</td>
<td class="tg-031e">17120</td>
<td class="tg-031e">99.1%</td>
<td class="tg-031e">12309</td>
<td class="tg-031e">71.2%</td>
<td class="tg-031e">659</td>
<td class="tg-031e">86.6%</td>
<td class="tg-031e">75</td>
<td class="tg-031e">9.9%</td>
</tr>
<tr>
<td class="tg-031e">MmffAtomTypeMatcher</td>
<td class="tg-031e">17279</td>
<td class="tg-031e">100.0%</td>
<td class="tg-031e">17279</td>
<td class="tg-031e">100.0%</td>
<td class="tg-031e">761</td>
<td class="tg-031e">100.0%</td>
<td class="tg-031e">761</td>
<td class="tg-031e">100.0%</td>
</tr>
</tbody></table>
<p>
I wasn't keen to hard code the atom typing procedure but was delighted to find Robert Hanson of <a href="http://jmol.sourceforge.net/">JMol</a> had some SMARTS patterns that could be used as a starting point. After about a month of tweaking I managed to simplify the SMARTS patterns and achieve 100% precision on the validation suite. You can find the SMARTS patterns here: <a href="https://github.com/cdk/cdk/blob/master/tool/forcefield/src/main/resources/org/openscience/cdk/forcefield/mmff/MMFFSYMB.sma">/org/openscience/cdk/forcefield/mmff/MMFFSYMB.sma</a>.
</p>
<p>
Apart from improving atom type assignments the charge assignment also needed updating to include charge sharing and bond class differences. This wasn't quite as simple as I first thought as the parameter set parsing also needed reworking. After many months of analysis paralysis I decided last week to just rewrite what was needed and delegate calls from the existing implementation.
</p>
<p>
Now the patch is finished, charge assignments are much better. Notice that in the previous version (labelled CDK 1.5.10) equivalent terminal oxygens and the nitrogens in imidazole anion have different values. The overall charge was also inconsistent with the formal charges.
</p>
<table align="center" cellpadding="0" cellspacing="0" class="tr-caption-container" style="margin-left: auto; margin-right: auto; text-align: center;"><tbody>
<tr><td style="text-align: center;"><a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEgawh7cfYfXYqwcog-IAEZ1OMqLPySDPFteLr3-jRzQ5hOL41NkK-iRF0cvsagkazw3pwIQ1D5oSo9xD97YJXpFsIvzvJlPK6Yqt5aU4oRjthq6S0vaK_c6c775WkErpJejLRtnrpvgL8Q/s1600/mmff-improvements.png" imageanchor="1" style="margin-left: auto; margin-right: auto;"><img border="0" height="640" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEgawh7cfYfXYqwcog-IAEZ1OMqLPySDPFteLr3-jRzQ5hOL41NkK-iRF0cvsagkazw3pwIQ1D5oSo9xD97YJXpFsIvzvJlPK6Yqt5aU4oRjthq6S0vaK_c6c775WkErpJejLRtnrpvgL8Q/s640/mmff-improvements.png" width="400" /></a></td></tr>
<tr><td class="tr-caption" style="text-align: center;">Improved charge assignment</td></tr>
</tbody></table>
<div class="separator" style="clear: both; text-align: center;">
</div>
<p>Roger Sayle noted to me this week that MMFF charges should not be affected by representation, for example, charge separated pi bonds in nitro groups or phosphates.</p>
<table align="center" cellpadding="0" cellspacing="0" class="tr-caption-container" style="margin-left: auto; margin-right: auto; text-align: center;"><tbody>
<tr><td style="text-align: center;"><a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEjVxwnSuDuHbHOBgPWZ1Wz-34YJWfQNtqX3BHPOKoxvNmJxzb5wUEsn6SWr-7G7VNE5d7okrrrIVyGxW1UGzYFCQa1FNMECRhtIkXzm5vO3MevNW-j4CvVhZOZwh7eB2R8thDJ48373sfM/s1600/mmff-diffforms.png" imageanchor="1" style="margin-left: auto; margin-right: auto;"><img border="0" height="121" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEjVxwnSuDuHbHOBgPWZ1Wz-34YJWfQNtqX3BHPOKoxvNmJxzb5wUEsn6SWr-7G7VNE5d7okrrrIVyGxW1UGzYFCQa1FNMECRhtIkXzm5vO3MevNW-j4CvVhZOZwh7eB2R8thDJ48373sfM/s320/mmff-diffforms.png" width="320" /></a></td></tr>
<tr><td class="tr-caption" style="text-align: center;">Charges are independant of representation</td></tr>
</tbody></table>
<p>
Many thanks to Mark and <a href="http://www-ucc.ch.cam.ac.uk/members/apkc2">Alison Choy</a> for reporting the problem and adding patches for debugging and testing.</p>John Mayfieldhttp://www.blogger.com/profile/03062858450255240984noreply@blogger.com0tag:blogger.com,1999:blog-6131674809499098789.post-70229103089630968182015-01-29T17:38:00.002+00:002015-01-29T17:38:49.610+00:00PhD Thesis Now Available<p>
I'm please to announce that my PhD thesis is now available from the Cambridge DSpace repository: <a href="https://www.repository.cam.ac.uk/handle/1810/246652">https://www.repository.cam.ac.uk/handle/1810/246652</a>. One thing potentially of note is the description of fast Kekulisation that I originally <a href="http://efficientbits.blogspot.co.uk/2013/12/new-smiles-behaviour-parsing-cdk-154.html">intended</a> to write as a blog post. Also following up from <a href="http://nextmovesoftware.com/">NextMove Software</a>'s recent post by Daniel on <a href="http://nextmovesoftware.com/blog/2015/01/21/r-or-s-lets-vote/">Cahn-Ingold-Prelog (CIP)</a>, the results of Chapter 6 contains some more CIP madness.</p>John Mayfieldhttp://www.blogger.com/profile/03062858450255240984noreply@blogger.com0tag:blogger.com,1999:blog-6131674809499098789.post-4790683942493067292014-12-30T21:02:00.000+00:002014-12-30T21:02:12.379+00:00CDK Release 1.5.10<a href="http://dx.doi.org/10.5281/zenodo.13559"><img src="https://zenodo.org/badge/doi/10.5281/zenodo.13559.svg" alt="10.5281/zenodo.13559"></a>
<p>CDK 1.5.10 has been released and is available from sourceforge (<a href="http://sourceforge.net/projects/cdk/files/cdk%20%28development%29/1.5.10/">download here</a>) and the Maven central repository (XML 1).</p>
<p>
This release follows very shortly after 1.5.9 and is the first release available from the central maven repository. This means there is now no need to include a custom repo when using the library in downstream projects (XML 1)</p>
<p>
The short release notes (<a href="https://github.com/cdk/cdk/wiki/1.5.10-Release-Notes">1.5.10-Release-Notes</a>) summarise and detail the changes. Other than the availability in the central repository the release includes a new MolecularFormulaGenerator contributed by Tomáš Pluskal that provide mass to formula generation in a fraction of the time of the old MassToFormulaTool.</p>
<div>
<div class="code-header">
<b>XML 1</b> - Maven POM configuration</div>
<pre class="prettyprint lang-xml">
<dependency>
<groupId>org.openscience.cdk</groupId>
<artifactId>cdk-bundle</artifactId>
<version>1.5.10</version>
</dependency>
</pre>
<div class="code-footer">
</div>
</div>
John Mayfieldhttp://www.blogger.com/profile/03062858450255240984noreply@blogger.com0tag:blogger.com,1999:blog-6131674809499098789.post-6718065261905312772014-12-24T15:56:00.002+00:002014-12-24T15:59:06.475+00:00CDK Release 1.5.9<a href="http://dx.doi.org/10.5281/zenodo.13368">
<img alt="10.5281/zenodo.13368" src="https://zenodo.org/badge/6043/cdk/cdk.svg" />
</a><br />
<p>CDK 1.5.9 has been released and is available from sourceforge (<a href="http://sourceforge.net/projects/cdk/files/cdk%20%28development%29/1.5.9/">download here</a>) and the EBI maven repo (XML 1).</p>
<p>
This is the first release to be built using Java 7 and will require the Java SE Runtime 7 to execute. The previous release (1.5.8) will be the last to work with Java SE 6.</p>
<p>
The full release notes (<a href="https://github.com/cdk/cdk/wiki/1.5.9-Release-Notes">1.5.9-Release-Notes</a>) summarise and detail the changes. One of the new features is the recognition of perspective projection stereochemistry.</p>
<div class="separator" style="clear: both; text-align: center;">
<a href="http://cdk.github.io/cdk/img/misc/releasenotes/stereorecognition.png" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img alt="Stereochemistry recognition" border="0" src="http://cdk.github.io/cdk/img/misc/releasenotes/stereorecognition.png" height="266" width="400" /></a></div>
<div>
<div class="code-header">
<b>XML 1</b> - Maven POM configuration</div>
<pre class="prettyprint lang-xml"><repository>
<url>http://www.ebi.ac.uk/intact/maven/nexus/content/repositories/ebi-repo/</url>
</repository>
...
<dependency>
<groupId>org.openscience.cdk</groupId>
<artifactId>cdk-bundle</artifactId>
<version>1.5.9</version>
</dependency>
</pre>
<div class="code-footer">
</div>
</div>
John Mayfieldhttp://www.blogger.com/profile/03062858450255240984noreply@blogger.com0tag:blogger.com,1999:blog-6131674809499098789.post-25162354661162042372014-12-01T23:31:00.001+00:002014-12-01T23:56:55.788+00:00Memory Mapped Fingerprint Index - Part IIThis post follows up on the previous to report some timings. I've checked all the code into GitHub (<a href="https://github.com/johnmay/efficient-bits/tree/master/fp-idx">johnmay/efficient-bits/fp-idx</a>) and it has some stand alone programs that can be run from the command line.<br />
<br />
Currently there are a few limitations that we'll get out the way:<br />
<br />
<ul>
<li>Only generation of the CDK ECFP4 is supported and at a folded length of 1024, this should give a close approximation to what <a href="http://blog.matt-swain.com/post/87093745652/chemical-similarity-search-in-mongodb">Matt used in MongoDB</a> (RDKit Morgan FP). Other fingerprints and foldings could be used but the generation time of path based fingerprints in the CDK is currently (painfully) slow.</li>
<li>Building the index is done in memory, since 1,000,000x1024 bit fingerprints is only 122 MiB you can easily build indexes with less than 10 million on modern hardware.</li>
<li>During index searching the entire index is memory mapped, setting the <span style="font-family: Courier New, Courier, monospace;">chunks</span> system property (see the GitHub readme) will avoid this at a slight performance cost.</li>
<li>Results return the id in the index (indirection) and to get the original Id one would need to resolve it with another file (generated by <span style="font-family: Courier New, Courier, monospace;">mkidx</span>). </li>
<li>Index update operations are not supported without rebuilding it.</li>
</ul>
These are all pretty trivial to resolve and I've simply omitted them due to time. With that done, here's a quick synopsis of making the index, there is more in the GitHub readme.
<br />
<div class="code-header">
<b>Code 1</b> - Synopsis</div>
<pre class="prettyprint lang-java">$ ./smi2fps /data/chembl_19.smi chembl_19.fps # ~5 mins
$ ./mkidx chembl_19.fps chembl_19.idx # seconds
</pre>
<div class="code-footer">
</div>
<p>
The <code>fpsscan</code> does a linear search computing all Tanimoto's and outputting the lines that
are above a certain threshold. The <code>simmer</code> and <code>toper</code> utils use the index, they either filter for similarity or the top <b><i>k</i></b> results. They can take multiple SMILES via the command line or from a file.
</p>
<div class="code-header">
<b>Code 2</b> - Running queries</div>
<pre class="prettyprint lang-java">$ ./fpsscan /data/chembl_19.fps 'c1cc(c(cc1CCN)O)O' 0.7 # ~ 1 second
$ ./simmer chembl_19.idx 0.7 'c1cc(c(cc1CCN)O)O' # < 1 second
$ ./toper chembl_19.idx 50 'c1cc(c(cc1CCN)O)O' # < 1 second (top 50)
</pre>
<div class="code-footer">
</div>
<p>
Using the same queries from the <a href="http://blog.matt-swain.com/post/87093745652/chemical-similarity-search-in-mongodb">MongoDB search</a> I get the following distribution of search times for different thresholds.</p>
<div class="separator" style="clear: both; text-align: center;">
<a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEgjsFTDFAdxA9708iDyR-yTGNXPcUKIut0VcZXL1ecyzlSStm5m5Ds2uNkC8oWNwu41Y6la-8EUMaj2jcnzbxKbvm6rxjRDerHjjFqAtCBd-mOyi6ENEzl0u5IvZ6zEx9RMI8VFk1Lh-TA/s1600/Rplot.png" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img border="0" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEgjsFTDFAdxA9708iDyR-yTGNXPcUKIut0VcZXL1ecyzlSStm5m5Ds2uNkC8oWNwu41Y6la-8EUMaj2jcnzbxKbvm6rxjRDerHjjFqAtCBd-mOyi6ENEzl0u5IvZ6zEx9RMI8VFk1Lh-TA/s1600/Rplot.png" height="248" width="400" /></a></div>
Some median search times are as follows.
<table style="margin: 10px auto; width: 250px;">
<tbody>
<tr>
<th>Threshold</th>
<th>Median time (ms)</th>
</tr>
<tr>
<td>0.90</td>
<td>14</td>
</tr>
<tr>
<td>0.80</td>
<td>31</td>
</tr>
<tr>
<td>0.70</td>
<td>46</td>
</tr>
<tr>
<td>0.60</td>
<td>53</td>
</tr>
</tbody></table>
In the box plot above the same (first) query is always the slowest, this is likely due to JIT.
<br/>
It's interesting to see that the times seem to flatten out. By plotting how many fingerprints the search had to check we observe that below a certain threshold we are essentially checking the entire dataset.
<br />
<div>
<br />
<div class="separator" style="clear: both; text-align: center;">
<a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEhkJzw76SHimIEdWIuSs1EQRUDyqRYHGlA6tujhSgWqU-uZnUE_cp1WtvoT1Zd-CGY8w4hCmyu58BpukkjvJQ14R9ORqVgTrm8N7zPXAyu7QqZ7GT8WI19j0K9Im-IVI0QZr62Ovq37AQs/s1600/Rplot02.png" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img border="0" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEhkJzw76SHimIEdWIuSs1EQRUDyqRYHGlA6tujhSgWqU-uZnUE_cp1WtvoT1Zd-CGY8w4hCmyu58BpukkjvJQ14R9ORqVgTrm8N7zPXAyu7QqZ7GT8WI19j0K9Im-IVI0QZr62Ovq37AQs/s1600/Rplot02.png" height="248" width="400" /></a></div>
<div class="separator" style="clear: both; text-align: center;">
<br /></div>
<br /></div>
<p>
The reason for this is potentially due to the sparse circular fingerprints. Examining the result file (see the github README) we can estimate that on average we're calculating 23,556,103 Tanimoto's a second. This also means that retrieving the top <b><i>k</i></b> queries isn't bad either. For example 10,000 gives a median time (Code 3) of 72 ms.
</p>
<div class="code-header">
<b>Code 3</b> - Top 10,000 hits for queries (same as before)</div>
<pre class="prettyprint lang-java">
$ ./toper chembl_19.idx 10000 queries.smi
</pre>
<div class="code-footer">
</div>
<p>
Next I'll look at some like-for-like comparisons.
</p>John Mayfieldhttp://www.blogger.com/profile/03062858450255240984noreply@blogger.com0tag:blogger.com,1999:blog-6131674809499098789.post-19027398158309392562014-11-26T22:50:00.001+00:002014-11-26T22:50:28.996+00:00Memory Mapped Fingerprint Index - Part I<p>I attended an interesting talk this afternoon (<a href="http://www.c-inf.net/">CCNM</a>) by Matt Swain on using <a href="http://blog.matt-swain.com/post/87093745652/chemical-similarity-search-in-mongodb">MongoDB for chemical similarity searching</a> (code: <a href="https://github.com/mcs07/mongodb-chemistry">github/mcs07/mongodb-chemistry</a>).</p>
<p>
The similarity searching partially uses the "Baldi" algorithm with some extra tweaks based on checking rare bits. The Baldi method is nicely summarised along with others by Tim Vandermeersch in his post on <a href="http://timvdm.blogspot.co.uk/2012/08/fingerprint-searching-using-various.html">Fingerprint Searching Using Various Indexing Methods</a>. As is noted by Tim, it can be improved upon.</p>
<p>
Anyways, I had an implementation of a memory mapped Baldi index lying around, there is also one in the <a href="https://github.com/cdk/orchem">OrChem</a> database cartridge. I prototyped the implementation back in April and was/is part of a "nfp" (new fingerprint) module for <a href="https://github.com/cdk/cdk/">CDK</a>. I've now put the code on a GitHub project (<a href="https://github.com/johnmay/efficient-bits/tree/master/fp-idx">github/johnmay/efficient-bits/fp-idx</a>) and will do some benchmarking to see how it does.</p>
<p>
My feeling is that the very simple (<a href="https://github.com/johnmay/efficient-bits/blob/master/fp-idx/src/main/java/org/openscience/cdk/nfp/SimilarityIndex.java">it's about 100 lines</a>) memory mapped index can give competitive performance on small datasets (<5 million entries).
</p>
John Mayfieldhttp://www.blogger.com/profile/03062858450255240984noreply@blogger.com0tag:blogger.com,1999:blog-6131674809499098789.post-51810821946959081702014-11-16T17:10:00.000+00:002015-04-29T15:38:39.745+01:00Fun (and abuse) of implicit methods<p>Earlier this year I wrote up some <a href="http://ctr.wikia.com/wiki/Chemistry_Toolkit_Rosetta_Wiki">Chemical Toolkit Rosetta</a> examples of using the <a href="https://github.com/cdk/cdk/">CDK</a> in <a href="http://www.scala-lang.org/">Scala</a> (<a href="http://www.github.com/cdk/cdk-scala-examples">github/cdk/cdk-scala-examples</a>). When I was writing this it sprung to mind that it would be cool to (ab)use one feature for interoperability between cheminformatics toolkits.</p>
<p>
Scala is a statically typed functional language that runs on the Java Virtual Machine. It has some nice features and syntax that can produce some very concise code. One thing particular neat is the ability to define implicit methods. Essentially these are methods that define how to convert between types, they are implicit because the compiler can insert them automatically.
</p>
<p>
Implicit methods are very similar to <a href="https://docs.oracle.com/javase/tutorial/java/data/autoboxing.html">auto(un)boxing</a> that was introduced in Java 5 to simplify conversion of primitives and their instance wrappers (Code 1).
</p>
<div>
<div class="code-header">
<b>Code 1</b> - Autoboxing and autounboxing in Java</div>
<pre class="prettyprint lang-java">
Integer x = 5; // ~ Integer x = Integer.valueOf(5);
int y = x; // ~ int y = x.intValue();
x = y; // ~ x = Integer.valueOf(y);
if (x == y) { // ~ x.intValue() == y
}
</pre>
<div class="code-footer">
</div>
</div>
<p>
Much like it is possible in some programming languages to define custom operators, Scala makes it possible to define custom conversions that are inserted at compile time. The main advantage is it allows APIs to be extended to accept different types without introducing extra methods.
</p>
<h3>Conversion from line notations</h3>
<p>
Line notations are a concise means of encoding a chemical structure as sequence of characters (String). Common examples include SMILES, InChI*, WLN, SLN, and systematic nomenclature. Conversion to and from these formats isn't too computationally expensive but probably not something you want to do on-the-fly. However, just for fun, let's see what an implicit method for converting from strings can do. First we need the specified methods for loading from a known string type. We'll use the CDK for SMILES and InChI with <a href="http://opsin.ch.cam.ac.uk/">Opsin</a> for nomenclature.</p>
<div>
<div class="code-header">
<b>Code 2a</b> - Parsing of linear notations</div>
<pre class="prettyprint lang-scala">
val bldr = SilentChemObjectBuilder.getInstance
val sp = new SmilesParser(bldr)
val igf = InChIGeneratorFactory.getInstance
def inchipar(inchi: String) =
igf.getInChIToStructure(inchi, bldr).getAtomContainer
def cdksmipar(smi: String) =
sp.parseSmiles(smi)
def nompar(nom: String) =
cdksmipar(NameToStructure.getInstance.parseToSmiles(nom))
def cansmi(ac: IAtomContainer) =
SmilesGenerator.unique().create(ac)
// Universal SMILES (see. O'Boyle N, 2012**)
def unismi(ac: IAtomContainer) =
SmilesGenerator.absolute().create(ac)
</pre>
<div class="code-footer">
</div>
</div>
<div>
<div class="code-header">
<b>Code 2b</b> - Implicit conversion from a String to an IAtomContainer</div>
<pre class="prettyprint lang-scala">
implicit def autoParseCDK(str: String): IAtomContainer = {
if (str.startsWith("InChI=")) {
inchipar(str)
} else if (str.startsWith("1S/")) {
inchipar("InChI=" + str)
} else {
try {
cdksmipar(str)
} catch {
case _: InvalidSmilesException => nompar(str)
}
}
}
</pre>
<div class="code-footer">
</div>
</div>
<p>Now the implicit method has been defined, any method in the CDK API that accepts an IAtomContainer can now behave as though it accepts a linear notation. Code 3 shows how we can get the same Universal SMILES for different representations of caffeine and compute the <a href="http://cheminf20.org/2014/02/21/open-source-ecfpfcfp-circular-fingerprints-in-cdk/">ECFP4 fingerprint</a> for porphyrin</p>
<div>
<div class="code-header">
<b>Code 3</b> - Using implicit methods</div>
<pre class="prettyprint lang-java">
println(unismi("caffeine"))
println(unismi("CN1C=NC2=C1C(=O)N(C)C(=O)N2C"))
println(unismi("InChI=1S/C8H10N4O2/c1-10-4-9-6-5(10)7(13)12(3)8(14)11(6)2/h4H,1-3H3"))
println(unismi("1S/C8H10N4O2/c1-10-4-9-6-5(10)7(13)12(3)8(14)11(6)2/h4H,1-3H3"))
val fp = new CircularFingerprinter(CLASS_ECFP4).getCountFingerprint("porphyrin")
</pre>
<div class="code-footer">
</div>
</div>
<p>
</p>
<h3>Conversion between toolkits</h3>
<p>
It is also possible to add in implicit methods to auto-convert between toolkit types. To convert between the CDK and <a href="http://www.rdkit.org/">RDKit</a> (Java bindings) I'll go via SMILES. This conversion is lossy without an auxiliary data vector but serves as a proof of concept. I've lifted the Java bindings from the RDKit lucene project (<a href="https://github.com/rdkit/org.rdkit.lucene/">github/rdkit/org.rdkit.lucene/</a>) as the shared library works out the box for me. We can also add in the from string implicit conversions.
</p>
<p>
Code 4 shows the implicit method definitions. The additional <code>autoParseRDKit</code> allows us to bootstrap the RDKit API to also accept linear notations on all methods that expect an RWMol (or ROMol).
</p>
<div>
<div class="code-header">
<b>Code 4</b> - Implicit methods for conversion between CDK and RDKit</div>
<pre class="prettyprint lang-java">
implicit def cdk2rdkit(ac : IAtomContainer) : RWMol =
RWMol.MolFromSmiles(SmilesGenerator.isomeric.create(ac))
// XXX: better to use non-canon SMILES
implicit def rdkit2cdk(rwmol : RWMol) : IAtomContainer =
cdksmipar(RDKFuncs.getCanonSmiles(rwmol))
implicit def autoParseRDKit(str: String): RWMol =
cdk2rdkit(autoParseCDK(str))
</pre>
<div class="code-footer">
</div>
</div>
<p>
Now we can obtain the <a href="http://sourceforge.net/projects/avalontoolkit/">Avalon</a> fingerprint of caffeine from it's name and pass an RWMol to the CDK's <a href="ftp://ftp.ncbi.nlm.nih.gov/pubchem/specifications/pubchem_fingerprints.pdf">PubchemFingerprinter</a> (Code 5).
</p>
<div>
<div class="code-header">
<b>Code 5</b> - Using the RDKit API</div>
<pre class="prettyprint lang-java">
val fp = new ExplicitBitVect(512)
RDKFuncs.getAvalonFP("caffeine", fp2)
val caffeine : RWMol = "caffeine"
new PubchemFingerprinter(bldr).getBitFingerprint(caffeine)
</pre>
<div class="code-footer">
</div>
</div>
<p>
Given that auto(un)boxing primitives in Java can sting you in performance critical code, the above examples should be used sparingly. They do serve as a fun example of what is possible and I've put together the working code example in a Scala project for others to try <a href="https://github.com/johnmay/efficient-bits/tree/master/impl-conversion">github/johnmay/efficient-bits/impl-conversion</a>.</p>
<h4>Footnotes</h4>
<p>
<ul>
<li>* - Technically "InChI is not a replacement for any
existing internal structure
representations", <a href="http://www.hellers.com/steve/pub-talks/ams-5-14.pdf">Heller S. ICCS. 2014</a> but we'll allow it.
</li>
<li>** - <a href="http://www.jcheminf.com/content/4/1/22">http://www.jcheminf.com/content/4/1/22</a></li>
</ul>
</p>John Mayfieldhttp://www.blogger.com/profile/03062858450255240984noreply@blogger.com0tag:blogger.com,1999:blog-6131674809499098789.post-55975211822117770012014-09-12T14:59:00.001+01:002014-09-13T11:04:42.965+01:00Not to scale<p>
The latest release of the CDK (1.5.8) includes a new generator for rendering structure diagrams. A detailed introduction to configuring the new generator is available on the CDK wiki<sup>[<a href="http://efficientbits.blogspot.co.uk/2014/09/not-to-scale.html#ref1">1</a>]</sup>.
</p>
<p>
The new generator can be used as a drop in replacement in existing code. However, one aspect of rendering that I've struggled with previously was getting good sized depictions with the CDK - most notably with vector graphic output. This post will look at how we can size depictions and will provide code in an example project as a reference.
</p>
<p>
ChEBI's current <a href="http://www.ebi.ac.uk/chebi/entityMonthForward.do">entity of the month</a> - maytansine [<a href="http://www.ebi.ac.uk/chebi/searchId.do?chebiId=6701"><code>CHEBI:6701</code></a>] will be used to demonstrate the sizing.
</p>
<h4>Parameters</h4>
<p>
Three parameters that are important in the overall sizing of depictions. These are the <code>BondLength</code>, <code>Scale</code>, and <code>Zoom</code> which are all registered as <code>BasicSceneGenerator</code> parameters. The <code>Zoom</code> is not needed if we allow our diagram to be fitted automatically.</p>
<p>The <code>BondLength</code> can be set by the user and has a default value of '40' whilst the <code>Scale</code> is set during diagram generation. <code>BondLength</code> units are arbitrary - for now we'll consider this as '40 px'.
</p>
<h4>Scaling</h4>
<p>
The <code>Scale</code> parameter is used to render molecules with different coordinate systems consistently<sup>[<a href="http://efficientbits.blogspot.co.uk/2014/09/not-to-scale.html#ref2">2</a>,<a href="http://efficientbits.blogspot.co.uk/2014/09/not-to-scale.html#ref3">3</a>]</sup>. The value is determined using the <code>BondLength</code> parameter and the bond length in the molecule. For maytansine [<a href="http://www.ebi.ac.uk/chebi/searchId.do?chebiId=6701"><code>CHEBI:6701</code></a>] the median bond length is ~0.82. Again, the units are arbitrary - this could be Angstroms (it isn't).
</p>
<p>
The <code>Scale</code> is therefore the ratio of the measured bond length (0.82) to the desired bond length (40 px). For this example, the scale is 48.48. The coordinates must be scaled by ~4800% such that each bond is drawn as 40 px long.
</p>
<h4>Bounds</h4>
<p>
Now we know our scale (~48.48), how big is our depiction going to be? It depends how we measure it. One way would be to check the bounding box that contains all atom coordinates (using <code>GeometryUtil.getMinMax()</code>). However, this does not consider the positions of adjuncts and would lead to parts of the diagram being cut off<sup>[<a href="http://efficientbits.blogspot.co.uk/2014/09/not-to-scale.html#ref4">4</a>]</sup>.
</p>
<div align="center">
<img style="border-width: 1px; border-style: solid; border-color: #444444; padding: 0;" src="http://cdk.github.io/cdk/img/stdgen/bounds/chebi-6701-bounds.svg" width="40%" align="center"/>
<img style="border-width: 1px; border-style: solid; border-color: #444444; padding: 0;" src="http://cdk.github.io/cdk/img/stdgen/bounds/chebi-6701-bad-bounds.svg" width="36%" align="center"/>
</div>
<p>
Fortunately the new generator provides a <code>Bounds</code> rendering element allowing us to determine the true diagrams bounds of 8.46x8.03. Since the scale is ~48.48 the final size of the depiction would be ~410x390 px. A margin is also added.
</p>
<div align="center">
<img style="border-width: 1px; border-style: solid; border-color: #444444; padding: 0;" src="http://cdk.github.io/cdk/img/stdgen/bounds/chebi-6701-good-bounds.svg" width="40%" align="center"/>
<img style="border-width: 1px; border-style: solid; border-color: #444444; padding: 0;" src="http://cdk.github.io/cdk/img/stdgen/bounds/chebi-6701-with-border.svg" width="42%" align="center"/>
</div>
<br/>
<h4>Current Rendering API</h4>
<p>
Now we have the size of our diagram we can render raster images. Unfortunately the current rendering API makes this a little tricky as the diagram is generated after the desired image size is provided by the user. To get the correct size we need to generate the diagram twice (to get the bounds) or use an intermediate representation (we'll see this later).
</p>
<div>
<div class="code-header">
<b>Code 1</b> - Current rendering API</div>
<pre class="prettyprint lang-java">
// structure with coordinates
IAtomContainer container = ...;
// create the renderer - we don't use a font manager
List<IGenerator<IAtomContainer>> generators =
Arrays.asList(new BasicSceneGenerator(),
new StandardGenerator(new Font("Verdana", Plain, 18));
AtomContainerRenderer renderer = new AtomContainerRenderer(generators, null);
Graphics2D g2 = ...; // Graphics2D to draw raster / vector graphics
Rectangle2D bounds = ...; // need the bounds here!
renderer.paint(new AWTDrawVisitor(g2),
bounds);
</pre>
<div class="code-footer">
</div>
</div>
<br/>
<h4>Vector graphics</h4>
<p>
To render scalable graphics we can use the <code>VectorGraphics2D</code><sup>[<a href="http://efficientbits.blogspot.co.uk/2014/09/not-to-scale.html#ref5">5</a>]</sup> implementations of the Java <code>Graphics2D</code> class. Vector graphics output can use varied units (e.g. pt, mm, px) - the <code>VectorGraphics2D</code> uses mm.
</p>
<p>
Without adjusting our scaling the render of maytansine [<a href="http://www.ebi.ac.uk/chebi/searchId.do?chebiId=6701"><code>CHEBI:6701</code></a>] would be displayed with bond lengths of 40 mm and a total size of ~410x390 mm. The output can be rescaled after rendering but the default width of 41 cm is a bit large. We therefore need to change our desired bond length.
</p>
<p>
The bond length of published structure diagrams varies between journals. A common and recommended style for wikipedia <sup>[<a href="http://efficientbits.blogspot.co.uk/2014/09/not-to-scale.html#ref6">6</a>]</sup> is 'ACS 1996' - the style has a bond length of '5.08 mm'. Although setting the <code>BondLength</code> parameter to '5.08' would work, other parameters would need adjusting such as Font size (which is provided in pt!).</p>
<p>
To render the diagram with the same proportions as the raster image we can instead resize the bounds and fit the diagram to this. Since the desired bond length is '5.08 mm' instead of '40 mm' we need rescale the diagram by 12.7 %. Our final diagram size is then ~52x50 mm. The border for ACS 1996 is '0.56 mm' which can be added to the diagram size.
</p>
<h4>Example code</h4>
<p>
To help demonstrate the above rendering I've put together a quick GitHub project <a href="https://github.com/johnmay/efficient-bits/tree/master/scaled-renders">johnmay/efficient-bits/scaled-renders</a>. The code provides a convenient API and a command line utility for generating images.
</p>
<div>
<div class="code-header">
<b>Code 2</b> - Intermediate object</div>
<pre class="prettyprint lang-java">
// structure with coordinates
IAtomContainer container = ...;
// create the depiction generator
Font font = new Font("Verdana", Plain, 18);
DepictionGenerator generator = new DepictionGenerator(new BasicSceneGenerator(),
new StandardGenerator(font));
// generate the intermediate 'depiction'
Depiction depiction = generator.generate(container);
// holds on to the rendering primitives as well as the size
double w = depiction.width();
double h = depiction.height();
// draw at 'default' size
depiction.draw(g2, w, h);
// generate a PDF (or SVG)
String pdfContent = depiction.toPdf(); // default size
String pdfContent = depiction.toPdf(1.5); // 1.5 * default size
String pdfContent = depiction.toPdf(0.508, 0.056); // bond length, margin
</pre>
<div class="code-footer">
</div>
</div>
<p>The command line utility provides several options to play with and can load from molfile, SMILES, InChI, or name (using OPSIN<sup>[<a href="http://efficientbits.blogspot.co.uk/2014/09/not-to-scale.html#ref7">7</a>]</sup>).</p>
<div>
<div class="code-header">
<b>Code 3</b> - Command line</div>
<pre class="prettyprint lang-shell">
# In the project root set the following alias
$: alias render='mvn exec:java -Dexec.mainClass=Main'
# Using OPSIN to load porphyrin and generate a PDF
$: render -Dexec.args="-name porphyrin -pdf"
# Highlight one of the pyrrole in porphyrin
$: render -Dexec.args="-name porphyrin -pdf -sma n1cccc1"
# Show atom numbers
$: render -Dexec.args="-name porphyrin -pdf -atom-numbers"
# Show CIP labels
$: render -Dexec.args="-name '(2R)-butan-2-ol' -pdf -cip-labels"
# Generate a PDF / SVG for ethanol SMILES
$: render -Dexec.args="-smi CCO -pdf ethanol.pdf -svg ethanol.svg"
# Load a molfile
$: render -Dexec.args="-mol ChEBI_6701.mol -pdf chebi-6701.pdf"
</pre>
<div class="code-footer">
</div>
</div>
<p>
You can even play with the font</p>
<pre>$: render -Dexec.args="-name 'caffeine' -svg cc-caffeine.svg -font-family 'Cinnamon Cake' -stroke-scale 0.6 -kekule"</pre>
<div align="center">
<img style="border-width: 1px; border-style: solid; border-color: #444444; padding: 0;" src="http://cdk.github.io/cdk/img/stdgen/bounds/cc-caffeine.svg" width="42%" align="center"/>
</div>
<h1>Links/References</h1>
<ol>
<li id="ref1"><a href="https://github.com/cdk/cdk/wiki/Standard-Generator">https://github.com/cdk/cdk/wiki/Standard-Generator</a></li>
<li id="ref2"><a href="http://gilleain.blogspot.co.uk/2010/09/scaling-and-text.html">http://gilleain.blogspot.co.uk/2010/09/scaling-and-text.html</a></li>
<li id="ref3"><a href="http://gilleain.blogspot.co.uk/2010/09/consistent-zoom-with-models-of.html">http://gilleain.blogspot.co.uk/2010/09/consistent-zoom-with-models-of.html</a></li>
<li id="ref4"><a href="http://onlinelibrary.wiley.com/doi/10.1002/minf.201200171/abstract">http://onlinelibrary.wiley.com/doi/10.1002/minf.201200171/abstract</a></li>
<li id="ref5"><a href="http://trac.erichseifert.de/vectorgraphics2d/">http://trac.erichseifert.de/vectorgraphics2d/</a></li>
<li id="ref6"><a href="http://en.wikipedia.org/wiki/Wikipedia:Manual_of_Style/Chemistry/Structure_drawing">http://en.wikipedia.org/wiki/Wikipedia:Manual_of_Style/Chemistry/Structure_drawing</a></li>
<li id="ref7"><a href="http://opsin.ch.cam.ac.uk/">http://opsin.ch.cam.ac.uk/</a></li>
</ol>John Mayfieldhttp://www.blogger.com/profile/03062858450255240984noreply@blogger.com0tag:blogger.com,1999:blog-6131674809499098789.post-22012468973959793262014-09-11T11:40:00.001+01:002014-09-11T15:49:22.071+01:00CDK Release 1.5.8<a href="http://dx.doi.org/10.5281/zenodo.11681"><img style="border: 0px; padding: 0px;" src="https://zenodo.org/badge/doi/10.5281/zenodo.11681.png" alt="10.5281/zenodo.11681"></a>
<p>CDK 1.5.8 has been released and is available from sourceforge (<a href="http://sourceforge.net/projects/cdk/files/cdk%20%28development%29/1.5.8/">download here</a>) and the EBI maven repo (XML 1).</p>
<p>The release notes (<a href="https://github.com/cdk/cdk/wiki/1.5.8-Release-Notes">1.5.8-Release-Notes</a>) summarise and detail the changes.</p>
<div>
<div class="code-header">
<b>XML 1</b> - Maven POM configuration</div>
<pre class="prettyprint lang-xml">
<repository>
<url>http://www.ebi.ac.uk/intact/maven/nexus/content/repositories/ebi-repo/</url>
</repository>
...
<dependency>
<groupId>org.openscience.cdk</groupId>
<artifactId>cdk-bundle</artifactId>
<version>1.5.8</version>
</dependency>
</pre>
<div class="code-footer">
</div>
</div>John Mayfieldhttp://www.blogger.com/profile/03062858450255240984noreply@blogger.com0tag:blogger.com,1999:blog-6131674809499098789.post-10815706929114138122014-07-22T11:58:00.002+01:002014-07-22T12:46:30.948+01:00Polish-ed SMARTS parsing<p>As introduced in previous posts, <a href="http://en.wikipedia.org/wiki/Smiles_arbitrary_target_specification">SMARTS</a> is a concise notation for describing chemical substructure queries. There are several aspects to a SMARTS implementation: subgraph graph matching, parsing, generating, and even optimisation<sup>[<a href="http://efficientbits.blogspot.co.uk/2014/07/polish-ed-smarts-parsing.html#ref1">1</a>,<a href="http://efficientbits.blogspot.co.uk/2014/07/polish-ed-smarts-parsing.html#ref2">2</a>]</sup>.</p>
<p>In this post I'll show a way of parsing the binary atom expressions that I found quite neat.</p>
<h4>Preliminaries</h4>
<p>Conceptually, a SMARTS atom expression is composed of primitives and operators (binary and unary). The primitives test whether some property of a atom (e.g. element, charge, valence, etc) has a certain value<sup>[<a href="http://efficientbits.blogspot.co.uk/2014/07/polish-ed-smarts-parsing.html#ref3">3</a>]</sup>. The operators invert and combine these primitives through conjunction (and), disjunction (or), and negation (not).</p>
<p>Some examples of atom expressions are:</p>
<pre>
[O&X1]
[!C&!N]
[C,c;X3&v4]
[N&!H0&X3]
[!#6&X4]
[O,S,#7,#15]
[C&X3;$([H2]),$([H1][#6]),$(C([#6])[#6])]
</pre>
<p>The operators in these expressions ordered by their precedence are:</p>
<pre>
! unary not
& binary and (high)
, binary or
; binary and (low)
</pre>
<p>The default operator is '&' and can often be omitted such that the first pattern would read <code>[OX1]</code>. There are two 'and' operators with difference precedence allowing logical expressions like:</p>
<pre>
[C,N&X1] C or (N and X1)
[C,N;X1] (C or N) and X1
</pre>
<p>More complex expression trees can be accomplished with recursive SMARTS.</p>
<!--<p>There are a couple of ways to parse and interpret these SMARTS expressions. It can be convenient to separate out the lexical analysis and parsing but I'll keep them together here.</p>-->
<p>A formal grammar for SMARTS that respects precedence looks something like this (lifted from the CDK javacc implementation):</p>
<div>
<div class="code-header">
SMARTS EBNF grammar</div>
<pre class="prettyprint lang-java">
AtomExpression ::= "[" <LowAndExpression> "]"
LowAndExpression ::= <OrExpression> [ ";" <LowAndExpression> ]
OrExpression ::= <HighAndExpression> [ "," <OrExpression> ]
HighAndExpression ::= <NotExpression> [ '&' <HighAndExpression> ]
NotExpression ::= [ "!" ] <AtomPrimitive>
</pre>
<div class="code-footer">
</div>
</div>
<p>Notice this is a recursive procedure where I ascend up the precedence hierarchy while descending into the grammar. The small number of operators in SMARTS means this is generally good enough. However there is a non-recursive alternative. </p>
<h4>Reverse Polish notation</h4>
<p>Reverse Polish notation (RPN) is a notation where the operator follows the operands of an expression<sup>[<a href="http://efficientbits.blogspot.co.uk/2014/07/polish-ed-smarts-parsing.html#ref4">4</a>]</sup>. Some simple mathematical expressions are written as follows:</p>
<pre>
5 + 1 5 1 +
3 + 4 * 2 3 4 2 * +
(3 + 4) * 2 3 4 + 2 *
</pre>
<p>RPN is extremely useful and simple for implementing and performing operations on stack-based machines<sup>[<a href="http://efficientbits.blogspot.co.uk/2014/07/polish-ed-smarts-parsing.html#ref5">5</a>]</sup>. An excellent property is that the operators are applied as soon as they are encountered. Notice that I don't need parentheses to change the multiply and addition order. Also notice that a lookahead check for operator validity isn't needed, when an operator is applied the primitives have already been parsed.</p>
<p>SMARTS operators are infix but let us see what RPN SMARTS might look like:</p>
<pre>
[O&X1] O X1 &
[!C&!N] C ! N ! &
[C,c;X3&v4] C c , X3 v4 & ;
[N&!H0&X3] N H0 ! X3 & &
[!#6&X4] #6 ! X4 &
[O,S,#7,#15] O S #7 #15 , , ,
</pre>
<p>RPN SMARTS is much simpler to write a parser for that respect precedence. All that is needed is a way to convert from infix to postfix. The Shunting-yard algorithm<sup>[<a href="http://efficientbits.blogspot.co.uk/2014/07/polish-ed-smarts-parsing.html#ref6">6</a>]</sup> does just that.</p>
<h4>Implementation</h4>
<p>The Shunting-yard algorithm is explained well in many other webpages so I'll neglect that here. I will be converting from infix to postfix and build the expression at the same time. To do this, two stacks are needed, one for atom primitives and one for operators. The atom primitive stack is essentially the output of the Shunting-yard but I apply the operators instead of appending them to the postfix string.</p>
<p>Code 1 consumes characters from the input and either shunts an operator or parses a primitive. Once all the input is consumed the remaining operators are applied. The created query atom is on top of the stack and is returned. A low precedence no-operator is pushed on the stack to make thinks simpler and buffer the shunting.</p>
<p>To handle the implicit '&' between primitives a little more work is needed. Essentially one would optionally invoke <code>shunt(atoms, operators, '&');</code> as needed at each iteration.</p>
<div>
<div class="code-header">
<b>Code 1</b> - Primary loop</div>
<pre class="prettyprint lang-java">IQueryAtom parse(CharBuffer buffer) throws IOException {
// stacks of atom primitives and operators
Deque<IQueryAtom> atoms = new ArrayDeque<>();
Deque<Character> operators = new ArrayDeque<>();
operators.push(Character.MAX_VALUE); // a pseudo low precedence op
while (buffer.hasRemaining()) {
char c = buffer.get();
if (isOperator(c)) // c == '!' or '&' or ',' or ';'?
shunt(atoms, operators, c);
else
atoms.push(parsePrimitive(buffer));
}
// apply remaining operators
while (!operators.isEmpty())
apply(operators.pop(), atoms);
return atoms.pop();
}
</pre>
<div class="code-footer">
</div>
</div>
<p>Code 2 shows the creation of query atom primitives, here they are delegated to several self explanatory utility methods. For compactness only a subset of primitives are read.</p>
<div>
<div class="code-header">
<b>Code 2</b> - Parsing selected primitives</div>
<pre class="prettyprint lang-java">IQueryAtom parsePrimitive(CharBuffer buffer) throws IOException {
switch (buffer.get(buffer.position() - 1)) {
case 'A': return newAliphaticQryAtm();
case 'C': return newAliphaticQryAtm(6);
case 'N': return newAliphaticQryAtm(7);
case 'O': return newAliphaticQryAtm(8);
case 'P': return newAliphaticQryAtm(15);
case 'S': return newAliphaticQryAtm(16);
case 'a': return newAromaticQryAtm();
case 'c': return newAromaticQryAtm(6);
case 'n': return newAromaticQryAtm(7);
case 'o': return newAromaticQryAtm(8);
case 'p': return newAromaticQryAtm(15);
case 's': return newAromaticQryAtm(16);
case '#': return newNumberQryAtm(parseNum(buffer));
case 'X': return newConnectivityQryAtm(parseNum(buffer));
case 'H': return newHydrogenCountQryAtm(parseNum(buffer));
case 'R': return newRingMembershipQryAtom(parseNum(buffer));
case 'v': return newValenceQryAtom(parseNum(buffer));
}
throw new IOException("Primitive not handled");
}
</pre>
<div class="code-footer">
</div>
</div>
<p>To apply an operator, take the operands (primitives) off the top of the atom stack, create a new query atom, and push it back on to the stack (Code 3). If there aren't enough operands, the expression is invalid (not shown).</p>
<div>
<div class="code-header">
<b>Code 3</b> - Applying an operator</div>
<pre class="prettyprint lang-java">void apply(char op, Deque<IQueryAtom> atoms) {
if (op == '&' || op == ';')
atoms.push(and(atoms.pop(), atoms.pop()));
else if (op == ',')
atoms.push(or(atoms.pop(), atoms.pop()));
else if (op == '!')
atoms.push(not(atoms.pop()));
}
</iqueryatom></pre>
<div class="code-footer">
</div>
</div>
<p>Finally, to handle the operator (Code 4), check if the operator currently on top of the stack has precedence over the new operator. If so, pop it from the stack and apply it. The new operator is then added to the stack. Conveniently the code point of the operator character can be used as the precedence.</p>
<div>
<div class="code-header">
<b>Code 4</b> - Handling operator precedence</div>
<pre class="prettyprint lang-java">void shunt(Deque<IQueryAtom> atoms, Deque<Characters> operators, char op) {
while (precedence(operators.peek()) < precedence(op))
apply(operators.pop(), atoms);
operators.push(op);
}
static int precedence(char c) {
return c; // in ASCII, '!' < '&' < ',' < ';'
}
</pre>
<div class="code-footer">
</div>
</div>
<p>With the exception of a few utility methods these four snippets are essentially the whole implementation. You can find the fully functional code on the GitHub project<sup>[<a href="http://efficientbits.blogspot.co.uk/2014/07/polish-ed-smarts-parsing.html#ref7">7</a>]</sup>.</p>
<p>Not only is the code is incredibly compact and elegant but it can easily be expanded. Several convenience extensions to SMARTS have been made in the past – for example, <code>#X</code> for <code>!#1!#6</code>. A common requirement in general expressions and the Shunting-yard is to handle parenthesis. These need special treatment but it is only a simple modification to the shunting and the precedence value (Code 5).</p>
<div>
<div class="code-header">
<b>Code 5</b> - Handling parenthesis</div>
<pre class="prettyprint lang-java">
void shunt(Deque<IQueryAtom> atoms, Deque<Character> operators, char op) {
if (op == ')') {
while ((op = operators.pop()) != '(')
apply(op, atoms);
} else {
if (op != '(') {
while (precedence(operators.peek()) < precedence(op))
apply(operators.pop(), atoms);
}
operators.push(op);
}
}
int precedence(char c) {
switch (c) {
case '!': return 1;
case '&': return 2;
case ',': return 3;
case ';': return 4;
case '(':
case ')': return 5;
default: return 6;
}
}
</pre>
<div class="code-footer">
</div>
</div>
<p>The parser will now correctly handle the following expressions without recursive SMARTS:</p>
<pre>
[!(C,N,O,P,S)] C N O P , , , !
[!(C,N,O&X1)] C N O X1 & , , !
[((C,N)&X3),((O,S)&X2)] C N , X3 & O S , X2 & ,
</pre>
<p>All source code is available at <a href="https://github.com/johnmay/efficient-bits/blob/master/polished-smarts/src/main/java/SmartsAtomExprParser.java">github/johnmay/efficient-bits/polished-smarts</a>.</p>
<h4>
References</h4>
<ol>
<li id="ref1"><a href="http://www.nextmovesoftware.com/patsy.html">PATSY, NextMove Software</a></li>
<li id="ref2"><a href="http://timvdm.blogspot.co.uk/2012/09/smarts-optimisation-compilation.html">SMARTS Optimisation & Compilation: Introduction & Optimisation, Tim Vandermeersch</a></li>
<li id="ref3"><a href="http://www.daylight.com/dayhtml/doc/theory/index.pdf">Daylight theory manual, Daylight CIS</a></li>
<li id="ref4"><a href="http://en.wikipedia.org/wiki/Reverse_Polish_notation">Reverse Polish notation, Wikipedia</a></li>
<li id="ref5"><a href="https://www.youtube.com/watch?v=7ha78yWRDlE">Reverse Polish notation and the stack, Computerphile</a></li>
<li id="ref6"><a href="http://en.wikipedia.org/wiki/Shunting-yard_algorithm">Shunting-yard algorithm, Wikipedia</a></li>
<li id="ref7"><a href="https://github.com/johnmay/efficient-bits/blob/master/polished-smarts/src/main/java/SmartsAtomExprParser.java">github/johnmay/efficient-bits/polished-smarts</a></li>
</ol>John Mayfieldhttp://www.blogger.com/profile/03062858450255240984noreply@blogger.com0tag:blogger.com,1999:blog-6131674809499098789.post-18710451036984167742014-07-18T11:30:00.001+01:002014-07-18T11:30:27.133+01:00CDK Release 1.5.7<p>CDK 1.5.7 has been released and is available from sourceforge (<a href="http://sourceforge.net/projects/cdk/files/cdk%20%28development%29/1.5.7/">download here</a>) and the EBI maven repo (XML 1).</p>
<p>The release notes (<a href="https://github.com/cdk/cdk/wiki/1.5.7-Release-Notes">1.5.7-Release-Notes</a>) summarise and detail the changes. Among the new bug fixes and features, several plugins have been added to the build. The release notes describe how these plugins can be run and what they do so be sure check the notes out if you're a contributor.</p>
<div>
<div class="code-header">
<b>XML 1</b> - Maven POM configuration</div>
<pre class="prettyprint lang-xml">
<repository>
<url>http://www.ebi.ac.uk/intact/maven/nexus/content/repositories/ebi-repo/</url>
</repository>
...
<dependency>
<groupId>org.openscience.cdk</groupId>
<artifactId>cdk-bundle</artifactId>
<version>1.5.7</version>
</dependency>
</pre>
<div class="code-footer">
</div>
</div>John Mayfieldhttp://www.blogger.com/profile/03062858450255240984noreply@blogger.com0tag:blogger.com,1999:blog-6131674809499098789.post-78815725667947901672014-03-26T19:06:00.000+00:002014-03-26T20:06:02.960+00:00Mischievous SMARTS QueriesLast year I extended the <a href="http://sourceforge.net/projects/cdk/">CDK</a> <a href="http://en.wikipedia.org/wiki/Smiles_arbitrary_target_specification">SMARTS</a> implementation to match component groupings and stereochemistry. Specifying stereochemistry presents some interesting logical predicate that might be tricky to handle.<br />
<br />
Here are some examples that I came up with for testing the correctness of query handling. They start simple before getting a little mischievous. First, recursion and component grouping.<br />
<br />
<table style="width: 100%;">
<tbody>
<tr>
<th>query</th><th>targets</th><th><i>n<sub>match</sub></i></th><th>Comment</th>
</tr>
<tr>
</tr>
<tr><td colspan="3"><i>Component grouping (fragment)</i></td></tr>
<tr>
<td><code>(O).(O)</code></td><td><code>O=O</code></td><td>0</td><td>Example from Daylight</td>
</tr>
<tr>
<td></td><td><code>OCCO</code></td><td>0</td>
</tr>
<tr>
<td></td><td><code>O.CCO</code></td><td>2</td>
</tr>
<tr><td colspan="3"><i>Component grouping (connected)</i></td></tr>
<tr>
<td><code>(O.O)</code></td><td><code>O=O</code></td><td>2</td><td>Example from Daylight</td>
</tr>
<tr>
<td></td><td><code>OCCO</code></td><td>2</td>
</tr>
<tr>
<td></td><td><code>O.CCO</code></td><td>0</td>
</tr>
<tr><td colspan="3"><i>Recursion, ad infinitum</i></td></tr>
<tr>
<td><code>[$(CC[$(CCO),$(CCN)])]</code></td><td><code>CCCCO</code></td><td>1</td><td></td>
</tr>
<tr>
<td></td><td><code>CCCCN</code></td><td>1</td>
</tr>
<tr>
<td></td><td><code>CCCCC</code></td><td>0</td>
</tr>
<tr><td colspan="3"><i>Recursive component grouping</i></td></tr>
<tr>
<td><code>[O;D1;$(([a,A]).([A,a]))][CH]=O</code></td><td><code>OC=O.c1ccccc1</code></td><td>1</td><td><a href="http://sourceforge.net/p/cdk/bugs/1312/">Feature/Bug #1312</a></td>
</tr>
<tr>
<td></td><td><code>OC=O</code></td><td>0</td>
</tr>
</tbody></table>
<br />
<table style="width: 100%px;">
<tbody>
</tbody>
</table>
These next ones are concerned with logic and stereochemistry.
<table style="width: 100%;">
<tbody>
<tr>
<th>query</th><th>targets</th><th><i>n<sub>match</sub></i></th><th>Comment</th>
</tr>
<tr/>
<tr><td colspan="3"><i>Ensure local stereo matching</i></td></tr>
<tr>
<td><code>*[@](*)(*)(*)</code></td><td><code>O[C@](N)(C)CC</code></td><td>12</td><td>tetrahedrons have 12 rotation symmetries</td>
</tr>
<tr>
<td></td><td><code>O[C@@](N)(C)CC</code></td><td>12</td>
</tr>
<tr>
<td></td><td><code>O[C](N)(C)CC</code></td><td>0</td>
</tr>
<tr><td colspan="3"><i>Implicit (hydrogen or lone-pair) neighbour</i></td></tr>
<tr>
<td><code>CC[S@](C)=O</code></td><td><code>CC[S@](C)=O</code></td><td>1</td>
</tr>
<tr>
<td></td><td><code>CC[S@@](C)=O</code></td><td>0</td>
</tr>
<tr>
<td></td><td><code>CC[S](C)=O</code></td><td>0</td>
</tr>
<tr><td colspan="3"><i>Either (tetrahedral)</i></td></tr>
<tr>
<td><code>CC[@,@@](C)O</code></td><td><code>CC[C@H](C)O</code></td><td>1</td>
</tr>
<tr>
<td></td><td><code>CC[C@@H](C)O</code></td><td>1</td>
</tr>
<tr>
<td></td><td><code>CCC(C)O</code></td><td>0</td>
</tr>
<tr><td colspan="3"><i>Both (tetrahedral)</i></td></tr>
<tr>
<td><code>CC[@&@@](C)O</code></td><td><code>CC[C@H](C)O</code></td><td>0</td>
</tr>
<tr>
<td></td><td><code>CC[C@@H](C)O</code></td><td>0</td>
</tr>
<tr>
<td></td><td><code>CCC(C)O</code></td><td>0</td>
</tr>
<tr><td colspan="3"><i>Respect logical precedence 1</i></td></tr>
<tr>
<td><code>CC[@,Si@@](C)O</code></td><td><code>CC[C@H](C)O</code></td><td>1</td>
</tr>
<tr>
<td></td><td><code>CC[C@@H](C)O</code></td><td>0</td>
</tr>
<tr>
<td></td><td><code>CCC(C)=O</code></td><td>0</td>
</tr>
<tr><td colspan="3"><i>Respect logical precedence 2</i></td></tr>
<tr>
<td><code>CC[C@,Si@@](C)O</code></td><td><code>CC[C@H](C)O</code></td><td>1</td>
</tr>
<tr>
<td></td><td><code>CC[C@@H](C)O</code></td><td>0</td>
</tr>
<tr>
<td></td><td><code>CCC(C)O</code></td><td>0</td>
</tr>
<tr>
<td></td><td><code>CC[Si@H](C)O</code></td><td>0</td>
</tr>
<tr>
<td></td><td><code>CC[Si@@H](C)O</code></td><td>1</td>
</tr>
<tr>
<td></td><td><code>CC[Si](C)O</code></td><td>0</td>
</tr>
<tr><td colspan="3"><i>Unspecified</i></td></tr>
<tr>
<td><code>CC[@@?](C)O</code></td><td><code>CC[C@H](C)O</code></td><td>0</td>
</tr>
<tr>
<td></td><td><code>CC[C@@H](C)O</code></td><td>1</td>
</tr>
<tr>
<td></td><td><code>CCC(C)O</code></td><td>1</td>
</tr>
<tr>
</tr>
<tr><td colspan="3"><i>Negation</i></td></tr>
<tr>
<td><code>CC[!@](C)O</code></td><td><code>CC[C@H](C)O</code></td><td>0</td><td><code>!@@</code> is also equivalent to <code>@?</code></td>
</tr>
<tr>
<td></td><td><code>CC[C@@H](C)O</code></td><td>1</td>
</tr>
<tr>
<td></td><td><code>CCC(C)O</code></td><td>1</td>
</tr>
<tr><td colspan="3"><i>Neither (tetrahedral) using 'or unspecified'</i></td></tr>
<tr>
<td><code>CC[@?@@?](C)O</code></td><td><code>CC[C@H](C)O</code></td><td>0</td>
</tr>
<tr>
<td></td><td><code>CC[C@@H](C)O</code></td><td>0</td>
</tr>
<tr>
<td></td><td><code>CCC(C)O</code></td><td>1</td>
</tr>
<tr><td colspan="3"><i>Neither (tetrahedral) using negation</i></td></tr>
<tr>
<td><code>CC[!@!@@](C)O</code></td><td><code>CC[C@H](C)O</code></td><td>0</td>
</tr>
<tr>
<td></td><td><code>CC[C@@H](C)O</code></td><td>0</td>
</tr>
<tr>
<td></td><td><code>CCC(C)O</code></td><td>1</td>
</tr>
<tr><td colspan="3"><i>Either (geomeric)</i></td></tr>
<tr>
<td><code>C/C=C/,\C</code></td><td><code>C/C=C/C</code></td><td>1</td>
</tr>
<tr>
<td></td><td><code>C/C=C\C</code></td><td>1</td>
</tr>
<tr>
<td></td><td><code>CC=CC</code></td><td>0</td>
</tr>
<tr><td colspan="3"><i>Neither (geomeric)</i></td></tr>
<tr>
<td><code>C/C=C!/!\C</code></td><td><code>C/C=C/C</code></td><td>0</td>
</tr>
<tr>
<td></td><td><code>C/C=C\C</code></td><td>0</td>
</tr>
<tr>
<td></td><td><code>CC=CC</code></td><td>1</td>
</tr>
</tbody></table>
The last two are quite tricky (and not currently implemented) but once the atom-centric handling is correct it's a simple reduction. It's quite fun to work out so i'll <b style="font-style: italic;">leaf</b> that up to the reader.John Mayfieldhttp://www.blogger.com/profile/03062858450255240984noreply@blogger.com0tag:blogger.com,1999:blog-6131674809499098789.post-38009323785722197382014-02-19T15:59:00.000+00:002014-02-20T10:37:00.555+00:00CDK now built using Maven<div>
At 13 years and 4 months the <a href="https://sourceforge.net/projects/cdk/">Chemistry Development Kit</a> (CDK) is reasonably mature for a software project. Over the years there have been many changes in development practices as the code base evolved. This post is a departure for the usual algorithms and performance tests and looks at a recent and major change in the CDK development process.<br />
<br />
On Monday, Egon, Nina and I made the final alterations that changed the build system from <a href="http://ant.apache.org/">Ant</a> to <a href="http://maven.apache.org/">Maven</a>. This change has been in the works for a long time and has been suggested multiple times. The actually migration has taken about a years worth of planning.<br />
<h4>
<span style="font-weight: normal;">If you want to have a play with the new build system yourself I've put together a brief guide that also describes how to import the project into several popular IDEs - </span><a href="https://github.com/cdk/cdk/wiki/Building-CDK" style="font-weight: normal;">Building </a><span style="font-weight: normal;"><a href="https://github.com/cdk/cdk/wiki/Building-CDK">CDK</a>. The project <a href="https://github.com/cdk/cdk/blob/master/README.md">README</a> also summarises the command line usage.</span></h4>
<h4>
I download CDK releases and use it my project, what has changed?</h4>
If you are using the CDK as a dependancy you should not notice any difference. The library and bundled dependencies will still be distributed at each release. If you are also using maven then CDK module artefacts have been deployed for last few releases. These are by far the easiest way to use the library as dependency versioning is managed maven and newer releases can be automatically downloaded. Please see the project <a href="https://github.com/cdk/cdk/blob/master/README.md">README</a> for repository details.<br />
<h4>
I build the CDK source and use it my project, what has changed?</h4>
<div>
The source code is now built with maven - the <a href="https://github.com/cdk/cdk/blob/master/README.md">README</a> summarises the steps. As with releases, SNAPSHOT artefacts will be deployed to a remote repository (currently EBI).</div>
<h4>
I have modifications to the CDK that I apply, what has changed?</h4>
If your patches are Git commits then these can still be applied. Git will sort out and use the correct file locations to any modified files. If the patch creates new files these will need to be moved manually to the correct location.<br />
<br />
<br />
<table align="center" cellpadding="0" cellspacing="0" class="tr-caption-container" style="margin-left: auto; margin-right: auto; text-align: center;"><tbody>
<tr><td style="text-align: center;"><a href="http://31.media.tumblr.com/9f461496abbf6f52568d6525a619d1b9/tumblr_my5f4bwDRg1spphago1_1280.png" imageanchor="1" style="margin-left: auto; margin-right: auto;"><img border="0" src="http://31.media.tumblr.com/9f461496abbf6f52568d6525a619d1b9/tumblr_my5f4bwDRg1spphago1_1280.png" height="345" width="640" /></a></td></tr>
<tr><td class="tr-caption" style="text-align: center;">CDK Modules in Dec 2013 - Egon W, <a href="http://egonw.tumblr.com/post/70669027045/current-overview-of-cdk-module-dependencies-in">Bits of Blah</a></td></tr>
</tbody></table>
<br />
<h4>
Existing project structure</h4>
Prior to Monday the project code was organised under a single root folder. The Ant build would then read instructions in the source code and assemble the modules during compilation. This approach allowed progressive partition the code into modules over an extended period. Without this system we would not have be able to convert to maven at all.<br />
<br />
This system was customised and specific to the CDK which, in my opinion, made it a significant barrier to contributions. I know that personally I struggled to understand what was going on at compile time. A highly customised build process makes it not only difficult for a human to comprehend but also any automated tooling (<a href="http://en.wikipedia.org/wiki/Integrated_development_environment">Integrated Development Environments</a>, IDEs). Superficial support has been provided for Eclipse and Netbeans editors but neither correctly interrupted the modules and relationships between them.<br />
<h4>
Separate source trees</h4>
The most noticeable difference in the project is each module now has a separate source tree. This allows easier reasoning about the contents of module and provides a visual cue about the modular structure. Below we can see the 'cip' (Cahn-Ingold-Prelog) module source tree only contains the classes relevant to the module. Separate source trees are not specific to maven and we could still use Ant. The main benefit is that Maven supports and encourages this kind of structuring by default.<br />
<br />
<table align="center" cellpadding="0" cellspacing="0" class="tr-caption-container" style="margin-left: auto; margin-right: auto; text-align: center;"><tbody>
<tr><td style="text-align: center;"><a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEh8hbk7E9FW6M0oxNf7-Z-k3KY81nmJXw7_CTR8rt6yIgzg4L5COojhgK-kcVfrnmRoHS_tT6SO8-3L6iLMtZRE8VWvmXnTCkJlqUCGEVkVUVqpJPQimoWCreyw-PKgqmCgC5_exxXMjp0/s1600/Screen+Shot+2014-02-18+at+21.17.20.png" imageanchor="1" style="margin-left: auto; margin-right: auto;"><img border="0" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEh8hbk7E9FW6M0oxNf7-Z-k3KY81nmJXw7_CTR8rt6yIgzg4L5COojhgK-kcVfrnmRoHS_tT6SO8-3L6iLMtZRE8VWvmXnTCkJlqUCGEVkVUVqpJPQimoWCreyw-PKgqmCgC5_exxXMjp0/s1600/Screen+Shot+2014-02-18+at+21.17.20.png" height="400" width="331" /></a></td></tr>
<tr><td class="tr-caption" style="text-align: center;">Source code in the CIP module</td></tr>
</tbody></table>
There is still more work to do on the module organisation, for example, CMLWritier is the the 'libiocml' module whilst the CMLReader is the 'io' module. The modules are mainly organised by their dependancies but in future it may be beneficial to organise by function. Normally classes with similar dependencies have similar function but this isn't always the case. An example of this is seen with the <b>LINGOFingerprinter</b> and <b>SignatureFingerprinter</b> in the 'smiles' and 'signatures' modules rather than the 'fingerprint' module. </div>
<div>
<h4>
Super modules</h4>
<div>
The Maven build also allows us to define groupings of the existing modules. These intermediate modules group the code base in a few digestible sections. You can see these groupings at the root level in the repository - <a href="https://github.com/cdk/cdk/">https://github.com/cdk/cdk/</a>. There are five groups and an additional <b>misc/ </b>module for the left overs. I'm planning to write a more in depth guide for the wiki but here is a quick overview of what is present in each.</div>
<div>
<ul>
<li><b>base/</b> <i>- API and implementations of domain objects and central algorithms to handling chemical information</i></li>
<li><b>descriptor/</b> - <i>fingerprinters, qsars and signatures for describing and characterising attributes of a compound.</i></li>
<li><b>storage/</b> <i>- reading and writing of chemical compounds from multiple file formats</i></li>
<li><b>tool/</b> <i>- structure diagram generation, smarts, smsd, hashcodes, tautomer and tools that either answer a question directly, manipulate input or compute intrinsic properties</i></li>
<li><b>display/</b> - <i>rendering of 2D depictions</i></li>
</ul>
</div>
<br />
<br />
<br /></div>
<ul><ul>
</ul>
</ul>
John Mayfieldhttp://www.blogger.com/profile/03062858450255240984noreply@blogger.com0tag:blogger.com,1999:blog-6131674809499098789.post-1306566501267034952014-02-13T18:58:00.000+00:002014-02-14T09:42:33.486+00:00Animated Algorithm: Canonical Tautomer AssignmentEffectively understanding an algorithm with only a description is difficult. Reading source code, possibly more so. Although these approaches explain the finer details, invariants and proofs, a higher level view offers clarity. A great example of this is seen at McKay's and Piperno's canonical labelling site, <a href="http://pallini.di.uniroma1.it/SearchTree.html">The Search Tree</a>.<br />
<br />
<i>Tautomers are constitutional isomers of organic compounds that rapidly interconvert </i>(<a href="http://en.wikipedia.org/wiki/Tautomer">Wikipedia</a>). The most common form, is the relocation of a proton. Many computer representations are tautomer specific and distinguish different tautomeric forms of a compound. They would for example not have the same unique SMILES.<br />
<br />
There are several approaches and algorithms for handling tautomers (Warr W, 2010). At the Daylight EuroMUG99, Roger Sayle and Jack Delany presented an algorithm for enumerating and assigning a unique tautomer (Sayle R and Delany J, 1999). From <a href="http://www.daylight.com/meetings/emug99/Delany/taut_html/sld020.htm">slide 20</a> in that presentation, there is a very nice step wise example on guanine.<br />
<br />
This afternoon I had fun creating a animation for algorithm using an implementation I wrote last year. I've used a more complicated example that emphasises the backtracking when an incorrect assignment is made.<br />
<br />
Being a large compound, I didn't include the keto-enol types. Also I had to modify the order (normally the nitrogens would be assigned first) to allow it be watchable in reasonable time. Each proton is placed and the hetroatoms become either a donor (green) or acceptor (blue). At several points it attempts to place two protons in each of the five membered rings. After updating adjacent atoms the mistake is identified and the assignment backtracks setting one as an proton acceptor.<br />
<br />
Be sure to set the HD option.<br />
<br />
<div class="separator" style="clear: both; text-align: center;">
<iframe allowfullscreen='allowfullscreen' webkitallowfullscreen='webkitallowfullscreen' mozallowfullscreen='mozallowfullscreen' width='756' height='425' src='https://www.youtube.com/embed/N4px9SsbJ1I?feature=player_embedded' frameborder='0'></iframe></div>
<br />
<h4>
<span style="font-weight: normal;">Warr W (2010) shows an example labelled as "hidden tautomers". This compound presents an interesting challenge and a nice demonstration. In total there are 68 tautomers generated.</span></h4>
<div>
<span style="font-weight: normal;"><br /></span></div>
<div class="separator" style="clear: both; text-align: center;">
<iframe allowfullscreen='allowfullscreen' webkitallowfullscreen='webkitallowfullscreen' mozallowfullscreen='mozallowfullscreen' width='756' height='425' src='https://www.youtube.com/embed/uEBzeYaXM0A?feature=player_embedded' frameborder='0'></iframe></div>
<div>
<span style="font-weight: normal;"><br /></span></div>
<h4>
References</h4>
<br />
<span style="font-size: x-small;">Warr WA. Tautomerism in chemical information management systems. J Comput Aided Mol Des. 24(6-7):497-520. 2010</span><br />
<span style="font-size: x-small;">Also presented at the ChemAxon 2010 UGM - http://www.youtube.com/watch?v=1C-RTD4DAJ8</span><br />
<span style="font-size: x-small;"><br /></span>
<span style="font-size: x-small;">Sayle RA and Delany JJ. Canonicalization and enumeration of tauomers. Presented at EuroMUG99, Cambridge, UK, 28-29 Oct 1999</span><br />
<br />
<br />John Mayfieldhttp://www.blogger.com/profile/03062858450255240984noreply@blogger.com0