Earlier this year I wrote up some Chemical Toolkit Rosetta examples of using the CDK in Scala (github/cdk/cdk-scala-examples). When I was writing this it sprung to mind that it would be cool to (ab)use one feature for interoperability between cheminformatics toolkits.
Scala is a statically typed functional language that runs on the Java Virtual Machine. It has some nice features and syntax that can produce some very concise code. One thing particular neat is the ability to define implicit methods. Essentially these are methods that define how to convert between types, they are implicit because the compiler can insert them automatically.
Implicit methods are very similar to auto(un)boxing that was introduced in Java 5 to simplify conversion of primitives and their instance wrappers (Code 1).
Integer x = 5; // ~ Integer x = Integer.valueOf(5); int y = x; // ~ int y = x.intValue(); x = y; // ~ x = Integer.valueOf(y); if (x == y) { // ~ x.intValue() == y }
Much like it is possible in some programming languages to define custom operators, Scala makes it possible to define custom conversions that are inserted at compile time. The main advantage is it allows APIs to be extended to accept different types without introducing extra methods.
Conversion from line notations
Line notations are a concise means of encoding a chemical structure as sequence of characters (String). Common examples include SMILES, InChI*, WLN, SLN, and systematic nomenclature. Conversion to and from these formats isn't too computationally expensive but probably not something you want to do on-the-fly. However, just for fun, let's see what an implicit method for converting from strings can do. First we need the specified methods for loading from a known string type. We'll use the CDK for SMILES and InChI with Opsin for nomenclature.
val bldr = SilentChemObjectBuilder.getInstance val sp = new SmilesParser(bldr) val igf = InChIGeneratorFactory.getInstance def inchipar(inchi: String) = igf.getInChIToStructure(inchi, bldr).getAtomContainer def cdksmipar(smi: String) = sp.parseSmiles(smi) def nompar(nom: String) = cdksmipar(NameToStructure.getInstance.parseToSmiles(nom)) def cansmi(ac: IAtomContainer) = SmilesGenerator.unique().create(ac) // Universal SMILES (see. O'Boyle N, 2012**) def unismi(ac: IAtomContainer) = SmilesGenerator.absolute().create(ac)
implicit def autoParseCDK(str: String): IAtomContainer = { if (str.startsWith("InChI=")) { inchipar(str) } else if (str.startsWith("1S/")) { inchipar("InChI=" + str) } else { try { cdksmipar(str) } catch { case _: InvalidSmilesException => nompar(str) } } }
Now the implicit method has been defined, any method in the CDK API that accepts an IAtomContainer can now behave as though it accepts a linear notation. Code 3 shows how we can get the same Universal SMILES for different representations of caffeine and compute the ECFP4 fingerprint for porphyrin
println(unismi("caffeine")) println(unismi("CN1C=NC2=C1C(=O)N(C)C(=O)N2C")) println(unismi("InChI=1S/C8H10N4O2/c1-10-4-9-6-5(10)7(13)12(3)8(14)11(6)2/h4H,1-3H3")) println(unismi("1S/C8H10N4O2/c1-10-4-9-6-5(10)7(13)12(3)8(14)11(6)2/h4H,1-3H3")) val fp = new CircularFingerprinter(CLASS_ECFP4).getCountFingerprint("porphyrin")
Conversion between toolkits
It is also possible to add in implicit methods to auto-convert between toolkit types. To convert between the CDK and RDKit (Java bindings) I'll go via SMILES. This conversion is lossy without an auxiliary data vector but serves as a proof of concept. I've lifted the Java bindings from the RDKit lucene project (github/rdkit/org.rdkit.lucene/) as the shared library works out the box for me. We can also add in the from string implicit conversions.
Code 4 shows the implicit method definitions. The additional autoParseRDKit
allows us to bootstrap the RDKit API to also accept linear notations on all methods that expect an RWMol (or ROMol).
implicit def cdk2rdkit(ac : IAtomContainer) : RWMol = RWMol.MolFromSmiles(SmilesGenerator.isomeric.create(ac)) // XXX: better to use non-canon SMILES implicit def rdkit2cdk(rwmol : RWMol) : IAtomContainer = cdksmipar(RDKFuncs.getCanonSmiles(rwmol)) implicit def autoParseRDKit(str: String): RWMol = cdk2rdkit(autoParseCDK(str))
Now we can obtain the Avalon fingerprint of caffeine from it's name and pass an RWMol to the CDK's PubchemFingerprinter (Code 5).
val fp = new ExplicitBitVect(512) RDKFuncs.getAvalonFP("caffeine", fp2) val caffeine : RWMol = "caffeine" new PubchemFingerprinter(bldr).getBitFingerprint(caffeine)
Given that auto(un)boxing primitives in Java can sting you in performance critical code, the above examples should be used sparingly. They do serve as a fun example of what is possible and I've put together the working code example in a Scala project for others to try github/johnmay/efficient-bits/impl-conversion.
Footnotes
- * - Technically "InChI is not a replacement for any existing internal structure representations", Heller S. ICCS. 2014 but we'll allow it.
- ** - http://www.jcheminf.com/content/4/1/22
No comments:
Post a Comment
Note: only a member of this blog may post a comment.