Wednesday, 19 February 2014

CDK now built using Maven

At 13 years and 4 months the Chemistry Development Kit (CDK) is reasonably mature for a software project. Over the years there have been many changes in development practices as the code base evolved. This post is a departure for the usual algorithms and performance tests and looks at a recent and major change in the CDK development process.

On Monday, Egon, Nina and I made the final alterations that changed the build system from Ant to Maven. This change has been in the works for a long time and has been suggested multiple times. The actually migration has taken about a years worth of planning.

If you want to have a play with the new build system yourself I've put together a brief guide that also describes how to import the project into several popular IDEs - Building CDK. The project README also summarises the command line usage.

I download CDK releases and use it my project, what has changed?

If you are using the CDK as a dependancy you should not notice any difference. The library and bundled dependencies will still be distributed at each release. If you are also using maven then CDK module artefacts have been deployed for last few releases. These are by far the easiest way to use the library as dependency versioning is managed maven and newer releases can be automatically downloaded. Please see the project README for repository details.

I build the CDK source and use it my project, what has changed?

The source code is now built with maven - the README summarises the steps. As with releases, SNAPSHOT artefacts will be deployed to a remote repository (currently EBI).

I have modifications to the CDK that I apply, what has changed?

If your patches are Git commits then these can still be applied. Git will sort out and use the correct file locations to any modified files. If the patch creates new files these will need to be moved manually to the correct location.

CDK Modules in Dec 2013 - Egon W, Bits of Blah

Existing project structure

Prior to Monday the project code was organised under a single root folder. The Ant build would then read instructions in the source code and assemble the modules during compilation. This approach allowed progressive partition the code into modules over an extended period. Without this system we would not have be able to convert to maven at all.

This system was customised and specific to the CDK which, in my opinion, made it a significant barrier to contributions. I know that personally I struggled to understand what was going on at compile time. A highly customised build process makes it not only difficult for a human to comprehend but also any automated tooling (Integrated Development Environments, IDEs). Superficial support has been provided for Eclipse and Netbeans editors but neither correctly interrupted the modules and relationships between them.

Separate source trees

The most noticeable difference in the project is each module now has a separate source tree. This allows easier reasoning about the contents of module and provides a visual cue about the modular structure. Below we can see the 'cip' (Cahn-Ingold-Prelog) module source tree only contains the classes relevant to the module. Separate source trees are not specific to maven and we could still use Ant. The main benefit is that Maven supports and encourages this kind of structuring by default.

Source code in the CIP module

There is still more work to do on the module organisation, for example, CMLWritier is the the 'libiocml' module whilst the CMLReader is the 'io' module. The modules are mainly organised by their dependancies but in future it may be beneficial to organise by function. Normally classes with similar dependencies have similar function but this isn't always the case. An example of this is seen with the LINGOFingerprinter and SignatureFingerprinter in the 'smiles' and 'signatures' modules rather than the 'fingerprint' module.

Super modules

The Maven build also allows us to define groupings of the existing modules. These intermediate modules group the code base in a few digestible sections. You can see these groupings at the root level in the repository - https://github.com/cdk/cdk/. There are five groups and an additional misc/ module for the left overs. I'm planning to write a more in depth guide for the wiki but here is a quick overview of what is present in each.

base/ - API and implementations of domain objects and central algorithms to handling chemical information
descriptor/ - fingerprinters, qsars and signatures for describing and characterising attributes of a compound.
storage/ - reading and writing of chemical compounds from multiple file formats
tool/ - structure diagram generation, smarts, smsd, hashcodes, tautomer and tools that either answer a question directly, manipulate input or compute intrinsic properties
display/ - rendering of 2D depictions

Thursday, 13 February 2014

Animated Algorithm: Canonical Tautomer Assignment

Effectively understanding an algorithm with only a description is difficult. Reading source code, possibly more so. Although these approaches explain the finer details, invariants and proofs, a higher level view offers clarity. A great example of this is seen at McKay's and Piperno's canonical labelling site, The Search Tree.

Tautomers are constitutional isomers of organic compounds that rapidly interconvert (Wikipedia). The most common form, is the relocation of a proton. Many computer representations are tautomer specific and distinguish different tautomeric forms of a compound. They would for example not have the same unique SMILES.

There are several approaches and algorithms for handling tautomers (Warr W, 2010). At the Daylight EuroMUG99, Roger Sayle and Jack Delany presented an algorithm for enumerating and assigning a unique tautomer (Sayle R and Delany J, 1999). From slide 20 in that presentation, there is a very nice step wise example on guanine.

This afternoon I had fun creating a animation for algorithm using an implementation I wrote last year. I've used a more complicated example that emphasises the backtracking when an incorrect assignment is made.

Being a large compound, I didn't include the keto-enol types. Also I had to modify the order (normally the nitrogens would be assigned first) to allow it be watchable in reasonable time. Each proton is placed and the hetroatoms become either a donor (green) or acceptor (blue). At several points it attempts to place two protons in each of the five membered rings. After updating adjacent atoms the mistake is identified and the assignment backtracks setting one as an proton acceptor.

Be sure to set the HD option.

Warr W (2010) shows an example labelled as "hidden tautomers". This compound presents an interesting challenge and a nice demonstration. In total there are 68 tautomers generated.

References

Warr WA. Tautomerism in chemical information management systems. J Comput Aided Mol Des. 24(6-7):497-520. 2010
Also presented at the ChemAxon 2010 UGM - http://www.youtube.com/watch?v=1C-RTD4DAJ8

Sayle RA and Delany JJ. Canonicalization and enumeration of tauomers. Presented at EuroMUG99, Cambridge, UK, 28-29 Oct 1999

Saturday, 11 January 2014

New SMILES behaviour - generating (CDK 1.5.4)

In the previous post I outline the changes to SMILES Parsing in the Chemistry Development Kit (CDK). The original plan was to have several posts detailing the changes but in the end it was more practical to put this in a single release note document (available: here). Although the key changes to SMILES Generation are summarised in the release notes I thought it would be worth touching on some details.

Types of SMILES

The original API involved creating a mutable generator and then creating SMILES by invoking one of several methods.

Code 1 - Existing SMILES generation API

SmilesGenerator smigen = new SmilesGenerator();
String smi    = smigen.createSMILES(molecule);
String chismi = smigen.createChiralSMILES(molecule);

The new generator is immutable and a change in the configuration creates a new instance. There aren't many configuration options but key point is there isn't an internal state. This makes it a lot more robust.

Code 2 - New SMILES generation API

SmilesGenerator smigen = SmilesGenerator.unique()
                                        .aromatic()
                                        .withAtomClasses();
smigen.createSMILES(molecule); // now deprecated
smigen.create(molecule);

The generator can now create several different types of SMILES. The naming follows the Daylight specification.

	non-canonical	canonical
no isotopes / stereo	generic	unique
with isotopes / stereo	isomeric	absolute

CDK previously used the term chiral SMILES to refer to SMILES with stereochemistry. The term isomeric SMILES is now used and aligns with other toolkits (Daylight, RDKit, ChemAxon and OEChem).

Unique and Absolute SMILES

The equitable refinement published by Daylight (Weininger et al. 1988) has been optimised and is used to produce the unique SMILES. Generation of absolute SMILES is more tricky and currently uses the InChI algorithm. I believe this idea was first suggested by Andrew Dalke (inchi-dicuss archive). In 2012, Noel O'Boyle published and implemented (in Open Babel) procedures for generate Universal and Inchified SMILES (O'Boyle 2012).

The procedure used by the CDK absolute SMILES follows the Universal SMILES rules. There may still be some small differences in the implementation so I'm hesitant of declaring it as a Universal SMILES implementation before that validation can be done. There are also some problems, such as, delocalised charges that would be useful to handle.

Performance

Generally the handling of SMILES in the CDK is much more robust with hydrogen counts and stereochemistry correctly round tripped. In addition to this the performance is also much better. This is most noticeable when using the non-canonical outputs and only storage is needed but it is also true of the canonical output.

The following table summarises the time taken to read and generate canonical SMILES for three different size datasets. The times are one-off measurements on a cold JVM. The red number in brackets is the number of compounds which caused an error. Using the InChI the absolute SMILES can not encode unknown atoms (CC*) which are ubiquitous in ChEBI (ontology classes).

Data	Compounds	CDK 1.4.15	CDK 1.5.4 unique	CDK 1.5.4 absolute
ChEBI 108	~27,000	54 s	5 s	23 s (1251)
NCI Aug '00	~250,000	2 m 8 s	14 s	2 m 12 s (184)
ChEMBL 17	~1,300,000	23 m 48 s (11)	1 m 38 s	15 m 26 s (36)

References

Wednesday, 18 December 2013

New SMILES behaviour - parsing (CDK 1.5.4)

It's been a while since the last CDK development release (over 4 months). The CDK 1.5.4 will be released in the next few days and has a lot of improved functionality. This is the first of several posts expanding on the functionality and detailing some design decisions.

One major overhaul has been the handling of the SMILES format. Mainly, tetrahedral and double-bond stereo-chemistry is read and written but they'll be more on that in the release note. In general the SMILES parsing API is the same as older versions but there is some behaviour changes which are worth highlighting to avoid possible "gotchas" with SMILES.

The previous parser behaviour was to create a structure, perform atom typing and add hydrogens using the CDK atom types. Bonds in aromatic SMILES were left as single bonds with aromatic flags. Aromaticity was optionally re-perceived. Leaving aromatic bonds with correct bond orders caused problems elsewhere in the toolkit.

Atom-typing / Implicit hydrogen counts

Atom-typing is no longer performed, this mirrors the behaviour of the other CDK readers and avoids atom-typing being invoked twice. As a significant overhead the atom-typer should be optional (although it is often required). If you read SMILES and use a procedure that requires the CDK atom-types you must now explicitly invoke the atom-type configuration.

Although atom-types are not automatically set, the implicit hydrogen count is. The hydrogen count follows the Daylight SMILES specification allowing the formula to be precisely determined. If the atom-typing / hydrogen adding is redone with the CDK you may lose information. As is shown below, removing the automatic atom-typing allows better round-tripping.

IChemObjectBuilder     builder = SilentChemObjectBuilder.getInstance();
SmilesParser           sp      = new SmilesParser(builder);
SmilesGenerator        sg      = new SmilesGenerator();
InChIGeneratorFactory  igf     = InChIGeneratorFactory.getInstance();
        
IAtomContainer m = sp.parseSmiles("[O]");
System.out.println(sg.create(m));                         // [O]
System.out.println(igf.getInChIGenerator(m).getInchi());  // InChI=1S/O
        
// configure atom types
AtomContainerManipulator.percieveAtomTypesAndConfigureAtoms(m);
CDKHydrogenAdder.getInstance(builder).addImplicitHydrogens(m);

System.out.println(sg.create(m));                         // O ([OH2])
System.out.println(igf.getInChIGenerator(m).getInchi());  // InChI=1S/H2O/h1H2

Kekulé representation

Reading an aromatic SMILES will now provide a structure with defined bond orders. The kekulisation is efficient and correctly handles some tough cases. Example inputs and depictions are shown below. There will be a future post describing the algorithm in detail.

Input	Depiction
`c1(c)c(c)c(c)c(c)c(c)c1(c)`
`c1cccc1c1cccc1`
`oc1ccc(o)cc1`
`oc1ccocc1`
`oc(cc)co`

If a structure could not be kekulised it is considered invalid and a checked exception is thrown. The most common cause of errors is omitting a hydrogen present on an aromatic nitrogen (or phosphorus).

Input	Depiction
`n1cccc1`	Error - could not kekulise
`[nH]1cccc1`

Why are invalid SMILES not "fixed"?

The decision to consider such SMILES invalid may be controversial. The primary issue is there is no guaranteed way to unambiguously handle the input such the intended structure was obtained.

To kekulise the pyrrole example, n1cccc1, we would need to change the formula of the structure by adding a hydrogen ([nH]1cccc1) or assign a negative charge to the nitrogen ([n-]1cccc1). Both alter the SMILES string by the same number of edits, such decisions can lead to ambiguities.

In the case of pyrrole the decision seems obvious and we would arrive at the correct representation. Most would favour adding a hydrogen rather than modifying the charge (although positive charges are frequently assigned to 4 valent nitrogens in normalisation). In general the placement of the hydrogen is not always as trivial.

Fixarom is a program available in the Daylight Contrib that attempts to modify SMILES to allow parsing (kekulisation). In effect it tries every possible combination of hydrogen placements and is how most resolutions would work. Apart from the obvious inefficiency the structure we end up with will be different depending on the input order. This is because SMILES is tautomer specific, that is, different tautomers have different SMILES.

Input	Depiction
`CC(C)C1=CC=CC2=C1NC=N2` or `CC(C)c1cccc2c1[nH]cn2`
`CC(C)C1=CC=CC2=C1N=CN2` or `CC(C)c1cccc2c1nc[nH]2`

I'm a huge fan and regularly use the Open Babel ready-to-use programs, recently I found that the obabel utility will automatically add hydrogens to aromatic nitrogens. As seen below, the input only differs in the atom order. We generate two different canonical SMILES because the output structures are different tautomers.

obabel -:'CC(C)c1cccc2ncnc12' -ocan
CC(c1cccc2c1nc[nH]2)C 
1 molecule converted
obabel -:'CC(c1cccc2c1ncn2)C' -ocan
CC(c1cccc2c1[nH]cn2)C
1 molecule converted

It would be possible to assign a single representation by choosing the canonical tautomer. There is however no guarantee that the representation is the same as was used to generate the original SMILES.

Rather than the choice of hydrogen placement it may be that there really should be no hydrogen placed. CHEBI:32730 which currently has the SMILES NC(Cc1c[n]c2ccccc12)C(O)=O serves as a good example. Following our decision on pyrrole we would incorrectly add a hydrogen to the aromatic nitrogen when it was actually a radical. We could use the brackets as a trigger to indicate that there really should be no hydrogen but then we still can't assign bond orders. I believe this error in ChEBI is actually due to the incorrect use of the 'loose' instead of 'general' aromaticity option in Marvin. Again, we can obtain two different structures depending on our interpretation. In this case it is the less likely one which is correct.

Input	Depiction
`NC(Cc1c[n]c2ccccc12)C(O)=O` (interpreted as `NC(Cc1c[nH]c2ccccc12)C(O)=O`)
`NC(Cc1c[n]c2ccccc12)C(O)=O` (interpreted as `NC(Cc1c[N]c2ccccc12)C(O)=O`)

As you can see resolving missing hydrogens in aromatic SMILES is ambiguous. In practise providing the SMILES were generated correctly the number of invalid strucutres is small. There are fewer than 40 invalid inputs between ChEBI and ChEMBL and generally they involved organometallics incorrectly represented in SMILES with non-covalent bonds specified as covalent.

If you still really want to parse these structures then 'kekulise' can be set to false but this is not recommended.

IChemObjectBuilder builder   = SilentChemObjectBuilder.getInstance();
SmilesParser       sp        = new SmilesParser(builder);
IAtomContainer     container = sp.parseSmiles("n1cccc1");  // throws exception
sp.kekulise(false);                                        // turn off kekulisation
IAtomContainer     container = sp.parseSmiles("n1cccc1");  // creates an AtomContainer

Aromaticity re-perception

Previously the option setPreservingAromaticity could be set to return atoms and bonds with the aromaticity flags matching the input. By default the option was off and the CDKHueckelAromaticityDetector was automatically invoked. The new default behaviour is to preserve the aromatic specification in the input (even if the input is not aromatic). With the bond orders now set we can safely apply a new aromaticity model if desired or simply remove the aromatic flags. The functionality followed from the removal of automatic atom-typing. The option, setPreservingAromaticity, has been deprecated.

IChemObjectBuilder builder   = SilentChemObjectBuilder.getInstance();
SmilesParser       sp        = new SmilesParser(builder);

// cyclobuta-1,3-diene is not aromatic but has been specified as such
IAtomContainer container = sp.parseSmiles("c1ccc1");

// remove flags -> molecule is still valid thanks to kekulisation
for (IAtom a : container.atoms()) 
    a.setFlag(CDKConstants.ISAROMATIC, false);
for (IBond b : container.bonds()) 
    b.setFlag(CDKConstants.ISAROMATIC, false);

Ring closure bond type

A SMILES ring closure may optionally have an explicit bond type. As recommended by Open SMILES an explicit symbol should only be written on either the opening or the closure. Writing the symbol on both the ring opening and closure is ambiguous.

A violation of this is sometimes found with up/down directional labels in double-bond stereo-chemistry. The bond labels are relative and reverse sense on ring closures. Such specification obviously occurs in rings but we can mock an example to show that C/1.C/C=C\1 is correct C/1.C/C=C/1 is incorrect. The new SMILES parser will not accept the second input and highlights where the problem was.

InvalidSMILESException: could not parse 'C/1.C/C=C/1', Ring closure bonds
did not match. Ring was opened with '/' and closed with '\'. Note -
directional bonds ('/','\') are relative.
C/1.C/C=C/1
         ^

I decided to highlight this as during testing I encountered a couple of structures with this issue. CHEMBL1315255 currently has the SMILES shown below, the incorrect labels on the ring closures are highlighted.

Cl.Cl.Cc1cccc2\C(=C/3\\Nc4ccccc4/C/3=N\OCCN5CCNCC5)\C(=O)Nc12 1315255

Generally if a parser accepts this input it may be choosing either to use the ring opening or ring closure. A simple test to determine if the opening or closing is use is to parse C=1CCCC#1 and C#1CCCC=1. The choice of whether to prefer the open or close label differs between toolkits and I don't think there is a correct answer.

I would like to thank the valuable input and discussions from the CDK developers, EBI colleagues and CCN attendees. I have previously notified the ChEBI and ChEMBL curators of the problematic SMILES identified.

I'll leave you with a fun and seasonal example of valid SMILES input.

HoHoHo Merry Christmas (Daylight DEPICT)

Monday, 11 November 2013

Improved Substructure Matching

Since it was written the UniversalIsomorphismTester (UIT) has been the goto class in the Chemistry Development Kit (CDK) to check structure identity, substructure and overlap matching. The utility works really well but derives the substructure from the more expensive MCS computation. There has been several efforts to bring faster structure matching through Ullmann [1] and the Vento-Foggia (VF) [2] algorithms. Most notably VF was used for structure matching in OrChem [3] and SMSD [4]. The original SMSD code was added to CDK but since then SMSD has been completely rewritten and optimised. I recently tried reintegrating the latest version but have had several failed attempts.

Substructure and identity checking is fundamental and I felt it needed a more centralised home in the code base. This weekend I wrote the first parts of the API and implementations which I will discuss here.

Lazy Matching

For brevity I won't go over how substructure matching works, Rich Apodaca has two excellent introductions on One of These Things is Not Like The Others and Substructure Search From Scratch which gives the key ideas. In the first post it is mentioned that recursion is at the heart of any implementation. Generally recursive implementations are one-or-all, that is you can get the first match or all matches.

As recursion is simply storing frames on the call stack for matching you can easily replace the recursion with a non-recursive stack which only stores each candidate matchings. Providing front end access to this allows a lazy substructure matcher which improves memory and has another benefit we'll see later.

So far I've added three matchers VentoFoggia (substructure/identity) and Ullmann (substructure) with scope for others [5].

API

The new structure matching paradigm is similar to Java regular expressions - a Pattern is created for a query and then that pattern can be tested on multiple targets. Whether the pattern is matching SMARTS queries, identity or substructure doesn't matter. Personally I've found my self parsing the query and target molecules in the wrong order with the existing matchers and using the pattern makes it impossible to get the order wrong.

Code 1 - Matching API predicate

IAtomContainer query   = ...;
Pattern        pattern = Ullmann.findSubstructure(query);
for (IAtomContainer target : targets) 
    if (pattern.matches(target))
        hits++;

With language features due in Java 8 the matching hits can even be done in one line.

List<IAtomContainer> hits = targets.filter(Ullmann.findSubstructure(query)::matches);

The first atom-mapping can be obtained with match() and all atom mappings with matchAll() (which is lazy).

Code 1 - Matching API, lazy loop over all matches

for (IAtomContainer target : targets) 
    for (int[] mapping : pattern.matchAll(target))
        hits++;

Counting Self Matches

To test the performance I first looked at counting self matches. You'd generally want to do this with a partition refinement procedure (see. Using the CDK's group module) but it serves as a good test for the matchers. To compare the new matchers I've used the CDK's UniversalIsomorphismTester (UIT) and the latest SMSD's Substructure (VF implementation). It should be noted that SMSD does go a lot faster with an API change (see. Benchmarking Substructure Search) but this has not yet been added.

ferrocene as 200 isomorphisms to it's self

The average (r=10) time taken to match and count isomorphisms (n) was tested on several symetric structures. All structures were normalised to only have sigma bonds. The SMSD implementation did not get the correct number of matches which I think is a bug propagated from the original ChemKit implementation. If any matcher took longer than 10 minutes I considered it as timed out (shown as n/a).

Structure	Matched	UIT	SMSD	Ullmann	VentoFoggia
Structure	n	t (s)	t (s)	t (s)	t (s)
ferrocene	200	0.281	0.166 (n=180?)	0.007	0.001
fullerene C60	120	279	7.5 (n=1)	1.2	0.017
fullerene C70	60	n/a	9.8 (n=1)	2	0.058
fullerene C320	120	n/a	n/a	n/a	0.527

VentoFoggia only took half a second to find all matches in fullerene C320. I was curious to see how far it could go. The new matchers allow disconnected structures and matching multiple cyclopropanes we quickly reach a huge number of matches.

C1CC1.C1CC1.C1CC1.C1CC1.C1CC1.C1CC1 has more than 33 million isomorphisms

Structure	Matched	Ullmann	VentoFoggia
(SMILES)	n	t (s)	t (s)
`C1CC1`	6	0	0
`C1CC1.C1CC1`	72	0.002	0.002
`C1CC1.C1CC1.C1CC1`	1,296	0.002	0.005
`C1CC1.C1CC1.C1CC1.C1CC1`	31,104	0.168	0.037
`C1CC1.C1CC1.C1CC1.C1CC1.C1CC1`	933,120	6.5	1.1
`C1CC1.C1CC1.C1CC1.C1CC1.C1CC1.C1CC1`	33,592,320	270.5	36.3

Dataset Searching

As a more general gauge of performance the NCI Aug 2000 (~250,000 structures) was searched for structures containing O=C1C=CCC=C1. It's a single isolated test so conclusions can't really be drawn but it provides a sense for the timings.

Query structure matched against ~250,000 structures in the NCI Aug 2000 data set.

The SMSD uses heuristics internally whilst the others were run without heuristics on every structure (even if the query was larger than the target). How long the matching took for each implementation is shown below. When finding hits we can find the first match in each structure (3,623 total) or find all the hits (12,706 total). The SMSD matcher is fast when finding the first matching in each structure but much slower to find all matches.

Implementation	t first	t all
UIT	43.4 s	43.6 s
SMSD	28.6 s	4 m 40 s
Ullmann	6.3 s	6.5 s
VentoFoggia	5.1 s	4.9 s

Stereochemistry Matching

The new matchers also check the tetrahedral and geometric (double-bond) stereo configurations are consistent. Technically it's more efficient to check stereo configurations as the vertices are mapped but I decided to go for a post-matching check. This is more modular and also means we can use the same stereo consistency checking for other matchers (the existing UIT). This is where the lazy matching is really useful as we can get the first match which is stereochemically valid without finding all matchings.The stereo checking also seems to be fast enough and I've left it on by default (including all times in this post).

Using the VentoFoggia implementation, it takes ~38 seconds to find all 4,306,056 matches for C(C)(C)(C)(C) in ChEMBL 17. Matching one with a stereo configuration ([C@](C)(C)(C)(C)) is actually just as quick, taking ~37 seconds to find 1,007,652 matches.

Finally I'll leave you with a sneak peak at the Smiles class I have planned which makes loading and matching molecules from SMILES concise and elegant.

Pattern pattern = VentoFoggia.findSubstructure(Smiles.fromString("[C@](C)(C)(C)(C)"));
for (IAtomContainer target : Smiles.fromPath("/db/chembl/chembl_17.smi.gz")) {
    hits += Iterables.size(pattern.matchAll(target)); // Guava utils
}

Code

The matcher code is currently awaiting review on the branch feature/isomorphisms. If the branch is missing then it will be in the package org.openscience.cdk.isomorphism.

References

1. Ullmann JR. An Algorithm for Subgraph Isomorphism. Journal of the Association for Computing Machinery. 2004
2. Cordella LP et al. A (Sub)Graph Isomorphism Algorithm for Matching Large Graphs. IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE. 2004
3. Rijnbeek M and Steinbeck C. OrChem - An open source chemistry search engine for Oracle. Journal of Cheminformatics. 2009
4. Rahman SA et al. Small Molecule Subgraph Detector (SMSD) toolkit. Journal of Cheminformatics. 2009
5. Gouda K and Hassaan, M. A fast algorithm for subgraph search problem. Informatics and Systems (INFOS). 2012