The core class for molecule representation in CDK is the
AtomContainer
. The
AtomContainer
uses an edge-list data structure for storing the underlying connection table (see
The Right Representation for the Job).
Essentially this edge-list representation is efficient in space. Atoms can be shared between and belong to multiple
AtomContainer
s. Therefore querying connectivity (is this atom connected to this other atom) is linear time in the number of bonds.
The inefficiency of the
AtomContainer
can really sting. If someone was to describe
Morgan's relaxation algorithm you may implement it like Code 1. The algorithm looks reasonable however it will run much slower than you expected. You may expect the runtime of this algorithm to be ~
N2 but it's actually ~
N3. I've annotated with
XXX
where the extra effort creeps in.
// Step 1. Algorithm body
int[] prev = new int[mol.getAtomCount()];
int[] next = new int[mol.getAtomCount()];
for (int i = 0; i < mol.getAtomCount(); i++) {
next[i] = prev[i] = mol.getAtom(i).getAtomicNumber();
}
for (int rep = 0; rep < mol.getAtomCount(); rep++) { // 0..numAtoms
for (int j = 0; j < mol.getAtomCount(); j++) { // 0..numAtoms
IAtom atom = mol.getAtom(j);
// XXX: linear traversal! 0..numBonds
for (IBond bond : mol.getConnectedBondsList(atom)) {
IAtom nbr = bond.getConnectedAtom(atom);
// XXX: linear traversal! 0..numAtoms avg=numAtoms/2
next[j] += prev[mol.getAtomNumber(nbr)];
}
}o
System.arraycopy(next, 0, prev, 0, next.length);
}
A New Start: API Rewrite?
Ultimately to fix this problem correctly, would involve changing the core
AtomContainer
representation, unfortunately this would require an API change, optimally I think adding the constraint that atoms/bonds can not be in multiple molecules would be needed**. This would be a monumental change and not one I can stomach right now.
Existing Trade Off: The GraphUtil class
In 2013 I added the
GraphUtil
class for converting an
AtomContainer
to a more optimal adjacency list (
int[][]) that was subsequently used to speed up many algorithms including: ring finding, canonicalisation, and substructure searching. Each time one of these algorithm is invoked with an
IAtomContainer
the first step is to build the adjacency list 2D array.
IAtomContainer mol = ...;
int[][] adj = GraphUtil.toAdjList(mol);
// optional with lookup map to bonds
EdgeToBondMap e2b = EdgeToBondMap.withSpaceFor(mol);
int[][] adj = GraphUtil.toAdjList(mol, e2b);
Although useful the usage of
GraphUtil
is somewhat clunky requiring passing around not just the adjacency list but the original molecule and the
EdgeToBondMap
if needed.
void visit(IAtomContainer mol, int[][] adj, EdgeToBondMap bondmap, int beg, int prev) {
mol.getAtom(beg).setFlag(CDKConstants.VISITED, true);
for (int end : adjlist[beg]) {
if (end == prev)
continue;
if (!mol.getAtom(end).getFlag(CDKConstants.VISITED))
visit(mol, adj, bondmap, end, beg);
else
bondmap.get(beg, end).setIsInRing(true); // back edge
}
}
Using the
GraphUtil
approach has been successful but due to the clunky-ness I've not felt comfortable exposing the option of passing these through to public APIs. It was only ever meant as an internal optimisation to be hidden from the caller. Beyond causing unintentional poor performance (Code 1) what often happens in a workflow is
GraphUtil
is invoked multiple times. A typical use case would be matching multiple SMARTS against one
AtomContainer
.
A New Public API: Atom and Bond References
I wanted something nicer to work with and came up with the idea of using
object composition to extend the existing Atom and Bond APIs with methods to improve performance and connectivity checks.
Essentially the idea is to provide two classes, and
AtomRef
and
BondRef
that
reference a given atom or bond in a particular AtomContainer. An
AtomRef
knows about the original
atom
it's
connected bonds
and the
index
, the
BondRef
knows about the original
bond
, it's
index
and the
AtomRef
for the connected atoms. The majority of methods (e.g. setSymbol, setImplicitHydrogenCount, setBondOrder) are passed straight through to the original atom. Some methods (such as setAtom on
IBond
) are blocked as being unmodifiable.
class AtomRef implements IAtom {
IAtom atm;
int idx;
List<BondRef> bnds;
}
class BondRef implements IBond {
IBond bnd;
int idx;
AtomRef beg, end;
}
We can now re-write the Morgan-like relaxation (Code 1) using AtomRef and BondRef. The scaling of this algorithm is now ~
N2 as you would expect.
// Step 1. Initial up front conversion cost
AtomRef[] arefs = AtomRef.getAtomRefs(mol);
// Step 2. Algorithm body
int[] prev = new int[mol.getAtomCount()];
int[] next = new int[mol.getAtomCount()];
for (int i = 0; i < mol.getAtomCount(); i++) {
next[i] = prev[i] = mol.getAtom(i).getAtomicNumber();
}
for (int rep = 0; rep < mol.getAtomCount(); rep++) {
for (AtomRef aref : arefs) {
int idx = aref.getIndex();
for (BondRef bond : aref.getBonds()) {
next[idx] += prev[bond.getConnectedAtom(aref).getIndex()];
}
}
System.arraycopy(next, 0, prev, 0, next.length);
}
The depth first implementation also improves in readability and only requires two arguments.
// Step 1. Initial up front conversion cost
void visit(AtomRef beg, BondRef prev) {
beg.setFlag(CDKConstants.VISITED, true);
for (BondRef bond : beg.getBonds()) {
if (bond == prev)
continue;
AtomRef nbr = bond.getConnectedAtom(beg);
if (!nbr.getFlag(CDKConstants.VISITED))
visit(nbr, bond);
else
bond.setIsInRing(true); // back edge
}
}
Benchmark
I like the idea of exposing the
AtomRef
and
BondRef
to public APIs. I wanted to check that the trade-off in
calculating and
using the
AtomRef/BondRef vs the current internal
GraphUtil
. To test this I wrote a benchmark that implements some variants of a Depth First Search and Morgan-like algorithms. I varied the algorithm implementations and whether I used,
IAtomContainer
,
GraphUtil
, or
AtomRef
.
The performance was measured over ChEMBL 22 and averaged the run time performance over 1/10th (167,839 records). You can find the code on
GitHub (
Benchmark.java). Each algorithm computes a checksum to verify the same work is being done. Here are the raw results:
depthfirst.tsv, and
relaxation.tsv.
Depth First Traversal
A Depth first traversal is a linear time algorithm. I tested eight implementations that varied the graph data structure and whether I used an external visit array or atom flags to mark visited atoms. When looking just at initialisation time the
AtomRef
creation is about the same as
GraphUtil
. There was some variability between the different variants but I couldn't isolate where the different came from (maybe GC/JIT related). The runtime of the
AtomRef
was marginally slower than
GraphUtil
. Both were significantly faster (18-20x) than the
AtomContainer
to do the traversal. When we look at the total run-time (initialisation+traversal) we see that even for a linear algorithm, the
AtomRef
(and
GraphUtil
) were ~3x faster. Including the
EdgeToBondMap
adds a significant penalty.
Graph Relaxation
A more interesting test is a Morgan-like relaxation, as a more expensive algorithm (N
2) it should emphasise any difference between the
AtomRef
and
GraphUtil
. The variability in this algorithm is whether we relax over atoms (AtomIter - see Code 1/5) or bonds (BondIter). We see a huge variability in AtomContainer/AtomIter implementation. This is because the algorithm is more susceptible to difference in input (molecule) size.
Clearly the AtomContainer/AtomIter is really bad (~80x slower). Excluding this results shows that as expected the
AtomRef/AtomIter
is slower than the
GraphUtil/AtomIter
equivalent (~2x slower). However because the
AtomRef
has a richer syntax, we can do a trick with XOR number storage to improve performance or iterate over bonds (BondIter) giving like-for-like speeds.
Conclusions
The proposed
AtomRef
and
BondRef
provide a convenience API to use the CDK in a natural way with efficient connectivity access. The conversion to an
AtomRef
is efficient and provides a speedup even for linear algorithms. The encapsulation facilities the passing as a public API parameter, users will be able to compute it ahead of time and pass it along to multiple algorithms.
I'm somewhat tempted to provide an equivalent
AtomContainerRef
allowing a drop-in replacement for methods that take the
IAtomContainer
interface. It is technically possible to implement writes (e.g. delete bond) efficiently in which case it would no longer be a 'Ref'. Maybe I'll omit that functionality or use a better name?
Footnotes
- ** My colleague Daniel Lowe notes that OPSIN allows atoms to be in multiple molecules and know about their neighbours but it's a bit of a fudge. It's certainly possible with some extra book keeping but prevents some other optimisations from being applied.