Wednesday 19 December 2012

These go up to eleven

You're on ten ... Where can you go? Where? ... Nowhere. Exactly. What we do is, if we need that extra push over the cliff, you know what we do? [pause] Put it up to eleven.- Nigel Tufnel, This is Spinal Tap

Today, I was reminded of this great movie scene when testing a new piece of code. The problem I was testing was to find the shortest paths from one vertex in a graph to all other connected vertices. Listing one path is easy. Listing all alternate shortest paths is a bit more tricky. The current implementation only provided a single shortest path and there was a bug associated with this problem.

I'm going to skim over the details here but will show some of the API. The patch is current on this Sourceforge tracker if you wish to see the complete implementation. To start, some simple cases such as cubane were tested.

The 6 shortest paths between oposite vertices in a cubane molecule (CHEBI:33014)

With the correct result confirmed, it was time to turn up the complexity.  C60 Fullerene seemed a good choice. Unfortunately, after running it, I was disappointed to find there is at maximum only six possible paths to an opposite vertex. This one is a bit tricky to verify but luckily there is a ball and stick model in the office which made it a little easier. I also generated some spiro compounds but I was looking for a more difficult molecule.

C60 fullerene - CHEBI:33128


I was pretty much done with testing until this morning I stumbled across the FULLERENE program and database which provides a download for a C720 fullerene. Now that is a challenge!


C720 fullerene


Time to turn it up to eleven!

Here is the existing code to reconstruct the path from each atom to every other atom. Note we only do the the comparison one way as it is quicker to reverse the path to get both from i to j and from j to i.

Code 1 - All pairs shortest paths
for(int i = 0, n = fullerene.getAtomCount(); i < n; i++){
    for(int j = i + 1; j < n; j++){
        List path = PathTools.getShortestPath(fullerene,
                                                     fullerene.getAtom(i),
                                                     fullerene.getAtom(j));
    }
}

I only managed to run it once and as it took 3481106 ms (~58 min) to complete. Using the new utility we can do the equivalent in well under a second (~200 ms).

Code 2 - All pairs shortest paths new API
AllShortestPaths asp = new AllShortestPaths(fullerene);
 
for(int i = 0, n = fullerene.getAtomCount(); i < n; i++){
    for(int j = i + 1; j < n; j++){  
        IAtom[] path = asp.from(i).atomsTo(j);
    }
}

The current code is actually just a wrapper around several searches from each atom. Including the path reconstruction in the Floyd-Weshall algorithm (implemented as TopologicalMatrix) is also possible but for simplicity I left the AllShortestPaths as it a simple helper. The atomsTo method can be replaced with the pathTo method which will run even quicker. We can also count the number of shortest paths between each vertex.

Code 3 - Counting all pairs paths
sum += asp.from(i).nPathsTo(j);

From this we find a total of 39,086,040 (double that if you include j to i). We can get the actual paths but this takes significantly longer - around 8000 ms. Now thats pretty slow but we can check the theoretical maximum speed by filling an array the size of each shortest paths set.

Code 4 - storing all pairs paths
int[][] paths = new int[asp.from(i).nPathsTo(j)][asp.from(i).distanceTo(j) + 1];
for(int k = 0; k < paths.length; k++){
    for(int l = 0; l < paths[k].length; l++) {
        paths[k][l] = k + l;
    }
}


This will take 3000+ ms on it's own. It may be possible to get the time close to that but it is likely that most of the time difference is spent in array copies and traversing the recursive method call. Currently, this isn't worth optimising further. As the API provides a way to check how many shortest paths there are, it is better to only reconstruct all the paths if there is sufficiently few. After all, if you could reconstruct 39 million paths, you've still got to do something else with them afterwards.


Friday 7 December 2012

Scaling Up - Faster Ring Detection in CDK

In the last post I wrote on how different graph representations can affect algorithm performance. The choice of representation not only affects basic operations but can also change how an algorithm is structured. To demonstrate this I decided to take a look at ring detection in chemical structures. For this post I refer to a ring as a cycle and detection as a test that an vertex (atom) belongs to cycle of any size. I'm going to explain some concepts from the very basics but we end up with a particularly elegant (and efficient) method for cycle detection. Finally I will show how it performs when we scale up our input.

I should add there is more than one (quick) way to detect cycles, the end implementation definitely is nothing new but it was certainly fun to program.

Cycle Detection in the CDK

In the Chemistry Development Kit (CDK) the SpanningTree class is the usual choice for cycle detection. It is used to obtain a molecule (aka AtomContainer) that only contains cyclic atoms. This molecule is then tested as to whether each atom is pressent.

Code 1 - Using the existing SpanningTree
// create the spanning tree and obtain the cyclic fragments
SpanningTree   tree   = new SpanningTree(molecule);
IAtomContainer cyclic = tree.getCyclicFragmentsContainer();
 
// test whether the cyclic fragment contains each 'atom'
for(IAtom atom : molecule.atoms()){
    cyclic.contains(atom);
}

The method is relatively fast and is used in some core concepts, including, aromaticity detection and atom typing.

Depth First Target Search

I'll come back to why the spanning tree is important and what it actually is later. If we dig down in the code we find the spanning tree uses a depth first target search to test whether a vertex is in a cycle. This mirrors the core idea of a cycle in an undirected graph.

An undirected graph has a cycle if and only if a depth-first search (DFS) finds an edge that points to an already-visited vertex (a back edge) - Wikipedia

The exact piece of code which is used is a depth first target search. Lets have a look at this targeted DFS - I've restructured the code a little so it fits in the snippet, you can see the original here.

Code 2 - SpanningTree internals
// see org.openscience.cdk.PathTools.depthFirstTargetSearch for original code
boolean targetedDFS(IAtomContainer mol, IAtom atom, IAtom target, IAtomContainer path) {        
     
    atom.setFlag(VISITED, true);                                   // step 1                                           
        
    for (IBond bond : mol.getConnectedBondsList(atom)) {           // step 2            
            
        IAtom connected = bond.getConnectedAtom(atom);             // step 3
            
        if (!connected.getFlag(VISITED)) {                         // step 4 
        
            path.addAtom(connected);                               // step 5
            path.addBond(bond);                                    
                
            if (connected == target) {                             // step 6   
                return true;                  
            } else {                     
                if (targetedDFS(mol, connected, target, path)) {   // step 7
                    return true; 
                } else {            
                    path.removeAtom(connected);                    // step 8 
                    path.removeBond(bond);                         
                }
            }
        
        }
    }
        
    return false;                                                  // step 9
}

I'm going to step through each block and so it is clear what it does, I will add some commentary on various aspects after.
  1. indicate we are visiting atom
  2. access all bonds connected to the atom
  3. for each bond from the atom get the connected atom
  4. check we have not visited the connected atom - note we actually may be in the process of visiting it further up the stack
  5. add the connected atom to our path
  6. if the connected atom is our target return true - the path will contain all atoms we visited to get to the target
  7. connected atom was not the target, recursively visit the neighbours of the connected atom and attempt to find the target. If whilst recursively visiting we found the target, return true, the path still contains all atoms which lead to the target.
  8. we did not find the target whilst visiting the connected atom, connected atom cannot therefore lead to the target and we remove it from the path.
  9. we have visited all connected atoms and did not find the target, return false

    It is important to note the atom/target can not be the same atom. For the purposes of demonstration let us presume this is allowed - this will become more clear when we look back at spanning tree. The following animation visually represents a targeted DFS on the atoms a, f and b. The visit order of the neighbours is determined by the order in which the edges are stored - the animations depict the worst possible route but this of course can occur.

    A DFS from vertex a, the search terminates when we visit vertex b a second time Vertex b was not the target, no path is returned.

    A DFS from vertex f, the search terminates when we visit vertex e for the second time Vertex e was not the target, no path is returned. Note that we are traversing the same edges as before.
    A DFS from b, the search finds the target (b) and the path of {bhg, edc} (which are all in a cycle) is returned.
    Time Complexity

    There a couple of things to note about the complexity of the above process. Here I denote the number of vertices (atoms) as |V| and the number of edges (bonds) as |E|. This follows the standard Big O notation for graphs.

    Firstly, the VISITED flag (step 1) needs to be reset for each subsequent search. This is a consequence of mixing the algorithm state with our data model. To re-invoke this method over the same set of atoms (which we need to do) we first need to reset all the flags. Reseting the flags takes time proportional to |V| |E|.  Secondly, to access the connected bonds (step 2) we need to perform a linear search (in coordinate list representation) taking at most |E|. Thirdly, to remove the atom and bond (step 8) we need to again do a linear search over all atoms/bonds taking at most |V| and |E| operations respectively. Phew, thats a lot of |V|'s and |E|'s floating about. It should be noted we need to do a targeted search for every vertex (atom). This would take a very long time, surely there is a way we can reduce the number of atoms we need to test? This is exactly where the spanning tree is used.

    Spanning Tree

    Okay, so what is a spanning tree? I'm not a graph theorist so my definitions may be slightly loose, the links below will provide a more accurate definition. Well a tree is a connected graph where every vertices is connected by one simple path - that is, there are no cycles. A spanning tree of a graph is the transformation of the graph (which may contain cycles) into a tree which contains only enough edges to span all connected vertices. In other words it does not contain redundant edges.

    Valid spanning trees of our example graph. The dashed line indicates where an edge was removed.

    The spanning tree provides several interesting properties. Any edges that are not pressent in the tree are redundant and therefore much be in a cycle. If we obtain multiple trees from our graph, our graph is disconnected. The CDK uses an implementation of Kruskal's algorithm to calculate the spanning tree which runs in O(|E| log |E|).

    Once the spanning tree is determined we now know which edges are in a cycle. We then use the depth first targeted search (above) between two vertices of any edges that were removed.

    Depth First Target Search on a spanning tree. The operation is equivalent to the previous self target search on single vertices. Instead of the same query/target vertex we have the same query/target edge. 

    This process is efficient as we only ever check targets which must be in cycles. However, If we have two or more cycles we still traverse the graph multiple times. Can we detect all cycles and only visit each edge once...?

    Changing the Representation

    Firstly lets use the more friendly adjacency list for traversal. As shown previously the adjacency list allows quicker access to adjacent neighbours (step 2). When we create the adjacency list we index each vertex to an int value between 0 and |V|. This index allows us to quickly access the vertex value (atom) but it also allows us to decouple the algorithm visited state by storing a boolean array of size |V|(step 1). In fact we can do even better by storing which vertices we have visited in a BitSet or if the molecule is small enough on a single 64-bit long value. Similarly as we do not need to know the order we can also store the path we visited as a BitSet (step 8).
    All these changes sound great but we will still be visiting the same vertices multiple times and we still need to calculate the spanning tree. There is actually a far simpler way - which doesn't require the spanning tree at all. In truth I truncated the wikipedia quote at the start of this post. The solution should be clear from the first diagram but if not here is the full quote

    An undirected graph has a cycle if and only if a depth-first search (DFS) finds an edge that points to an already-visited vertex (a back edge).  Equivalently, all the back edges, which DFS skips over, are part of cyclesIn the case of undirected graphs, only O(n) time is required, since at most n − 1 edges can be tree edges (where n is the number of vertices). - Wikipedia

    Here O(n) means O(|V| + |E|) - Now, that's an algorithm that scales!

    The animations below demonstrated what the targeted DFS was doing and what we can actually do.

    Previous target DFS from vertex a, the search terminates when we visit vertex b a second time Vertex b was not the target, no path is returned and no cycles were marked.
    DFS from a, when b is reached for the second time all vertices we visited between now and when we first visited b can be marked as being in a cycle.

    With our adjacency representation we can do this very easily. The core method of the algorithm can be written in only a few lines of code.

    Code 3 - Is in ring
    void check(int i, long parentPath, long path) {
     
        explored |= path = setBit(stack[i] = path, i);
     
        for (int j : graph.get(i)) {
            if (isBitSet(parentPath, j))
                rings |= stack[j] ^ path;
            else if (!isBitSet(explored, j))
                check(j, stack[i], path);
        }
    }
    

    Okay I'm cheating a bit, lets make this a bit more legible and add some comments.

    Code 4 - Is in ring, verbose
    // adjacency list representation
    List<List<Integer>> graph;
     
    // index of set bits indicate which vertices we have visited
    long explored;                    
     
    // index of set bits indicate the vertices in rings
    long rings;
     
    // keeps track of our state when we visit each vertex - each vertex is only
    // visited once
    long[] stack;
     
    boolean isInRing(int i) {
        // check if vertex i has been explored
        if (!isBitSet(explored, i))
            check(i, 0, 0); // search from i to all connected components
        // check with the bit 'i' is set and thus in a ring
        return isBitSet(rings, i);
    }
     
    void check(int i, long parentPath, long path) {
     
        stack[i] = path;             // store our state before we visit 'i'
        path     = setBit(path, i);  // set 'i' is visited in out current path
        explored = explored | path;  // set 'i' as explored globally
     
        // for each neighbour of i
        for (int j : graph.get(i)) {
     
            // have we visited it before in out parent path?
            // - we use the parent path as one 'j' will be our
            //   parent and this would falsely indiciate a cycle.
            if (isBitSet(parentPath, j)) 
            
                 // take the different between our current path and our state
                 // right before we last visited 'j'. this difference is the
                 // set of vertices we traversed to reach 'j' again. mark all
                 // of these as in a cycle/ring. 
                rings |= stack[j] ^ path;
     
            // have we explored j? stops us revisiting parent vertices
            else if (!isBitSet(explored, j))
                // check vertex 'j' passing the path before we
                // visited 'i' as the parent path and our current path
                check(j, stack[i], path);
        }
    }
    

    Benchmark

    To benchmark this I wrote a quick executable jar that takes an input SDF and tests how long it takes to find all the ring atoms. It prints out the answer from each method to confirm they performed correctly. It also includes a stress test which I'll come back to in a bit. The source code and cycle-detection.jar can be found on github project. Here is an example usage (--help will print all options):

    Code 5 - Runnable benchmark example
    java -Xmx1G -jar cycle-detection.jar -i ~/Downloads/Compound_000000001_000025000.sdf
    

    The times change a little but normally the BasicRingTester is about 800% to 1000% faster than the spanning tree. For 20,000 molecules the BasicRingTester usually takes about 120-190 ms and the SpanningTree takes between 900-1000 ms to complete.

    Average (n=50) time taken to detect vertices which belong to cycles in PubChem Compound ID 00001-25000


    I mentioned in the last post there is some overhead in converting our graph to the adjacency list. We can test how much this overhead is by removing the actual ring computation. This can be done by setting the simulate option.

    Code 6 - Runnable benchmark example 2
    java -Xmx1G -jar cycle-detection.jar -i ~/Downloads/Compound_000000001_000025000.sdf --simulate
    

    If this option is set we'll see that is takes 90-110 ms (on my test input) just to convert the actual graph and the true time of the computation is very quick at around 20-50 ms. We can show just how quick the algorithm is if we scale up the input and perform the stress test.

    The stress test loads a graphene molecule with 2,598 atoms in cycles and a single atom which is not.

    Molecule loaded for stress testing
    It we run the benchmark we find the BasicRingTester runs in a couple of milliseconds while the SpanningTree can take up to and over a minute.

    Average (n=5) time taken todetect vertices which belong to cycles in a graphene molecule with ~2,500 atoms

    To run the stress test, yourself

    Code 7 - Runnable benchmark example 3
    java -Xmx1G -jar cycle-detection.jar --stress-test -i dummy
    

    Final Thoughts

    Although the above code is just for a boolean test (i.e. is this atom in a ring) this is still very useful. The simple calculation in the CDKHuckelAromaticityDetector use the spanning tree to simply get which atoms are in a ring system. Clearly it is possible to adapt the code to detect the independent systems. The technique also roughly preserves the concepts of a path it may well be possible to determine other cycle properties (i.e. size) just by looking at the sets of vertices as they are discovered.

    Edit: added graphs to better demonstrate the performance of each method

    Edit: this is now available using org.openscience.cdk.ringsearch.RingSearch.

    Monday 3 December 2012

    The Right Representation for the Job

    All programmers know that premature-optimisation is the root of all evil. Sometimes however, you just can't shake the hunch that something really is slowing your code down. It's often wise to try to suppress such thoughts as in reality it probably won't make much difference. Occasionally it does actual matter.

    This particular case concerns the representation of (chemical) graphs. Below shows a simple undirected graph with seven vertexes, seven edges and a single cycle.


    simple graph


    We can store this in several ways. Wikipedia has a nice article on the abstract data type which also lists the time complexity of operations on each representation.


    Representations of the simple graph. The adjacency list, lists all adjacent vertices for each vertex. The adjacency matrix indicates which  vertices are connected by marking the appropriate row/column coordinate.
    It is clear just from drawing the representations that the matrix takes a lot more space. We can reduce the space required by changing the representation when we store it. A simple way would be to store the row and column indicies which are connected - known as a Coordinate List. It is also possible to compress the graph further and tune depending on whether the matrix has sparse rows or sparse columns.

    Representation of the simple graph using only the edges.

    The popular CTFile format (MDL Molfile) uses a coordinate list to store the connections between the atoms. The Chemistry Development Kit and OpenBabel both mirror this storage in their data model with a list of atoms (vertices) and bonds (edges). This is clearly the correct approach when you're short on space but can be sub-optimal for some operations. As an example, to find the neighbours of 'b' we need to perform a linear search on the whole list to find all edges that include 'b'. Clearly there is going to be some performance improvement if we convert our coordinate list to an adjacency list or matrix. Two questions I wanted to ask was how much do you gain and whether the overhead of the doing the conversion was worth it? To test this I wrote a simple function which iterates over all the atoms and sums the atomic numbers.

    Code 1 - Graph iteration
    /* atom container iteration */
    int iterate(IAtomContainer container) {
        int total = 0;
        for (int d = 0; d < 8; d++) {
            for (IAtom atom : container.atoms()) {
                for (IAtom neighbour : container.getConnectedAtomsList(atom)) {                    
                    total += neighbour.getAtomicNumber();
                }
            }
        }
        return total;
    }
    
    /* adjacency list iteration */
    int iterate(AdjacencyList graph) {
        int total = 0;
        for (int d = 0; d < 8; d++) {
            for (int i = 0; i < graph.n(); i++) {
                // access adjacent neighbours of i
                for (int j : graph.neighbors(i)) { 
                    total += graph.getAtom(j).getAtomicNumber();
                }
            }
        }
        return total;
    }
    
    /* adjacency matrix iteration */
    int iterate(AdjacencyMatrix matrix) {
        int[][] graph = matrix.getGraph();
        int total = 0;    
        for (int d = 0; d < 8; d++) {
            for (int i = 0; i < graph.length; i++) {
                for (int j = 0; j < graph[i].length; j++) {
                    // test whether i and j are connected
                    if(graph[i][j] != 0)
                        total += graph.getAtom(j).getAtomicNumber();
                }
            }
        }
        return total;
    }
    

    The size of eight was chosen as this mimics the default path length of the CDK fingerprinters. The iteration was tested on PubChem-Compound entries 1-25000 with less than 60 atoms (20587) and for all molecules (23128).

    Average time taken to iterate over all graphs. Results were averaged over 50 repetitions. 

    The difference between adjacency list and matrix is negligible but both are much faster than the coordinate list. It is also clear that these representations scale better with molecule size. When we remove the filter for < 60 atoms we only have ~10% more molecules but the iteration takes twice as long.

    To put these times in perspective if we scaled up our test and presume the rest of pubchem has a similar distriubtion of molecule size than the adjacency list would finish in around 4 minutes whilst the coordinate list would take nearly 2 hours 15 minutes.

    Of course it takes time to actually convert the representation from the coordinate list in the first place. The time taken to convert to an adjacency list is ~250-350 ms and for the matrix is ~150 ms. So is it worth it? Yes, if you're going to be repeatedly accessing the neighbours, considerably time can be gained.

    The CDK provides a utility to convert the molecule into an adjacency matrix however this representation is rarely used in the rest of the library. Finally, I should add that there is no perfect representation but there are representations that are better suited to certain tasks.