**AllRingsFinder**in CDK provides all simple cycles (every possible ring) in a graph by progressively collapsing edges (Hanser et al, 1996). The

**AllRingsFinder**is used (sometimes incorrectly) throughout the library in many different modules.

To perceive all simple cycles can take a long time and I was pessimistic about making any improvement with it's speed. Fortunately, I was pleasantly surprised.

I'll go through the technical aspects of the implementation in the next post. This post will primarily focus on the performance. It should be noted that the algorithm is exactly the same and only the data structures and operations were modified.

## Preliminaries

We will be testing the new implementation named**AllCycles**against the existing

**AllRingsFinder**and

**HanserRingFinder**. The

**HanserRingFinder**was written by Rich Apodaca (depth-first) and is included in the CDK SMSD module. The

**AllCycles**takes as input an adjacency list graph with the vertices described by integers. This is standard for graph algorithms - the time to convert the CDK

*molecule*(

**IAtomContainer**) is included in all results.

The benchmarks were run with Java 7 (21) on a 2.66 GHz Intel Core i7.

## Complete Graphs Benchmark

When each vertex in the graph is adjacent to every other vertex we have the worst case for the algorithm. The number of simple cycles in these graphs grows very rapidly. The graphs are known as complete graphs and are named*K*

_{n, }*n*being the number of vertices. The table below summarises the results of each implementation in milliseconds.

**AllRingsFinder'**is a modified version of

**AllRingsFinder**that does not convert the paths to

**IRing**objects. This conversion is currently inefficient and can impact the performance when there is a large number of cycles. The

**AllCycles**and

**HanserRingFinder**implementations already return the paths as a sequence of integers and atoms respectively.

*Edit*- I realise now the reason for the difference here is that I accidentally used the

*default*and not

*silent*domain objects. The

*default*objects fire events on updates which can cause massive performance hits - particularly when adding atoms. Using the

*silent*objects (no events) yields similar times to

**AllRingsFinder'**without the IRing conversion. In short - don't use the

*default*objects, ever... you don't need it, trust me. I'll leave the timings as reminded not to do this - the dataset results presented were already using the

*silent*objects.

Graph | Number of Cycles | Implementation | Time (ms) | |
---|---|---|---|---|

K_{3} |
1 | AllCycles | 0.00 ms | |

HanserRingFinder | 0.00 ms | |||

AllRingsFinder | 0.15 ms | |||

AllRingsFinder' | 0.00 ms | |||

K_{4} |
7 | AllCycles | 0.00 ms | |

HanserRingFinder | 0.00 ms | |||

AllRingsFinder | 0.95 ms | |||

AllRingsFinder' | 0.00 ms | |||

K_{5} |
37 | AllCycles | 0.00 ms | |

HanserRingFinder | 0.15 ms | |||

AllRingsFinder | 2.20 ms | |||

AllRingsFinder' | 0.75 ms | |||

K_{6} |
197 | AllCycles | 0.10 ms | |

HanserRingFinder | 0.85 ms | |||

AllRingsFinder | 7.80 ms | |||

AllRingsFinder' | 2.00 ms | |||

K_{7} |
1172 | AllCycles | 0.55 ms | |

HanserRingFinder | 26.10 ms | |||

AllRingsFinder | 193.80 ms | |||

AllRingsFinder' | 39.80 ms | |||

K_{8} |
8018 | AllCycles | 12.0 ms | |

HanserRingFinder | 817.00 ms | |||

AllRingsFinder | 10078.75 ms | |||

AllRingsFinder' | 1360.75 ms | |||

K_{9} |
62814 | AllCycles | 994.70 ms | |

HanserRingFinder | 54524.85 ms | |||

AllRingsFinder | 389597.00 ms | |||

AllRingsFinder' | 81699.53 ms |

Performance of each implementation on complete graphs - log scale |

**300000%**improvement in performance (

*K*).

_{9}## Dataset Benchmark

Examining the worst case gives us a concise view of how the implementation scales. To judge the*real*performance several datasets were used; ChEBI, HMDB and a subset of PubChem Compound

*(CID 0 - 25000)*

*.*

There are molecules in the datasets with far more cycles than we can reasonably handle. Such molecules are considered

*impractical*and depends on some predefined escape clause.

**AllRingsFinder**measures the time when the algorithm starts and terminates if a set time limit is exceeded.**AllCycles**terminates when a preset maximum degree (connections) on a vertex is exceeded.**HanserRingFinder**does not have a timeout and so was not tested on the datasets.

**AllRingsFinder**was reduced from the default

**5000 ms**

**to**

**50 ms**. Running on ChEBI with the default timeout took

**389597**

**ms**(

**29 impractical**) and clogged the memory so much the timing was not stable.

**The maximum degree used for**

**AllCycles**was arbitrarily chosen as

**500**. Additionally the

**findAllRingsInIsolatedRingSystem()**was used instead of

**findAllRings().**This allowed nearly identical prepossessing for both implementations using the fast

**RingSearch**. Invoking the

**findAllRings()**directly will yield slower results than those listed below.

Data | Molecules | Implementation | Time (ms) | Impractical | Cycles |
---|---|---|---|---|---|

ChEBI | 29954 | AllCycles | 256 ± 56 ms | 36 | 155476 |

AllRingsFinder | 10324 ± 189 ms | 86.86 | 131124.72 | ||

HMDB | 37495 | AllCycles | 390 ± 48 ms | 2 | 93148 |

AllRingsFinder | 5410 ± 180 ms | 40.83 | 86230.38 | ||

PubChem Compound Subset | 22676 | AllCycles | 195 ± 55 ms | 1 | 61229 |

AllRingsFinder | 1942 ± 173 ms | 5.05 | 60587.58 |

We can see that for ChEBI we were able to perceive

**20000+**more cycles in**~2%**of the time. As mentioned the times include the preprocessing and conversion, were molecules loaded direct as adjacency lists the time for PubChem Compound subset (**195 ms**) would be even faster! The counts for**AllRingsFinder**are fractional due to the fluctuations between repeats.## Choosing a Maximum Degree Limit

The performance of new implementation depends on choice of the maximum degree. The limit doesn't really have a meaning and may be chosen incorrectly. Set the limit too low and we lose information, too high and we may generate more information than we can handle .

Running the implementation on all of PubChem Compound (Dec '12) with a limit of

Plotting the cumulated percent (T) against the maximum degree reached indicates that

To reduced the number of

**5000**we can retrospectively calculate the limit for a given percentile. In PubChem Compound there were**17842848**ring systems with more than one ring,**987 (****0.005%)**of which exceeded our hard maximum degree of**5000**.Plotting the cumulated percent (T) against the maximum degree reached indicates that

**99%**of ring systems can be completed with a limit of just**9**. Due to the sheer number of systems, if we did choose that value there would still be**166131**systems which were considered*impractical*and did not finish*.*To reduced the number of

*impractical*molecules we need to go to fractions of a percent.Maximum Degree | Percent (%) | Completed (ring systems) | Impractical (ring systems) |
---|---|---|---|

72 | 99.95 | 17834013 | 8835 |

84 | 99.96 | 17835876 | 6972 |

126 | 99.97 | 17837692 | 5156 |

216 | 99.98 | 17839293 | 3555 |

684 | 99.99 | 17841065 | 1783 |

882 | 99.991 | 17841342 | 1506 |

1062 | 99.992 | 17841429 | 1419 |

1440 | 99.993 | 17841602 | 1246 |

3072 | 99.994 | 17841789 | 1059 |

This breakdown provides a meaningful insight for an appropriate value. I haven't decided yet where to include a

*Run Forever*option - due to the combinatorial explosion there will always be molecules which we can't compute. I'll leave you with this entertaining video which explains the amazing power of combinatorial explosion.

## No comments:

## Post a Comment

Note: only a member of this blog may post a comment.