The default Java serialization framework provides a convenient mechanism for streaming in-memory Objects to another computer or storing them on disk. Beyond the obvious badness of being tied to the internal object layout (i.e. not stable through changes), serialization can be very inefficient. Externalization and libraries like Kyro are popular for improving performance.
|
SMILES: CO[C@@H]([C@H](OC(C)=O)[C@@H](OC(C)=O)[C@H](OC(C)=O)[C@H](OC(C)=O)COC(C)=O)SC |
In the domain of Chemistry we have a rich variety of formats (e.g. SMILES) with which we can store molecules and reactions (in memory these are labelled graphs). Although these formats do not completely fulfil the utility of Object serialization they can be used as building block upon which we build. Not only are these defacto standards but they can be much faster and compact than default serialization of the in-memory connection table (graph) representation.
Recent History
Crafting efficient (de)serialization is beneficial and you can get great speed with simple setup. A few years ago I ran some experiments on writing an externalization stream for the Chemistry Development Kit (CDK) molecules (thread - High Performance Structure IO). Since the objects are huge any improvement over the default would be useful. This partly fed into the needs of CDK-Knime (a workflow tool) where I think CML was being used originally. From testing on ChEBI (~20,000 molecules) we see actually the ObjectInputStream was about as fast as an SDfile and much faster than CML. SDfiles are now much faster but that would be another post.
Read Performance
Method |
Time |
Size |
Throughput |
AtomContainerStream |
346 ms |
11.1 MiB |
63739 s-1 |
SDfile |
4159 ms |
51.7 MiB |
5302 s-1 |
CML |
18605 ms |
91.5 MiB |
1185 s-1 |
ObjectInputStream |
5552 ms |
93.9 MiB |
3972 s-1 |
It was around that time that Andrew Dalke payed a visit to EMBL-EBI. In discussing what I was currently working on he promptly showed me how fast OEChem could read/write SMILES. Needless to say – pretty quick and as fast if not faster than my attempt at 'High Throughput' streaming.
The CDK now also has fast SMILES processing and I wanted to compare this to the serialization to see how much of a performance penalty there is.
Benchmark
For a benchmark I used 100,000 structures for ChEMBL 20.
$ shuf chembl_20.smi | head -n 100,000 > chembl_20_subset.smi
Writing it to a ObjectOutputStream
takes 28.78 seconds. The SMILES subset file takes up 6.8 MiB on disk whilst the serialized objects take up 295 MiB. Ouch, that's 42x larger.
IChemObjectBuilder bldr = SilentChemObjectBuilder.getInstance();
SmilesParser smipar = new SmilesParser(bldr);
String srcname = "/data/chembl_20_subset.smi";
String destname = "/data/chembl_20_subset.obj";
try (InputStream in = new FileInputStream(srcname);
Reader rdr = new InputStreamReader(in, StandardCharsets.UTF_8);
BufferedReader brdr = new BufferedReader(rdr);
ObjectOutputStream oos = new ObjectOutputStream(new FileOutputStream(destname))) {
String line;
long t0 = System.nanoTime();
while ((line = brdr.readLine()) != null) {
try {
IAtomContainer mol = smipar.parseSmiles(line);
// stereochemistry does not implement serializable...
// so need to remove it
mol.setStereoElements(new ArrayList(0));
oos.writeObject(mol);
} catch (CDKException e) {
System.err.println(e.getMessage());
}
}
long t1 = System.nanoTime();
System.err.printf("write time: %.2f s\n", (t1 - t0) / 1e9);
}
In CDK we first read SMILES with Beam and then convert to the CDK objects so we'll also look at that small overhead. Here I compare the time to read the 100,000 SMILES using Beam, CDK, and the objects using an ObjectInputStream
. Both CDK and Beam take less than 1 second whilst the ObjectInputStream
takes more than 50.
In terms of throughput (mol per sec) here is the kind of speed we hit. I also show the total elapsed time for all 15 repeats.
Method | Min | Max | Elapsed Time | Size |
Deserialization | 1961 s-1 | 2089 s-1 | 12 m 16 s | 295 MiB |
Kryo (Auto) | 42401 s-1 | 44557 s-1 | 33.9 s | 186 MiB |
Kryo Unsafe (Auto) | 44854 s-1 | 47331 s-1 | 31.9 s | 231 MiB |
CDK | 135286 s-1 | 142126 s-1 | 10.7 s | 6.8 MiB |
Beam | 347534 s-1 | 489545 s-1 | 3.2 s | 6.8 MiB |
Auxiliary Data
With a performance difference that huge why would anyone want to use Serialization? Some use-cases might be that a format doesn't store the parts we need. A common argument against SMILES is the lack of coordinates but we can simply store this supplementary to the SMILES if we no what the input order will be (Code 2).
IAtomContainer mol = ...;
// 'Generic' - avoid canon SMILES we are not doing identity check
SmilesGenerator sg = SmilesGenerator.generic();
int n = mol.getAtomCount();
int[] order = new int[n];
// the order array is filled up as the SMILES is generated
String smi = sg.create(mol, order);
// load the coordinates array such that they are in the order the atoms
// are read when parsing the SMILES
Point2d[] coords = new Point2d[mol.getAtomCount()];
for (int i = 0; i < coords.length; i++)
coords[order[i]] = container.getAtom(i).getPoint2d();
// SMILES string suffixed by the coordinates
String smi2d = smi + " " + Arrays.toString(coords);
Using that same technique it's relatively simply to extend this to handle arbitrary data fields and it even forms the basis of ChemAxon's extended SMILES. A more advanced method would be combining the SMILES with a DataOutputStream
since we know how many coordinates there are expected to be.
Summary
I'm certainly not against a performant AtomContainerInputStream
but the default Java serialization should never be the first choice. Hopefully this post has put some numbers on why and will discourage knee-jerk usage.
Update
Added Kryo performance