Tuesday 22 July 2014

Polish-ed SMARTS parsing

As introduced in previous posts, SMARTS is a concise notation for describing chemical substructure queries. There are several aspects to a SMARTS implementation: subgraph graph matching, parsing, generating, and even optimisation[1,2].

In this post I'll show a way of parsing the binary atom expressions that I found quite neat.

Preliminaries

Conceptually, a SMARTS atom expression is composed of primitives and operators (binary and unary). The primitives test whether some property of a atom (e.g. element, charge, valence, etc) has a certain value[3]. The operators invert and combine these primitives through conjunction (and), disjunction (or), and negation (not).

Some examples of atom expressions are:

[O&X1]
[!C&!N]
[C,c;X3&v4]
[N&!H0&X3]
[!#6&X4]
[O,S,#7,#15]
[C&X3;$([H2]),$([H1][#6]),$(C([#6])[#6])]

The operators in these expressions ordered by their precedence are:

! unary not
& binary and (high)
, binary or
; binary and (low)

The default operator is '&' and can often be omitted such that the first pattern would read [OX1]. There are two 'and' operators with difference precedence allowing logical expressions like:

[C,N&X1]  C or (N and X1)
[C,N;X1]  (C or N) and X1

More complex expression trees can be accomplished with recursive SMARTS.

A formal grammar for SMARTS that respects precedence looks something like this (lifted from the CDK javacc implementation):

SMARTS EBNF grammar
AtomExpression    ::= "[" <LowAndExpression> "]"
LowAndExpression  ::= <OrExpression> [ ";" <LowAndExpression> ]
OrExpression      ::= <HighAndExpression> [ "," <OrExpression> ]
HighAndExpression ::= <NotExpression> [ '&' <HighAndExpression> ]
NotExpression     ::= [ "!" ] <AtomPrimitive>

Notice this is a recursive procedure where I ascend up the precedence hierarchy while descending into the grammar. The small number of operators in SMARTS means this is generally good enough. However there is a non-recursive alternative.

Reverse Polish notation

Reverse Polish notation (RPN) is a notation where the operator follows the operands of an expression[4]. Some simple mathematical expressions are written as follows:

5 + 1              5 1 +
3 + 4 * 2          3 4 2 * +
(3 + 4) * 2        3 4 + 2 *

RPN is extremely useful and simple for implementing and performing operations on stack-based machines[5]. An excellent property is that the operators are applied as soon as they are encountered. Notice that I don't need parentheses to change the multiply and addition order. Also notice that a lookahead check for operator validity isn't needed, when an operator is applied the primitives have already been parsed.

SMARTS operators are infix but let us see what RPN SMARTS might look like:

[O&X1]             O X1 &
[!C&!N]            C ! N ! &
[C,c;X3&v4]        C c , X3 v4 & ;
[N&!H0&X3]         N H0 ! X3 & &
[!#6&X4]           #6 ! X4 &
[O,S,#7,#15]       O S #7 #15 , , ,

RPN SMARTS is much simpler to write a parser for that respect precedence. All that is needed is a way to convert from infix to postfix. The Shunting-yard algorithm[6] does just that.

Implementation

The Shunting-yard algorithm is explained well in many other webpages so I'll neglect that here. I will be converting from infix to postfix and build the expression at the same time. To do this, two stacks are needed, one for atom primitives and one for operators. The atom primitive stack is essentially the output of the Shunting-yard but I apply the operators instead of appending them to the postfix string.

Code 1 consumes characters from the input and either shunts an operator or parses a primitive. Once all the input is consumed the remaining operators are applied. The created query atom is on top of the stack and is returned. A low precedence no-operator is pushed on the stack to make thinks simpler and buffer the shunting.

To handle the implicit '&' between primitives a little more work is needed. Essentially one would optionally invoke shunt(atoms, operators, '&'); as needed at each iteration.

Code 1 - Primary loop
IQueryAtom parse(CharBuffer buffer) throws IOException {

    // stacks of atom primitives and operators
    Deque<IQueryAtom> atoms     = new ArrayDeque<>();
    Deque<Character>  operators = new ArrayDeque<>();
    operators.push(Character.MAX_VALUE); // a pseudo low precedence op

    while (buffer.hasRemaining()) {
        char c = buffer.get();
        if (isOperator(c)) // c == '!' or '&' or ',' or ';'? 
           shunt(atoms, operators, c);
        else                
           atoms.push(parsePrimitive(buffer));
    }

    // apply remaining operators
    while (!operators.isEmpty())
        apply(operators.pop(), atoms);

    return atoms.pop();
}

Code 2 shows the creation of query atom primitives, here they are delegated to several self explanatory utility methods. For compactness only a subset of primitives are read.

Code 2 - Parsing selected primitives
IQueryAtom parsePrimitive(CharBuffer buffer) throws IOException {
    switch (buffer.get(buffer.position() - 1)) {
        case 'A': return newAliphaticQryAtm();
        case 'C': return newAliphaticQryAtm(6);
        case 'N': return newAliphaticQryAtm(7);
        case 'O': return newAliphaticQryAtm(8);
        case 'P': return newAliphaticQryAtm(15);
        case 'S': return newAliphaticQryAtm(16);

        case 'a': return newAromaticQryAtm();
        case 'c': return newAromaticQryAtm(6);
        case 'n': return newAromaticQryAtm(7);
        case 'o': return newAromaticQryAtm(8);
        case 'p': return newAromaticQryAtm(15);
        case 's': return newAromaticQryAtm(16);

        case '#': return newNumberQryAtm(parseNum(buffer));
        case 'X': return newConnectivityQryAtm(parseNum(buffer));
        case 'H': return newHydrogenCountQryAtm(parseNum(buffer));
        case 'R': return newRingMembershipQryAtom(parseNum(buffer));
        case 'v': return newValenceQryAtom(parseNum(buffer));
    }
    throw new IOException("Primitive not handled");
}

To apply an operator, take the operands (primitives) off the top of the atom stack, create a new query atom, and push it back on to the stack (Code 3). If there aren't enough operands, the expression is invalid (not shown).

Code 3 - Applying an operator
void apply(char op, Deque<IQueryAtom> atoms) {
    if (op == '&' || op == ';')
        atoms.push(and(atoms.pop(), atoms.pop()));
    else if (op == ',')
        atoms.push(or(atoms.pop(), atoms.pop()));
    else if (op == '!')
        atoms.push(not(atoms.pop()));
}

Finally, to handle the operator (Code 4), check if the operator currently on top of the stack has precedence over the new operator. If so, pop it from the stack and apply it. The new operator is then added to the stack. Conveniently the code point of the operator character can be used as the precedence.

Code 4 - Handling operator precedence
void shunt(Deque<IQueryAtom> atoms, Deque<Characters> operators, char op) {
    while (precedence(operators.peek()) < precedence(op))
        apply(operators.pop(), atoms);
    operators.push(op);
}

static int precedence(char c) {
    return c; // in ASCII, '!' < '&' < ',' < ';' 
}

With the exception of a few utility methods these four snippets are essentially the whole implementation. You can find the fully functional code on the GitHub project[7].

Not only is the code is incredibly compact and elegant but it can easily be expanded. Several convenience extensions to SMARTS have been made in the past – for example, #X for !#1!#6. A common requirement in general expressions and the Shunting-yard is to handle parenthesis. These need special treatment but it is only a simple modification to the shunting and the precedence value (Code 5).

Code 5 - Handling parenthesis
void shunt(Deque atoms, Deque operators, char op) {
    if (op == ')') {
        while ((op = operators.pop()) != '(')
            apply(op, atoms);
    } else {
        if (op != '(') {
            while (precedence(operators.peek()) < precedence(op))
                apply(operators.pop(), atoms);
        }
        operators.push(op);               
    }
}

int precedence(char c) {
    switch (c) {
        case '!': return 1;
        case '&': return 2;
        case ',': return 3;
        case ';': return 4;
        case '(':
        case ')': return 5;
        default:  return 6;
    }
}

The parser will now correctly handle the following expressions without recursive SMARTS:

[!(C,N,O,P,S)]              C N O P , , , !
[!(C,N,O&X1)]               C N O X1 & , , !
[((C,N)&X3),((O,S)&X2)]     C N , X3 & O S , X2 & ,

All source code is available at github/johnmay/efficient-bits/polished-smarts.

References

  1. PATSY, NextMove Software
  2. SMARTS Optimisation & Compilation: Introduction & Optimisation, Tim Vandermeersch
  3. Daylight theory manual, Daylight CIS
  4. Reverse Polish notation, Wikipedia
  5. Reverse Polish notation and the stack, Computerphile
  6. Shunting-yard algorithm, Wikipedia
  7. github/johnmay/efficient-bits/polished-smarts

Friday 18 July 2014

CDK Release 1.5.7

CDK 1.5.7 has been released and is available from sourceforge (download here) and the EBI maven repo (XML 1).

The release notes (1.5.7-Release-Notes) summarise and detail the changes. Among the new bug fixes and features, several plugins have been added to the build. The release notes describe how these plugins can be run and what they do so be sure check the notes out if you're a contributor.

XML 1 - Maven POM configuration
<repository>
  <url>http://www.ebi.ac.uk/intact/maven/nexus/content/repositories/ebi-repo/</url>
</repository>
...
<dependency>
  <groupId>org.openscience.cdk</groupId>
  <artifactId>cdk-bundle</artifactId>
  <version>1.5.7</version>
</dependency>