Efficient Bits: RDKit Reaction SMARTS

Update 19/09/22

I believe I was wrong and Daylight did allow SMIRKS to be run backwards and that the "direction" parameter controls it ("personal communication" from Roger).. The key insight is the documentation notes SMARTS patterns should only appear on atoms that don't have bond changes around them. From the man page:

"Atomic expressions may be any valid atomic SMARTS expression for nodes where the bonding (connectivity & bond order) doesn't change.1 Otherwise, the atomic expressions must be valid SMILES."

Whether this make sense or not for writing portable SMIRKS is questionable.

This is supplementary to my original grumble on ambiguous naming. I still believe RDKit should just call it SMIRKS or at least something which didn't already mean something else (e.g. "RdSmirks"). I do agree there is a lot of overlap between SMARTS and SMIRKS but conflating these terms is problematic. Shortly after I originally wrote this original post, SMIRKS Native Open Force Field (SMIRNOFF) was published. In that work SMARTS is used, but they called it SMIRKS. There is some wiggling attempted by the authors that they use "SMIRKS features" of atom maps. However it is incorrect that these were a SMIRKS feature as in what Daylight called Reaction SMARTS, see the table "Examples of Reaction SMARTS" (SMARTS Theory Page). Of course SMIRNOFF was just too good a pun to change, but it again creates more confusion.

I recently ran some comparisons/benchmarks on different SMIRKS implementations, everyone's semantics are consistently inconsistent and the community would benefit by an effort to standardise. Perhaps then I can convince RDKit that they really do have a (good) SMIRKS implementation and it would be less confusing to just call it that.

There's a been some papers using the RDKit for synthesis planning. If you're writing a paper and use the term "Reaction SMARTS" make sure you mean what everyone else thinks it means.

The SMILES, SMARTS, and SMIRKS line notations were created* by Daylight for storing, matching, and transforming connection tables.

SMILES describes a connection table to store molecule and reactions
SMARTS describes a pattern (or query) to match molecules and reactions
SMIRKS describes a transform (or "reaction") to modify molecules

RDKit uses the term "Reaction SMARTS" to mean "transform" (see RDKit Book). Unfortunately in Daylight's terminology Reaction SMARTS is a pattern not a transform.

Screenshot from the Daylight SMARTS theory manual.

Reactions SMARTS is primarily useful for searching reaction databases. For example this Reaction SMILES:

[cH:13]1[c:14]([cH:19][c:20]([c:10]([c:11]1[Cl:12])[n:9]2[cH:8][c:5]([c:4]([n:22]2)[NH2:3])[C:6]#[N:7])[Cl:21])[C:15]([F:16])([F:17])[F:18].[OH:1][OH:2]>C(Cl)Cl.C(=O)(C(F)(F)F)OC(=O)C(F)(F)F.O>[cH:13]1[c:14]([cH:19][c:20]([c:10]([c:11]1[Cl:12])[n:9]2[cH:8][c:5]([c:4]([n:22]2)[N+:3](=[O:1])[O-:2])[C:6]#[N:7])[Cl:21])[C:15]([F:16])([F:17])[F:18]

is matched by this Reaction SMARTS

[*:1][Nh2:2]>>[*:1][Nh0:2](~[OD1])~[OD1] amino to nitro

You can highlight the substructure:

Highlighting the SMARTS in the SMILES using CDK Depict

But that's a transform!

Yes but it's matching a transform (SMARTS) not applying one (SMIRKS), some may think you could read this unmodified as a SMIRKS but this is not the case. SMIRKS needs "real parts" after the second angled bracket as these are the parts created by the transform. Note that '*' is valid SMILES and in SMIRKS it kind of means "unmodified". This actually gives us the nice invariants:

All SMILES are valid SMARTS but not all SMARTS are valid SMILES
and
All SMIRKS are valid SMARTS but not all SMARTS are valid SMIRKS

Here is the SMIRKS transform for amino to nitro

[*:1][ND3:2]([H])([H])>>[*:1][N:2](=O)=O amino to nitro

In SMIRKS I can apply this SMIRKS to "molecules" and it will create "reactions". Note these molecules do not need to have atom-maps but they will come out with atom maps (see dt_transform)!

c1ccccc1N
[nH]1ccc2c1cc(N)cc2

The output is

c1cccc[c:1]1[NH2:2]>>c1cccc[c:1]1[N:2](=O)=O
[nH]1ccc2c1c[c:1]([NH2:2])cc2>>[nH]1ccc2c1c[c:1]([N:2](=O)=O)cc2

And another thing...

In general you can't run SMIRKS backwards. If I want to run a nitro to amino because the atoms/bonds we're adding need to be "real" we need to encode the reverse transform separately!

[*:1][ND3:2]([H])([H])>>[*:1][N:2](=O)=O amino to nitro
[*:1][ND3:2](~[OD1])(~[OD1])>>[*:1][N:2]([H])[H] nitro to amino

Although dt_transform specifies a direction this only controls whether the input molecules appear on the left or right of the output reaction.

*SMILES was created by Dave Weininger whilst at EPA

Efficient Bits

Friday, 13 April 2018

RDKit Reaction SMARTS

Update 19/09/22

But that's a transform!

And another thing...

No comments:

Post a Comment