Friday, December 23, 2011

Christmas presents

Jérôme Pansanel has completed the new serialization code for mychem and pgchem, so there is no need to handle stereo and non-stereo queries differently anymore. I have moved the index functions to GCC's vector extensions where applicable, and the first result is that index build times have been roughly cut by half while substructure search times have improved, but not that much.

Index build times


SystemIndex build time
pgchem with OpenBabel or Indigo352137 ms
pgchem with OpenBabel or Indigo vectorized192815 ms

OpenBabel with binary storage and FP2 fingerprint vectorized


QueryHitsno IndexHitswith Index
GH2484098416 ms484017044 ms
GH726094053 ms2601564 ms
GH13580113690 ms58034504 ms
GH162691099365 ms2691055154 ms

Merry Christmas and a happy new year!

Saturday, December 17, 2011

Benchmark data published

The base data used in the previous benchmarks can be found here: pubchem_0_100000.

Thursday, December 15, 2011

Selected GH17 results for 10^6 structures

GH17 substructure search speed


OpenBabel with binary+SMILES storage and FP2 fingerprint


QueryHitsno IndexHitswith Index
GH24840108807 ms484021164 ms
GH7260105050 ms2601934 ms
GH13580118978 ms58052416 ms
GH1626910109886 ms2691064742 ms

Indigo with binary storage and ext+sub fingerprint


QueryHitsno IndexHitswith Index
GH24840213075 ms484027887 ms
GH7410178963 ms4104451 ms
GH13580251938 ms58039134 ms
GH1627100172534 ms2710080523 ms

Bingo 1.7beta2 with molfiles as text storage


QueryHitsno IndexHitswith Index
GH24710647889 ms471021733 ms
GH7410538784 ms4106658 ms
GH13580675093 ms58012418 ms
GH1627100528891 ms2710028541 ms

Index build times


SystemIndex build time
pgchem with OpenBabel or Indigo352137 ms
Bingo3458681ms


Again, Bingo without it's index is apparently killed by the overhead of parsing text into the internal molecule format. With index it's a mixed bag, while it shines at GH13 and GH16, pgchem is about equal or faster at GH2 and GH7.

Wednesday, December 14, 2011

GH17 results

Mikhail Rybalkin from GGA Software asked me to do this, so here it is...

The GH17 test queries used


There is a small set of queries used in article Chemical substructure search in SQL by Golovin and Henrick. These queries were lately reused in other articles:

GH1 ONC1CC(C(O)C1O)[n]2cnc3c(NC4CC4)ncnc23
GH2 Nc1ncnc2[n]cnc12
GH3 CNc1ncnc2[n](C)cnc12
GH4 Nc1ncnc2[n](cnc12)C3CCCC3
GH5 CC12CCC3C(CCC4=CC(O)CCC34C)C1CCC2
GH6 OC2=CC(=O)c1c(cccc1)O2
GH7 Nc1nnc(S)s1
GH8 C1C2SCCN2C1
GH9 CP(O)(O)=O
GH10 CCCCCP(O)(O)=O
GH11 N2CCC13CCCCC1C2Cc4c3cccc4
GH12 s1cncc1
GH13 C34CCC1C(CCC2CC(=O)CCC12)C3CCC4
GH14 CCCCCCCCCCCP(O)(O)=O
GH15 CC1CCCC1
GH16 CCC1CCCC1
GH17 CCCC1CCCC1

GH17 substructure search speed


OpenBabel with binary+SMILES storage and FP2 fingerprint


QueryHitsno IndexHitswith Index
GH109517 ms025 ms
GH24848519 ms484111 ms
GH3638632 ms6343 ms
GH458950 ms548 ms
GH53610020 ms3678 ms
GH608696 ms032 ms
GH7268279 ms2631 ms
GH81708454 ms17056 ms
GH93488068 ms34871 ms
GH10368820 ms3621 ms
GH11669113 ms6652 ms
GH128317920 ms831124 ms
GH13589864 ms58448 ms
GH1449998 ms436 ms
GH1530088549 ms3008555 ms
GH1626918665 ms2691501 ms
GH1722908717 ms2290560 ms

Indigo with binary storage and ext+sub fingerprint


QueryHitsno IndexHitswith Index
GH1028161 ms032 ms
GH248419188 ms484187 ms
GH36820498 ms6878 ms
GH4523388 ms533 ms
GH53625887 ms3662 ms
GH6021034 ms031 ms
GH74116302 ms4131 ms
GH817016629 ms17078 ms
GH937314005 ms37398 ms
GH103716881 ms3721 ms
GH116624091 ms6678 ms
GH1282914842 ms829210 ms
GH135824470 ms58133 ms
GH14421403 ms436 ms
GH15304715817 ms3047749 ms
GH16271016767 ms2710732 ms
GH17230417524 ms2304788 ms

Bingo 1.7beta2 with molfiles as text storage


QueryHitsno IndexHitswith Index
GH1070277 ms0125 ms
GH247156821 ms471156 ms
GH36857754 ms68125 ms
GH4560067 ms5125 ms
GH53665586 ms36140 ms
GH67957188 ms79125 ms
GH74152134 ms41125 ms
GH817051685 ms170140 ms
GH937347613 ms373138 ms
GH103749961 ms37110 ms
GH116661176 ms66125 ms
GH1277450281 ms829156 ms
GH135861108 ms58156 ms
GH14453636 ms4140 ms
GH15304750213 ms3047343 ms
GH16271051227 ms2710327 ms
GH17230451495 ms2304362 ms

Fingerprint efficiency (with regard to false positives)


FP2


QueryCandidates screenedHits matchedfalse positivesEfficiency
GH10001.000
GH248548410.998
GH3696360.913
GH4375320.135
GH512036840.300
GH6790790.000
GH74126150.634
GH817717070.960
GH9377348290.923
GH10373610.973
GH111236610.537
GH1283183101.000
GH1313465812880.043
GH14204160.200
GH15376030087520.800
GH16330526916140.814
GH17330522907150.762

ext+sub


QueryCandidates screenedHits matchedfalse positivesEfficiency
GH10001.000
GH248448411.000
GH3686801.000
GH45501.000
GH54736110.766
GH60001.000
GH7414101.000
GH817017001.000
GH937337301.000
GH10373701.000
GH11666601.000
GH1282982901.000
GH13259582010.224
GH14204160.200
GH1530613047140.995
GH1627202710100.996
GH17272023044160.847

Index build times


SystemIndex build time
pgchem with OpenBabel or Indigo25690 ms
Bingo336319 ms


Indigo's ext+sub fingerprint is truly more selective than FP2. Still, OpenBabel with binary storage shows the better prformance because of its faster matcher lower query overhead.

Also, the result for GH3, GH7, GH9, GH12, GH15, GH16, and GH17 are different between OpenBabel and Indigo, and Bingo finds 79 hits for GH6 where pgchem finds zero.

Index building on pgchem is 13 times faster than Bingo, but since pgchem (currently) does not support features like tautomer searching or SMARTS searching with index support this comparison is a bit like apples and oranges.

The slow performance of Bingo without index, comparable to pgchem without binary storage, is quite likely a result of the storage of molecules in textual representation. Parsing text to binary molecules is a first class performance killer. Unfortunately, there is no way to convert molecules into native format directly with Bingo for PostgreSQL, but Bingo does the conversion implicitly when building the index.

Friday, December 9, 2011

OBMol de-/serialization revisited

The current serialization/deserialization mechanism in pgchem and mychem does not preserve stereochemistry of OBMol objects properly. As Tim Vandermeersch wrote:

The OBChiralData isn't used anymore. Also the functions OBAtom::IsClockwise, and OBAtom::IsAntiClockwise are obsolate. Instead, you should serialize the OBCisTransStereo and OBTetrahedralStereo data objects.

Fortunately (for me, since I don't have the time at the moment), Jérôme Pansanel has:

...started the serialization of the OBCisTransStereo and OBTetrahedralStereo objects.

For the time being, I have tried the following workaround and it seems to be working well. First, I have removed now unneccessary code from the unserialization:

bool unserializeOBMol(OBBase* pOb, const char *serializedInput)
{
  OBMol* pmol = pOb->CastAndClear<OBMol>();
  OBMol &mol = *pmol;
  unsigned int i,natoms,nbonds;

  unsigned int *intptr = (unsigned int*) serializedInput;

  ++intptr;

  natoms = *intptr;

  ++intptr;

  nbonds = *intptr;

  ++intptr;

  _ATOM *atomptr = (_ATOM*) intptr;

  mol.ReserveAtoms(natoms);

  OBAtom atom;
  int stereo;

  for (i = 1; i <= natoms; i++) {
    atom.SetIdx(atomptr->idx);
    atom.SetHyb(atomptr->hybridization);
    atom.SetAtomicNum((int) atomptr->atomicnum);
    atom.SetIsotope((unsigned int) atomptr->isotope);
    atom.SetFormalCharge((int) atomptr->formalcharge);
    stereo = atomptr->stereo;

    if(stereo == 3) {
      atom.SetChiral();
    }

    atom.SetSpinMultiplicity((short) atomptr->spinmultiplicity);

    if(atomptr->aromatic != 0) {
      atom.SetAromatic();
    }

    if (!mol.AddAtom(atom)) {
      return false;
    }

    atom.Clear();

    ++atomptr;
  }

  _BOND *bondptr = (_BOND*) atomptr;

  unsigned int start,end,order,flags;

  for (i = 0;i < nbonds;i++) {
    flags = 0;

    start = bondptr->beginidx;
    end = bondptr->endidx;
    order = (int) bondptr->order;

    if (start == 0 || end == 0 || order == 0 || start > natoms || end > natoms) {
      return false;
    }

    order = (unsigned int) (order == 4) ? 5 : order;

    stereo = bondptr->stereo;

    if (stereo) {
      if (stereo == 1) {
        flags |= OB_WEDGE_BOND;
      }
      if (stereo == 6) {
        flags |= OB_HASH_BOND;
      }
    }

    if (bondptr->aromatic != 0) {
      flags |= OB_AROMATIC_BOND;
    }

    if (!mol.AddBond(start,end,order,flags)) {
      return false;
    }

    ++bondptr;
  }

  intptr = (unsigned int*) bondptr;

  mol.SetAromaticPerceived();
  mol.SetKekulePerceived();

  return true;
}
Then, when matching, I check if the query might contain tetrahedral stereo. If yes, I build the target OBMol fresh from SMILES, if no directly from the serialized object:
 if (strchr (querysmi, '@') != NULL)
    {
        //Match against an OBMol generated from SMILES
    }
    else
    {
         //Match against an OBMol deserialized from binary
    }

Tuesday, December 6, 2011

First Light

As promised, here the first results for substructure searching c1ccccc1Cl on the first 100000 compounds from Pubchem: select * from pubchem.compound where compound >= 'c1ccccc1Cl'::molecule

Substructure search speed


RankBuildStorageFingerprintHitsno IndexHitswith Index
1OpenBabelbinary+SMILESFP280709236 ms8067936 ms
2Indigobinaryext+sub804924821 ms80492418 ms
3OpenBabelSMILESFP2807057971 ms80675432 ms

I've checked why the OpenBabel FP2 fingerprint eliminates three structures that otherwise would pass: Using the VF2 OBIsomorphismMapper instead of the OBSmartsPattern, it's also 8067 without index. But it's about four times slower, 90526 ms without index, 8798 with index.
The structures in question are: 80944, 83450 and 99925 and I'm pretty sure it's caused by differences in aromaticity detection.

Stereochemistry in substructure searches


CheckQueryExpectedIndigo binaryOpenBabel SMILESOpenBabel binary
R/S differentselect 'C[C@H]([C@@H](C(=O)O)N)O'::molecule <= 'O=C(O)[C@H](N)[C@@H](O)C'::moleculefalsepassfailfail
R/S sameselect 'C[C@H]([C@@H](C(=O)O)N)O'::molecule <= 'C[C@H]([C@@H](C(=O)O)N)O'::moleculetruepasspasspass
E/Z different select 'C(=C\Cl)\Cl'::molecule <= 'C(=C/Cl)\Cl'::moleculefalsepassfailfail

OpenBabel fails the 'E/Z different' and 'R/S different' checks, but these are known issues up to version 2.3.1. More disturbing are the issues with 'R/S different'. For SMILES I can reproduce it with obgrep, so it's not my code causing it.

Indigo has no obvious issues with matching R/S and E/Z stereochemistry. Speedwise, OpenBabel with binary storage would be the leader of the pack, but has inconsistent behaviour compared with it's SMILES storage sibling. So the serialization/deserialization code clearly is missing something. But I've found a simple workaround: If the query contains chirality information it uses SMILES, otherwise binary. Speedwise, OpenBabel with binary storage now is the leader of the pack.

Fingerprint efficiency


Fingerprint typeCandidates screenedHits matchedfalse positivesEfficiency
ext+sub80908049410.995
FP281458067780.99

Pretty close.

Wednesday, November 30, 2011

What's up, doc?


OK, I have now a working (i.e. not crashing & giving meaningful results) hybrid version of pgchem::tigress with OpenBabel and Indigo, where the OpenBabel functions in the search engine have been replaced by Indigo functionality.

Why?

  • To see if it can be done :-)
  • To learn more about programming with Indigo
  • To see which is faster
  • Indigo has better reaction support as it seems

Interestingly I could keep the external interface almost stable. As only a dozen or so functions outside the search engine are affected by the change, I think with some #IFDEFs I can make a common codebase that compiles either to 100% OpenBabel or the hybrid version.

Monday, November 14, 2011

R-Matrix implementation with Indigo

Long time no see.

I have been trying programming with Indigo recently and here is my first, straightforward (to avoid the term 'naive') implementation of an Ugi like R-Matrix (Gasteiger/Engel, Chemoinformatics, 2003, Wiley-VCH, p. 185ff.) for adding make/break/change bond detection to its reaction mapper:
#include <stdio.h>
#include <string.h>
#include <stdlib.h>
#include "indigo.h"

static inline int mp(const int column, const int row, const int maxRows)
{
    return column-1+((row-1)*maxRows);
}

int main(int argc, char *argv[])
{

    int index, reaction, reactant, product, atom, bond, atomIter, reactantIter, productIter, neighborIter, totalAtoms, mappingNumber, targetMappingNumber, bondMakeBreak;
    int *bMatrix, *eMatrix;//, *rMatrix;

    totalAtoms=0;

    qword session = indigoAllocSessionId();

    indigoSetSessionId(session);

    printf("%s\n",indigoVersion());

    reaction = indigoLoadReactionFromFile("c:/temp/testrxn.rxn");

    //indigoFoldHydrogens(handle);

    indigoDearomatize(reaction);

    //printf("1:%s\n",indigoGetLastError());

    indigoAutomap(reaction,"discard");

    reactantIter = indigoIterateReactants(reaction);

    while((reactant = indigoNext(reactantIter)))
    {
        totalAtoms+=indigoCountAtoms(reactant);
    }

    indigoFree(reactantIter);

    bMatrix = (int*) calloc(totalAtoms*totalAtoms,sizeof(int));
    eMatrix = (int*) calloc(totalAtoms*totalAtoms,sizeof(int));
    //rMatrix = (int*) calloc(totalAtoms*totalAtoms,sizeof(int));

    reactantIter = indigoIterateReactants(reaction);

    while((reactant = indigoNext(reactantIter)))
    {
        atomIter = indigoIterateAtoms(reactant);
        while((atom = indigoNext(atomIter)))
        {
            mappingNumber = indigoGetAtomMappingNumber(reaction,atom);
            //printf("%i\n",mappingNumber);
            //indigoGetExplicitValence(atom, &valence);
            //printf("%i\n",valence);
            //bMatrix[mp(mappingNumber, mappingNumber, totalAtoms)]=valence;

            neighborIter = indigoIterateNeighbors(atom);

            while((atom = indigoNext(neighborIter)))
            {
                targetMappingNumber = indigoGetAtomMappingNumber(reaction,atom);

                index = mp(mappingNumber, targetMappingNumber, totalAtoms);

                if(bMatrix[index]==0)
                {
                    bond=indigoBond(atom);
                    bMatrix[index]=indigoBondOrder(bond);
                }
            }

            //printf("\n");

            indigoFree(neighborIter);
        }
        indigoFree(atomIter);
    }

    indigoFree(reactantIter);

    productIter = indigoIterateProducts(reaction);

    while((product = indigoNext(productIter)))
    {
        atomIter = indigoIterateAtoms(product);
        while((atom = indigoNext(atomIter)))
        {
            mappingNumber = indigoGetAtomMappingNumber(reaction,atom);
            //printf("%i\n",mappingNumber);
            //indigoGetExplicitValence(atom, &valence);
            //printf("%i\n",valence);
            //eMatrix[mp(mappingNumber, mappingNumber, totalAtoms)]=valence;

            neighborIter = indigoIterateNeighbors(atom);

            while((atom = indigoNext(neighborIter)))
            {
                targetMappingNumber = indigoGetAtomMappingNumber(reaction,atom);

                index = mp(mappingNumber, targetMappingNumber, totalAtoms);

                //printf("%i",indigoBondOrder(bond));
                if(eMatrix[index]==0)
                {
                    bond=indigoBond(atom);
                    eMatrix[index]=indigoBondOrder(bond);
                }
            }

            //printf("\n");

            indigoFree(neighborIter);
        }
        indigoFree(atomIter);
    }

    indigoFree(productIter);

    for(int i=0; i<(totalAtoms*totalAtoms); i++)
    {
        bondMakeBreak=eMatrix[i]-bMatrix[i];

        //rMatrix[i]=bondMakeBreak;

        if(bondMakeBreak>0)
        {
            if(bMatrix[i]>0)
                printf("change: %i to %i\n",(i / totalAtoms)+1,(i % totalAtoms)+1);
            else
                printf("make: %i to %i\n",(i / totalAtoms)+1,(i % totalAtoms)+1);
        }
        else if(bondMakeBreak<0)
        {
            if(bMatrix[i]>1)
                printf("change: %i to %i\n",(i / totalAtoms)+1,(i % totalAtoms)+1);
            else
                printf("break: %i to %i\n",(i / totalAtoms)+1,(i % totalAtoms)+1);
        }
    }

    //printf("2:%s\n",indigoGetLastError());

    indigoSaveRxnfileToFile(reaction,"c:/temp/testrxn_mapped.rxn");

    //printf("3:%s\n",indigoGetLastError());

    indigoFree(reaction);
    indigoReleaseSessionId(session);
    free(bMatrix);
    free(eMatrix);
    //free(rMatrix);

    return 0;
}
At the moment it has the following shortcomings:
  1. 50% too much memory consumption, since only half of the R-Matrix would suffice
  2. Each bond is found twice (Atom1 to Atom2 and Atom2 to Atom1)
  3. The make/break/change info is not added to the mapped reaction file

Have fun!

Monday, May 23, 2011

Putting things into perspective

How to do it the wrong way:

Radiation in Japan

I am no physicist, but this is so straightforward ignorant of factors like long term exposure effects, ingestion of radionuclides and non-uniform contamination that it leaves me speechless.

Dear silicon valley executives with geiger counters, leave them at home. Please.

Monday, May 9, 2011

JSDraw available again

Apprently, the HTML/JavaScript chemical editor JSDraw is available again:

JSDraw

As being haunted by 'issues' with plugin type editors based on Java or ActiveX for years, it's nice to see that browser-native editing/depiction of chemical structures is gaining more and more momentum.

The other players I know of are:

ChemWriter
Ketcher
Sketcher 2D

Friday, April 29, 2011

Java in-memory compression

Sometimes it would be nice to have a means to compress seldom used objects in memory to time for space (since compression/decompression takes a bit of overhead time). Especially in chemoinformatics applications where objects often carry substantial compactible information, e.g. textual representations of structures, the compression ratio is quite noticable.

Sure, one could do it for every variable individually, but how about compressing whole Objects using the standard serializing mechanism?

If you use WEKA, you have the SerializedObject.

If not - enter the CompressedReference:

import java.io.ByteArrayInputStream;
import java.io.ByteArrayOutputStream;
import java.io.IOException;
import java.io.ObjectInputStream;
import java.io.ObjectOutputStream;
import java.io.Serializable;
import java.util.zip.GZIPInputStream;
import java.util.zip.GZIPOutputStream;

public class CompressedReference<T extends Serializable> implements Serializable {

private static final long serialVersionUID = 7967994340450625830L;

private byte[] theCompressedReferent = null;

public CompressedReference(T referent) {
try {
compress(referent);
} catch (IOException e) {
// TODO Auto-generated catch block
e.printStackTrace();
}
}

public int size() {
return theCompressedReferent.length;
}

public T get() {
try {
return decompress();
} catch (IOException e) {
// TODO Auto-generated catch block
e.printStackTrace();
} catch (ClassNotFoundException e) {
// TODO Auto-generated catch block
e.printStackTrace();
}
return null;
}

private void compress(T referent) throws IOException {

ByteArrayOutputStream bos = new ByteArrayOutputStream();
GZIPOutputStream zos = new GZIPOutputStream(bos);
ObjectOutputStream ous = new ObjectOutputStream(zos);

ous.writeObject(referent);

zos.finish();

bos.flush();

theCompressedReferent = bos.toByteArray();

bos.close();
}

@SuppressWarnings("unchecked")
private T decompress() throws IOException, ClassNotFoundException {
T tmpObject = null;
ByteArrayInputStream bis = new ByteArrayInputStream(theCompressedReferent);
GZIPInputStream zis = new GZIPInputStream(bis);
ObjectInputStream ois = new ObjectInputStream(zis);
tmpObject = (T) ois.readObject();

ois.close();

return tmpObject;
}
}

A quick test shows 528 byte size for a String of 250 characters (since Unicode needs two bytes per char) and 64 bytes after compression, a ratio of about 8:1. The only requirement is that the Object stored in the reference has to implement Serializable.

And yes, I have to clear those TODO reminders... ;-)

Update
The compression ratio for a CDK Molecule representation of 2-Fluoronaphthalene is about 3:1.
The compression ratio for the V2000 mofile of 2-Fluoronaphthalene is about 7:1.