The plate is bad: December 2011

Friday, December 23, 2011

Christmas presents

Jérôme Pansanel has completed the new serialization code for mychem and pgchem, so there is no need to handle stereo and non-stereo queries differently anymore. I have moved the index functions to GCC's vector extensions where applicable, and the first result is that index build times have been roughly cut by half while substructure search times have improved, but not that much.

Index build times

System	Index build time
pgchem with OpenBabel or Indigo	352137 ms
pgchem with OpenBabel or Indigo vectorized	192815 ms

OpenBabel with binary storage and FP2 fingerprint vectorized

Query	Hits	no Index	Hits	with Index
GH2	4840	98416 ms	4840	17044 ms
GH7	260	94053 ms	260	1564 ms
GH13	580	113690 ms	580	34504 ms
GH16	26910	99365 ms	26910	55154 ms

Merry Christmas and a happy new year!

Saturday, December 17, 2011

Benchmark data published

The base data used in the previous benchmarks can be found here: pubchem_0_100000.

Thursday, December 15, 2011

Selected GH17 results for 10^6 structures

GH17 substructure search speed

OpenBabel with binary+SMILES storage and FP2 fingerprint

Query	Hits	no Index	Hits	with Index
GH2	4840	108807 ms	4840	21164 ms
GH7	260	105050 ms	260	1934 ms
GH13	580	118978 ms	580	52416 ms
GH16	26910	109886 ms	26910	64742 ms

Indigo with binary storage and ext+sub fingerprint

Query	Hits	no Index	Hits	with Index
GH2	4840	213075 ms	4840	27887 ms
GH7	410	178963 ms	410	4451 ms
GH13	580	251938 ms	580	39134 ms
GH16	27100	172534 ms	27100	80523 ms

Bingo 1.7beta2 with molfiles as text storage

Query	Hits	no Index	Hits	with Index
GH2	4710	647889 ms	4710	21733 ms
GH7	410	538784 ms	410	6658 ms
GH13	580	675093 ms	580	12418 ms
GH16	27100	528891 ms	27100	28541 ms

Index build times

System	Index build time
pgchem with OpenBabel or Indigo	352137 ms
Bingo	3458681ms

Again, Bingo without it's index is apparently killed by the overhead of parsing text into the internal molecule format. With index it's a mixed bag, while it shines at GH13 and GH16, pgchem is about equal or faster at GH2 and GH7.

Wednesday, December 14, 2011

GH17 results

Mikhail Rybalkin from GGA Software asked me to do this, so here it is...

The GH17 test queries used

There is a small set of queries used in article Chemical substructure search in SQL by Golovin and Henrick. These queries were lately reused in other articles:

GH1 ONC1CC(C(O)C1O)[n]2cnc3c(NC4CC4)ncnc23
GH2 Nc1ncnc2[n]cnc12
GH3 CNc1ncnc2[n](C)cnc12
GH4 Nc1ncnc2[n](cnc12)C3CCCC3
GH5 CC12CCC3C(CCC4=CC(O)CCC34C)C1CCC2
GH6 OC2=CC(=O)c1c(cccc1)O2
GH7 Nc1nnc(S)s1
GH8 C1C2SCCN2C1
GH9 CP(O)(O)=O
GH10 CCCCCP(O)(O)=O
GH11 N2CCC13CCCCC1C2Cc4c3cccc4
GH12 s1cncc1
GH13 C34CCC1C(CCC2CC(=O)CCC12)C3CCC4
GH14 CCCCCCCCCCCP(O)(O)=O
GH15 CC1CCCC1
GH16 CCC1CCCC1
GH17 CCCC1CCCC1

GH17 substructure search speed

OpenBabel with binary+SMILES storage and FP2 fingerprint

Query	Hits	no Index	Hits	with Index
GH1	0	9517 ms	0	25 ms
GH2	484	8519 ms	484	111 ms
GH3	63	8632 ms	63	43 ms
GH4	5	8950 ms	5	48 ms
GH5	36	10020 ms	36	78 ms
GH6	0	8696 ms	0	32 ms
GH7	26	8279 ms	26	31 ms
GH8	170	8454 ms	170	56 ms
GH9	348	8068 ms	348	71 ms
GH10	36	8820 ms	36	21 ms
GH11	66	9113 ms	66	52 ms
GH12	831	7920 ms	831	124 ms
GH13	58	9864 ms	58	448 ms
GH14	4	9998 ms	4	36 ms
GH15	3008	8549 ms	3008	555 ms
GH16	2691	8665 ms	2691	501 ms
GH17	2290	8717 ms	2290	560 ms

Indigo with binary storage and ext+sub fingerprint

Query	Hits	no Index	Hits	with Index
GH1	0	28161 ms	0	32 ms
GH2	484	19188 ms	484	187 ms
GH3	68	20498 ms	68	78 ms
GH4	5	23388 ms	5	33 ms
GH5	36	25887 ms	36	62 ms
GH6	0	21034 ms	0	31 ms
GH7	41	16302 ms	41	31 ms
GH8	170	16629 ms	170	78 ms
GH9	373	14005 ms	373	98 ms
GH10	37	16881 ms	37	21 ms
GH11	66	24091 ms	66	78 ms
GH12	829	14842 ms	829	210 ms
GH13	58	24470 ms	58	133 ms
GH14	4	21403 ms	4	36 ms
GH15	3047	15817 ms	3047	749 ms
GH16	2710	16767 ms	2710	732 ms
GH17	2304	17524 ms	2304	788 ms

Bingo 1.7beta2 with molfiles as text storage

Query	Hits	no Index	Hits	with Index
GH1	0	70277 ms	0	125 ms
GH2	471	56821 ms	471	156 ms
GH3	68	57754 ms	68	125 ms
GH4	5	60067 ms	5	125 ms
GH5	36	65586 ms	36	140 ms
GH6	79	57188 ms	79	125 ms
GH7	41	52134 ms	41	125 ms
GH8	170	51685 ms	170	140 ms
GH9	373	47613 ms	373	138 ms
GH10	37	49961 ms	37	110 ms
GH11	66	61176 ms	66	125 ms
GH12	774	50281 ms	829	156 ms
GH13	58	61108 ms	58	156 ms
GH14	4	53636 ms	4	140 ms
GH15	3047	50213 ms	3047	343 ms
GH16	2710	51227 ms	2710	327 ms
GH17	2304	51495 ms	2304	362 ms

Fingerprint efficiency (with regard to false positives)

FP2

Query	Candidates screened	Hits matched	false positives	Efficiency
GH1	0	0	0	1.000
GH2	485	484	1	0.998
GH3	69	63	6	0.913
GH4	37	5	32	0.135
GH5	120	36	84	0.300
GH6	79	0	79	0.000
GH7	41	26	15	0.634
GH8	177	170	7	0.960
GH9	377	348	29	0.923
GH10	37	36	1	0.973
GH11	123	66	1	0.537
GH12	831	831	0	1.000
GH13	1346	58	1288	0.043
GH14	20	4	16	0.200
GH15	3760	3008	752	0.800
GH16	3305	2691	614	0.814
GH17	3305	2290	715	0.762

ext+sub

Query	Candidates screened	Hits matched	false positives	Efficiency
GH1	0	0	0	1.000
GH2	484	484	1	1.000
GH3	68	68	0	1.000
GH4	5	5	0	1.000
GH5	47	36	11	0.766
GH6	0	0	0	1.000
GH7	41	41	0	1.000
GH8	170	170	0	1.000
GH9	373	373	0	1.000
GH10	37	37	0	1.000
GH11	66	66	0	1.000
GH12	829	829	0	1.000
GH13	259	58	201	0.224
GH14	20	4	16	0.200
GH15	3061	3047	14	0.995
GH16	2720	2710	10	0.996
GH17	2720	2304	416	0.847

Index build times

System	Index build time
pgchem with OpenBabel or Indigo	25690 ms
Bingo	336319 ms

Indigo's ext+sub fingerprint is truly more selective than FP2. Still, OpenBabel with binary storage shows the better prformance because of its ~~faster matcher~~ lower query overhead.

Also, the result for GH3, GH7, GH9, GH12, GH15, GH16, and GH17 are different between OpenBabel and Indigo, and Bingo finds 79 hits for GH6 where pgchem finds zero.

Index building on pgchem is 13 times faster than Bingo, but since pgchem (currently) does not support features like tautomer searching or SMARTS searching with index support this comparison is a bit like apples and oranges.

The slow performance of Bingo without index, comparable to pgchem without binary storage, is quite likely a result of the storage of molecules in textual representation. Parsing text to binary molecules is a first class performance killer. Unfortunately, there is no way to convert molecules into native format directly with Bingo for PostgreSQL, but Bingo does the conversion implicitly when building the index.

Friday, December 9, 2011

OBMol de-/serialization revisited

The current serialization/deserialization mechanism in pgchem and mychem does not preserve stereochemistry of OBMol objects properly. As Tim Vandermeersch wrote:

The OBChiralData isn't used anymore. Also the functions OBAtom::IsClockwise, and OBAtom::IsAntiClockwise are obsolate. Instead, you should serialize the OBCisTransStereo and OBTetrahedralStereo data objects.

Fortunately (for me, since I don't have the time at the moment), Jérôme Pansanel has:

...started the serialization of the OBCisTransStereo and OBTetrahedralStereo objects.

For the time being, I have tried the following workaround and it seems to be working well. First, I have removed now unneccessary code from the unserialization:

bool unserializeOBMol(OBBase* pOb, const char *serializedInput)
{
  OBMol* pmol = pOb->CastAndClear<OBMol>();
  OBMol &mol = *pmol;
  unsigned int i,natoms,nbonds;

  unsigned int *intptr = (unsigned int*) serializedInput;

  ++intptr;

  natoms = *intptr;

  ++intptr;

  nbonds = *intptr;

  ++intptr;

  _ATOM *atomptr = (_ATOM*) intptr;

  mol.ReserveAtoms(natoms);

  OBAtom atom;
  int stereo;

  for (i = 1; i <= natoms; i++) {
    atom.SetIdx(atomptr->idx);
    atom.SetHyb(atomptr->hybridization);
    atom.SetAtomicNum((int) atomptr->atomicnum);
    atom.SetIsotope((unsigned int) atomptr->isotope);
    atom.SetFormalCharge((int) atomptr->formalcharge);
    stereo = atomptr->stereo;

    if(stereo == 3) {
      atom.SetChiral();
    }

    atom.SetSpinMultiplicity((short) atomptr->spinmultiplicity);

    if(atomptr->aromatic != 0) {
      atom.SetAromatic();
    }

    if (!mol.AddAtom(atom)) {
      return false;
    }

    atom.Clear();

    ++atomptr;
  }

  _BOND *bondptr = (_BOND*) atomptr;

  unsigned int start,end,order,flags;

  for (i = 0;i < nbonds;i++) {
    flags = 0;

    start = bondptr->beginidx;
    end = bondptr->endidx;
    order = (int) bondptr->order;

    if (start == 0 || end == 0 || order == 0 || start > natoms || end > natoms) {
      return false;
    }

    order = (unsigned int) (order == 4) ? 5 : order;

    stereo = bondptr->stereo;

    if (stereo) {
      if (stereo == 1) {
        flags |= OB_WEDGE_BOND;
      }
      if (stereo == 6) {
        flags |= OB_HASH_BOND;
      }
    }

    if (bondptr->aromatic != 0) {
      flags |= OB_AROMATIC_BOND;
    }

    if (!mol.AddBond(start,end,order,flags)) {
      return false;
    }

    ++bondptr;
  }

  intptr = (unsigned int*) bondptr;

  mol.SetAromaticPerceived();
  mol.SetKekulePerceived();

  return true;
}

Then, when matching, I check if the query might contain tetrahedral stereo. If yes, I build the target OBMol fresh from SMILES, if no directly from the serialized object:

 if (strchr (querysmi, '@') != NULL)
    {
        //Match against an OBMol generated from SMILES
    }
    else
    {
         //Match against an OBMol deserialized from binary
    }

Tuesday, December 6, 2011

First Light

As promised, here the first results for substructure searching c1ccccc1Cl on the first 100000 compounds from Pubchem: select * from pubchem.compound where compound >= 'c1ccccc1Cl'::molecule

Substructure search speed

Rank	Build	Storage	Fingerprint	Hits	no Index	Hits	with Index
1	OpenBabel	binary+SMILES	FP2	8070	9236 ms	8067	936 ms
2	Indigo	binary	ext+sub	8049	24821 ms	8049	2418 ms
3	OpenBabel	SMILES	FP2	8070	57971 ms	8067	5432 ms

I've checked why the OpenBabel FP2 fingerprint eliminates three structures that otherwise would pass: Using the VF2 OBIsomorphismMapper instead of the OBSmartsPattern, it's also 8067 without index. But it's about four times slower, 90526 ms without index, 8798 with index.
The structures in question are: 80944, 83450 and 99925 and I'm pretty sure it's caused by differences in aromaticity detection.

Stereochemistry in substructure searches

Check	Query	Expected	Indigo binary	OpenBabel SMILES	OpenBabel binary
R/S different	select 'C[C@H]([C@@H](C(=O)O)N)O'::molecule <= 'O=C(O)[C@H](N)[C@@H](O)C'::molecule	false	pass	fail	fail
R/S same	select 'C[C@H]([C@@H](C(=O)O)N)O'::molecule <= 'C[C@H]([C@@H](C(=O)O)N)O'::molecule	true	pass	pass	pass
E/Z different	select 'C(=C\Cl)\Cl'::molecule <= 'C(=C/Cl)\Cl'::molecule	false	pass	fail	fail

OpenBabel fails the 'E/Z different' and 'R/S different' checks, but these are known issues up to version 2.3.1. ~~More disturbing are the issues with 'R/S different'. For SMILES I can reproduce it with obgrep, so it's not my code causing it.~~

Indigo has no obvious issues with matching R/S and E/Z stereochemistry. ~~Speedwise, OpenBabel with binary storage would be the leader of the pack, but has inconsistent behaviour compared with it's SMILES storage sibling.~~ So the serialization/deserialization code clearly is missing something. But I've found a simple workaround: If the query contains chirality information it uses SMILES, otherwise binary. Speedwise, OpenBabel with binary storage now is the leader of the pack.

Fingerprint efficiency

Fingerprint type	Candidates screened	Hits matched	false positives	Efficiency
ext+sub	8090	8049	41	0.995
FP2	8145	8067	78	0.99

Pretty close.

Friday, December 23, 2011

Christmas presents

Index build times

OpenBabel with binary storage and FP2 fingerprint vectorized

Saturday, December 17, 2011

Benchmark data published

Thursday, December 15, 2011

Selected GH17 results for 10^6 structures

GH17 substructure search speed

OpenBabel with binary+SMILES storage and FP2 fingerprint

Indigo with binary storage and ext+sub fingerprint

Bingo 1.7beta2 with molfiles as text storage

Index build times

Wednesday, December 14, 2011

GH17 results

The GH17 test queries used

GH17 substructure search speed

OpenBabel with binary+SMILES storage and FP2 fingerprint

Indigo with binary storage and ext+sub fingerprint

Bingo 1.7beta2 with molfiles as text storage

Fingerprint efficiency (with regard to false positives)

FP2

ext+sub

Index build times

Friday, December 9, 2011

OBMol de-/serialization revisited

Tuesday, December 6, 2011

First Light

Substructure search speed

Stereochemistry in substructure searches

Fingerprint efficiency

Blog Archive

Blog Shortlist