Wednesday, September 26, 2012
Friday, December 23, 2011
Christmas presents
Jérôme Pansanel has completed the new serialization code for mychem and pgchem, so there is no need to handle stereo and non-stereo queries differently anymore. I have moved the index functions to GCC's vector extensions where applicable, and the first result is that index build times have been roughly cut by half while substructure search times have improved, but not that much.
Index build times
| System | Index build time |
|---|---|
| pgchem with OpenBabel or Indigo | 352137 ms |
| pgchem with OpenBabel or Indigo vectorized | 192815 ms |
OpenBabel with binary storage and FP2 fingerprint vectorized
| Query | Hits | no Index | Hits | with Index |
|---|---|---|---|---|
| GH2 | 4840 | 98416 ms | 4840 | 17044 ms |
| GH7 | 260 | 94053 ms | 260 | 1564 ms |
| GH13 | 580 | 113690 ms | 580 | 34504 ms |
| GH16 | 26910 | 99365 ms | 26910 | 55154 ms |
Merry Christmas and a happy new year!
Saturday, December 17, 2011
Benchmark data published
Thursday, December 15, 2011
Selected GH17 results for 10^6 structures
GH17 substructure search speed
OpenBabel with binary+SMILES storage and FP2 fingerprint
| Query | Hits | no Index | Hits | with Index |
|---|---|---|---|---|
| GH2 | 4840 | 108807 ms | 4840 | 21164 ms |
| GH7 | 260 | 105050 ms | 260 | 1934 ms |
| GH13 | 580 | 118978 ms | 580 | 52416 ms |
| GH16 | 26910 | 109886 ms | 26910 | 64742 ms |
Indigo with binary storage and ext+sub fingerprint
| Query | Hits | no Index | Hits | with Index |
|---|---|---|---|---|
| GH2 | 4840 | 213075 ms | 4840 | 27887 ms |
| GH7 | 410 | 178963 ms | 410 | 4451 ms |
| GH13 | 580 | 251938 ms | 580 | 39134 ms |
| GH16 | 27100 | 172534 ms | 27100 | 80523 ms |
Bingo 1.7beta2 with molfiles as text storage
| Query | Hits | no Index | Hits | with Index |
|---|---|---|---|---|
| GH2 | 4710 | 647889 ms | 4710 | 21733 ms |
| GH7 | 410 | 538784 ms | 410 | 6658 ms |
| GH13 | 580 | 675093 ms | 580 | 12418 ms |
| GH16 | 27100 | 528891 ms | 27100 | 28541 ms |
Index build times
| System | Index build time |
|---|---|
| pgchem with OpenBabel or Indigo | 352137 ms |
| Bingo | 3458681ms |
Again, Bingo without it's index is apparently killed by the overhead of parsing text into the internal molecule format. With index it's a mixed bag, while it shines at GH13 and GH16, pgchem is about equal or faster at GH2 and GH7.
Wednesday, December 14, 2011
GH17 results
Mikhail Rybalkin from GGA Software asked me to do this, so here it is...
The GH17 test queries used
There is a small set of queries used in article Chemical substructure search in SQL by Golovin and Henrick. These queries were lately reused in other articles:
GH1 ONC1CC(C(O)C1O)[n]2cnc3c(NC4CC4)ncnc23
GH2 Nc1ncnc2[n]cnc12
GH3 CNc1ncnc2[n](C)cnc12
GH4 Nc1ncnc2[n](cnc12)C3CCCC3
GH5 CC12CCC3C(CCC4=CC(O)CCC34C)C1CCC2
GH6 OC2=CC(=O)c1c(cccc1)O2
GH7 Nc1nnc(S)s1
GH8 C1C2SCCN2C1
GH9 CP(O)(O)=O
GH10 CCCCCP(O)(O)=O
GH11 N2CCC13CCCCC1C2Cc4c3cccc4
GH12 s1cncc1
GH13 C34CCC1C(CCC2CC(=O)CCC12)C3CCC4
GH14 CCCCCCCCCCCP(O)(O)=O
GH15 CC1CCCC1
GH16 CCC1CCCC1
GH17 CCCC1CCCC1
GH17 substructure search speed
OpenBabel with binary+SMILES storage and FP2 fingerprint
| Query | Hits | no Index | Hits | with Index |
|---|---|---|---|---|
| GH1 | 0 | 9517 ms | 0 | 25 ms |
| GH2 | 484 | 8519 ms | 484 | 111 ms |
| GH3 | 63 | 8632 ms | 63 | 43 ms |
| GH4 | 5 | 8950 ms | 5 | 48 ms |
| GH5 | 36 | 10020 ms | 36 | 78 ms |
| GH6 | 0 | 8696 ms | 0 | 32 ms |
| GH7 | 26 | 8279 ms | 26 | 31 ms |
| GH8 | 170 | 8454 ms | 170 | 56 ms |
| GH9 | 348 | 8068 ms | 348 | 71 ms |
| GH10 | 36 | 8820 ms | 36 | 21 ms |
| GH11 | 66 | 9113 ms | 66 | 52 ms |
| GH12 | 831 | 7920 ms | 831 | 124 ms |
| GH13 | 58 | 9864 ms | 58 | 448 ms |
| GH14 | 4 | 9998 ms | 4 | 36 ms |
| GH15 | 3008 | 8549 ms | 3008 | 555 ms |
| GH16 | 2691 | 8665 ms | 2691 | 501 ms |
| GH17 | 2290 | 8717 ms | 2290 | 560 ms |
Indigo with binary storage and ext+sub fingerprint
| Query | Hits | no Index | Hits | with Index |
|---|---|---|---|---|
| GH1 | 0 | 28161 ms | 0 | 32 ms |
| GH2 | 484 | 19188 ms | 484 | 187 ms |
| GH3 | 68 | 20498 ms | 68 | 78 ms |
| GH4 | 5 | 23388 ms | 5 | 33 ms |
| GH5 | 36 | 25887 ms | 36 | 62 ms |
| GH6 | 0 | 21034 ms | 0 | 31 ms |
| GH7 | 41 | 16302 ms | 41 | 31 ms |
| GH8 | 170 | 16629 ms | 170 | 78 ms |
| GH9 | 373 | 14005 ms | 373 | 98 ms |
| GH10 | 37 | 16881 ms | 37 | 21 ms |
| GH11 | 66 | 24091 ms | 66 | 78 ms |
| GH12 | 829 | 14842 ms | 829 | 210 ms |
| GH13 | 58 | 24470 ms | 58 | 133 ms |
| GH14 | 4 | 21403 ms | 4 | 36 ms |
| GH15 | 3047 | 15817 ms | 3047 | 749 ms |
| GH16 | 2710 | 16767 ms | 2710 | 732 ms |
| GH17 | 2304 | 17524 ms | 2304 | 788 ms |
Bingo 1.7beta2 with molfiles as text storage
| Query | Hits | no Index | Hits | with Index |
|---|---|---|---|---|
| GH1 | 0 | 70277 ms | 0 | 125 ms |
| GH2 | 471 | 56821 ms | 471 | 156 ms |
| GH3 | 68 | 57754 ms | 68 | 125 ms |
| GH4 | 5 | 60067 ms | 5 | 125 ms |
| GH5 | 36 | 65586 ms | 36 | 140 ms |
| GH6 | 79 | 57188 ms | 79 | 125 ms |
| GH7 | 41 | 52134 ms | 41 | 125 ms |
| GH8 | 170 | 51685 ms | 170 | 140 ms |
| GH9 | 373 | 47613 ms | 373 | 138 ms |
| GH10 | 37 | 49961 ms | 37 | 110 ms |
| GH11 | 66 | 61176 ms | 66 | 125 ms |
| GH12 | 774 | 50281 ms | 829 | 156 ms |
| GH13 | 58 | 61108 ms | 58 | 156 ms |
| GH14 | 4 | 53636 ms | 4 | 140 ms |
| GH15 | 3047 | 50213 ms | 3047 | 343 ms |
| GH16 | 2710 | 51227 ms | 2710 | 327 ms |
| GH17 | 2304 | 51495 ms | 2304 | 362 ms |
Fingerprint efficiency (with regard to false positives)
FP2
| Query | Candidates screened | Hits matched | false positives | Efficiency |
|---|---|---|---|---|
| GH1 | 0 | 0 | 0 | 1.000 |
| GH2 | 485 | 484 | 1 | 0.998 |
| GH3 | 69 | 63 | 6 | 0.913 |
| GH4 | 37 | 5 | 32 | 0.135 |
| GH5 | 120 | 36 | 84 | 0.300 |
| GH6 | 79 | 0 | 79 | 0.000 |
| GH7 | 41 | 26 | 15 | 0.634 |
| GH8 | 177 | 170 | 7 | 0.960 |
| GH9 | 377 | 348 | 29 | 0.923 |
| GH10 | 37 | 36 | 1 | 0.973 |
| GH11 | 123 | 66 | 1 | 0.537 |
| GH12 | 831 | 831 | 0 | 1.000 |
| GH13 | 1346 | 58 | 1288 | 0.043 |
| GH14 | 20 | 4 | 16 | 0.200 |
| GH15 | 3760 | 3008 | 752 | 0.800 |
| GH16 | 3305 | 2691 | 614 | 0.814 |
| GH17 | 3305 | 2290 | 715 | 0.762 |
ext+sub
| Query | Candidates screened | Hits matched | false positives | Efficiency |
|---|---|---|---|---|
| GH1 | 0 | 0 | 0 | 1.000 |
| GH2 | 484 | 484 | 1 | 1.000 |
| GH3 | 68 | 68 | 0 | 1.000 |
| GH4 | 5 | 5 | 0 | 1.000 |
| GH5 | 47 | 36 | 11 | 0.766 |
| GH6 | 0 | 0 | 0 | 1.000 |
| GH7 | 41 | 41 | 0 | 1.000 |
| GH8 | 170 | 170 | 0 | 1.000 |
| GH9 | 373 | 373 | 0 | 1.000 |
| GH10 | 37 | 37 | 0 | 1.000 |
| GH11 | 66 | 66 | 0 | 1.000 |
| GH12 | 829 | 829 | 0 | 1.000 |
| GH13 | 259 | 58 | 201 | 0.224 |
| GH14 | 20 | 4 | 16 | 0.200 |
| GH15 | 3061 | 3047 | 14 | 0.995 |
| GH16 | 2720 | 2710 | 10 | 0.996 |
| GH17 | 2720 | 2304 | 416 | 0.847 |
Index build times
| System | Index build time |
|---|---|
| pgchem with OpenBabel or Indigo | 25690 ms |
| Bingo | 336319 ms |
Indigo's ext+sub fingerprint is truly more selective than FP2. Still, OpenBabel with binary storage shows the better prformance because of its
Also, the result for GH3, GH7, GH9, GH12, GH15, GH16, and GH17 are different between OpenBabel and Indigo, and Bingo finds 79 hits for GH6 where pgchem finds zero.
Index building on pgchem is 13 times faster than Bingo, but since pgchem (currently) does not support features like tautomer searching or SMARTS searching with index support this comparison is a bit like apples and oranges.
The slow performance of Bingo without index, comparable to pgchem without binary storage, is quite likely a result of the storage of molecules in textual representation. Parsing text to binary molecules is a first class performance killer. Unfortunately, there is no way to convert molecules into native format directly with Bingo for PostgreSQL, but Bingo does the conversion implicitly when building the index.
Friday, December 9, 2011
OBMol de-/serialization revisited
The OBChiralData isn't used anymore. Also the functions OBAtom::IsClockwise, and OBAtom::IsAntiClockwise are obsolate. Instead, you should serialize the OBCisTransStereo and OBTetrahedralStereo data objects.
Fortunately (for me, since I don't have the time at the moment), Jérôme Pansanel has:...started the serialization of the OBCisTransStereo and OBTetrahedralStereo objects.
For the time being, I have tried the following workaround and it seems to be working well. First, I have removed now unneccessary code from the unserialization:
bool unserializeOBMol(OBBase* pOb, const char *serializedInput)
{
OBMol* pmol = pOb->CastAndClear<OBMol>();
OBMol &mol = *pmol;
unsigned int i,natoms,nbonds;
unsigned int *intptr = (unsigned int*) serializedInput;
++intptr;
natoms = *intptr;
++intptr;
nbonds = *intptr;
++intptr;
_ATOM *atomptr = (_ATOM*) intptr;
mol.ReserveAtoms(natoms);
OBAtom atom;
int stereo;
for (i = 1; i <= natoms; i++) {
atom.SetIdx(atomptr->idx);
atom.SetHyb(atomptr->hybridization);
atom.SetAtomicNum((int) atomptr->atomicnum);
atom.SetIsotope((unsigned int) atomptr->isotope);
atom.SetFormalCharge((int) atomptr->formalcharge);
stereo = atomptr->stereo;
if(stereo == 3) {
atom.SetChiral();
}
atom.SetSpinMultiplicity((short) atomptr->spinmultiplicity);
if(atomptr->aromatic != 0) {
atom.SetAromatic();
}
if (!mol.AddAtom(atom)) {
return false;
}
atom.Clear();
++atomptr;
}
_BOND *bondptr = (_BOND*) atomptr;
unsigned int start,end,order,flags;
for (i = 0;i < nbonds;i++) {
flags = 0;
start = bondptr->beginidx;
end = bondptr->endidx;
order = (int) bondptr->order;
if (start == 0 || end == 0 || order == 0 || start > natoms || end > natoms) {
return false;
}
order = (unsigned int) (order == 4) ? 5 : order;
stereo = bondptr->stereo;
if (stereo) {
if (stereo == 1) {
flags |= OB_WEDGE_BOND;
}
if (stereo == 6) {
flags |= OB_HASH_BOND;
}
}
if (bondptr->aromatic != 0) {
flags |= OB_AROMATIC_BOND;
}
if (!mol.AddBond(start,end,order,flags)) {
return false;
}
++bondptr;
}
intptr = (unsigned int*) bondptr;
mol.SetAromaticPerceived();
mol.SetKekulePerceived();
return true;
}
if (strchr (querysmi, '@') != NULL)
{
//Match against an OBMol generated from SMILES
}
else
{
//Match against an OBMol deserialized from binary
}
Tuesday, December 6, 2011
First Light
Substructure search speed
| Rank | Build | Storage | Fingerprint | Hits | no Index | Hits | with Index |
|---|---|---|---|---|---|---|---|
| 1 | OpenBabel | binary+SMILES | FP2 | 8070 | 9236 ms | 8067 | 936 ms |
| 2 | Indigo | binary | ext+sub | 8049 | 24821 ms | 8049 | 2418 ms |
| 3 | OpenBabel | SMILES | FP2 | 8070 | 57971 ms | 8067 | 5432 ms |
I've checked why the OpenBabel FP2 fingerprint eliminates three structures that otherwise would pass: Using the VF2 OBIsomorphismMapper instead of the OBSmartsPattern, it's also 8067 without index. But it's about four times slower, 90526 ms without index, 8798 with index.
The structures in question are: 80944, 83450 and 99925 and I'm pretty sure it's caused by differences in aromaticity detection.
Stereochemistry in substructure searches
| Check | Query | Expected | Indigo binary | OpenBabel SMILES | OpenBabel binary |
|---|---|---|---|---|---|
| R/S different | select 'C[C@H]([C@@H](C(=O)O)N)O'::molecule <= 'O=C(O)[C@H](N)[C@@H](O)C'::molecule | false | pass | fail | fail |
| R/S same | select 'C[C@H]([C@@H](C(=O)O)N)O'::molecule <= 'C[C@H]([C@@H](C(=O)O)N)O'::molecule | true | pass | pass | pass |
| E/Z different | select 'C(=C\Cl)\Cl'::molecule <= 'C(=C/Cl)\Cl'::molecule | false | pass | fail | fail |
OpenBabel fails the 'E/Z different' and 'R/S different' checks, but these are known issues up to version 2.3.1.
Indigo has no obvious issues with matching R/S and E/Z stereochemistry.
Fingerprint efficiency
| Fingerprint type | Candidates screened | Hits matched | false positives | Efficiency |
|---|---|---|---|---|
| ext+sub | 8090 | 8049 | 41 | 0.995 |
| FP2 | 8145 | 8067 | 78 | 0.99 |
Pretty close.