Mikhail Rybalkin from GGA Software asked me to do this, so here it is...
The GH17 test queries used
There is a small set of queries used in article Chemical substructure search in SQL by Golovin and Henrick. These queries were lately reused in other articles:
GH1 ONC1CC(C(O)C1O)[n]2cnc3c(NC4CC4)ncnc23
GH2 Nc1ncnc2[n]cnc12
GH3 CNc1ncnc2[n](C)cnc12
GH4 Nc1ncnc2[n](cnc12)C3CCCC3
GH5 CC12CCC3C(CCC4=CC(O)CCC34C)C1CCC2
GH6 OC2=CC(=O)c1c(cccc1)O2
GH7 Nc1nnc(S)s1
GH8 C1C2SCCN2C1
GH9 CP(O)(O)=O
GH10 CCCCCP(O)(O)=O
GH11 N2CCC13CCCCC1C2Cc4c3cccc4
GH12 s1cncc1
GH13 C34CCC1C(CCC2CC(=O)CCC12)C3CCC4
GH14 CCCCCCCCCCCP(O)(O)=O
GH15 CC1CCCC1
GH16 CCC1CCCC1
GH17 CCCC1CCCC1
GH17 substructure search speed
OpenBabel with binary+SMILES storage and FP2 fingerprint
Query | Hits | no Index | Hits | with Index |
---|---|---|---|---|
GH1 | 0 | 9517 ms | 0 | 25 ms |
GH2 | 484 | 8519 ms | 484 | 111 ms |
GH3 | 63 | 8632 ms | 63 | 43 ms |
GH4 | 5 | 8950 ms | 5 | 48 ms |
GH5 | 36 | 10020 ms | 36 | 78 ms |
GH6 | 0 | 8696 ms | 0 | 32 ms |
GH7 | 26 | 8279 ms | 26 | 31 ms |
GH8 | 170 | 8454 ms | 170 | 56 ms |
GH9 | 348 | 8068 ms | 348 | 71 ms |
GH10 | 36 | 8820 ms | 36 | 21 ms |
GH11 | 66 | 9113 ms | 66 | 52 ms |
GH12 | 831 | 7920 ms | 831 | 124 ms |
GH13 | 58 | 9864 ms | 58 | 448 ms |
GH14 | 4 | 9998 ms | 4 | 36 ms |
GH15 | 3008 | 8549 ms | 3008 | 555 ms |
GH16 | 2691 | 8665 ms | 2691 | 501 ms |
GH17 | 2290 | 8717 ms | 2290 | 560 ms |
Indigo with binary storage and ext+sub fingerprint
Query | Hits | no Index | Hits | with Index |
---|---|---|---|---|
GH1 | 0 | 28161 ms | 0 | 32 ms |
GH2 | 484 | 19188 ms | 484 | 187 ms |
GH3 | 68 | 20498 ms | 68 | 78 ms |
GH4 | 5 | 23388 ms | 5 | 33 ms |
GH5 | 36 | 25887 ms | 36 | 62 ms |
GH6 | 0 | 21034 ms | 0 | 31 ms |
GH7 | 41 | 16302 ms | 41 | 31 ms |
GH8 | 170 | 16629 ms | 170 | 78 ms |
GH9 | 373 | 14005 ms | 373 | 98 ms |
GH10 | 37 | 16881 ms | 37 | 21 ms |
GH11 | 66 | 24091 ms | 66 | 78 ms |
GH12 | 829 | 14842 ms | 829 | 210 ms |
GH13 | 58 | 24470 ms | 58 | 133 ms |
GH14 | 4 | 21403 ms | 4 | 36 ms |
GH15 | 3047 | 15817 ms | 3047 | 749 ms |
GH16 | 2710 | 16767 ms | 2710 | 732 ms |
GH17 | 2304 | 17524 ms | 2304 | 788 ms |
Bingo 1.7beta2 with molfiles as text storage
Query | Hits | no Index | Hits | with Index |
---|---|---|---|---|
GH1 | 0 | 70277 ms | 0 | 125 ms |
GH2 | 471 | 56821 ms | 471 | 156 ms |
GH3 | 68 | 57754 ms | 68 | 125 ms |
GH4 | 5 | 60067 ms | 5 | 125 ms |
GH5 | 36 | 65586 ms | 36 | 140 ms |
GH6 | 79 | 57188 ms | 79 | 125 ms |
GH7 | 41 | 52134 ms | 41 | 125 ms |
GH8 | 170 | 51685 ms | 170 | 140 ms |
GH9 | 373 | 47613 ms | 373 | 138 ms |
GH10 | 37 | 49961 ms | 37 | 110 ms |
GH11 | 66 | 61176 ms | 66 | 125 ms |
GH12 | 774 | 50281 ms | 829 | 156 ms |
GH13 | 58 | 61108 ms | 58 | 156 ms |
GH14 | 4 | 53636 ms | 4 | 140 ms |
GH15 | 3047 | 50213 ms | 3047 | 343 ms |
GH16 | 2710 | 51227 ms | 2710 | 327 ms |
GH17 | 2304 | 51495 ms | 2304 | 362 ms |
Fingerprint efficiency (with regard to false positives)
FP2
Query | Candidates screened | Hits matched | false positives | Efficiency |
---|---|---|---|---|
GH1 | 0 | 0 | 0 | 1.000 |
GH2 | 485 | 484 | 1 | 0.998 |
GH3 | 69 | 63 | 6 | 0.913 |
GH4 | 37 | 5 | 32 | 0.135 |
GH5 | 120 | 36 | 84 | 0.300 |
GH6 | 79 | 0 | 79 | 0.000 |
GH7 | 41 | 26 | 15 | 0.634 |
GH8 | 177 | 170 | 7 | 0.960 |
GH9 | 377 | 348 | 29 | 0.923 |
GH10 | 37 | 36 | 1 | 0.973 |
GH11 | 123 | 66 | 1 | 0.537 |
GH12 | 831 | 831 | 0 | 1.000 |
GH13 | 1346 | 58 | 1288 | 0.043 |
GH14 | 20 | 4 | 16 | 0.200 |
GH15 | 3760 | 3008 | 752 | 0.800 |
GH16 | 3305 | 2691 | 614 | 0.814 |
GH17 | 3305 | 2290 | 715 | 0.762 |
ext+sub
Query | Candidates screened | Hits matched | false positives | Efficiency |
---|---|---|---|---|
GH1 | 0 | 0 | 0 | 1.000 |
GH2 | 484 | 484 | 1 | 1.000 |
GH3 | 68 | 68 | 0 | 1.000 |
GH4 | 5 | 5 | 0 | 1.000 |
GH5 | 47 | 36 | 11 | 0.766 |
GH6 | 0 | 0 | 0 | 1.000 |
GH7 | 41 | 41 | 0 | 1.000 |
GH8 | 170 | 170 | 0 | 1.000 |
GH9 | 373 | 373 | 0 | 1.000 |
GH10 | 37 | 37 | 0 | 1.000 |
GH11 | 66 | 66 | 0 | 1.000 |
GH12 | 829 | 829 | 0 | 1.000 |
GH13 | 259 | 58 | 201 | 0.224 |
GH14 | 20 | 4 | 16 | 0.200 |
GH15 | 3061 | 3047 | 14 | 0.995 |
GH16 | 2720 | 2710 | 10 | 0.996 |
GH17 | 2720 | 2304 | 416 | 0.847 |
Index build times
System | Index build time |
---|---|
pgchem with OpenBabel or Indigo | 25690 ms |
Bingo | 336319 ms |
Indigo's ext+sub fingerprint is truly more selective than FP2. Still, OpenBabel with binary storage shows the better prformance because of its
Also, the result for GH3, GH7, GH9, GH12, GH15, GH16, and GH17 are different between OpenBabel and Indigo, and Bingo finds 79 hits for GH6 where pgchem finds zero.
Index building on pgchem is 13 times faster than Bingo, but since pgchem (currently) does not support features like tautomer searching or SMARTS searching with index support this comparison is a bit like apples and oranges.
The slow performance of Bingo without index, comparable to pgchem without binary storage, is quite likely a result of the storage of molecules in textual representation. Parsing text to binary molecules is a first class performance killer. Unfortunately, there is no way to convert molecules into native format directly with Bingo for PostgreSQL, but Bingo does the conversion implicitly when building the index.
Thank you for this detailed comparison. It is very interesting!
ReplyDeleteAs I see from the results, pgchem has much lower overhead per query. Minimal time per query in pgchem is only 20 ms, while Bingo takes at least 120 ms. But if there are a lot of hits Bingo search time is lower (GH15, GH16, GH17).
> "OpenBabel with binary storage shows the better prformance because of its faster matcher"
This is true for comparing OpenBabel and Indigo, where we have overhead due to API simplification. Search time for GH15, GH16, GH17 shows that matcher in Bingo is faster. Your results show where we have to optimize our code.
It is also interesting to get similar comparison search time on larger datasets, for example with 1 million molecules, because overhead per query may be constant both in pgchem and Bingo.
--
Mikhail Rybalkin
"It is also interesting to get similar comparison search time on larger datasets, for example with 1 million molecules"
ReplyDeleteI can do that. The easiest way would be duplicating the current dataset 10 times, only adding mass but not diversity.
And I won't run all queries again, one or two you would be most interested in?
I think that the most interesting queries are GH2, and GH16 (and GH7, GH13).
ReplyDeleteWhat database did you use to do the searching? Also, I've written a similar post on my blog using the same queries: http://timvdm.blogspot.be/2012/08/fingerprint-efficiency-for-substructure.html
ReplyDelete