Wednesday, December 14, 2011

GH17 results

Mikhail Rybalkin from GGA Software asked me to do this, so here it is...

The GH17 test queries used


There is a small set of queries used in article Chemical substructure search in SQL by Golovin and Henrick. These queries were lately reused in other articles:

GH1 ONC1CC(C(O)C1O)[n]2cnc3c(NC4CC4)ncnc23
GH2 Nc1ncnc2[n]cnc12
GH3 CNc1ncnc2[n](C)cnc12
GH4 Nc1ncnc2[n](cnc12)C3CCCC3
GH5 CC12CCC3C(CCC4=CC(O)CCC34C)C1CCC2
GH6 OC2=CC(=O)c1c(cccc1)O2
GH7 Nc1nnc(S)s1
GH8 C1C2SCCN2C1
GH9 CP(O)(O)=O
GH10 CCCCCP(O)(O)=O
GH11 N2CCC13CCCCC1C2Cc4c3cccc4
GH12 s1cncc1
GH13 C34CCC1C(CCC2CC(=O)CCC12)C3CCC4
GH14 CCCCCCCCCCCP(O)(O)=O
GH15 CC1CCCC1
GH16 CCC1CCCC1
GH17 CCCC1CCCC1

GH17 substructure search speed


OpenBabel with binary+SMILES storage and FP2 fingerprint


QueryHitsno IndexHitswith Index
GH109517 ms025 ms
GH24848519 ms484111 ms
GH3638632 ms6343 ms
GH458950 ms548 ms
GH53610020 ms3678 ms
GH608696 ms032 ms
GH7268279 ms2631 ms
GH81708454 ms17056 ms
GH93488068 ms34871 ms
GH10368820 ms3621 ms
GH11669113 ms6652 ms
GH128317920 ms831124 ms
GH13589864 ms58448 ms
GH1449998 ms436 ms
GH1530088549 ms3008555 ms
GH1626918665 ms2691501 ms
GH1722908717 ms2290560 ms

Indigo with binary storage and ext+sub fingerprint


QueryHitsno IndexHitswith Index
GH1028161 ms032 ms
GH248419188 ms484187 ms
GH36820498 ms6878 ms
GH4523388 ms533 ms
GH53625887 ms3662 ms
GH6021034 ms031 ms
GH74116302 ms4131 ms
GH817016629 ms17078 ms
GH937314005 ms37398 ms
GH103716881 ms3721 ms
GH116624091 ms6678 ms
GH1282914842 ms829210 ms
GH135824470 ms58133 ms
GH14421403 ms436 ms
GH15304715817 ms3047749 ms
GH16271016767 ms2710732 ms
GH17230417524 ms2304788 ms

Bingo 1.7beta2 with molfiles as text storage


QueryHitsno IndexHitswith Index
GH1070277 ms0125 ms
GH247156821 ms471156 ms
GH36857754 ms68125 ms
GH4560067 ms5125 ms
GH53665586 ms36140 ms
GH67957188 ms79125 ms
GH74152134 ms41125 ms
GH817051685 ms170140 ms
GH937347613 ms373138 ms
GH103749961 ms37110 ms
GH116661176 ms66125 ms
GH1277450281 ms829156 ms
GH135861108 ms58156 ms
GH14453636 ms4140 ms
GH15304750213 ms3047343 ms
GH16271051227 ms2710327 ms
GH17230451495 ms2304362 ms

Fingerprint efficiency (with regard to false positives)


FP2


QueryCandidates screenedHits matchedfalse positivesEfficiency
GH10001.000
GH248548410.998
GH3696360.913
GH4375320.135
GH512036840.300
GH6790790.000
GH74126150.634
GH817717070.960
GH9377348290.923
GH10373610.973
GH111236610.537
GH1283183101.000
GH1313465812880.043
GH14204160.200
GH15376030087520.800
GH16330526916140.814
GH17330522907150.762

ext+sub


QueryCandidates screenedHits matchedfalse positivesEfficiency
GH10001.000
GH248448411.000
GH3686801.000
GH45501.000
GH54736110.766
GH60001.000
GH7414101.000
GH817017001.000
GH937337301.000
GH10373701.000
GH11666601.000
GH1282982901.000
GH13259582010.224
GH14204160.200
GH1530613047140.995
GH1627202710100.996
GH17272023044160.847

Index build times


SystemIndex build time
pgchem with OpenBabel or Indigo25690 ms
Bingo336319 ms


Indigo's ext+sub fingerprint is truly more selective than FP2. Still, OpenBabel with binary storage shows the better prformance because of its faster matcher lower query overhead.

Also, the result for GH3, GH7, GH9, GH12, GH15, GH16, and GH17 are different between OpenBabel and Indigo, and Bingo finds 79 hits for GH6 where pgchem finds zero.

Index building on pgchem is 13 times faster than Bingo, but since pgchem (currently) does not support features like tautomer searching or SMARTS searching with index support this comparison is a bit like apples and oranges.

The slow performance of Bingo without index, comparable to pgchem without binary storage, is quite likely a result of the storage of molecules in textual representation. Parsing text to binary molecules is a first class performance killer. Unfortunately, there is no way to convert molecules into native format directly with Bingo for PostgreSQL, but Bingo does the conversion implicitly when building the index.

4 comments:

  1. Thank you for this detailed comparison. It is very interesting!

    As I see from the results, pgchem has much lower overhead per query. Minimal time per query in pgchem is only 20 ms, while Bingo takes at least 120 ms. But if there are a lot of hits Bingo search time is lower (GH15, GH16, GH17).

    > "OpenBabel with binary storage shows the better prformance because of its faster matcher"

    This is true for comparing OpenBabel and Indigo, where we have overhead due to API simplification. Search time for GH15, GH16, GH17 shows that matcher in Bingo is faster. Your results show where we have to optimize our code.

    It is also interesting to get similar comparison search time on larger datasets, for example with 1 million molecules, because overhead per query may be constant both in pgchem and Bingo.

    --
    Mikhail Rybalkin

    ReplyDelete
  2. "It is also interesting to get similar comparison search time on larger datasets, for example with 1 million molecules"

    I can do that. The easiest way would be duplicating the current dataset 10 times, only adding mass but not diversity.

    And I won't run all queries again, one or two you would be most interested in?

    ReplyDelete
  3. I think that the most interesting queries are GH2, and GH16 (and GH7, GH13).

    ReplyDelete
  4. What database did you use to do the searching? Also, I've written a similar post on my blog using the same queries: http://timvdm.blogspot.be/2012/08/fingerprint-efficiency-for-substructure.html

    ReplyDelete