Substructure search speed
Rank | Build | Storage | Fingerprint | Hits | no Index | Hits | with Index |
---|---|---|---|---|---|---|---|
1 | OpenBabel | binary+SMILES | FP2 | 8070 | 9236 ms | 8067 | 936 ms |
2 | Indigo | binary | ext+sub | 8049 | 24821 ms | 8049 | 2418 ms |
3 | OpenBabel | SMILES | FP2 | 8070 | 57971 ms | 8067 | 5432 ms |
I've checked why the OpenBabel FP2 fingerprint eliminates three structures that otherwise would pass: Using the VF2 OBIsomorphismMapper instead of the OBSmartsPattern, it's also 8067 without index. But it's about four times slower, 90526 ms without index, 8798 with index.
The structures in question are: 80944, 83450 and 99925 and I'm pretty sure it's caused by differences in aromaticity detection.
Stereochemistry in substructure searches
Check | Query | Expected | Indigo binary | OpenBabel SMILES | OpenBabel binary |
---|---|---|---|---|---|
R/S different | select 'C[C@H]([C@@H](C(=O)O)N)O'::molecule <= 'O=C(O)[C@H](N)[C@@H](O)C'::molecule | false | pass | fail | fail |
R/S same | select 'C[C@H]([C@@H](C(=O)O)N)O'::molecule <= 'C[C@H]([C@@H](C(=O)O)N)O'::molecule | true | pass | pass | pass |
E/Z different | select 'C(=C\Cl)\Cl'::molecule <= 'C(=C/Cl)\Cl'::molecule | false | pass | fail | fail |
OpenBabel fails the 'E/Z different' and 'R/S different' checks, but these are known issues up to version 2.3.1.
Indigo has no obvious issues with matching R/S and E/Z stereochemistry.
Fingerprint efficiency
Fingerprint type | Candidates screened | Hits matched | false positives | Efficiency |
---|---|---|---|---|
ext+sub | 8090 | 8049 | 41 | 0.995 |
FP2 | 8145 | 8067 | 78 | 0.99 |
Pretty close.
Hello! Could you explain how have you tested "R/S same" check? I have checked it using Indigo and it works well: Indigo finds a match of C[C@H]([C@@H](C(=O)O)N)O in the same molecule.
ReplyDeleteI'm currently checking this. I suspect it is a bug in my code...
ReplyDeleteYes, it was a bug on my side. Fixed it and updated the post.
ReplyDeleteHello!
ReplyDeleteI have tested substructure searching c1ccccc1Cl on the first 100000 compounds from PubChem snapshot dated May with Bingo-SqlServer cartridge. It returned 8688 resuls, which differs to your case. I suppose that we can use different versions of PubChem. Can you publish compounds used in your tests?
Also I'm interested in efficiency of FP2 fingerprints. In Bingo 8705 compounds passed fingerprint test and 8688 were finally matched, so in this case fingerprints screening efficiency is 0.998. Can you provide such results for FP2?
As for time measurements, I recommend you to install Bingo-PostgreSQL cartridge and compare it to pgchem::tigress. Indigo is simplified high level C API over low level C++ API, which is used in Bingo with code optimizations, so that Bingo should be faster.
Michael Kvyatkovskiy, GGA Software.
ergo, do you have plans to test some set of queries? Your single query is not very representative.
ReplyDeleteMikhail Rybalkin, GGA Software.
"Also I'm interested in efficiency of FP2 fingerprints. In Bingo 8705 compounds passed fingerprint test and 8688 were finally matched, so in this case fingerprints screening efficiency is 0.998. Can you provide such results for FP2?"
ReplyDeleteDone.
"Can you publish compounds used in your tests?"
It's a 55 MB archive. I'll see where I can host it...
"As for time measurements, I recommend you to install Bingo-PostgreSQL cartridge and compare it to pgchem::tigress. Indigo is simplified high level C API over low level C++ API, which is used in Bingo with code optimizations, so that Bingo should be faster."
This was not intended as a shootout between Bingo and pgchem :-), only as a comparison between Indigo and OpenBabel as alternative matching engines for pgchem. Since I cannot use C++ directly in PostgreSQL, OpenBabel is called through an C wrapper library like Indigo, so I think from pgchem's perspective this is the correct way to compare them.
"Your single query is not very representative."
I agree, but the test data used is also not representative. The question is, what is 'representative'? I'd say this depends on the type of system, a catalogue of commercially available chemicals will contain different data than a medchem screening library and the typical queries will be different too. To roughly show differences between the various alternatives my approach is sufficient, I'd say, and that was the only intention, not a in-depth comparison between Indigo and OpenBabel.
But I'm open to suggestions. What would a representative dataset for general benchmarking be like? And what queries should be used?
best regards,
Ergo
"This was not intended as a shootout between Bingo and pgchem :-)"
ReplyDeleteIndigo and Bingo shares the same core code. Simplified Indigo API was designed for easy access to the underlying cheminformatics algorithms, but it is not too optimized. For Bingo we have also developed optimized C API for sequential processing, and have thoughts to document this API as it is done with Indigo. Comparison with Bingo could show us whether it worth it or not.
"What would a representative dataset for general benchmarking be like? And what queries should be used?"
This is very interesting question, and I do not know an answer to it. Even yesterday Andrew Dalke asked such question on BlueObelisk forum. At least I would suggest to test fingerprints efficiency on different queries, because I'm sure that Indigo fingerprint is better than FP2.
There is a small set of queries used in article Chemical substructure search in SQL by Golovin and Henrick. These queries were lately reused in other articles:
GH1 ONC1CC(C(O)C1O)[n]2cnc3c(NC4CC4)ncnc23
GH2 Nc1ncnc2[n]cnc12
GH3 CNc1ncnc2[n](C)cnc12
GH4 Nc1ncnc2[n](cnc12)C3CCCC3
GH5 CC12CCC3C(CCC4=CC(O)CCC34C)C1CCC2
GH6 OC2=CC(=O)c1c(cccc1)O2
GH7 Nc1nnc(S)s1
GH8 C1C2SCCN2C1
GH9 CP(O)(O)=O
GH10 CCCCCP(O)(O)=O
GH11 N2CCC13CCCCC1C2Cc4c3cccc4
GH12 s1cncc1
GH13 C34CCC1C(CCC2CC(=O)CCC12)C3CCC4
GH14 CCCCCCCCCCCP(O)(O)=O
GH15 CC1CCCC1
GH16 CCC1CCCC1
GH17 CCCC1CCCC1
I think that such list is sufficient for simple comparison.