The plate is bad: First Light

Tuesday, December 6, 2011

First Light

As promised, here the first results for substructure searching c1ccccc1Cl on the first 100000 compounds from Pubchem: select * from pubchem.compound where compound >= 'c1ccccc1Cl'::molecule

Substructure search speed

Rank	Build	Storage	Fingerprint	Hits	no Index	Hits	with Index
1	OpenBabel	binary+SMILES	FP2	8070	9236 ms	8067	936 ms
2	Indigo	binary	ext+sub	8049	24821 ms	8049	2418 ms
3	OpenBabel	SMILES	FP2	8070	57971 ms	8067	5432 ms

I've checked why the OpenBabel FP2 fingerprint eliminates three structures that otherwise would pass: Using the VF2 OBIsomorphismMapper instead of the OBSmartsPattern, it's also 8067 without index. But it's about four times slower, 90526 ms without index, 8798 with index.
The structures in question are: 80944, 83450 and 99925 and I'm pretty sure it's caused by differences in aromaticity detection.

Stereochemistry in substructure searches

Check	Query	Expected	Indigo binary	OpenBabel SMILES	OpenBabel binary
R/S different	select 'C[C@H]([C@@H](C(=O)O)N)O'::molecule <= 'O=C(O)[C@H](N)[C@@H](O)C'::molecule	false	pass	fail	fail
R/S same	select 'C[C@H]([C@@H](C(=O)O)N)O'::molecule <= 'C[C@H]([C@@H](C(=O)O)N)O'::molecule	true	pass	pass	pass
E/Z different	select 'C(=C\Cl)\Cl'::molecule <= 'C(=C/Cl)\Cl'::molecule	false	pass	fail	fail

OpenBabel fails the 'E/Z different' and 'R/S different' checks, but these are known issues up to version 2.3.1. ~~More disturbing are the issues with 'R/S different'. For SMILES I can reproduce it with obgrep, so it's not my code causing it.~~

Indigo has no obvious issues with matching R/S and E/Z stereochemistry. ~~Speedwise, OpenBabel with binary storage would be the leader of the pack, but has inconsistent behaviour compared with it's SMILES storage sibling.~~ So the serialization/deserialization code clearly is missing something. But I've found a simple workaround: If the query contains chirality information it uses SMILES, otherwise binary. Speedwise, OpenBabel with binary storage now is the leader of the pack.

Fingerprint efficiency

Fingerprint type	Candidates screened	Hits matched	false positives	Efficiency
ext+sub	8090	8049	41	0.995
FP2	8145	8067	78	0.99

Pretty close.

7 comments:

MikhailDecember 8, 2011 at 7:49 AM
Hello! Could you explain how have you tested "R/S same" check? I have checked it using Indigo and it works well: Indigo finds a match of C[C@H]([C@@H](C(=O)O)N)O in the same molecule.
ReplyDelete
Replies
ergoDecember 8, 2011 at 8:18 AM
I'm currently checking this. I suspect it is a bug in my code...
ReplyDelete
Replies
ergoDecember 8, 2011 at 9:24 AM
Yes, it was a bug on my side. Fixed it and updated the post.
ReplyDelete
Replies
MichaelDecember 8, 2011 at 10:20 AM
Hello!
I have tested substructure searching c1ccccc1Cl on the first 100000 compounds from PubChem snapshot dated May with Bingo-SqlServer cartridge. It returned 8688 resuls, which differs to your case. I suppose that we can use different versions of PubChem. Can you publish compounds used in your tests?
Also I'm interested in efficiency of FP2 fingerprints. In Bingo 8705 compounds passed fingerprint test and 8688 were finally matched, so in this case fingerprints screening efficiency is 0.998. Can you provide such results for FP2?

As for time measurements, I recommend you to install Bingo-PostgreSQL cartridge and compare it to pgchem::tigress. Indigo is simplified high level C API over low level C++ API, which is used in Bingo with code optimizations, so that Bingo should be faster.

Michael Kvyatkovskiy, GGA Software.
ReplyDelete
Replies
MikhailDecember 8, 2011 at 8:55 PM
ergo, do you have plans to test some set of queries? Your single query is not very representative.

Mikhail Rybalkin, GGA Software.
ReplyDelete
Replies
ergoDecember 9, 2011 at 11:09 AM
"Also I'm interested in efficiency of FP2 fingerprints. In Bingo 8705 compounds passed fingerprint test and 8688 were finally matched, so in this case fingerprints screening efficiency is 0.998. Can you provide such results for FP2?"

Done.

"Can you publish compounds used in your tests?"

It's a 55 MB archive. I'll see where I can host it...

"As for time measurements, I recommend you to install Bingo-PostgreSQL cartridge and compare it to pgchem::tigress. Indigo is simplified high level C API over low level C++ API, which is used in Bingo with code optimizations, so that Bingo should be faster."

This was not intended as a shootout between Bingo and pgchem :-), only as a comparison between Indigo and OpenBabel as alternative matching engines for pgchem. Since I cannot use C++ directly in PostgreSQL, OpenBabel is called through an C wrapper library like Indigo, so I think from pgchem's perspective this is the correct way to compare them.

"Your single query is not very representative."

I agree, but the test data used is also not representative. The question is, what is 'representative'? I'd say this depends on the type of system, a catalogue of commercially available chemicals will contain different data than a medchem screening library and the typical queries will be different too. To roughly show differences between the various alternatives my approach is sufficient, I'd say, and that was the only intention, not a in-depth comparison between Indigo and OpenBabel.

But I'm open to suggestions. What would a representative dataset for general benchmarking be like? And what queries should be used?

best regards,
Ergo
ReplyDelete
Replies
MikhailDecember 13, 2011 at 11:33 AM
"This was not intended as a shootout between Bingo and pgchem :-)"

Indigo and Bingo shares the same core code. Simplified Indigo API was designed for easy access to the underlying cheminformatics algorithms, but it is not too optimized. For Bingo we have also developed optimized C API for sequential processing, and have thoughts to document this API as it is done with Indigo. Comparison with Bingo could show us whether it worth it or not.

"What would a representative dataset for general benchmarking be like? And what queries should be used?"

This is very interesting question, and I do not know an answer to it. Even yesterday Andrew Dalke asked such question on BlueObelisk forum. At least I would suggest to test fingerprints efficiency on different queries, because I'm sure that Indigo fingerprint is better than FP2.

There is a small set of queries used in article Chemical substructure search in SQL by Golovin and Henrick. These queries were lately reused in other articles:
GH1 ONC1CC(C(O)C1O)[n]2cnc3c(NC4CC4)ncnc23
GH2 Nc1ncnc2[n]cnc12
GH3 CNc1ncnc2[n](C)cnc12
GH4 Nc1ncnc2[n](cnc12)C3CCCC3
GH5 CC12CCC3C(CCC4=CC(O)CCC34C)C1CCC2
GH6 OC2=CC(=O)c1c(cccc1)O2
GH7 Nc1nnc(S)s1
GH8 C1C2SCCN2C1
GH9 CP(O)(O)=O
GH10 CCCCCP(O)(O)=O
GH11 N2CCC13CCCCC1C2Cc4c3cccc4
GH12 s1cncc1
GH13 C34CCC1C(CCC2CC(=O)CCC12)C3CCC4
GH14 CCCCCCCCCCCP(O)(O)=O
GH15 CC1CCCC1
GH16 CCC1CCCC1
GH17 CCCC1CCCC1

I think that such list is sufficient for simple comparison.
ReplyDelete
Replies

Add comment