We still have postgres on a VMWare, but now directly attached to a beefy SAN (HP EVA 4400). For ChEBIs molecule of the month we get:
select count(*) from compound where molecule('CC1=C(C(=O)C(=C(C1=O)OC)OC)CC=C(C)CCC=C(C)CCC=C(C)CCC=C(C)CCC=C(C)CCC=C') <= mol_structure; count ------- 63 Time: 65308.260 ms
So that's around one minute, and the query plan is QUERY PLAN ---------------------------------------------------------------------------------------------------------------------------------- Aggregate (cost=113782.66..113782.67 rows=1 width=0) -> Bitmap Heap Scan on compound (cost=5423.32..113710.53 rows=28851 width=0) Recheck Cond: ('CC1=C(C(=O)C(=C(C1=O)OC)OC)CC=C(C)CCC=C(C)CCC=C(C)CCC=C(C)CCC=C(C)CCC=C'::molecule <= mol_structure) -> Bitmap Index Scan on compound_structure (cost=0.00..5416.11 rows=28851 width=0) Index Cond: ('CC1=C(C(=O)C(=C(C1=O)OC)OC)CC=C(C)CCC=C(C)CCC=C(C)CCC=C(C)CCC=C(C)CCC=C'::molecule <= mol_structure)
sorry for the late reply. Technically, PostgreSQL and tigress can handle this. But deriving from Steffen's experience with 28M it will not be fast. Having said that, the key questions are:
What means 'fast' to you - what is your expectancy for query response time? How many concurrent users? How much power does the intended hardware deliver? What is the update frequency of the database? How do your queries look like?
Especially the last question is very important to know to make a sound recommendation because queries can be tuned in various ways when the access pattern of the intended application is known.
Besides, substructure searching 100M structures will push any chemical cartridge to some or another limit...
metchem=# select count(*) from compound;
ReplyDeletecount
----------
28838421
(1 row)
This took ~15 hours on a dual-"CPU" VMware image
attached to a HP storage SAN box.
Thanks for pgchem !
Yours,
Steffen
How about search performance. Is it usable with that amount of data?
ReplyDeleteHi,
ReplyDeleteWe still have postgres on a VMWare, but now
directly attached to a beefy SAN (HP EVA 4400).
For ChEBIs molecule of the month we get:
select count(*) from compound where molecule('CC1=C(C(=O)C(=C(C1=O)OC)OC)CC=C(C)CCC=C(C)CCC=C(C)CCC=C(C)CCC=C(C)CCC=C') <= mol_structure;
count
-------
63
Time: 65308.260 ms
So that's around one minute, and the query plan is
QUERY PLAN
----------------------------------------------------------------------------------------------------------------------------------
Aggregate (cost=113782.66..113782.67 rows=1 width=0)
-> Bitmap Heap Scan on compound (cost=5423.32..113710.53 rows=28851 width=0)
Recheck Cond: ('CC1=C(C(=O)C(=C(C1=O)OC)OC)CC=C(C)CCC=C(C)CCC=C(C)CCC=C(C)CCC=C(C)CCC=C'::molecule <= mol_structure)
-> Bitmap Index Scan on compound_structure (cost=0.00..5416.11 rows=28851 width=0)
Index Cond: ('CC1=C(C(=O)C(=C(C1=O)OC)OC)CC=C(C)CCC=C(C)CCC=C(C)CCC=C(C)CCC=C(C)CCC=C'::molecule <= mol_structure)
Yours,
Steffen
One more: The second execution (with all caches filled) is *much* faster:
ReplyDeleteTime: 209.294 ms
Steffen
Hi there,
ReplyDeleteI am considering to use pgchem (for substructure/similarity searching) - but I have a DB of 100M chemicals.
Do you reccommend it?
Hello JP,
ReplyDeletesorry for the late reply. Technically, PostgreSQL and tigress can handle this. But deriving from Steffen's experience with 28M it will not be fast. Having said that, the key questions are:
What means 'fast' to you - what is your expectancy for query response time?
How many concurrent users?
How much power does the intended hardware deliver?
What is the update frequency of the database?
How do your queries look like?
Especially the last question is very important to know to make a sound recommendation because queries can be tuned in various ways when the access pattern of the intended application is known.
Besides, substructure searching 100M structures will push any chemical cartridge to some or another limit...
best regards,
Ergo