The plate is bad: pgchem::tigress sets new world record!

Friday, November 26, 2010

pgchem::tigress sets new world record!

The IPB now has successfully loaded ~28 million structures from pubchem into a pgchem::tigress / PostgreSQL database.

This is the largest installation I know of and therefore sets a new world record - for pgchem. :-)

6 comments:

AnonymousNovember 30, 2010 at 2:57 PM
metchem=# select count(*) from compound;
count
----------
28838421
(1 row)

This took ~15 hours on a dual-"CPU" VMware image
attached to a HP storage SAN box.

Thanks for pgchem !

Yours,
Steffen
ReplyDelete
Replies
ergoDecember 2, 2010 at 12:19 PM
How about search performance. Is it usable with that amount of data?
ReplyDelete
Replies
AnonymousDecember 14, 2010 at 3:28 PM
Hi,

We still have postgres on a VMWare, but now
directly attached to a beefy SAN (HP EVA 4400).
For ChEBIs molecule of the month we get:

select count(*) from compound where molecule('CC1=C(C(=O)C(=C(C1=O)OC)OC)CC=C(C)CCC=C(C)CCC=C(C)CCC=C(C)CCC=C(C)CCC=C') <= mol_structure;
count
-------
63
Time: 65308.260 ms

So that's around one minute, and the query plan is
QUERY PLAN
----------------------------------------------------------------------------------------------------------------------------------
Aggregate (cost=113782.66..113782.67 rows=1 width=0)
-> Bitmap Heap Scan on compound (cost=5423.32..113710.53 rows=28851 width=0)
Recheck Cond: ('CC1=C(C(=O)C(=C(C1=O)OC)OC)CC=C(C)CCC=C(C)CCC=C(C)CCC=C(C)CCC=C(C)CCC=C'::molecule <= mol_structure)
-> Bitmap Index Scan on compound_structure (cost=0.00..5416.11 rows=28851 width=0)
Index Cond: ('CC1=C(C(=O)C(=C(C1=O)OC)OC)CC=C(C)CCC=C(C)CCC=C(C)CCC=C(C)CCC=C(C)CCC=C'::molecule <= mol_structure)

Yours,
Steffen
ReplyDelete
Replies
AnonymousDecember 14, 2010 at 3:34 PM
One more: The second execution (with all caches filled) is *much* faster:
Time: 209.294 ms

Steffen
ReplyDelete
Replies
UnknownFebruary 28, 2011 at 10:43 AM
Hi there,

I am considering to use pgchem (for substructure/similarity searching) - but I have a DB of 100M chemicals.

Do you reccommend it?
ReplyDelete
Replies
ergoMarch 7, 2011 at 12:22 PM
Hello JP,

sorry for the late reply. Technically, PostgreSQL and tigress can handle this. But deriving from Steffen's experience with 28M it will not be fast. Having said that, the key questions are:

What means 'fast' to you - what is your expectancy for query response time?
How many concurrent users?
How much power does the intended hardware deliver?
What is the update frequency of the database?
How do your queries look like?

Especially the last question is very important to know to make a sound recommendation because queries can be tuned in various ways when the access pattern of the intended application is known.

Besides, substructure searching 100M structures will push any chemical cartridge to some or another limit...

best regards,
Ergo
ReplyDelete
Replies

Add comment

Friday, November 26, 2010

pgchem::tigress sets new world record!

6 comments:

Blog Archive

Blog Shortlist