Friday, November 26, 2010

pgchem::tigress sets new world record!

The IPB now has successfully loaded ~28 million structures from pubchem into a pgchem::tigress / PostgreSQL database.

This is the largest installation I know of and therefore sets a new world record - for pgchem. :-)

6 comments:

  1. metchem=# select count(*) from compound;
    count
    ----------
    28838421
    (1 row)

    This took ~15 hours on a dual-"CPU" VMware image
    attached to a HP storage SAN box.

    Thanks for pgchem !

    Yours,
    Steffen

    ReplyDelete
  2. How about search performance. Is it usable with that amount of data?

    ReplyDelete
  3. Hi,

    We still have postgres on a VMWare, but now
    directly attached to a beefy SAN (HP EVA 4400).
    For ChEBIs molecule of the month we get:

    select count(*) from compound where molecule('CC1=C(C(=O)C(=C(C1=O)OC)OC)CC=C(C)CCC=C(C)CCC=C(C)CCC=C(C)CCC=C(C)CCC=C') <= mol_structure;
    count
    -------
    63
    Time: 65308.260 ms

    So that's around one minute, and the query plan is
    QUERY PLAN
    ----------------------------------------------------------------------------------------------------------------------------------
    Aggregate (cost=113782.66..113782.67 rows=1 width=0)
    -> Bitmap Heap Scan on compound (cost=5423.32..113710.53 rows=28851 width=0)
    Recheck Cond: ('CC1=C(C(=O)C(=C(C1=O)OC)OC)CC=C(C)CCC=C(C)CCC=C(C)CCC=C(C)CCC=C(C)CCC=C'::molecule <= mol_structure)
    -> Bitmap Index Scan on compound_structure (cost=0.00..5416.11 rows=28851 width=0)
    Index Cond: ('CC1=C(C(=O)C(=C(C1=O)OC)OC)CC=C(C)CCC=C(C)CCC=C(C)CCC=C(C)CCC=C(C)CCC=C'::molecule <= mol_structure)

    Yours,
    Steffen

    ReplyDelete
  4. One more: The second execution (with all caches filled) is *much* faster:
    Time: 209.294 ms

    Steffen

    ReplyDelete
  5. Hi there,

    I am considering to use pgchem (for substructure/similarity searching) - but I have a DB of 100M chemicals.

    Do you reccommend it?

    ReplyDelete
  6. Hello JP,

    sorry for the late reply. Technically, PostgreSQL and tigress can handle this. But deriving from Steffen's experience with 28M it will not be fast. Having said that, the key questions are:

    What means 'fast' to you - what is your expectancy for query response time?
    How many concurrent users?
    How much power does the intended hardware deliver?
    What is the update frequency of the database?
    How do your queries look like?

    Especially the last question is very important to know to make a sound recommendation because queries can be tuned in various ways when the access pattern of the intended application is known.

    Besides, substructure searching 100M structures will push any chemical cartridge to some or another limit...

    best regards,
    Ergo

    ReplyDelete