The case at hand: Screening large sets of molecules for chemical simliarity.
Since (sub)graph isomorphism searching faces some mathematical challenges because of nonpolynomial O - even if you can use a specialized index, like pgchem::tigress does - fast similarity searching based on binary fingerprints has gained popularity in recent years.
I was tasked with evaluating a solution to the problem of similarity screening large sets of molecules with PostgreSQL where the fingerprints are generated externally, e.g. with the CDK.
This is, what I came up with...
Preparing the Racetrack
CREATE TABLE cdk.externalfp (
id int4 NOT NULL,
smiles varchar NOT NULL,
pubchemfp varbit NULL,
"cardinality" int4 NULL,
CONSTRAINT externalfp_pk PRIMARY KEY (id)
);
Above is the table definition of the final table. The cardinality column will be not used now, but since it is calculated by the fingerprint generator anyway, keeping it will save some work later. If you want to copy my example code 1:1, please use a database named chemistry and a schema named cdk.
First we need to load some data into the table. I used the free NCISMA99 dataset from the National Cancer Institute, containing 249081 chemical structures in SMILES notation.
COPY cdk.externalfp (id, smiles) FROM '/tmp/NCISMA99'
WITH (DELIMITER ' ', HEADER false, FORMAT csv);
And a few seconds later you should have 249081 rows in the table. Now we need to generate the fingerprints. The generator code is here, additionally you need the CDK 2.2 and a PostgreSQL JDBC driver. After changing the code to reflect your JDBC URL you are good to go.
Running the FingerprintGenerator should show no errors and takes about 30 Minutes on my Core i5 Linux Notebook. The fingerprint used is the PubChem fingerprint as described here.
Now we can put an index on the cardinality column (also used later) and are all set.
CREATE INDEX externalfp_cardinality_idx ON cdk.externalfp USING btree (cardinality);
Almost...
The function to calculate the similarity measure is still missing. We use the Tanimoto coefficient, as it is widely used and fairly easy to understand. The Tanimoto coefficient over PostgreSQL BIT VARYING can thus be written in pure SQL as:
CREATE OR REPLACE FUNCTION cdk.tanimoto(bit varying, bit varying)
RETURNS real
LANGUAGE sql
IMMUTABLE STRICT SECURITY INVOKER LEAKPROOF
AS $function$
select length(replace(($1 & $2)::text, '0', ''))::real / length(replace(($1 | $2)::text, '0', ''))::real;
$function$;
The Baseline. No brain, no muscles
For the first naive test, we run FindBySimilarity with the following inputs:
SMILES: CC(=O)OC1=CC=CC=C1C(=O)O
Threshold: 0.9
top N: 10
which gives the following plan from EXPLAIN ANALYZE:
Limit (cost=14515.22..14516.38 rows=10 width=77) (actual time=6436.022..6436.621 rows=10 loops=1)
-> Gather Merge (cost=14515.22..22587.95 rows=69190 width=77) (actual time=6436.021..6436.618 rows=10 loops=1)
Workers Planned: 2
Workers Launched: 2
-> Sort (cost=13515.19..13601.68 rows=34595 width=77) (actual time=6432.534..6432.535 rows=8 loops=3)
Sort Key: (((length(replace((('1100000001110000001110000000000000000000000000000000000000000000000000000000000000000000000000
Sort Method: top-N heapsort Memory: 27kB
Worker 0: Sort Method: quicksort Memory: 27kB
Worker 1: Sort Method: top-N heapsort Memory: 27kB
-> Parallel Seq Scan on externalfp (cost=0.00..12767.61 rows=34595 width=77) (actual time=41.329..6432.411 rows=30 loops=3)
Filter: (((length(replace((('110000000111000000111000000000000000000000000000000000000000000000000000000000000000000000
Rows Removed by Filter: 82997
Planning Time: 0.169 ms
Execution Time: 6436.648 ms
Including retrieval, this plan leads to an overall response time of about 14 seconds.
First race. Introducing the Hedgehog
An index could be helpful, but can we build one without major programming effort? Yes, due to the work of S. Joshua Swamidass and Pierre Baldi. In their 2007 paper "Bounds and Algorithms for Fast Exact Searches of Chemical Fingerprints in Linear and Sub-Linear Time", they found ways to calculate upper and lower bounds on the cardinality of fingerprints necessary to meet a given similarity once the cardinality of the search fingerprint is known.
The paper covers these calculations for various similarity measures. For Tanimoto it is:
min_cardinality_of_target = floor(cardinality_of_search_argument * similarity_threshold)
and
max_cardinality_of_target = ceil(cardinality_of_search_argument / similarity_threshold)
The function swamidassBaldiLimitsForTanimoto() in FindBySimilarity calculates those bounds and if you change:
//int[] lohi = swamidassBaldiLimitsForTanimoto(fp.cardinality(), threshold);
int[] lohi = {0, Integer.MAX_VALUE};
to
int[] lohi = swamidassBaldiLimitsForTanimoto(fp.cardinality(), threshold);
//int[] lohi = {0, Integer.MAX_VALUE};
they will be used. Now the index we already put on the "cardinality" column makes sense, allowing the database to filter out impossible candidates before invoking the tanimoto function.
For CC(=O)OC1=CC=CC=C1C(=O)O, the fingerprint has a cardinality of 115, which gives an upper bound of 128 and a lower bound of 103 bits. All fingerprints with cardinalities outside that bounds can be safely ignored since they cannot yield a Tanimoto coefficient >= 0.9.
Now the plan becomes:
Limit (cost=10445.68..10446.85 rows=10 width=77) (actual time=1492.430..1495.552 rows=10 loops=1)
-> Gather Merge (cost=10445.68..12328.11 rows=16134 width=77) (actual time=1492.429..1495.548 rows=10 loops=1)
Workers Planned: 2
Workers Launched: 2
-> Sort (cost=9445.66..9465.82 rows=8067 width=77) (actual time=1489.566..1489.567 rows=7 loops=3)
Sort Key: (((length(replace((('1100000001110000001110000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000
Sort Method: top-N heapsort Memory: 27kB
Worker 0: Sort Method: top-N heapsort Memory: 27kB
Worker 1: Sort Method: top-N heapsort Memory: 27kB
-> Parallel Bitmap Heap Scan on externalfp (cost=826.09..9271.33 rows=8067 width=77) (actual time=22.032..1489.502 rows=30 loops=3)
Recheck Cond: ((cardinality >= 103) AND (cardinality <= 128))
Filter: (((length(replace((('110000000111000000111000000000000000000000000000000000000000000000000000000000000000000000000000000000000000
Rows Removed by Filter: 19303
Heap Blocks: exact=1928
-> Bitmap Index Scan on externalfp_cardinality_idx (cost=0.00..821.25 rows=58083 width=0) (actual time=5.238..5.238 rows=58000 loops=1)
Index Cond: ((cardinality >= 103) AND (cardinality <= 128))
Planning Time: 0.166 ms
Execution Time: 1495.587 ms
Including retrieval, this plan gives us now an overall response time of about 3 seconds, or 4.6 times faster.
Second race. Introducing the Hare
Time to pull out the big guns. The tanimoto calculation in SQL is apparently slow, primarily because PostgreSQL is lacking a native bit count function for BIT VARYING, so this must be emulated using binary string replace() and length().
However, we can build one in C. Since counting bits is a simple operation for a microprocessor (Actually, it is an art in itself. See Andrew Dalke's popcount benchmark for many different ways to count bits.), a C function should perform much better.
After you installed tanimoto.c, changed the tanimoto function calls in the SQL in FindBySimilarity to tanimoto_c, and disabled the Swamidass/Baldi indexing, we see the raw power of a native function.
The plan now becomes:
Limit (cost=10363.85..10365.02 rows=10 width=77) (actual time=45.829..48.101 rows=10 loops=1)
-> Gather Merge (cost=10363.85..18436.58 rows=69190 width=77) (actual time=45.827..48.097 rows=10 loops=1)
Workers Planned: 2
Workers Launched: 2
-> Sort (cost=9363.83..9450.32 rows=34595 width=77) (actual time=43.667..43.668 rows=8 loops=3)
Sort Key: (tanimoto_c('11000000011100000011100000000000000000000000000000000000000000000000000000000000000000000000000000
Sort Method: top-N heapsort Memory: 27kB
Worker 0: Sort Method: top-N heapsort Memory: 27kB
Worker 1: Sort Method: top-N heapsort Memory: 27kB
-> Parallel Seq Scan on externalfp (cost=0.00..8616.24 rows=34595 width=77) (actual time=0.095..43.559 rows=30 loops=3)
Filter: (tanimoto_c('1100000001110000001110000000000000000000000000000000000000000000000000000000000000000000000000
Rows Removed by Filter: 82997
Planning Time: 0.062 ms
Execution Time: 48.145 ms
Overall response is 0.128 seconds, or 110 times faster.
Muscle beats brain. At least here.
Third race. Relay
But if we enable the Swamidass/Baldi index again and keep the native Tanimoto function, the real magic happens.
The final plan:
Limit (cost=9427.54..9427.56 rows=10 width=77) (actual time=35.567..35.570 rows=10 loops=1)
-> Sort (cost=9427.54..9475.94 rows=19361 width=77) (actual time=35.566..35.566 rows=10 loops=1)
Sort Key: (tanimoto_c('110000000111000000111000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000
Sort Method: top-N heapsort Memory: 26kB
-> Bitmap Heap Scan on externalfp (cost=826.09..9009.15 rows=19361 width=77) (actual time=5.404..35.459 rows=90 loops=1)
Recheck Cond: ((cardinality >= 103) AND (cardinality <= 128))
Filter: (tanimoto_c('11000000011100000011100000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000
Rows Removed by Filter: 57910
Heap Blocks: exact=6801
-> Bitmap Index Scan on externalfp_cardinality_idx (cost=0.00..821.25 rows=58083 width=0) (actual time=4.307..4.307 rows=58000 loops=1)
Index Cond: ((cardinality >= 103) AND (cardinality <= 128))
Planning Time: 0.149 ms
Execution Time: 35.641 ms
Final overall response now 0.04 seconds, or 350 times faster!
Muscle + brain: unmatched.
Conclusion
Sometimes the Hare wins, sometimes the Hedgehog does. Always attack nontrivial optimization problems from different angles and experiment, experiment, experiment.While C largely outperforms the index in this case, it's still good to know both, because using native functions is not always possible, e.g. on an AWS RDS instance. Then, 4.6x faster is better than nothing.
PostgreSQL needs a native builtin function for counting bits in BIT VARYING. Since fingerprint screening can be used not only for chemical structures, but for virtually every complex data that can be broken down into binary features, this would be ever so useful.