Friday, April 19, 2019

The Hare and the Hedgehog. Muscle, brain - or both?

In the famous fairy tale the hedgehog wins the race against the hare because he uses his brain to outwit the much faster hare: Brain beats muscle. But is that always the case? And what if we combine the two virtues?

The case at hand: Screening large sets of molecules for chemical simliarity.

Since (sub)graph isomorphism searching faces some mathematical challenges because of nonpolynomial O - even if you can use a specialized index, like pgchem::tigress does - fast similarity searching based on binary fingerprints has gained popularity in recent years.

I was tasked with evaluating a solution to the problem of similarity screening large sets of molecules with PostgreSQL where the fingerprints are generated externally, e.g. with the CDK.

This is, what I came up with...

Preparing the Racetrack


CREATE TABLE cdk.externalfp (
id int4 NOT NULL,
smiles varchar NOT NULL,
pubchemfp varbit NULL,
"cardinality" int4 NULL,
CONSTRAINT externalfp_pk PRIMARY KEY (id)
);

Above is the table definition of the final table. The cardinality column will be not used now, but since it is calculated by the fingerprint generator anyway, keeping it will save some work later. If you want to copy my example code 1:1, please use a database named chemistry and a schema named cdk.

First we need to load some data into the table. I used the free NCISMA99 dataset  from the National Cancer Institute, containing 249081 chemical structures in SMILES notation.

COPY cdk.externalfp (id, smiles) FROM '/tmp/NCISMA99' 
WITH (DELIMITER ' ', HEADER false, FORMAT csv);

And a few seconds later you should have 249081 rows in the table. Now we need to generate the fingerprints. The generator code is here, additionally you need the CDK 2.2 and a PostgreSQL JDBC driver. After changing the code to reflect your JDBC URL you are good to go.

Running the FingerprintGenerator should show no errors and takes about 30 Minutes on my Core i5 Linux Notebook. The fingerprint used is the PubChem fingerprint as described here.
Now we can put an index on the cardinality column (also used later) and are all set.

CREATE INDEX externalfp_cardinality_idx ON cdk.externalfp USING btree (cardinality);

Almost...

The function to calculate the similarity measure is still missing. We use the Tanimoto coefficient, as it is widely used and fairly easy to understand. The Tanimoto coefficient over PostgreSQL BIT VARYING can thus be written in pure SQL as:

CREATE OR REPLACE FUNCTION cdk.tanimoto(bit varying, bit varying)
 RETURNS real
 LANGUAGE sql
 IMMUTABLE STRICT SECURITY INVOKER LEAKPROOF
AS $function$
select length(replace(($1 & $2)::text, '0', ''))::real / length(replace(($1 | $2)::text, '0', ''))::real;
$function$;

Please note that the required bitcount function on BIT VARYING is emulated by removing all 0s and measuring the length of the remaining string, containing all 1s.


The Baseline. No brain, no muscles


For the first naive test, we run FindBySimilarity with the following inputs:

SMILES: CC(=O)OC1=CC=CC=C1C(=O)O
Threshold: 0.9
top N: 10

which gives the following plan from EXPLAIN ANALYZE:

Limit  (cost=14515.22..14516.38 rows=10 width=77) (actual time=6436.022..6436.621 rows=10 loops=1)                                         
  ->  Gather Merge  (cost=14515.22..22587.95 rows=69190 width=77) (actual time=6436.021..6436.618 rows=10 loops=1)                         
        Workers Planned: 2                                                                                                                 
        Workers Launched: 2                                                                                                                
        ->  Sort  (cost=13515.19..13601.68 rows=34595 width=77) (actual time=6432.534..6432.535 rows=8 loops=3)                            
              Sort Key: (((length(replace((('1100000001110000001110000000000000000000000000000000000000000000000000000000000000000000000000
              Sort Method: top-N heapsort  Memory: 27kB                                                                                    
              Worker 0:  Sort Method: quicksort  Memory: 27kB                                                                              
              Worker 1:  Sort Method: top-N heapsort  Memory: 27kB                                                                         
              ->  Parallel Seq Scan on externalfp  (cost=0.00..12767.61 rows=34595 width=77) (actual time=41.329..6432.411 rows=30 loops=3)
                    Filter: (((length(replace((('110000000111000000111000000000000000000000000000000000000000000000000000000000000000000000
                    Rows Removed by Filter: 82997                                                                                          
Planning Time: 0.169 ms                                                                                                                    
Execution Time: 6436.648 ms                                                                                                                                                                                                                                                                                                                           
Including retrieval, this plan leads to an overall response time of about 14 seconds.


First race. Introducing the Hedgehog


An index could be helpful, but can we build one without major programming effort? Yes, due to the work of S. Joshua Swamidass and Pierre Baldi. In their 2007 paper "Bounds and Algorithms for Fast Exact Searches of Chemical Fingerprints in Linear and Sub-Linear Time", they found ways to calculate upper and lower bounds on the cardinality of fingerprints necessary to meet a given similarity once the cardinality of the search fingerprint is known.

The paper covers these calculations for various similarity measures. For Tanimoto it is:

min_cardinality_of_target = floor(cardinality_of_search_argument * similarity_threshold)

and 

max_cardinality_of_target = ceil(cardinality_of_search_argument / similarity_threshold)

The function swamidassBaldiLimitsForTanimoto() in FindBySimilarity calculates those bounds and if you change:

//int[] lohi = swamidassBaldiLimitsForTanimoto(fp.cardinality(), threshold);
int[] lohi = {0, Integer.MAX_VALUE};

to

int[] lohi = swamidassBaldiLimitsForTanimoto(fp.cardinality(), threshold);
//int[] lohi = {0, Integer.MAX_VALUE};

they will be used. Now the index we already put on the "cardinality" column makes sense, allowing the database to filter out impossible candidates before invoking the tanimoto function.

For CC(=O)OC1=CC=CC=C1C(=O)O, the fingerprint has a cardinality of 115, which gives an upper bound of 128 and a lower bound of 103 bits. All fingerprints with cardinalities outside that bounds can be safely ignored since they cannot yield a Tanimoto coefficient >= 0.9.

Now the plan becomes:

Limit  (cost=10445.68..10446.85 rows=10 width=77) (actual time=1492.430..1495.552 rows=10 loops=1)                                                           
  ->  Gather Merge  (cost=10445.68..12328.11 rows=16134 width=77) (actual time=1492.429..1495.548 rows=10 loops=1)                                           
        Workers Planned: 2                                                                                       
        Workers Launched: 2                                                                                                                                  
        ->  Sort  (cost=9445.66..9465.82 rows=8067 width=77) (actual time=1489.566..1489.567 rows=7 loops=3)                                                 
              Sort Key: (((length(replace((('1100000001110000001110000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000
              Sort Method: top-N heapsort  Memory: 27kB                                                                                                      
              Worker 0:  Sort Method: top-N heapsort  Memory: 27kB                                                                                           
              Worker 1:  Sort Method: top-N heapsort  Memory: 27kB                                                                                           
              ->  Parallel Bitmap Heap Scan on externalfp  (cost=826.09..9271.33 rows=8067 width=77) (actual time=22.032..1489.502 rows=30 loops=3)          
                    Recheck Cond: ((cardinality >= 103) AND (cardinality <= 128))                                                                            
                    Filter: (((length(replace((('110000000111000000111000000000000000000000000000000000000000000000000000000000000000000000000000000000000000
                    Rows Removed by Filter: 19303                                                                                                            
                    Heap Blocks: exact=1928                                                                                                                  
                    ->  Bitmap Index Scan on externalfp_cardinality_idx  (cost=0.00..821.25 rows=58083 width=0) (actual time=5.238..5.238 rows=58000 loops=1)
                          Index Cond: ((cardinality >= 103) AND (cardinality <= 128))                                                                        
Planning Time: 0.166 ms                                                                                                                                      
Execution Time: 1495.587 ms                                                                                                                                                             

Including retrieval, this plan gives us now an overall response time of about 3 seconds, or 4.6 times faster.

Second race. Introducing the Hare


Time to pull out the big guns. The tanimoto calculation in SQL is apparently slow, primarily because PostgreSQL is lacking a native bit count function for BIT VARYING, so this must be emulated using binary string replace() and length().

However, we can build one in C. Since counting bits is a simple operation for a microprocessor (Actually, it is an art in itself. See Andrew Dalke's popcount benchmark for many different ways to count bits.), a C function should perform much better.

After you installed tanimoto.c, changed the tanimoto function calls in the SQL in FindBySimilarity to tanimoto_c, and disabled the Swamidass/Baldi indexing, we see the raw power of a native function.

The plan now becomes:

Limit  (cost=10363.85..10365.02 rows=10 width=77) (actual time=45.829..48.101 rows=10 loops=1)                                         
  ->  Gather Merge  (cost=10363.85..18436.58 rows=69190 width=77) (actual time=45.827..48.097 rows=10 loops=1)                         
        Workers Planned: 2                                                                                                             
        Workers Launched: 2                                                                                                            
        ->  Sort  (cost=9363.83..9450.32 rows=34595 width=77) (actual time=43.667..43.668 rows=8 loops=3)                              
              Sort Key: (tanimoto_c('11000000011100000011100000000000000000000000000000000000000000000000000000000000000000000000000000
              Sort Method: top-N heapsort  Memory: 27kB                                                                                
              Worker 0:  Sort Method: top-N heapsort  Memory: 27kB                                                                     
              Worker 1:  Sort Method: top-N heapsort  Memory: 27kB                                                                     
              ->  Parallel Seq Scan on externalfp  (cost=0.00..8616.24 rows=34595 width=77) (actual time=0.095..43.559 rows=30 loops=3)
                    Filter: (tanimoto_c('1100000001110000001110000000000000000000000000000000000000000000000000000000000000000000000000
                    Rows Removed by Filter: 82997                                                                                      
Planning Time: 0.062 ms                                                                                                                
Execution Time: 48.145 ms     


Overall response is 0.128 seconds, or 110 times faster.

Muscle beats brain. At least here.


Third race. Relay


But if we enable the Swamidass/Baldi index again and keep the native Tanimoto function, the real magic happens.

The final plan:


Limit  (cost=9427.54..9427.56 rows=10 width=77) (actual time=35.567..35.570 rows=10 loops=1)                                                           
  ->  Sort  (cost=9427.54..9475.94 rows=19361 width=77) (actual time=35.566..35.566 rows=10 loops=1)                                                   
        Sort Key: (tanimoto_c('110000000111000000111000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000
        Sort Method: top-N heapsort  Memory: 26kB                                                                                                      
        ->  Bitmap Heap Scan on externalfp  (cost=826.09..9009.15 rows=19361 width=77) (actual time=5.404..35.459 rows=90 loops=1)                     
              Recheck Cond: ((cardinality >= 103) AND (cardinality <= 128))                                                                            
              Filter: (tanimoto_c('11000000011100000011100000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000
              Rows Removed by Filter: 57910                                                                                                            
              Heap Blocks: exact=6801                                                                                                                  
              ->  Bitmap Index Scan on externalfp_cardinality_idx  (cost=0.00..821.25 rows=58083 width=0) (actual time=4.307..4.307 rows=58000 loops=1)
                    Index Cond: ((cardinality >= 103) AND (cardinality <= 128))                                                                        
Planning Time: 0.149 ms                                                                                                                                
Execution Time: 35.641 ms                                                                                                                                                                                                                                          
                                                                                                       
Final overall response now 0.04 seconds, or 350 times faster!

Muscle + brain: unmatched.

Conclusion

Sometimes the Hare wins, sometimes the Hedgehog does. Always attack nontrivial optimization problems from different angles and experiment, experiment, experiment.

While C largely outperforms the index in this case, it's still good to know both, because using native functions is not always possible, e.g. on an AWS RDS instance. Then, 4.6x faster is better than nothing.

PostgreSQL needs a native builtin function for counting bits in BIT VARYING. Since fingerprint screening can be used not only for chemical structures, but for virtually every complex data that can be broken down into binary features, this would be ever so useful.