Thursday, May 2, 2019

Not all CASTs are created equal?

Can somebody explain this?

[Solved: See the comments section]

PostgreSQL 11.2.

The documentation says:

"A type cast specifies a conversion from one data type to another. PostgreSQL accepts two equivalent syntaxes for type casts:

CAST ( expression AS type )


The CAST syntax conforms to SQL; the syntax with :: is historical PostgreSQL usage."

But when I test the lower limits of PostgreSQL's integer types, strange things happen.

select cast(-9223372036854775808 as bigint);
select cast(-2147483648 as integer);
select cast(-32768 as smallint);

All OK.

select -9223372036854775808::bigint;
select -2147483648::integer;
select -32768::smallint;

All fail with SQL Error [22003]: ERROR: out of range


select -9223372036854775807::bigint;
select -2147483647::integer;
select -32767::smallint;

All OK.


Friday, April 19, 2019

The Hare and the Hedgehog. Muscle, brain - or both?

In the famous fairy tale the hedgehog wins the race against the hare because he uses his brain to outwit the much faster hare: Brain beats muscle. But is that always the case? And what if we combine the two virtues?

The case at hand: Screening large sets of molecules for chemical simliarity.

Since (sub)graph isomorphism searching faces some mathematical challenges because of nonpolynomial O - even if you can use a specialized index, like pgchem::tigress does - fast similarity searching based on binary fingerprints has gained popularity in recent years.

I was tasked with evaluating a solution to the problem of similarity screening large sets of molecules with PostgreSQL where the fingerprints are generated externally, e.g. with the CDK.

This is, what I came up with...

Preparing the Racetrack

CREATE TABLE cdk.externalfp (
id int4 NOT NULL,
smiles varchar NOT NULL,
pubchemfp varbit NULL,
"cardinality" int4 NULL,
CONSTRAINT externalfp_pk PRIMARY KEY (id)

Above is the table definition of the final table. The cardinality column will be not used now, but since it is calculated by the fingerprint generator anyway, keeping it will save some work later. If you want to copy my example code 1:1, please use a database named chemistry and a schema named cdk.

First we need to load some data into the table. I used the free NCISMA99 dataset  from the National Cancer Institute, containing 249081 chemical structures in SMILES notation.

COPY cdk.externalfp (id, smiles) FROM '/tmp/NCISMA99' 

And a few seconds later you should have 249081 rows in the table. Now we need to generate the fingerprints. The generator code is here, additionally you need the CDK 2.2 and a PostgreSQL JDBC driver. After changing the code to reflect your JDBC URL you are good to go.

Running the FingerprintGenerator should show no errors and takes about 30 Minutes on my Core i5 Linux Notebook. The fingerprint used is the PubChem fingerprint as described here.
Now we can put an index on the cardinality column (also used later) and are all set.

CREATE INDEX externalfp_cardinality_idx ON cdk.externalfp USING btree (cardinality);


The function to calculate the similarity measure is still missing. We use the Tanimoto coefficient, as it is widely used and fairly easy to understand. The Tanimoto coefficient over PostgreSQL BIT VARYING can thus be written in pure SQL as:

CREATE OR REPLACE FUNCTION cdk.tanimoto(bit varying, bit varying)
AS $function$
select length(replace(($1 & $2)::text, '0', ''))::real / length(replace(($1 | $2)::text, '0', ''))::real;

Please note that the required bitcount function on BIT VARYING is emulated by removing all 0s and measuring the length of the remaining string, containing all 1s.

The Baseline. No brain, no muscles

For the first naive test, we run FindBySimilarity with the following inputs:

Threshold: 0.9
top N: 10

which gives the following plan from EXPLAIN ANALYZE:

Limit  (cost=14515.22..14516.38 rows=10 width=77) (actual time=6436.022..6436.621 rows=10 loops=1)                                         
  ->  Gather Merge  (cost=14515.22..22587.95 rows=69190 width=77) (actual time=6436.021..6436.618 rows=10 loops=1)                         
        Workers Planned: 2                                                                                                                 
        Workers Launched: 2                                                                                                                
        ->  Sort  (cost=13515.19..13601.68 rows=34595 width=77) (actual time=6432.534..6432.535 rows=8 loops=3)                            
              Sort Key: (((length(replace((('1100000001110000001110000000000000000000000000000000000000000000000000000000000000000000000000
              Sort Method: top-N heapsort  Memory: 27kB                                                                                    
              Worker 0:  Sort Method: quicksort  Memory: 27kB                                                                              
              Worker 1:  Sort Method: top-N heapsort  Memory: 27kB                                                                         
              ->  Parallel Seq Scan on externalfp  (cost=0.00..12767.61 rows=34595 width=77) (actual time=41.329..6432.411 rows=30 loops=3)
                    Filter: (((length(replace((('110000000111000000111000000000000000000000000000000000000000000000000000000000000000000000
                    Rows Removed by Filter: 82997                                                                                          
Planning Time: 0.169 ms                                                                                                                    
Execution Time: 6436.648 ms                                                                                                                                                                                                                                                                                                                           
Including retrieval, this plan leads to an overall response time of about 14 seconds.

First race. Introducing the Hedgehog

An index could be helpful, but can we build one without major programming effort? Yes, due to the work of S. Joshua Swamidass and Pierre Baldi. In their 2007 paper "Bounds and Algorithms for Fast Exact Searches of Chemical Fingerprints in Linear and Sub-Linear Time", they found ways to calculate upper and lower bounds on the cardinality of fingerprints necessary to meet a given similarity once the cardinality of the search fingerprint is known.

The paper covers these calculations for various similarity measures. For Tanimoto it is:

min_cardinality_of_target = floor(cardinality_of_search_argument * similarity_threshold)


max_cardinality_of_target = ceil(cardinality_of_search_argument / similarity_threshold)

The function swamidassBaldiLimitsForTanimoto() in FindBySimilarity calculates those bounds and if you change:

//int[] lohi = swamidassBaldiLimitsForTanimoto(fp.cardinality(), threshold);
int[] lohi = {0, Integer.MAX_VALUE};


int[] lohi = swamidassBaldiLimitsForTanimoto(fp.cardinality(), threshold);
//int[] lohi = {0, Integer.MAX_VALUE};

they will be used. Now the index we already put on the "cardinality" column makes sense, allowing the database to filter out impossible candidates before invoking the tanimoto function.

For CC(=O)OC1=CC=CC=C1C(=O)O, the fingerprint has a cardinality of 115, which gives an upper bound of 128 and a lower bound of 103 bits. All fingerprints with cardinalities outside that bounds can be safely ignored since they cannot yield a Tanimoto coefficient >= 0.9.

Now the plan becomes:

Limit  (cost=10445.68..10446.85 rows=10 width=77) (actual time=1492.430..1495.552 rows=10 loops=1)                                                           
  ->  Gather Merge  (cost=10445.68..12328.11 rows=16134 width=77) (actual time=1492.429..1495.548 rows=10 loops=1)                                           
        Workers Planned: 2                                                                                       
        Workers Launched: 2                                                                                                                                  
        ->  Sort  (cost=9445.66..9465.82 rows=8067 width=77) (actual time=1489.566..1489.567 rows=7 loops=3)                                                 
              Sort Key: (((length(replace((('1100000001110000001110000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000
              Sort Method: top-N heapsort  Memory: 27kB                                                                                                      
              Worker 0:  Sort Method: top-N heapsort  Memory: 27kB                                                                                           
              Worker 1:  Sort Method: top-N heapsort  Memory: 27kB                                                                                           
              ->  Parallel Bitmap Heap Scan on externalfp  (cost=826.09..9271.33 rows=8067 width=77) (actual time=22.032..1489.502 rows=30 loops=3)          
                    Recheck Cond: ((cardinality >= 103) AND (cardinality <= 128))                                                                            
                    Filter: (((length(replace((('110000000111000000111000000000000000000000000000000000000000000000000000000000000000000000000000000000000000
                    Rows Removed by Filter: 19303                                                                                                            
                    Heap Blocks: exact=1928                                                                                                                  
                    ->  Bitmap Index Scan on externalfp_cardinality_idx  (cost=0.00..821.25 rows=58083 width=0) (actual time=5.238..5.238 rows=58000 loops=1)
                          Index Cond: ((cardinality >= 103) AND (cardinality <= 128))                                                                        
Planning Time: 0.166 ms                                                                                                                                      
Execution Time: 1495.587 ms                                                                                                                                                             

Including retrieval, this plan gives us now an overall response time of about 3 seconds, or 4.6 times faster.

Second race. Introducing the Hare

Time to pull out the big guns. The tanimoto calculation in SQL is apparently slow, primarily because PostgreSQL is lacking a native bit count function for BIT VARYING, so this must be emulated using binary string replace() and length().

However, we can build one in C. Since counting bits is a simple operation for a microprocessor (Actually, it is an art in itself. See Andrew Dalke's popcount benchmark for many different ways to count bits.), a C function should perform much better.

After you installed tanimoto.c, changed the tanimoto function calls in the SQL in FindBySimilarity to tanimoto_c, and disabled the Swamidass/Baldi indexing, we see the raw power of a native function.

The plan now becomes:

Limit  (cost=10363.85..10365.02 rows=10 width=77) (actual time=45.829..48.101 rows=10 loops=1)                                         
  ->  Gather Merge  (cost=10363.85..18436.58 rows=69190 width=77) (actual time=45.827..48.097 rows=10 loops=1)                         
        Workers Planned: 2                                                                                                             
        Workers Launched: 2                                                                                                            
        ->  Sort  (cost=9363.83..9450.32 rows=34595 width=77) (actual time=43.667..43.668 rows=8 loops=3)                              
              Sort Key: (tanimoto_c('11000000011100000011100000000000000000000000000000000000000000000000000000000000000000000000000000
              Sort Method: top-N heapsort  Memory: 27kB                                                                                
              Worker 0:  Sort Method: top-N heapsort  Memory: 27kB                                                                     
              Worker 1:  Sort Method: top-N heapsort  Memory: 27kB                                                                     
              ->  Parallel Seq Scan on externalfp  (cost=0.00..8616.24 rows=34595 width=77) (actual time=0.095..43.559 rows=30 loops=3)
                    Filter: (tanimoto_c('1100000001110000001110000000000000000000000000000000000000000000000000000000000000000000000000
                    Rows Removed by Filter: 82997                                                                                      
Planning Time: 0.062 ms                                                                                                                
Execution Time: 48.145 ms     

Overall response is 0.128 seconds, or 110 times faster.

Muscle beats brain. At least here.

Third race. Relay

But if we enable the Swamidass/Baldi index again and keep the native Tanimoto function, the real magic happens.

The final plan:

Limit  (cost=9427.54..9427.56 rows=10 width=77) (actual time=35.567..35.570 rows=10 loops=1)                                                           
  ->  Sort  (cost=9427.54..9475.94 rows=19361 width=77) (actual time=35.566..35.566 rows=10 loops=1)                                                   
        Sort Key: (tanimoto_c('110000000111000000111000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000
        Sort Method: top-N heapsort  Memory: 26kB                                                                                                      
        ->  Bitmap Heap Scan on externalfp  (cost=826.09..9009.15 rows=19361 width=77) (actual time=5.404..35.459 rows=90 loops=1)                     
              Recheck Cond: ((cardinality >= 103) AND (cardinality <= 128))                                                                            
              Filter: (tanimoto_c('11000000011100000011100000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000
              Rows Removed by Filter: 57910                                                                                                            
              Heap Blocks: exact=6801                                                                                                                  
              ->  Bitmap Index Scan on externalfp_cardinality_idx  (cost=0.00..821.25 rows=58083 width=0) (actual time=4.307..4.307 rows=58000 loops=1)
                    Index Cond: ((cardinality >= 103) AND (cardinality <= 128))                                                                        
Planning Time: 0.149 ms                                                                                                                                
Execution Time: 35.641 ms                                                                                                                                                                                                                                          
Final overall response now 0.04 seconds, or 350 times faster!

Muscle + brain: unmatched.


Sometimes the Hare wins, sometimes the Hedgehog does. Always attack nontrivial optimization problems from different angles and experiment, experiment, experiment.

While C largely outperforms the index in this case, it's still good to know both, because using native functions is not always possible, e.g. on an AWS RDS instance. Then, 4.6x faster is better than nothing.

PostgreSQL needs a native builtin function for counting bits in BIT VARYING. Since fingerprint screening can be used not only for chemical structures, but for virtually every complex data that can be broken down into binary features, this would be ever so useful.

Wednesday, October 18, 2017

Wrong prediction of the day

"There is this fear people have that eventually actors are going to be replaced by computer characters. I don't think it is valid at all. The only thing that the technology is going to do is provide the actors with new places to go and new ways to go there."

- Steven Lisberger, 1982, BYTE Magazine, Vol. 07, #11, p. 74

Wednesday, March 29, 2017

Using PostgreSQL to get things done: 1. Installation

Currently I'm doing a short series of videos about data analysis PostgreSQL. Since this is for people who usually do not directly deal with databases every day, it starts very basic with the installation on Linux and Windows.

What do you think, is it worth the effort?

Thursday, March 23, 2017

Windows, keep your dirty fingers off my files!

I spent the better part of the morning figuring out why a colleague could not import a PostgreSQL dump in plain format made on Linux on his Windows machine.

According to documentation, this works like so (OS agnostic):

psql dbname < infile

However, this  gave the following error:

ERROR:  missing data for ...

However, the documentation for psql gives an alternative way to read commands from a file:

The -f switch.

Read commands from the file filename, rather than standard input. This option can be repeated and combined in any order with the -c option. When either -c or -f is specified, psql does not read commands from standard input; instead it terminates after processing all the -c and -f options in sequence. Except for that, this option is largely equivalent to the meta-command \i.
If filename is - (hyphen), then standard input is read until an EOF indication or \q meta-command. This can be used to intersperse interactive input with input from files. Note however that Readline is not used in this case (much as if -n had been specified).
Using this option is subtly different from writing psql < filename. In general, both will do what you expect, but using -f enables some nice features such as error messages with line numbers. There is also a slight chance that using this option will reduce the start-up overhead. On the other hand, the variant using the shell's input redirection is (in theory) guaranteed to yield exactly the same output you would have received had you entered everything by hand.

What this doesn't tell you, is that on Windows, CMD.exe apparently somehow tries to interpret the file it reads. And by doing so, it destroyed data in the dump so that COPY was unable to understand it anymore. So the last sentence of the statement above is just theory on Windows.

Long story short, with psql -f all went fine - and don't use I/O redirection with psql on Windows!