Thursday, February 26, 2015

The Long Tail - vertical table partitioning II

Having all parts from the previous post in place, some mechanism to do the routine maintenance by calling the transfer function automagically, is needed.

Of course this could be done with pgAgent or it could be done with cron, but since it should be elegant, this calls for a background worker process.

For illustration purposes, I wrote a sample implementation called  worker_ltt based on the worker_spi sample code. Sloppy - even the orginal comments are still in there.

Adding worker_ltt to shared_preload_libraries and

worker_ltt.naptime = 60
worker_ltt.database = 'yourdb'
worker_ltt.user = 'youruser'
worker_ltt.function = 'move_longtail'
 
to postgresql.conf starts executing move_longtail() every 60 seconds in yourdb as youruser. If the user is omitted, it runs with superuser rights!

Since move_longtail() basically can do anything, restricting the user is a good idea.

For more security, the SQL statements could be moved entirely into the background worker, but then the approach loses much of its flexibility... But this is a concept anyway, there is always room for improvement.

But it really works.

In part III I'll try to give a raw estimate how big the performance penalty is when the partitioned table switches from fast to slow storage during a query. And there is another important problem to be solved...

The Long Tail - vertical table partitioning I


DISCLAIMER: This is just an idea, I don't have tried this in a production environment!

Having said that, any input is welcome. :-)

Storing tons of rows in a table of which only a small percentage of rows are frequently queried is a common scenario for RDBMS, especially with databases that have to keep historical information just in case they might be audited, e.g. in environments regulated by law.

With the advent of the first affordable 8TB harddisk and fast SSDs still being much more expensive per GB, I wondered if such long-tailed tables could be split over a SSD holding the frequently accessed pages and a near-line storage HDD keeping the archive - elegantly - with PostgreSQL.

With elegant, I mean without fiddling around with VIEWs, INSTEAD OF triggers and exposing a clean and familiar interface to the developer.

OK, since PostgreSQL already supports horizontal partitioning, spreading one table transparently over many parallel tables, how about vertical partitioning, spreading one table over a hierarchy of speed?

The speed zones can be mapped to tablespaces:

CREATE TABLESPACE fast LOCATION '/mnt/fastdisk';
CREATE TABLESPACE slow LOCATION '/mnt/slowdisk';

Next comes the table(s):

CREATE TABLE the_table
(
  id integer NOT NULL,
  value real
)
WITH (
  OIDS=FALSE
)
TABLESPACE fast;


CREATE TABLE the_table_archive
(
)
INHERITS (the_table)
WITH (
  OIDS=FALSE
)
TABLESPACE slow;


Table inheritance in PostgreSQL is so cool...

And a function to move data from fast to slow:

CREATE OR REPLACE FUNCTION move_longtail()
  RETURNS boolean AS
$BODY$
DECLARE worked BOOLEAN;
DECLARE rowcount INTEGER;
BEGIN
worked := false;
rowcount := count(*) FROM ONLY the_table WHERE id >= 5000000;
IF (rowcount > 100000) THEN
INSERT INTO the_table_archive SELECT * FROM ONLY the_table WHERE id >= 5000000;
DELETE FROM ONLY the_table WHERE id >= 5000000;
worked := true;
END IF;
RETURN worked;
END;
$BODY$
  LANGUAGE plpgsql VOLATILE STRICT;


This function runs only if a minimum of movable rows qualify. This is recommended since SMR disks like large contiguous writes due to how SMR technically works.

Notice the ONLY keyword. This allows to control precisely which partition of the table is affected by the DML statement. Did I say already that table inheritance in PostgreSQL is so cool?

And basically that's it. All the developer sees is SELECT * FROM the _table; or something, being oblivious to the underlying machinery.

Some kind of automatic maintenance is missing. On to part II...

Wednesday, February 25, 2015

pgchem::tigress 3.2 released

pgchem::tigress 3.2 is finally out!

  • This builds against PostgreSQL 9.4.x and OpenBabel 2.3.2 on Linux.
  • It contains all fixes and contributions of the previous versions.
  • Windows is not supported anymore - and since it builds and runs way better on Linux, probably never will be again.
  • Depiction functions have been removed. Their run time was too unpredictable to be run inside a database server.
  • Theoretical isotope pattern generation with MERCURY7 is now available with isotopes for 39 elements.

So: CREATE EXTENSION pgchem_tigress;

Friday, February 20, 2015

Counting rows again - inheritance strangeness solved

Taking a second look at the execution plans, I've noticed that the scan on slow read twice the number of pages from disk than the one on fast:

Buffers: shared read=442478 vs. Buffers: shared read=221239

Since I loaded all rows into slow first and then moved 50% of them into fast, this makes sense, I guess.
If I understand it correctly, those pages in slow are now empty, but PostgreSQL keeps them for future use.

So I tried a VACUUM FULL on slow and ran my queries again. That changed the plans:

Buffers: shared read=221239 vs. Buffers: shared read=221239

And execution times are now about equal. The observed effect had nothing to do with table inheritance at all.

So VACUUM FULL can help, when moving big chunks of data between tables with otherwise low write activity.

Counting rows again - inheritance strangeness

Hm, for something I'm trying at the moment on 9.4, I created two identical tables where the second one inherits all from the first:

CREATE UNLOGGED TABLE everything.slow
(
  id integer NOT NULL,
  value real
)
WITH (
  OIDS=FALSE
)


CREATE UNLOGGED TABLE everything.fast
(
)
INHERITS (everything.slow)
WITH (
  OIDS=FALSE
)


No indexes. Default tablespace.

If I then load 50 million records into each of those tables and query them individually using the ONLY restrictor, a count(*) on the parent table (slow) is slower than on the descendant (fast):

select count(*) from only everything.slow;

"Aggregate  (cost=1067783.10..1067783.11 rows=1 width=0) (actual time=4973.812..4973.813 rows=1 loops=1)"
"  Output: count(*)"
"  Buffers: shared read=442478"
"  ->  Seq Scan on everything.slow  (cost=0.00..942722.08 rows=50024408 width=0) (actual time=1012.708..3416.349 rows=50000000 loops=1)"
"        Output: id, value"
"        Buffers: shared read=442478"
"Planning time: 0.118 ms"
"Execution time: 4973.901 ms"


select count(*) from only everything.fast;

"Aggregate  (cost=846239.00..846239.01 rows=1 width=0) (actual time=3988.235..3988.235 rows=1 loops=1)"
"  Output: count(*)"
"  Buffers: shared read=221239"
"  ->  Seq Scan on everything.fast  (cost=0.00..721239.00 rows=50000000 width=0) (actual time=0.101..2403.813 rows=50000000 loops=1)"
"        Output: id, value"
"        Buffers: shared read=221239"
"Planning time: 0.086 ms"
"Execution time: 3988.302 ms"


This works with other aggregates like avg() too.

I had expected some overhead when querying without ONLY on slow, because of the traversal of the inheritance hierarchy, but not when I restrict the query to a specific table with ONLY...

Can someone explain this? UPDATE: Maybe I found it myself. See next post for my explanation.

Wednesday, February 11, 2015

Finding mass spectra with PostgreSQL: Indexing

When naively searching for a specific spectral contrast angle, e.g.

select id from spectrum, round(spectral_contrast('[{"m/z":144.1150218,"intensity":0.908209840784744},{"m/z":145.118459841064,"intensity":0.0841924148716925}]'::json, id),2) as spectral_contrast where spectral_contrast = 1.0;

performance is not really stellar on my ~ 270k spectra database:

Execution time: 18695.619 ms for 71 hits.

This is no surprise, since the database has to calculate the spectral contrast angle for all rows and filter each row. How about indexing?

The simplest method I can envision, is to precalculate the number of peaks for each spectrum, index the column and add it to the query. Since the spectral contrast angle requires the number of peaks of the compared spectra to be equal, this is a safe restriction.

select id from spectrum, round(spectral_contrast('[{"m/z":144.1150218,"intensity":0.908209840784744},{"m/z":145.118459841064,"intensity":0.0841924148716925}]'::json, id),2) as spectral_contrast where spectral_contrast = 1.0 and num_peaks = 2;

Execution time: 797.346 ms for 71 hits.

OK, but can we do better? Yes, by using a more selective restriction, like the lower and upper bounds of the m/z values of the spectra, index the column and add it to the query. These can easily be obtained by min(m/z)and max(m/z) and then be stored in a numrange column. Since the spectral contrast angle requires the m/z values of the peaks of the compared spectra to be equal, this again is a safe restriction.

select id from spectrum, round(spectral_contrast('[{"m/z":144.1150218,"intensity":0.908209840784744},{"m/z":145.118459841064,"intensity":0.0841924148716925}]'::json, id),2) as spectral_contrast where spectral_contrast = 1.0 and mzrange = numrange(144.1150218,145.118459841064,'[]');

Execution time: 35.128 ms for 71 hits.

About 534 times faster.

Using a range type instead of storing the lower and upper bound in discrete columns gives more flexibility when querying with other criteria, like the Tanimoto index. Then, PostgreSQL's range operators, like &&  (overlap), might come in handy.

Monday, February 9, 2015

Finding mass spectra with PostgreSQL: Spectra as function arguments

If the spectra are represented as tables/recordets, like in my last posts on the topic, the question arises how to use a table as an argument to a function in PostgreSQL without creating a custom datatype or using temporary tables.

1. As one dimensional array of m/z values followed by the corresponding peak values. The array can then be cut in half and unnested inside the function into a set of records:

select unnest(sdarr[1:1]) as "m/z", unnest(sdarr[2:2]) as intensity from ...;

2. As two dimensional array of m/z values and their corresponding peak values. The partial arrays can then be unnested inside the function into a set of records:

select unnest(tdarray[1:3]) as "m/z", unnest(tdarray[4:6]) as intensity from ...;

3. As JSON of m/z values and their corresponding peak values. The JSON can then be converted inside the function into a set of records:
 
select * from json_to_recordset( '[{"m/z":1.0,"intensity":2.2},{"m/z":3.3,"intensity":4.8}]') as x("m/z" double precision, intensity double precision);

All statements above generate a recordset like:

m/z    intensity
1.0    2.2
3.3    4.8
...    ...

In an application, I'd go with JSON, since it has the most understandable structure for a developer against this API and it does not require fiddling around with array support functions of the driver, e.g. like createArrayOf() in JDBC.

Thursday, February 5, 2015

count(*) is faster than count(1) - no more with nine.four

In PostgreSQL 9.4, count(1) speed is equally fast to count(*).

Another reason to upgrade to 9.4.

count(<column>) is still slower if no index is used...

Wednesday, February 4, 2015

count(*) is faster than count(1) revisited

Whoa, I didn't expect that much resonance to the last post. :-)

"Can you post the `explain analyze` output of both queries? I'm pretty sure this is normal "distribution" caused by other stuff going on in the system"

OK. Since my 'test' was on a VirtualBox VM with a self-compiled PostgreSQL and this might not be a reliable test setup, so I hijacked one of our real PostgreSQL servers:

uname -a
Linux  2.6.32-358.23.2.el6.x86_64 #1 SMP Sat Sep 14 05:32:37 EDT 2013 x86_64 x86_64 x86_64 GNU/Linux

Intel(R) Xeon(R) CPU E5-2670 v2 @ 2.50GHz ,16 GB RAM, storage on SAN, ext4.

select version();
"PostgreSQL 9.3.5 on x86_64-unknown-linux-gnu, compiled by gcc (GCC) 4.4.7 20120313 (Red Hat 4.4.7-4), 64-bit"

create table hmr(id serial, value real);

insert into hmr (value) select random()*10000000 from generate_series(1,10000000);


CREATE INDEX idx_id
  ON hmr
  USING btree
  (id);


vacuum analyze hmr;

-----------------------------------------------------------------------

explain analyze select count(1) from hmr;

"Aggregate  (cost=169248.00..169248.01 rows=1 width=0) (actual time=1439.521..1439.521 rows=1 loops=1)"
"  ->  Seq Scan on hmr  (cost=0.00..144248.00 rows=10000000 width=0) (actual time=0.006..642.276 rows=10000000 loops=1)"
"Total runtime: 1439.548 ms"


select count(1) from hmr; --811 msec.

-----------------------------------------------------------------------

explain analyze select count(*) from hmr;


"Aggregate  (cost=169248.00..169248.01 rows=1 width=0) (actual time=1299.034..1299.034 rows=1 loops=1)"
"  ->  Seq Scan on hmr  (cost=0.00..144248.00 rows=10000000 width=0) (actual time=0.006..644.912 rows=10000000 loops=1)"
"Total runtime: 1299.061 ms"


select count(*) from hmr; --670 msec 

-----------------------------------------------------------------------

explain analyze select count(id) from hmr;


"Aggregate  (cost=169248.00..169248.01 rows=1 width=4) (actual time=1576.046..1576.046 rows=1 loops=1)"
"  ->  Seq Scan on hmr  (cost=0.00..144248.00 rows=10000000 width=4) (actual time=0.004..636.076 rows=10000000 loops=1)"
"Total runtime: 1576.069 ms"


select count(id) from hmr; --920 msec

----------------------------------------------------------------------- 

And the winner is: still count(*) with count(1) second and count(id) third.

This could be an explanation for count(1) being slower:

"So count(1) is explicitly passing the 1 in and checking it for being NULL before incrementing a counter while count(*) is simply incrementing the counter."

And count(id) might be slower, because it takes additional time to consider using the index?
Or because it's actually reading data: width=4 vs. width=0?

I'm out of my league here.