Wednesday, March 11, 2015

The Long Tail - vertical table partitioning III

"In part III I'll try to give a raw estimate how big the performance penalty is when the partitioned table switches from fast to slow storage during a query. And there is another important problem to be solved..."

It took some time...

Since I don't have a SMR HDD, I used a standard 5400 RPM 2TB HDD together with a 128 GB SSD instead. Both drives are attached to SATA2 ports.

The test machine is a Fujitsu Workstation with one Intel i7-4770, 16 GB RAM, running Windows 8.1 64 Bit and PostgreSQL 9.4.1 64 Bit.

CrystalDiskMark gives the following performance data for the pair:

HDD:

           Sequential Read :   139.764 MB/s
          Sequential Write :   128.897 MB/s
         Random Read 512KB :    17.136 MB/s
        Random Write 512KB :    71.074 MB/s
    Random Read 4KB (QD=1) :     0.280 MB/s [    68.3 IOPS]
   Random Write 4KB (QD=1) :     0.642 MB/s [   156.8 IOPS]
   Random Read 4KB (QD=32) :     0.999 MB/s [   243.8 IOPS]
  Random Write 4KB (QD=32) :     0.889 MB/s [   217.0 IOPS]

SSD:

           Sequential Read :   431.087 MB/s
          Sequential Write :   299.641 MB/s
         Random Read 512KB :   268.955 MB/s
        Random Write 512KB :   293.199 MB/s
    Random Read 4KB (QD=1) :    24.519 MB/s [  5986.0 IOPS]
   Random Write 4KB (QD=1) :    67.369 MB/s [ 16447.6 IOPS]
   Random Read 4KB (QD=32) :   328.456 MB/s [ 80189.5 IOPS]
  Random Write 4KB (QD=32) :   205.667 MB/s [ 50211.6 IOPS]


As you can see, the SSD is about 3x faster reading sequentially and about 85x - 328x faster reading random blocks, depending on the command queue depth.

PostgreSQL is running with

shared_buffers = 128kB # min 128kB

to minimize cache hits since I want to see how the disks perform.

For the 'benchmark' I first set up two tablespaces, hdd and ssd. Then the long tailed table was created as shown in the previous posts:

CREATE UNLOGGED TABLE fast
(
  id serial NOT NULL,
  value real
)
WITH (
  OIDS=FALSE
)
TABLESPACE ssd;

CREATE UNLOGGED TABLE slow
(
)
INHERITS (fast)
TABLESPACE hdd;

Then I created one billion rows in fast and slow:

INSERT INTO fast (value) SELECT random()*1000000000 FROM generate_series(1,1000000000);

INSERT INTO slow SELECT * FROM fast;

First, I wanted to see how each table performs with full table scans. All these numbers are ten-run averages, that's one reason why it took some time :-)

SELECT avg(value) FROM ONLY slow; -- 210 sec

SELECT avg(value) FROM ONLY fast; -- 90 sec

Which pretty much reflects the 3/1 ratio from CrystalDiskMark.

For the random read test, I created primary keys on the id columns of each table, but put their underlying indexes on the SSD to be fair. Then, 10000 random rows where selected from the whole table:

SELECT avg(value) FROM ONLY fast WHERE id IN (SELECT 1+floor(random()*1000000000)::integer FROM generate_series(1,10000)); -- 6 sec

SELECT avg(value) FROM ONLY slow WHERE id IN (SELECT 1+floor(random()*1000000000)::integer FROM generate_series(1,10000)); -- 100 sec

Here, the HDD is about 16x slower than the SSD.

And from the top 20% of each table:

SELECT avg(value) FROM ONLY fast WHERE id IN (SELECT 800000001+floor(random()*200000000)::integer FROM generate_series(1,10000)); -- 5 sec


SELECT avg(value) FROM ONLY slow WHERE id IN (SELECT 800000001+floor(random()*200000000)::integer FROM generate_series(1,10000)); -- 80 sec

Again, the HDD is about 16x slower than the SSD.

Knowing how each table performs, I then moved the top 20% of rows into fast and left the remaining 80% in slow, thus creating the long tailed table.

SELECT avg(value) FROM fast; -- 178 sec

Surprise, surprise, 210*0.8=168, 90*0.2=18, 168+18=186. The long tailed table is not slower than it's individual parts! 

And with random reads?

Whole table:

SELECT avg(value) FROM fast WHERE id IN (SELECT 1+floor(random()*1000000000)::integer FROM generate_series(1,10000)); -- 50 sec

It's way faster than the table on the SSD alone. This seems to be an anomaly I cannot explain at the moment. Either it helps a lot to have two indexes instead of one, or the most rows where selected from the SSD part.

Top 20% only:

SELECT avg(value) FROM fast WHERE id IN (SELECT 800000001+floor(random()*200000000)::integer FROM generate_series(1,10000)); -- 4 sec

A bit faster than having the whole table on SSD.

Conclusion:

Aside from the (positive) anomaly with random reads on the whole long tailed table, using a long tailed table is at least not slower than a vanilla table but you can put your data graveyard on slow but inexpensive storage while having the hot rows and the indexes on the fast drives. Completely transparent for the clients.

However, one question remains...

Is it possible to ask PostgreSQL what the most frequently accessed rows of a table are?

If so, the background worker could balance the long tailed table without having to know a specific, application dependent access pattern!

An that would be the icing on the cake...

A quick glance over pg_stat* and pg_statio* didn't show anything usable for this task, but I'm open for suggestions. :-)

Thursday, February 26, 2015

The Long Tail - vertical table partitioning II

Having all parts from the previous post in place, some mechanism to do the routine maintenance by calling the transfer function automagically, is needed.

Of course this could be done with pgAgent or it could be done with cron, but since it should be elegant, this calls for a background worker process.

For illustration purposes, I wrote a sample implementation called  worker_ltt based on the worker_spi sample code. Sloppy - even the orginal comments are still in there.

Adding worker_ltt to shared_preload_libraries and

worker_ltt.naptime = 60
worker_ltt.database = 'yourdb'
worker_ltt.user = 'youruser'
worker_ltt.function = 'move_longtail'
 
to postgresql.conf starts executing move_longtail() every 60 seconds in yourdb as youruser. If the user is omitted, it runs with superuser rights!

Since move_longtail() basically can do anything, restricting the user is a good idea.

For more security, the SQL statements could be moved entirely into the background worker, but then the approach loses much of its flexibility... But this is a concept anyway, there is always room for improvement.

But it really works.

In part III I'll try to give a raw estimate how big the performance penalty is when the partitioned table switches from fast to slow storage during a query. And there is another important problem to be solved...

The Long Tail - vertical table partitioning I


DISCLAIMER: This is just an idea, I don't have tried this in a production environment!

Having said that, any input is welcome. :-)

Storing tons of rows in a table of which only a small percentage of rows are frequently queried is a common scenario for RDBMS, especially with databases that have to keep historical information just in case they might be audited, e.g. in environments regulated by law.

With the advent of the first affordable 8TB harddisk and fast SSDs still being much more expensive per GB, I wondered if such long-tailed tables could be split over a SSD holding the frequently accessed pages and a near-line storage HDD keeping the archive - elegantly - with PostgreSQL.

With elegant, I mean without fiddling around with VIEWs, INSTEAD OF triggers and exposing a clean and familiar interface to the developer.

OK, since PostgreSQL already supports horizontal partitioning, spreading one table transparently over many parallel tables, how about vertical partitioning, spreading one table over a hierarchy of speed?

The speed zones can be mapped to tablespaces:

CREATE TABLESPACE fast LOCATION '/mnt/fastdisk';
CREATE TABLESPACE slow LOCATION '/mnt/slowdisk';

Next comes the table(s):

CREATE TABLE the_table
(
  id integer NOT NULL,
  value real
)
WITH (
  OIDS=FALSE
)
TABLESPACE fast;


CREATE TABLE the_table_archive
(
)
INHERITS (the_table)
WITH (
  OIDS=FALSE
)
TABLESPACE slow;


Table inheritance in PostgreSQL is so cool...

And a function to move data from fast to slow:

CREATE OR REPLACE FUNCTION move_longtail()
  RETURNS boolean AS
$BODY$
DECLARE worked BOOLEAN;
DECLARE rowcount INTEGER;
BEGIN
worked := false;
rowcount := count(*) FROM ONLY the_table WHERE id >= 5000000;
IF (rowcount > 100000) THEN
INSERT INTO the_table_archive SELECT * FROM ONLY the_table WHERE id >= 5000000;
DELETE FROM ONLY the_table WHERE id >= 5000000;
worked := true;
END IF;
RETURN worked;
END;
$BODY$
  LANGUAGE plpgsql VOLATILE STRICT;


This function runs only if a minimum of movable rows qualify. This is recommended since SMR disks like large contiguous writes due to how SMR technically works.

Notice the ONLY keyword. This allows to control precisely which partition of the table is affected by the DML statement. Did I say already that table inheritance in PostgreSQL is so cool?

And basically that's it. All the developer sees is SELECT * FROM the _table; or something, being oblivious to the underlying machinery.

Some kind of automatic maintenance is missing. On to part II...

Wednesday, February 25, 2015

pgchem::tigress 3.2 released

pgchem::tigress 3.2 is finally out!

  • This builds against PostgreSQL 9.4.x and OpenBabel 2.3.2 on Linux.
  • It contains all fixes and contributions of the previous versions.
  • Windows is not supported anymore - and since it builds and runs way better on Linux, probably never will be again.
  • Depiction functions have been removed. Their run time was too unpredictable to be run inside a database server.
  • Theoretical isotope pattern generation with MERCURY7 is now available with isotopes for 39 elements.

So: CREATE EXTENSION pgchem_tigress;

Friday, February 20, 2015

Counting rows again - inheritance strangeness solved

Taking a second look at the execution plans, I've noticed that the scan on slow read twice the number of pages from disk than the one on fast:

Buffers: shared read=442478 vs. Buffers: shared read=221239

Since I loaded all rows into slow first and then moved 50% of them into fast, this makes sense, I guess.
If I understand it correctly, those pages in slow are now empty, but PostgreSQL keeps them for future use.

So I tried a VACUUM FULL on slow and ran my queries again. That changed the plans:

Buffers: shared read=221239 vs. Buffers: shared read=221239

And execution times are now about equal. The observed effect had nothing to do with table inheritance at all.

So VACUUM FULL can help, when moving big chunks of data between tables with otherwise low write activity.

Counting rows again - inheritance strangeness

Hm, for something I'm trying at the moment on 9.4, I created two identical tables where the second one inherits all from the first:

CREATE UNLOGGED TABLE everything.slow
(
  id integer NOT NULL,
  value real
)
WITH (
  OIDS=FALSE
)


CREATE UNLOGGED TABLE everything.fast
(
)
INHERITS (everything.slow)
WITH (
  OIDS=FALSE
)


No indexes. Default tablespace.

If I then load 50 million records into each of those tables and query them individually using the ONLY restrictor, a count(*) on the parent table (slow) is slower than on the descendant (fast):

select count(*) from only everything.slow;

"Aggregate  (cost=1067783.10..1067783.11 rows=1 width=0) (actual time=4973.812..4973.813 rows=1 loops=1)"
"  Output: count(*)"
"  Buffers: shared read=442478"
"  ->  Seq Scan on everything.slow  (cost=0.00..942722.08 rows=50024408 width=0) (actual time=1012.708..3416.349 rows=50000000 loops=1)"
"        Output: id, value"
"        Buffers: shared read=442478"
"Planning time: 0.118 ms"
"Execution time: 4973.901 ms"


select count(*) from only everything.fast;

"Aggregate  (cost=846239.00..846239.01 rows=1 width=0) (actual time=3988.235..3988.235 rows=1 loops=1)"
"  Output: count(*)"
"  Buffers: shared read=221239"
"  ->  Seq Scan on everything.fast  (cost=0.00..721239.00 rows=50000000 width=0) (actual time=0.101..2403.813 rows=50000000 loops=1)"
"        Output: id, value"
"        Buffers: shared read=221239"
"Planning time: 0.086 ms"
"Execution time: 3988.302 ms"


This works with other aggregates like avg() too.

I had expected some overhead when querying without ONLY on slow, because of the traversal of the inheritance hierarchy, but not when I restrict the query to a specific table with ONLY...

Can someone explain this? UPDATE: Maybe I found it myself. See next post for my explanation.

Wednesday, February 11, 2015

Finding mass spectra with PostgreSQL: Indexing

When naively searching for a specific spectral contrast angle, e.g.

select id from spectrum, round(spectral_contrast('[{"m/z":144.1150218,"intensity":0.908209840784744},{"m/z":145.118459841064,"intensity":0.0841924148716925}]'::json, id),2) as spectral_contrast where spectral_contrast = 1.0;

performance is not really stellar on my ~ 270k spectra database:

Execution time: 18695.619 ms for 71 hits.

This is no surprise, since the database has to calculate the spectral contrast angle for all rows and filter each row. How about indexing?

The simplest method I can envision, is to precalculate the number of peaks for each spectrum, index the column and add it to the query. Since the spectral contrast angle requires the number of peaks of the compared spectra to be equal, this is a safe restriction.

select id from spectrum, round(spectral_contrast('[{"m/z":144.1150218,"intensity":0.908209840784744},{"m/z":145.118459841064,"intensity":0.0841924148716925}]'::json, id),2) as spectral_contrast where spectral_contrast = 1.0 and num_peaks = 2;

Execution time: 797.346 ms for 71 hits.

OK, but can we do better? Yes, by using a more selective restriction, like the lower and upper bounds of the m/z values of the spectra, index the column and add it to the query. These can easily be obtained by min(m/z)and max(m/z) and then be stored in a numrange column. Since the spectral contrast angle requires the m/z values of the peaks of the compared spectra to be equal, this again is a safe restriction.

select id from spectrum, round(spectral_contrast('[{"m/z":144.1150218,"intensity":0.908209840784744},{"m/z":145.118459841064,"intensity":0.0841924148716925}]'::json, id),2) as spectral_contrast where spectral_contrast = 1.0 and mzrange = numrange(144.1150218,145.118459841064,'[]');

Execution time: 35.128 ms for 71 hits.

About 534 times faster.

Using a range type instead of storing the lower and upper bound in discrete columns gives more flexibility when querying with other criteria, like the Tanimoto index. Then, PostgreSQL's range operators, like &&  (overlap), might come in handy.

Monday, February 9, 2015

Finding mass spectra with PostgreSQL: Spectra as function arguments

If the spectra are represented as tables/recordets, like in my last posts on the topic, the question arises how to use a table as an argument to a function in PostgreSQL without creating a custom datatype or using temporary tables.

1. As one dimensional array of m/z values followed by the corresponding peak values. The array can then be cut in half and unnested inside the function into a set of records:

select unnest(sdarr[1:1]) as "m/z", unnest(sdarr[2:2]) as intensity from ...;

2. As two dimensional array of m/z values and their corresponding peak values. The partial arrays can then be unnested inside the function into a set of records:

select unnest(tdarray[1:3]) as "m/z", unnest(tdarray[4:6]) as intensity from ...;

3. As JSON of m/z values and their corresponding peak values. The JSON can then be converted inside the function into a set of records:
 
select * from json_to_recordset( '[{"m/z":1.0,"intensity":2.2},{"m/z":3.3,"intensity":4.8}]') as x("m/z" double precision, intensity double precision);

All statements above generate a recordset like:

m/z    intensity
1.0    2.2
3.3    4.8
...    ...

In an application, I'd go with JSON, since it has the most understandable structure for a developer against this API and it does not require fiddling around with array support functions of the driver, e.g. like createArrayOf() in JDBC.