Thursday, June 25, 2015

nseq - A datatype for the efficient storage of nucleotide sequences in PostgreSQL

The nseq datatype allows to store DNA and RNA sequences consisting of the letters AGCT or AGCU respectively in PostgreSQL.

By encoding four bases per Byte, it uses 75 percent less space on disk than text. While this idea is obvious and far from being novel, it still is one of the most efficient compression schemes for DNA/RNA, if not the most efficient.

As of now, nseq only supports very basic native operations. The main shortcoming is, that it has to be decompressed into text hence expanding by 4x, for e. g. substring operations and the like.

This will change.

Enough said - here it is...

Check out PostBIS as well, it already has much more features.

Wednesday, June 3, 2015

Update to pgchem::tigress isotope pattern generation code

The isotope_pattern() function now contains data for the stable isotopes of 82 elements.

Thus, it fully supports HMDB, UNPD and ChEBI (except the transuranics) and is available here.

The individually affected files are obwrapper.cpp and libmercury++.h, in case you want to update your installation in-place.

Wednesday, March 11, 2015

The Long Tail - vertical table partitioning III

"In part III I'll try to give a raw estimate how big the performance penalty is when the partitioned table switches from fast to slow storage during a query. And there is another important problem to be solved..."

It took some time...

Since I don't have a SMR HDD, I used a standard 5400 RPM 2TB HDD together with a 128 GB SSD instead. Both drives are attached to SATA2 ports.

The test machine is a Fujitsu Workstation with one Intel i7-4770, 16 GB RAM, running Windows 8.1 64 Bit and PostgreSQL 9.4.1 64 Bit.

CrystalDiskMark gives the following performance data for the pair:

HDD:

           Sequential Read :   139.764 MB/s
          Sequential Write :   128.897 MB/s
         Random Read 512KB :    17.136 MB/s
        Random Write 512KB :    71.074 MB/s
    Random Read 4KB (QD=1) :     0.280 MB/s [    68.3 IOPS]
   Random Write 4KB (QD=1) :     0.642 MB/s [   156.8 IOPS]
   Random Read 4KB (QD=32) :     0.999 MB/s [   243.8 IOPS]
  Random Write 4KB (QD=32) :     0.889 MB/s [   217.0 IOPS]

SSD:

           Sequential Read :   431.087 MB/s
          Sequential Write :   299.641 MB/s
         Random Read 512KB :   268.955 MB/s
        Random Write 512KB :   293.199 MB/s
    Random Read 4KB (QD=1) :    24.519 MB/s [  5986.0 IOPS]
   Random Write 4KB (QD=1) :    67.369 MB/s [ 16447.6 IOPS]
   Random Read 4KB (QD=32) :   328.456 MB/s [ 80189.5 IOPS]
  Random Write 4KB (QD=32) :   205.667 MB/s [ 50211.6 IOPS]


As you can see, the SSD is about 3x faster reading sequentially and about 85x - 328x faster reading random blocks, depending on the command queue depth.

PostgreSQL is running with

shared_buffers = 128kB # min 128kB

to minimize cache hits since I want to see how the disks perform.

For the 'benchmark' I first set up two tablespaces, hdd and ssd. Then the long tailed table was created as shown in the previous posts:

CREATE UNLOGGED TABLE fast
(
  id serial NOT NULL,
  value real
)
WITH (
  OIDS=FALSE
)
TABLESPACE ssd;

CREATE UNLOGGED TABLE slow
(
)
INHERITS (fast)
TABLESPACE hdd;

Then I created one billion rows in fast and slow:

INSERT INTO fast (value) SELECT random()*1000000000 FROM generate_series(1,1000000000);

INSERT INTO slow SELECT * FROM fast;

First, I wanted to see how each table performs with full table scans. All these numbers are ten-run averages, that's one reason why it took some time :-)

SELECT avg(value) FROM ONLY slow; -- 210 sec

SELECT avg(value) FROM ONLY fast; -- 90 sec

Which pretty much reflects the 3/1 ratio from CrystalDiskMark.

For the random read test, I created primary keys on the id columns of each table, but put their underlying indexes on the SSD to be fair. Then, 10000 random rows where selected from the whole table:

SELECT avg(value) FROM ONLY fast WHERE id IN (SELECT 1+floor(random()*1000000000)::integer FROM generate_series(1,10000)); -- 6 sec

SELECT avg(value) FROM ONLY slow WHERE id IN (SELECT 1+floor(random()*1000000000)::integer FROM generate_series(1,10000)); -- 100 sec

Here, the HDD is about 16x slower than the SSD.

And from the top 20% of each table:

SELECT avg(value) FROM ONLY fast WHERE id IN (SELECT 800000001+floor(random()*200000000)::integer FROM generate_series(1,10000)); -- 5 sec


SELECT avg(value) FROM ONLY slow WHERE id IN (SELECT 800000001+floor(random()*200000000)::integer FROM generate_series(1,10000)); -- 80 sec

Again, the HDD is about 16x slower than the SSD.

Knowing how each table performs, I then moved the top 20% of rows into fast and left the remaining 80% in slow, thus creating the long tailed table.

SELECT avg(value) FROM fast; -- 178 sec

Surprise, surprise, 210*0.8=168, 90*0.2=18, 168+18=186. The long tailed table is not slower than it's individual parts! 

And with random reads?

Whole table:

SELECT avg(value) FROM fast WHERE id IN (SELECT 1+floor(random()*1000000000)::integer FROM generate_series(1,10000)); -- 50 sec

It's way faster than the table on the SSD alone. This seems to be an anomaly I cannot explain at the moment. Either it helps a lot to have two indexes instead of one, or the most rows where selected from the SSD part.

Top 20% only:

SELECT avg(value) FROM fast WHERE id IN (SELECT 800000001+floor(random()*200000000)::integer FROM generate_series(1,10000)); -- 4 sec

A bit faster than having the whole table on SSD.

Conclusion:

Aside from the (positive) anomaly with random reads on the whole long tailed table, using a long tailed table is at least not slower than a vanilla table but you can put your data graveyard on slow but inexpensive storage while having the hot rows and the indexes on the fast drives. Completely transparent for the clients.

However, one question remains...

Is it possible to ask PostgreSQL what the most frequently accessed rows of a table are?

If so, the background worker could balance the long tailed table without having to know a specific, application dependent access pattern!

An that would be the icing on the cake...

A quick glance over pg_stat* and pg_statio* didn't show anything usable for this task, but I'm open for suggestions. :-)

Thursday, February 26, 2015

The Long Tail - vertical table partitioning II

Having all parts from the previous post in place, some mechanism to do the routine maintenance by calling the transfer function automagically, is needed.

Of course this could be done with pgAgent or it could be done with cron, but since it should be elegant, this calls for a background worker process.

For illustration purposes, I wrote a sample implementation called  worker_ltt based on the worker_spi sample code. Sloppy - even the orginal comments are still in there.

Adding worker_ltt to shared_preload_libraries and

worker_ltt.naptime = 60
worker_ltt.database = 'yourdb'
worker_ltt.user = 'youruser'
worker_ltt.function = 'move_longtail'
 
to postgresql.conf starts executing move_longtail() every 60 seconds in yourdb as youruser. If the user is omitted, it runs with superuser rights!

Since move_longtail() basically can do anything, restricting the user is a good idea.

For more security, the SQL statements could be moved entirely into the background worker, but then the approach loses much of its flexibility... But this is a concept anyway, there is always room for improvement.

But it really works.

In part III I'll try to give a raw estimate how big the performance penalty is when the partitioned table switches from fast to slow storage during a query. And there is another important problem to be solved...

The Long Tail - vertical table partitioning I


DISCLAIMER: This is just an idea, I don't have tried this in a production environment!

Having said that, any input is welcome. :-)

Storing tons of rows in a table of which only a small percentage of rows are frequently queried is a common scenario for RDBMS, especially with databases that have to keep historical information just in case they might be audited, e.g. in environments regulated by law.

With the advent of the first affordable 8TB harddisk and fast SSDs still being much more expensive per GB, I wondered if such long-tailed tables could be split over a SSD holding the frequently accessed pages and a near-line storage HDD keeping the archive - elegantly - with PostgreSQL.

With elegant, I mean without fiddling around with VIEWs, INSTEAD OF triggers and exposing a clean and familiar interface to the developer.

OK, since PostgreSQL already supports horizontal partitioning, spreading one table transparently over many parallel tables, how about vertical partitioning, spreading one table over a hierarchy of speed?

The speed zones can be mapped to tablespaces:

CREATE TABLESPACE fast LOCATION '/mnt/fastdisk';
CREATE TABLESPACE slow LOCATION '/mnt/slowdisk';

Next comes the table(s):

CREATE TABLE the_table
(
  id integer NOT NULL,
  value real
)
WITH (
  OIDS=FALSE
)
TABLESPACE fast;


CREATE TABLE the_table_archive
(
)
INHERITS (the_table)
WITH (
  OIDS=FALSE
)
TABLESPACE slow;


Table inheritance in PostgreSQL is so cool...

And a function to move data from fast to slow:

CREATE OR REPLACE FUNCTION move_longtail()
  RETURNS boolean AS
$BODY$
DECLARE worked BOOLEAN;
DECLARE rowcount INTEGER;
BEGIN
worked := false;
rowcount := count(*) FROM ONLY the_table WHERE id >= 5000000;
IF (rowcount > 100000) THEN
INSERT INTO the_table_archive SELECT * FROM ONLY the_table WHERE id >= 5000000;
DELETE FROM ONLY the_table WHERE id >= 5000000;
worked := true;
END IF;
RETURN worked;
END;
$BODY$
  LANGUAGE plpgsql VOLATILE STRICT;


This function runs only if a minimum of movable rows qualify. This is recommended since SMR disks like large contiguous writes due to how SMR technically works.

Notice the ONLY keyword. This allows to control precisely which partition of the table is affected by the DML statement. Did I say already that table inheritance in PostgreSQL is so cool?

And basically that's it. All the developer sees is SELECT * FROM the _table; or something, being oblivious to the underlying machinery.

Some kind of automatic maintenance is missing. On to part II...

Wednesday, February 25, 2015

pgchem::tigress 3.2 released

pgchem::tigress 3.2 is finally out!

  • This builds against PostgreSQL 9.4.x and OpenBabel 2.3.2 on Linux.
  • It contains all fixes and contributions of the previous versions.
  • Windows is not supported anymore - and since it builds and runs way better on Linux, probably never will be again.
  • Depiction functions have been removed. Their run time was too unpredictable to be run inside a database server.
  • Theoretical isotope pattern generation with MERCURY7 is now available with isotopes for 39 elements.

So: CREATE EXTENSION pgchem_tigress;

Friday, February 20, 2015

Counting rows again - inheritance strangeness solved

Taking a second look at the execution plans, I've noticed that the scan on slow read twice the number of pages from disk than the one on fast:

Buffers: shared read=442478 vs. Buffers: shared read=221239

Since I loaded all rows into slow first and then moved 50% of them into fast, this makes sense, I guess.
If I understand it correctly, those pages in slow are now empty, but PostgreSQL keeps them for future use.

So I tried a VACUUM FULL on slow and ran my queries again. That changed the plans:

Buffers: shared read=221239 vs. Buffers: shared read=221239

And execution times are now about equal. The observed effect had nothing to do with table inheritance at all.

So VACUUM FULL can help, when moving big chunks of data between tables with otherwise low write activity.

Counting rows again - inheritance strangeness

Hm, for something I'm trying at the moment on 9.4, I created two identical tables where the second one inherits all from the first:

CREATE UNLOGGED TABLE everything.slow
(
  id integer NOT NULL,
  value real
)
WITH (
  OIDS=FALSE
)


CREATE UNLOGGED TABLE everything.fast
(
)
INHERITS (everything.slow)
WITH (
  OIDS=FALSE
)


No indexes. Default tablespace.

If I then load 50 million records into each of those tables and query them individually using the ONLY restrictor, a count(*) on the parent table (slow) is slower than on the descendant (fast):

select count(*) from only everything.slow;

"Aggregate  (cost=1067783.10..1067783.11 rows=1 width=0) (actual time=4973.812..4973.813 rows=1 loops=1)"
"  Output: count(*)"
"  Buffers: shared read=442478"
"  ->  Seq Scan on everything.slow  (cost=0.00..942722.08 rows=50024408 width=0) (actual time=1012.708..3416.349 rows=50000000 loops=1)"
"        Output: id, value"
"        Buffers: shared read=442478"
"Planning time: 0.118 ms"
"Execution time: 4973.901 ms"


select count(*) from only everything.fast;

"Aggregate  (cost=846239.00..846239.01 rows=1 width=0) (actual time=3988.235..3988.235 rows=1 loops=1)"
"  Output: count(*)"
"  Buffers: shared read=221239"
"  ->  Seq Scan on everything.fast  (cost=0.00..721239.00 rows=50000000 width=0) (actual time=0.101..2403.813 rows=50000000 loops=1)"
"        Output: id, value"
"        Buffers: shared read=221239"
"Planning time: 0.086 ms"
"Execution time: 3988.302 ms"


This works with other aggregates like avg() too.

I had expected some overhead when querying without ONLY on slow, because of the traversal of the inheritance hierarchy, but not when I restrict the query to a specific table with ONLY...

Can someone explain this? UPDATE: Maybe I found it myself. See next post for my explanation.