Wednesday, March 11, 2015

The Long Tail - vertical table partitioning III

"In part III I'll try to give a raw estimate how big the performance penalty is when the partitioned table switches from fast to slow storage during a query. And there is another important problem to be solved..."

It took some time...

Since I don't have a SMR HDD, I used a standard 5400 RPM 2TB HDD together with a 128 GB SSD instead. Both drives are attached to SATA2 ports.

The test machine is a Fujitsu Workstation with one Intel i7-4770, 16 GB RAM, running Windows 8.1 64 Bit and PostgreSQL 9.4.1 64 Bit.

CrystalDiskMark gives the following performance data for the pair:

HDD:

           Sequential Read :   139.764 MB/s
          Sequential Write :   128.897 MB/s
         Random Read 512KB :    17.136 MB/s
        Random Write 512KB :    71.074 MB/s
    Random Read 4KB (QD=1) :     0.280 MB/s [    68.3 IOPS]
   Random Write 4KB (QD=1) :     0.642 MB/s [   156.8 IOPS]
   Random Read 4KB (QD=32) :     0.999 MB/s [   243.8 IOPS]
  Random Write 4KB (QD=32) :     0.889 MB/s [   217.0 IOPS]

SSD:

           Sequential Read :   431.087 MB/s
          Sequential Write :   299.641 MB/s
         Random Read 512KB :   268.955 MB/s
        Random Write 512KB :   293.199 MB/s
    Random Read 4KB (QD=1) :    24.519 MB/s [  5986.0 IOPS]
   Random Write 4KB (QD=1) :    67.369 MB/s [ 16447.6 IOPS]
   Random Read 4KB (QD=32) :   328.456 MB/s [ 80189.5 IOPS]
  Random Write 4KB (QD=32) :   205.667 MB/s [ 50211.6 IOPS]


As you can see, the SSD is about 3x faster reading sequentially and about 85x - 328x faster reading random blocks, depending on the command queue depth.

PostgreSQL is running with

shared_buffers = 128kB # min 128kB

to minimize cache hits since I want to see how the disks perform.

For the 'benchmark' I first set up two tablespaces, hdd and ssd. Then the long tailed table was created as shown in the previous posts:

CREATE UNLOGGED TABLE fast
(
  id serial NOT NULL,
  value real
)
WITH (
  OIDS=FALSE
)
TABLESPACE ssd;

CREATE UNLOGGED TABLE slow
(
)
INHERITS (fast)
TABLESPACE hdd;

Then I created one billion rows in fast and slow:

INSERT INTO fast (value) SELECT random()*1000000000 FROM generate_series(1,1000000000);

INSERT INTO slow SELECT * FROM fast;

First, I wanted to see how each table performs with full table scans. All these numbers are ten-run averages, that's one reason why it took some time :-)

SELECT avg(value) FROM ONLY slow; -- 210 sec

SELECT avg(value) FROM ONLY fast; -- 90 sec

Which pretty much reflects the 3/1 ratio from CrystalDiskMark.

For the random read test, I created primary keys on the id columns of each table, but put their underlying indexes on the SSD to be fair. Then, 10000 random rows where selected from the whole table:

SELECT avg(value) FROM ONLY fast WHERE id IN (SELECT 1+floor(random()*1000000000)::integer FROM generate_series(1,10000)); -- 6 sec

SELECT avg(value) FROM ONLY slow WHERE id IN (SELECT 1+floor(random()*1000000000)::integer FROM generate_series(1,10000)); -- 100 sec

Here, the HDD is about 16x slower than the SSD.

And from the top 20% of each table:

SELECT avg(value) FROM ONLY fast WHERE id IN (SELECT 800000001+floor(random()*200000000)::integer FROM generate_series(1,10000)); -- 5 sec


SELECT avg(value) FROM ONLY slow WHERE id IN (SELECT 800000001+floor(random()*200000000)::integer FROM generate_series(1,10000)); -- 80 sec

Again, the HDD is about 16x slower than the SSD.

Knowing how each table performs, I then moved the top 20% of rows into fast and left the remaining 80% in slow, thus creating the long tailed table.

SELECT avg(value) FROM fast; -- 178 sec

Surprise, surprise, 210*0.8=168, 90*0.2=18, 168+18=186. The long tailed table is not slower than it's individual parts! 

And with random reads?

Whole table:

SELECT avg(value) FROM fast WHERE id IN (SELECT 1+floor(random()*1000000000)::integer FROM generate_series(1,10000)); -- 50 sec

It's way faster than the table on the SSD alone. This seems to be an anomaly I cannot explain at the moment. Either it helps a lot to have two indexes instead of one, or the most rows where selected from the SSD part.

Top 20% only:

SELECT avg(value) FROM fast WHERE id IN (SELECT 800000001+floor(random()*200000000)::integer FROM generate_series(1,10000)); -- 4 sec

A bit faster than having the whole table on SSD.

Conclusion:

Aside from the (positive) anomaly with random reads on the whole long tailed table, using a long tailed table is at least not slower than a vanilla table but you can put your data graveyard on slow but inexpensive storage while having the hot rows and the indexes on the fast drives. Completely transparent for the clients.

However, one question remains...

Is it possible to ask PostgreSQL what the most frequently accessed rows of a table are?

If so, the background worker could balance the long tailed table without having to know a specific, application dependent access pattern!

An that would be the icing on the cake...

A quick glance over pg_stat* and pg_statio* didn't show anything usable for this task, but I'm open for suggestions. :-)