Tuesday, August 25, 2015

The 'other' cloud - parasitic storage as a service

Since some NoSQL products by default ship with no security at all, this was no real surprise after MongoDB, but the magnitude is astounding.

How about using this for something useful?
  1. Scan the Internet for known products/servers that allow unconditional read/write access
  2. Write storage adapters
  3. Invent a mechanism to store data encrypted and with some redundancy, in case someone gets a wake up call
  4. Invent a mechanism to rebalance storage if servers become unavailable or new ones are added to the list of storage nodes
  5. Build a service around 1, 2, 3, and 4
There it is, the 'other' cloud, at almost no cost except bandwidth...

But of course, this is purely fictional. The bad guys don't have good ideas and the good guys won't do it. There is no illegal data stored in your session cache.

Just keep ignoring the fine manuals and carry on. Nobody needs database administrators, everybody knows that...

Tuesday, August 18, 2015

So, what is TABLESAMPLE actually good for?

Q: So, what is TABLESAMPLE actually good for?

A: To speed things up that can be answered from a small sample instead of the whole population!

Like so:

CREATE TABLE public.ts_test
  id integer NOT NULL DEFAULT nextval('ts_test_id_seq'::regclass),
  val double precision,
  CONSTRAINT ts_test_pkey PRIMARY KEY (id)

Insert 100 million random values in the range 0.0 <= x < 1.0:

INSERT INTO public.ts_test (val)
    generate_series(1, 100000000);

And now, some queries to get the standard deviation.

For the whole population:

SELECT round(stddev_pop(val)::numeric,5) FROM ts_test;

0.28869 in 15005 ms
Error: 0%
Speedup: 1x

For a sample of size 20000 (good for 1% error and 99% confidence):

SELECT round(stddev_samp(val)::numeric,5) FROM ts_test TABLESAMPLE BERNOULLI(0.02);

0.28867 in 2387 ms
Error: < 1%
Speedup: 6.3x

SELECT round(stddev_samp(val)::numeric,5) FROM ts_test TABLESAMPLE SYSTEM(0.02);

0.28791 in 25 ms
Error: < 1%
Speedup: 600x

With a table that large, both methods BERNOULLI and SYSTEM are accurate enough to give results well below the 1% error target.

(For potential problems with SYSTEM, see this post)

SYSTEM aces BERNOULLI with a speedup of 600x vs. 6.3x relative to a full population query.

And this is hard to beat...

Monday, August 17, 2015

ToroDB up and running in five simple steps

A working ToroDB installation in five steps.

You'll need:

ToroDB from 8Kdata:

A working PostgreSQL instance, 9.4 or better:

A suitable JRE, version 6 or higher. It has been mainly tested with Oracle JRE 8.

The mongo shell and mongoimport from MongoDB:

Some test data from citusdata:

"Hitman" Kevin MacLeod from incompetech.com
Licensed under Creative Commons: By Attribution 3.0

Tuesday, August 4, 2015

...gently down the stream

So, for the test I changed the location to San Francisco and after about one hour the table tstest.tstream contains 3705 tweets.

select tag, count(*) as tag_count from tstest.tagstream group 
              by tag order by tag_count desc limit 10;



So there is something going on on MTV in the Bay Area. :-)

In a production system, I probably would write a background worker for stream injection instead of an external job.

And following discussions on Database Soup, it could be indeed interesting to configure the stream server for maximum throughput sacrificing durability for speed and then pull any data that should be preserved over a postgres_fdw link into another server with a more conservative setup.

BTW: If you plan to keep way more data in the ringbuffer to dig further into history occasionally, this would be a perfect case for using a long-tailed table.

Row, row, row your boat...

In the announcement of PipelineDB, Josh Berkus gave a hint that -  and how - a similar functionality might be achieved with stock PostgreSQL. Let's see, if we can analyze a continuous twitter stream too...

First, we need a fast table to act as the ringbuffer:

  rec timestamp with time zone NOT NULL DEFAULT clock_timestamp(),
  tevent jsonb NOT NULL,
  CONSTRAINT pk_tstream PRIMARY KEY (rec)

To inject new tweets into the buffer and trim the outdated ones, I used this little injector, written in Java using the hosebird client and the example code provided there.

Once it runs in it's infinite loop, the table is populated with new tweets, while tweets older than one hour are evicted every five minutes by courtesy of a fixed rate timer task. So, the maximum size of the buffer table is one hour and five minutes worth of tweets in the worst case.

The analysis view is taken verbatim from the PipelineDB example:

  CREATE OR REPLACE VIEW tstest.tagstream as
    SELECT jsonb_array_elements(tevent #>
      ARRAY['entities','hashtags']) ->> 'text' AS tag
    FROM tstest.tstream
    WHERE rec >
          ( clock_timestamp() - interval '1 hour' );

Plus the missing ')' before ->> 'text'. :-)

As is the most popular tags query:

select tag, count(*) as tag_count from tstest.tagstream group
              by tag order by tag_count desc limit 10;

Does it work? Yes.

Since I don't have access to my development machine right now, stay tuned for Part 2 with some results.

Thursday, July 23, 2015

Deriving the elemental composition from a molformula in pgchem::tigress

pgchem::tigress can generate molecular formulae like C3H6NO2- from chemical structures.

But what if we need access to the elemental composition as a relation, e.g:

C 3
N 1
O 2

Fortunately, PostgreSQL is awesome:

CREATE OR REPLACE FUNCTION elemental_composition(molformula TEXT)
DECLARE elements TEXT[];
elements := ARRAY['C','N','O','P','S','Cl']; --expand as needed
molformula := REPLACE(REPLACE(molformula,'-',''),'+','');

FOREACH element IN ARRAY elements LOOP
count := 1;
token := REGEXP_MATCHES(molformula, element || '[\d?]*');

IF (token[1] IS NOT NULL) THEN
    token :=
        IF (token[1] iS NOT NULL) THEN
            count := token[1]::INTEGER;
        END IF;
  COST 1000;

SELECT * FROM elemental_composition('C3H6NO2-');

And that's it. Did I already mention that PostgreSQL is awesome? :-)

Tuesday, July 21, 2015

1st impression of pg_shard

1st impression of pg_shard:

  1. It works as described, documented limitations apply
  2. Status '3' of a shard_placement means inactive, '1' means online
  3. Read the issues list - unless you want to discover them yourselves :-)
  4. If a whole worker node goes offline this may go unnoticed, since there seems to be no heartbeat between the head and worker nodes unless you try to write data
  5. It already supports range based sharding but you have to activate it manually like so
  6. It does not support prepared statements at the moment

Thursday, June 25, 2015

nseq - A datatype for the efficient storage of nucleotide sequences in PostgreSQL

The nseq datatype allows to store DNA and RNA sequences consisting of the letters AGCT or AGCU respectively in PostgreSQL.

By encoding four bases per Byte, it uses 75 percent less space on disk than text. While this idea is obvious and far from being novel, it still is one of the most efficient compression schemes for DNA/RNA, if not the most efficient.

As of now, nseq only supports very basic native operations. The main shortcoming is, that it has to be decompressed into text hence expanding by 4x, for e. g. substring operations and the like.

This will change.

Enough said - here it is...

Check out PostBIS as well, it already has much more features.