Tuesday, August 25, 2015

The 'other' cloud - parasitic storage as a service

Since some NoSQL products by default ship with no security at all, this was no real surprise after MongoDB, but the magnitude is astounding.

How about using this for something useful?
  1. Scan the Internet for known products/servers that allow unconditional read/write access
  2. Write storage adapters
  3. Invent a mechanism to store data encrypted and with some redundancy, in case someone gets a wake up call
  4. Invent a mechanism to rebalance storage if servers become unavailable or new ones are added to the list of storage nodes
  5. Build a service around 1, 2, 3, and 4
There it is, the 'other' cloud, at almost no cost except bandwidth...

But of course, this is purely fictional. The bad guys don't have good ideas and the good guys won't do it. There is no illegal data stored in your session cache.

Just keep ignoring the fine manuals and carry on. Nobody needs database administrators, everybody knows that...

Tuesday, August 18, 2015

So, what is TABLESAMPLE actually good for?

Q: So, what is TABLESAMPLE actually good for?

A: To speed things up that can be answered from a small sample instead of the whole population!

Like so:

CREATE TABLE public.ts_test
(
  id integer NOT NULL DEFAULT nextval('ts_test_id_seq'::regclass),
  val double precision,
  CONSTRAINT ts_test_pkey PRIMARY KEY (id)
)
WITH (
  OIDS=FALSE
);

Insert 100 million random values in the range 0.0 <= x < 1.0:

INSERT INTO public.ts_test (val)
SELECT
    random()
FROM
    generate_series(1, 100000000);

And now, some queries to get the standard deviation.

For the whole population:

SELECT round(stddev_pop(val)::numeric,5) FROM ts_test;

0.28869 in 15005 ms
Error: 0%
Speedup: 1x

For a sample of size 20000 (good for 1% error and 99% confidence):

SELECT round(stddev_samp(val)::numeric,5) FROM ts_test TABLESAMPLE BERNOULLI(0.02);

0.28867 in 2387 ms
Error: < 1%
Speedup: 6.3x

SELECT round(stddev_samp(val)::numeric,5) FROM ts_test TABLESAMPLE SYSTEM(0.02);

0.28791 in 25 ms
Error: < 1%
Speedup: 600x

With a table that large, both methods BERNOULLI and SYSTEM are accurate enough to give results well below the 1% error target.

(For potential problems with SYSTEM, see this post)

SYSTEM aces BERNOULLI with a speedup of 600x vs. 6.3x relative to a full population query.

And this is hard to beat...

Addendum: TABLESAMPLE only works on Tables and materialized Views!

Monday, August 17, 2015

ToroDB up and running in five simple steps



A working ToroDB installation in five steps.

You'll need:

ToroDB from 8Kdata:
https://github.com/torodb/torodb

A working PostgreSQL instance, 9.4 or better:
http://www.postgresql.org/download/

A suitable JRE, version 6 or higher. It has been mainly tested with Oracle JRE 8.

The mongo shell and mongoimport from MongoDB:
https://www.mongodb.org/

Some test data from citusdata:
http://examples.citusdata.com/customer_reviews_nested_1998.json.gz

Music:
"Hitman" Kevin MacLeod from incompetech.com
Licensed under Creative Commons: By Attribution 3.0
http://creativecommons.org/licenses/by/3.0/

Tuesday, August 4, 2015

...gently down the stream

So, for the test I changed the location to San Francisco and after about one hour the table tstest.tstream contains 3705 tweets.

select tag, count(*) as tag_count from tstest.tagstream group 
              by tag order by tag_count desc limit 10;

produces:

"tag";"tag_count" 
"MTVHottest";376
"Hiring";210
"CareerArc";172
"Job";158
"job";157
"Jobs";145
"SanFrancisco";79
"hiring";69
"Sales";40
"SanJose";39

So there is something going on on MTV in the Bay Area. :-)

In a production system, I probably would write a background worker for stream injection instead of an external job.

And following discussions on Database Soup, it could be indeed interesting to configure the stream server for maximum throughput sacrificing durability for speed and then pull any data that should be preserved over a postgres_fdw link into another server with a more conservative setup.

BTW: If you plan to keep way more data in the ringbuffer to dig further into history occasionally, this would be a perfect case for using a long-tailed table.

Row, row, row your boat...


In the announcement of PipelineDB, Josh Berkus gave a hint that -  and how - a similar functionality might be achieved with stock PostgreSQL. Let's see, if we can analyze a continuous twitter stream too...

First, we need a fast table to act as the ringbuffer:

CREATE UNLOGGED TABLE tstest.tstream
(
  rec timestamp with time zone NOT NULL DEFAULT clock_timestamp(),
  tevent jsonb NOT NULL,
  CONSTRAINT pk_tstream PRIMARY KEY (rec)
)
WITH (
  OIDS=FALSE
);


To inject new tweets into the buffer and trim the outdated ones, I used this little injector, written in Java using the hosebird client and the example code provided there.

Once it runs in it's infinite loop, the table is populated with new tweets, while tweets older than one hour are evicted every five minutes by courtesy of a fixed rate timer task. So, the maximum size of the buffer table is one hour and five minutes worth of tweets in the worst case.

The analysis view is taken verbatim from the PipelineDB example:

  CREATE OR REPLACE VIEW tstest.tagstream as
    SELECT jsonb_array_elements(tevent #>
      ARRAY['entities','hashtags']) ->> 'text' AS tag
    FROM tstest.tstream
    WHERE rec >
          ( clock_timestamp() - interval '1 hour' );


Plus the missing ')' before ->> 'text'. :-)


As is the most popular tags query:

select tag, count(*) as tag_count from tstest.tagstream group
              by tag order by tag_count desc limit 10;


Does it work? Yes.

Since I don't have access to my development machine right now, stay tuned for Part 2 with some results.