Friday, May 22, 2020

Native (PostgreSQL only) streaming data tables

If you want to see (and analyze) only a window of data over some continuous data stream in PostgreSQL, one way is to use a specialized tool like the PipelineDB extension. But if you can't do that, e.g. because you are stuck with AWS RDS or for some other reason, streaming data tables, or continuous views, can be implemented with pretty much PostgreSQL alone.

The basic idea is to have a table that allows for fast INSERT operations, is aggressively VACUUMed, and has some key that can be used to prune outdated entries. This table is fed with the events from the data stream and regularly pruned. VoilĂ : a streaming data table.

We have done some testing with two approaches on an UNLOGGED table, prune on every INSERT, and pruning at reqular intervals. UNLOGGED is not a problem here, since a view on a data stream can be considered pretty much as ephemeral.

The timed variant is about 5x - 8x faster on INSERTs. And if you balance the timing and the pruning interval right, the window size is almost as stable.

The examples are implemented in Python3 with psycopg2. Putting an index on the table can help or hurt performance, INSERT might get slower but pruning with DELETE faster, depending on the size and structure of the data. Feel free to experiment. In our case, a vanilla BRIN index did just fine.

Instead of using an external scheduler for pruning, like the Python daemon thread in the stream_timed_cleanup.py example, other scheduling mechanisms can be of course used, e.g. pg_cron, or a scheduled Lambda on AWS, or similar.

Feel free to experiment and improve...

No comments:

Post a Comment