Friday, May 22, 2020

Native (PostgreSQL only) streaming data tables

If you want to see (and analyze) only a window of data over some continuous data stream in PostgreSQL, one way is to use a specialized tool like the PipelineDB extension. But if you can't do that, e.g. because you are stuck with AWS RDS or for some other reason, streaming data tables, or continuous views, can be implemented with pretty much PostgreSQL alone.

The basic idea is to have a table that allows for fast INSERT operations, is aggressively VACUUMed, and has some key that can be used to prune outdated entries. This table is fed with the events from the data stream and regularly pruned. VoilĂ : a streaming data table.

We have done some testing with two approaches on an UNLOGGED table, prune on every INSERT, and pruning at reqular intervals. UNLOGGED is not a problem here, since a view on a data stream can be considered pretty much as ephemeral.

The timed variant is about 5x - 8x faster on INSERTs. And if you balance the timing and the pruning interval right, the window size is almost as stable.

The examples are implemented in Python3 with psycopg2. Putting an index on the table can help or hurt performance, INSERT might get slower but pruning with DELETE faster, depending on the size and structure of the data. Feel free to experiment. In our case, a vanilla BRIN index did just fine.

Instead of using an external scheduler for pruning, like the Python daemon thread in the stream_timed_cleanup.py example, other scheduling mechanisms can be of course used, e.g. pg_cron, or a scheduled Lambda on AWS, or similar.

Feel free to experiment and improve...

Tuesday, May 19, 2020

MQTT as transport for PostgreSQL events

MQTT has become a de-facto standard for the transport of messages between IoT devices. As a result, a plethora of libraries and MQTT message brokers have become available. Can we use this to transport messages originating from PostgreSQL?

Aa message broker we use Eclipse Mosquitto which is dead simple to set up if you don't have to change the default settings. Such a default installation is neither secure nor highly available, but for our demo it will do just fine. The event generators are written in Python3 with Eclipse paho mqtt for Python.

There are at least two ways to generate events from a PostgreSQL database, pg_recvlogical and NOTIFY / LISTEN. Both have their advantages and shortcomings.

pg_recvlogical:

  • Configured on server and database level
  • Generates comprehensive information about everything that happens in the database
  • No additional programming neccessary
  • Needs plugins to decode messages, e.g. into JSON
  • Filtering has to be done later, e.g. by the decoder plugin
NOTIFY / LISTEN:
  • Configured on DDL and table level
  • Generates exactly the information and format you program into the triggers
  • Filtering can be done before sending the message
  • Needs trigger programming
  • The message size is limited to 8000 bytes
Examples for both approaches can be found here. The NOTIFY / LISTEN example lacks a proper decoder but this makes be a good excercise to start with. The pg_recvlogical example needs the wal2json plugin, which can be found here and the proper setup, which is also explained in the Readme. Please note, that the slot used in the example is mqtt_slot, not test_slot:


pg_recvlogical -d postgres --slot mqtt_slot --create-slot -P wal2json

Otherwise, setup.sql should generate all objects to run both examples.

Saturday, April 25, 2020

It looks like pgchem::tigress just got a major upgrade

With the Release of PostgreSQL 12.x and OpenBabel 3.x, I decided to see if pgchem::tigress would still compile. Well, it took some minor changes, but YES, it does!

And - it seems like OpenBabel now handles E/Z and enantiomer stereochemistry correctly, at least in SMILES notation. This is a major step forward, but I have to do some more checks before the next release...

Sunday, March 15, 2020

Authenticate PostgreSQL users against the Amazon AWS Cognito service

I was asked recently if PostgreSQL could authenticate login users against AWS Cognito.  Since PostgreSQL allows PAM authentication, I was pretty sure it could.

But an (admittedly not exhaustive) search on the web did not produce any PAMs for Cognito.

So I wrote one, using pam-pythonboto3, warrant and pyJWT:


It is designed primarily for PostgreSQL and pgbouncer, so it only supports pam_sm_authenticate and pam_sm_acct_mgmt, and all the work is done in pam_sm_authenticate. Because calling Cognito is comparatively slow, I didn't want to call it twice.

The necessary pam.d config is:

#%PAM-1.0
# Information for PostgreSQL process with the 'pam' option.
auth required  pam_python.so cognito_PAM.py aws_region user_pool_id client_id 
account required pam_python.so cognito_PAM.py

If you use PAM authentication, passwords are sent in cleartext, so transport layer encryption, e.g. SSL/TLS, between client and server becomes mandatory!

I think it does the correct dance of authentication with Cognito and supports USER_SRP_AUTH, but if you see any problem, please raise a paw.


Friday, February 14, 2020

Excel and ODF support for cloudfs_fdw

cloudfs_fdw now supports .xls (Excel 97-2003), .xlsx, and .ods (Open Document Format) Spreadsheets via pandas, xlrd, and odfpy. It requires pandas >= 1.0.1, so Multicorn must be compiled against Python 3.

Since pandas provides sorting and filtering capabilities, cloudfs_fdw tries to push down SQL qualifiers and sort keys when they can be translated into pandas notation.

Take a look and have fun.

Wednesday, September 25, 2019

cloudfs_fdw

Since I needed a Foreign Data Wrapper for files stored on S3, and the ones I found did things like loading the whole file in memory before sending the first rows, I wrote my own, using Multicorn.

Along the way, I discovered libraries like smart-open and ijson that allow to stream various file formats from various filesystems - and so this escalated a bit, into cloudfs_fdw.

It currently supports CSV and JSON files from S3, HTTP/HTTPS sources and local or network  filesystems but since smart-open supports more than that (e.g. HDFS, SSH), it certainly can be extended if needed.

For now, have fun.

Thursday, May 2, 2019

Not all CASTs are created equal?

Can somebody explain this?

[Solved: See the comments section]

PostgreSQL 11.2.

The documentation says:

"A type cast specifies a conversion from one data type to another. PostgreSQL accepts two equivalent syntaxes for type casts:

CAST ( expression AS type )

expression::type

The CAST syntax conforms to SQL; the syntax with :: is historical PostgreSQL usage."

But when I test the lower limits of PostgreSQL's integer types, strange things happen.

select cast(-9223372036854775808 as bigint);
select cast(-2147483648 as integer);
select cast(-32768 as smallint);

All OK.

select -9223372036854775808::bigint;
select -2147483648::integer;
select -32768::smallint;

All fail with SQL Error [22003]: ERROR: out of range

But:

select -9223372036854775807::bigint;
select -2147483647::integer;
select -32767::smallint;

All OK.

???