Friday, December 30, 2016

One-time passwords with Google Authenticator PAM (and friends)

PostgreSQL allows for more than plain password authentication in pg_hba.conf. One of the most flexible is authenticating against a PAM.

Let's see how this works with one-time passwords from  Google Authenticator.

1.) Install Google Authenticator on your Android or iOS device.

2.) Install the Google Authenticator PAM on the machine where your PostgreSQL server lives, like in Step 1 - 4 of this guide.

3.) Connect your device with the account on that machine.

4.) Create a login role for the database but without a password. Since roles without password can never log in regularly (See the PASSWORD remarks here), this limits this role to PAM logins.

5.) Configure a PAM service for PostgreSQL. E.g. create a file named postgresql where your PAM configs live, on Ubuntu this is /etc/pam.d/. The file should look like this:

auth         sufficient     pam_google_authenticator.so

6.) Configure PostgreSQL to use the PAM. E.g. a line in pg_hba.conf could look like this:

hostssl    all    all    127.0.0.1/32   pam    pamservice=postgresql

And that's basically it. Now, next time you login, PostgreSQL will ask you for a password that is generated individually on your device.

Of course you can use all kinds of PAM with PostgreSQL like this.

Unfortunately, I also found a few caveats along the way. :-(

First, PostgreSQL clients will ask only for one password, regardless if you chain n PAM's for n-factor authentication.

So if you e.g. chain a PAM against LDAP with Google Authenticator as the second factor, this won't work. This seems to be a shortcoming of the PAM implementation in PostgreSQL, not expecting multiple password prompts. It is still possible to enable n-factor authentication though, but only one PAM can prompt for a password. If the other factors are hardware devices like a fingerprint scanner that does not prompt for a password, you are fine.

Alternatively, you can provide your own PAM that takes all passwords in one prompt and handles them internally.

Second, PAM requires PostgreSQL clients to send the password in plaintext. So now is the time to switch on TLS and make it mandatory (Noticed the hostssl switch above?).

Third, some clients like pgAdmin3 break with one-time passwords, because they apparently open new connections without prompting for a password again, but re-use the initial one instead until you disconnect. This obviously does not work with passwords which are valid only for one login attempt.

Wednesday, November 30, 2016

Stampede is out!

Stampede just got live!

I'll try this for sure, when I have time, when I have time...

Sunday, November 27, 2016

pgchem::tigress 4.0 released

pgchem::tigress 4.0 is finally out!
  • This builds against PostgreSQL >= 9.4.x and OpenBabel 4.2.x on Linux.
  • It contains all fixes and contributions of the previous versions.
  • Support for building with Indigo has been removed
  • A lot of dead code has been removed
  • All palloc calls have been replaced by palloc0
  • MOLECULE is not binary compatible and needs to be recreated when you upgrade from 3.2 to 4.0!
  • MS spectrum creation has been removed. You may use enviPat instead for much more accurate spectra 
So: CREATE EXTENSION pgchem_tigress;

Friday, November 18, 2016

tcn again

In the previous post, I suggested that contrib/tcn for example could be used to selectively update caches.

OK, here is my try:


By touching the cache only when the database signals a change of the base table and using the passed information to touch the affected rows only, this should be much more efficient than other strategies like routinely invalidating the cache every n minutes or so.

Monday, November 14, 2016

Module Monday: tcn

The next installment of the series covers contrib/tcn.

tl;dr

Tcn provides a native trigger function to send notifications about INSERT, UPDATE and DELETEs on any table it is attached to, to subscribers of an arbitrarily named channel via PostgreSQL's NOTIFY/LISTEN mechanism. This is useful e.g. for refreshing caches automagically.

First, tcn must be installed in the database:

CREATE EXTENSION tcn;

Now, a test table is created:

CREATE TABLE public.cats
(
  id serial NOT NULL,
  face_name text NOT NULL,
  CONSTRAINT cats_pkey PRIMARY KEY (face_name)
)
WITH (
  OIDS=FALSE
);

And the trigger:

CREATE TRIGGER cats_tcn_trigger
   AFTER INSERT OR UPDATE OR DELETE ON cats
   FOR EACH ROW EXECUTE PROCEDURE triggered_change_notification('cats');

Here, the channel is named 'cats'. If the parameter is omitted, the channel is 'tcn' by default.

It's all set, now we can subscribe to channel 'cats':

LISTEN cats;

Now, when we INSERT some data, the following happens:

INSERT INTO cats (face_name) VALUES ('Tailchaser');

Query returned successfully: one row affected, 13 msec execution time.

Asynchronous notification of 'cats' received from backend pid 5457

Data: "cats",I,"face_name"='Tailchaser'

Multiple operations generate multiple notifications:

INSERT INTO cats (face_name) VALUES ('Hushpad'),('Pouncequick');

Query returned successfully: 2 rows affected, 11 msec execution time.

Asynchronous notification of 'cats' received from backend pid 5457
Data: "cats",I,"face_name"='Hushpad'
Asynchronous notification of 'cats' received from backend pid 5457
Data: "cats",I,"face_name"='Pouncequick'

But what is this good for?

Well, some PostgreSQL drivers support asynchronous notification, e.g. libpq for C, psycopg2 for Python or JDBC for Java. So this can be used for notifying all clients if the content of a table has been changed.

Then the client can react accordingly, e.g. refresh the cache of such a table, or display the changes in a (near) real-time information system, or whatever action is required in the context of the application if a change has occurred.

The following Python code shows an example matching the 'cats' table from above.

Start with python notify.py:


Waiting for notifications on channel 'cats'

Another INSERT

INSERT INTO cats (face_name) VALUES ('Roofshadow');

yields

Got NOTIFY: 5457 cats "cats",I,"face_name"='Roofshadow'
Must refresh cache

from Python.

As always, imagination is the limit.

Monday, November 7, 2016

Locking PostgreSQL shared memory to physical RAM

As an anonymous reader of my last post remarked, instance-level encryption might leak decrypted data to disk.

It looks like if there is a solution to this problem. If you change only one line in src/include/portability/mem.h from:

#define PG_MMAP_FLAGS                   (MAP_SHARED|MAP_ANONYMOUS|MAP_HASSEMAPHORE)

to:

#define PG_MMAP_FLAGS                   (MAP_LOCKED|MAP_SHARED|MAP_ANONYMOUS|MAP_HASSEMAPHORE)

PostgreSQL shared memory should be locked in physical RAM and never been swapped/paged out to disk.

In order for this to work you obviously need enough physical RAM, and the user PostgreSQL runs as needs permission to lock enough memory, though. So you better check the ulimit:

ulimit -l

unlimited (or at least big enough)

Otherwise you get this rather cryptic error at startup:

"could not map anonymous shared memory: resource temporarily unavailable"

If you are not root, it will say "resource temporarily unavailable" and hide the real cause. Since PostgreSQL refuses to run as root, you'll never see the real error: "cannot allocate memory".

Well I'm no PostgreSQL hacker. I seems to be working but is it really that easy to fix?


Sunday, November 6, 2016

The Elephant in the Locker Room: Instance-encrypted PostgreSQL meets Vault

Recently, Cybertec has released a patched PostgreSQL 9.6.0 with instance level encryption. And since I'm frequently asked if PostgreSQL supports this, I had to try it out immediately.

Well, it seems to work as advertised, but as always when dealing with encryption, the pesky question is - where to put the key?

One method to provide the key, is to put a passphrase into a environment variable named PGENCRYPTIONKEY, which is a bit of a misnomer, because the actual key is derived
by SHA256 from the passphrase. But the documentation mentions a second way to provide
the key.

Once the cluster is initialized with initdb, you can provide a command that retrieves the key at startup,
but no example for this is given in the release announcement. So, here is how it works:

As key repository, I chose HashiCorp's Vault, which is "a tool for managing secrets" as they say.

First, fire up vault. I use development mode here which is the easiest way but of course not recommended for production systems.

vault server -dev

Keep the output of this command, you'll need it later!

Second, get the SHA256 hash of the passphrase the database cluster was initialized with. Mine was 'sekrit'. :-)

echo -n sekrit | sha256sum

7e35aaaea6b83ab43c9271737060ce485dc285837d9b4b98b6458408886ac7b1  -

This is the secret to store in Vault. Don't forget the '-n' option for echo!

vault write secret/pg_key value=encryptionkey=7e35aaaea6b83ab43c9271737060ce485dc285837d9b4b98b6458408886ac7b1

The documentation says, that the value returned by the key retrieval command must be 'encryptionkey=...', so this was written as the secret to the vault, not only the key. 'pg_key' is an arbitrary name under which the secret can be retrieved later.

Now, we need a command to retrieve the secret from the Vault. Using the REST interface this becomes:

curl -s -X GET -H "X-Vault-Token:$VAULT_TOKEN" http://localhost:8200/v1/secret/pg_key | jq '.data.value' | tr -d '"'

This is the raw response:


{

  "request_id": "1d485f2a-893f-67e7-6257-12b3b401f916",
  "lease_id": "",
  "renewable": false,
  "lease_duration": 2592000,
  "data": {
    "value": "encryptionkey=7e35aaaea6b83ab43c9271737060ce485dc285837d9b4b98b6458408886ac7b1"
  },
  "wrap_info": null,
  "warnings": null,
  "auth": null
}

curl does the actual REST call, jq parses the value of interest out of the JSON response and tr removes the superfluous quotes around it. At the end we get the desired output:

encryptionkey=7e35aaaea6b83ab43c9271737060ce485dc285837d9b4b98b6458408886ac7b1

Looks OK, so this can be put into postgresql.conf.

pgcrypto.keysetup_command='curl -s -X GET -H "X-Vault-Token:$VAULT_TOKEN" http://localhost:8200/v1/secret/pg_key | jq ''.data.value'' | tr -d ''"'''

And with that in place, every time the PostgreSQL server starts, the key is pulled out of vault and all is set.

But what is the point of using a secret manager like vault instead of other mechanisms? Well, vault needs authentication and authorization to provide it's services.

Actually the REST interface requires a authorization token  to work: -H "X-Vault-Token:$VAULT_TOKEN". vault writes a root token to stdout, that's why the output at startup is important. vault will not reveal that token ever again after first startup!

Unseal Key (hex)   : d6426284cb1b9756d8a108f17a2508a261963acf529cfbbd6b12b8cdc90da643
Unseal Key (base64): 1kJihMsbl1bYoQjxeiUIomGWOs9SnPu9axK4zckNpkM=
Root Token: 60a42d59-1e2b-76fb-329f-c32f3b77291e

In production you would of course never use the root token directly, but get additional tokens with limited rights and lifetime! HTTPS instead of HTTP is the protocol of choice for the REST API.

Also note the unseal key. A vault in production mode will emit several unseal keys of which you need a subset to unseal the secret storage.

So in a production setup, vault allows to:
  1. Manage all your secrets in a central service
  2. Restrict access to the secrets by scope and time
  3. Protect all secrets in encrypted storage
  4. Seal the vault immediately if you suspect your system is compromised
  5. Unseal the secrets only by consent of several unseal key holders
  6. And much more, see the documentation for details
I'm not affiliated with HashiCorp in any way and when I took first look at vault I was not impressed. But the more I dive into it's documentation and try things, this gradually changes. And it can not only manage secrets for PostgreSQL, but use PostgreSQL as a storage backend too!

Monday, October 31, 2016

Module Monday: spi/timetravel

For pg_sentinel I used a SPI function for convenience. After that I decided to refresh my SPI knowledge and discovered the contrib/spi module that comes with PostgreSQL, especially timetravel, which I will introduce today.

tl;dr

Timetravel transparently morphs UPDATEs to INSERTs and DELETEs to UPDATEs together with timestamps to preserve DML changes made to a table. Thus, the state of the data in the table can be reproduced for any given point in time since the table was put under timetravel control.

First, timetravel must be installed in the database:

create extension timetravel;

Now, a test table is created

CREATE TABLE baz (
id INTEGER,
val INTEGER,
PRIMARY KEY (id));

Let's INSERT and UPDATE a single row.

INSERT INTO baz(id,val) VALUES(1,0);

UPDATE baz SET val=1 WHERE id=1;
UPDATE baz SET val=2 WHERE id=1;
UPDATE baz SET val=3 WHERE id=1;

A SELECT reveals that only the last UPDATE survived, which is the normal behaviour.

SELECT * FROM baz;

Now, in order to enable timetravel capabilities on table baz, it has to be changed a bit.

TRUNCATE TABLE baz;

ALTER TABLE baz ADD COLUMN v_start ABSTIME;
ALTER TABLE baz ADD COLUMN v_end ABSTIME;
ALTER TABLE baz DROP CONSTRAINT baz_pkey;
ALTER TABLE baz ADD PRIMARY KEY (id, v_end);

Now, for every row, two timestamp columns define the timespan for which this row was valid. These columns can be arbitrarily named, but have to be of type ABSTIME. And the primary key must be extended with the 'end' timestamp. Otherwise primary key duplication violations will occur.

Now, baz can be put under timetravel control:

CREATE TRIGGER baz_timetravel
BEFORE INSERT OR DELETE OR UPDATE ON baz
FOR EACH ROW
EXECUTE PROCEDURE
timetravel (v_start, v_end);

Again, we INSERT and UPDATE a single row.

INSERT INTO baz(id,val) VALUES(1,0);

UPDATE baz SET val=1 WHERE id=1;
UPDATE baz SET val=2 WHERE id=1;
UPDATE baz SET val=3 WHERE id=1;

Now, a SELECT reveals what timetravel does - for every UPDATE a new row was INSERTed and timestamped.

SELECT * FROM baz;

See, all values were persisted. The row of the last UPDATE has a 'end' timestamp of infinity, indicating that this is the current value of the row.

Likewise, a DELETE becomes an UPDATE, setting the 'end' timestamp to the time when the DELETE happened.

DELETE FROM baz WHERE id=1 AND val=3;

SELECT * FROM baz;

If you try to UPDATE or DELETE a non active row, timetravel intercepts it with a no-op. Only active rows with a 'end' of infinity are mutable.

Timetravel can be switched on and off per table with SELECT set_timetravel('table_name', 1) and SELECT set_timetravel('table_name', 0) respectively, while SELECT get_timetravel('table_name') shows the current status.

And that's basically it. However, timetravel has a few shortcomings:
  • Tables and primary keys have to be changed
  • Does not protect against TRUNCATE TABLE
  • Needs additional disk space
  • The timestamps have to be ABSTIME, which is deprecated according to the PostgreSQL documentation
  • You can only travel backwards, not into the future ;-)
To be continued every Monday whenever my spare time permits...

Tuesday, October 11, 2016

pg_sentinel - Protecting data with sentinel values

According to OWASP, SQL injection is the #1 attack vector against web applications. Well at least it was in 2013, but a 2016 survey is under way. But since it also was #1 in 2010, it might make the triple.

But even if injection does not work, an amazing number of database servers can be discovered by Shodan - and it's not only MongoDB...

Quite a few years ago, I read about the idea to detect leaking data by using sentinel values. The concept is to put values into the data that are never queried and thus never emitted if a certain application uses the data as it should.
But if an attacker bypasses security and slurps all of the data, e.g. by SELECT * FROM sometable; the sentinels are emitted, detected, and countermeasures can be triggered.

As far as I can remember, the original idea was to put sensors on the network to detect the sentinels, but since then I always wondered if it could be applied to the PostgreSQL server itself.

Where to put the sensor then?

The application is certainly the wrong place, since the attacker might bypass it altogether. The driver isn't an option also, because the attacker might bring his own driver. Even a middleman like pgbouncer or pgpool-II could be bypassed, so the only way is to sink the detector as deep as possible into the heart of the server.

To cut a long story short, as proof-of-concept I chose to write a PostgreSQL module that hooks into ExecutorRun(): pg_sentinel.

Documentation comes with the module.

If you find things to improve, feel free to do so. After all, except for a minuscule contribution to pg_plan_filter, I had no previous experience programming modules and hooks and started eight hours ago from zero.

Friday, September 23, 2016

pgchem::tigress vs. OpenBabel 2.4.0

OpenBabel 2.4.0 is released, the build process worked flawlessly, 100% tests passed.

Now I have to see if the Tigress still likes it...

Well, pgchem::tigress builds against OpenBabel 2.4.0 without errors, but will it work?

Yes, pgchem::tigress works with OpenBabel 2.4.0 without any notable alterations, except changing:

OPENBABEL_SOURCE=openbabel-2.3.2

in the Makefile to:

OPENBABEL_SOURCE=openbabel-2.4.0

As always with .0 releases, subtle issues might be lurking below the surface, though.

Thursday, September 15, 2016

Blending gene sequence variation data into PostgreSQL

VCF, the Variant Call Format, is a clever idea. Instead of storing all genetic information of a sampled genome, it only stores the delta against some reference genome. This squeezes a lot of redundancy out of the data and thus occupies a lot less storage space.

Unfortunately, VCF is also a unwieldy format. Only a part is fixed, with metadata in the header describing fields in the body which in turn describe the actual data format of the samples.

This makes VCF especially hard to grasp for systems like Hadoop that work on chunked files spread over many compute nodes. With VCF, every chunk has to carry a copy of the header to make sense of the data it carries.

Formats like ADAM are under development that tackle this (any many more) problems, but when I was given the task to make VCF files accessible to a database last year, I took the direct route:

Since there have been already quite powerful tools and libraries to work with VCF files around for some time, I used Multicorn and PyVCF and wrote a foreign data wrapper for PostgreSQL that understands the VCF format.

After it was finished, I realized that it had been done before...

However, my implementation is different in some parts. Most notably it abstracts more from the actual storage of the VCF files and it works with vanilla PyVCF and does not need any special modifications.

A few days ago, I was granted permission to release the code into the wild by the people who paid for it in the first place, so now you have the freedom of choice. ;-)

Ladies and Gentlemen, please take a look at just another multicorn based foreign data wrapper for VCF files for PostgreSQL, the one and only truly integrarelational DBMS.

Thursday, August 25, 2016

Timeseries: How long can the elephant remember?

Frankly, I don't know where the practical limit for the number of rows in a single PostgreSQL table is from experience, but the interwebs seems to agree on 10^9 for narrow tables.

After a lively discussion with a NoSQL afficionado yesterday about the (in)ability to effectively store timeseries data in a RDBMS I made a quick calculation.

Timeseries data is usually a triple of the form key timestamp value, so it can be stored in a pretty narrow table, hence I stick to the 10^9 rows limit.

If we get a data point every second, we can store 10^9 seconds worth of data. 10^9 seconds is 16666666.6667 minutes, which is 277777.777778 hours, which is 11574.0740741 days, which is good for about 31 years of recording.

Every second of 31 years. Per table.

Thursday, August 18, 2016

Hexastores are easy

Did you know that you can make a Hexastore from a RDF triple in just one line of SQL? (This needs PostgreSQL 9.4 or better, because of the multi-array unnest)

CREATE OR REPLACE FUNCTION hexify(
    IN sub text,
    IN pred text,
    IN obj text)
  RETURNS TABLE(ord text, a text, b text, c text) AS
$$select A.t || B.t || C.t as ord, A.v, B.v, C.v from (select * from unnest(ARRAY[sub, pred, obj],ARRAY['s', 'p', 'o'])) as A(v, t) cross join (select * from unnest(ARRAY[sub, pred, obj],ARRAY['s', 'p', 'o'])) as B(v, t) cross join (select * from unnest(ARRAY[sub, pred, obj],ARRAY['s', 'p', 'o'])) as C(v, t) where a.v != b.v and a.v != c.v and b.v != c.v order by ord desc$$
  LANGUAGE sql IMMUTABLE STRICT
  COST 100
  ROWS 6;

SELECT * FROM hexify('subject','predicate','object');


Sometimes, PostgreSQL SQL is just awesome...

More on Hexastores here and here.

Thursday, June 23, 2016

When 'good enough' is good enough - approximate answers with PostgreSQL 9.4999

Approximation in databases seems to be an alien concept at first. But if your application can deal with a known and controllable degree of error, it can help even in cases where conventional tuning is not an option for whatever reason.

Approximation is not evil 

 

One of the top requirements for database systems is reliability. Whether you run a big bank or a small retail business, you don't want to lose a cent here or there or charge your customer twice for the Pink Fluffy Unicorn he just bought, just because the DBMS gave a wrong answer. Classic OLTP operations have to be always 100% correct.

However, for the case of analytics, things become different. In some cases, it can be desirable to trade a bit of accuracy for a lot of speed. This is called approximation and to many database people (and users), the concept of accepting results with less than 100% accuracy seems strange at first.

But if you know - and can control - the error introduced by approximation, it is not. It can even be very useful, if a 95% accurate answer now is worth more than a 100% accurate answer tomorrow.

Welcome to approximation in PostgreSQL 9.5.

Approximating queries

 

Approximate queries work on subsets, called samples, of the whole data set, called the population.
If the sampling is done statistically correct, a sample much smaller than the whole population gives answers close to the real answer within a known error range.

A possible application for the hypothetical retail business would be to find which product is currently trending.
Instead of knowing that exactly 500, 1000 and 2000 Pink Fluffy Unicorns were sold in the last three weeks, knowing that 498, 1001 and 1999 Pink Fluffy Unicorns were sold in the last three weeks with let's say 5% error tells the procurement people that Pink Fluffy Unicorns are a trending product just as fine as the exact numbers. Only, they might have to wait a few seconds for the answer instead of a few hours...

PostgreSQL 9.5 has built-in support for approximate queries. Because I'm lazy and already wrote about this I just point to the corresponding post.

Still, all the population data has to be there for approximate queries to work. How about running queries without storing the underlying data at all?

 

Approximating data structures 

 

If PostgreSQL has a weakness, it's the comparably poor performance of count() and distinct.
Due to the lock-free multiversion concurrency design of PostgreSQL, count() has to touch each row in a table to check whether it is visible in the current transaction or not. Unlike locking DBMS like Oracle, it can only use an index to count in a few cases . Full table scan.

Distinct always has to sort the table. It can use an index, but only covering indexes, and the larger the index is compared to the table, the less likely PostgreSQL will use it. Sorting can be tuned by raising work_mem, but since this is a per session parameter, it is limited by available RAM.

So count(distinct) is like the worst of both worlds (In the following example distinct alone is slower, because it has to return ten million rows to the client, count(distinct) returns only one value).
Like here (times are w/o Index / w Index):
 
create table hmr(id serial, value real);

insert into hmr (value) select random()*10000000 from generate_series(1,10000000);

select count (value) from hmr; --826 msec. / 817 msec.

select distinct value from hmr; --33917 msec. / 32420 msec.

select count (distinct value) from hmr; -- 9665 msec. / 9439 msec.

Enter the HyperLogLog cardinality estimator. Some clever people at Google observed, that the cardinality of a multiset of evenly distributed random numbers can be predicted by finding the maximum number of leading zeroes in the binary representation of those numbers: For a maximum of k leading zeroes, the cardinality is 2^k.

HyperLogLog uses a hash function to transform arbitrary input values into such random numbers and thus allows to estimate the cardinality of an input multiset for cardinalities > 10^9 with a 2-3% error, using only 1280 bytes of storage

PostgreSQL has a HyperLogLog extension, hll.
 
create extension hll;

CREATE TABLE cardinality (
            id      integer,
            set     hll
    );

INSERT INTO cardinality(id, set)
    SELECT 1, (select hll_add_agg(hll_hash_any(value))
    FROM hmr); -- 2267 msec.

SELECT hll_cardinality(set)::int FROM cardinality WHERE id = 1; -- 11 msec.
 
Since count distinct(value) = 8187749 and hll_cardinality = 8470057, the error is ~3%

Another, not so PostgreSQL specific example would be a database that has a stream table, e.g. holding only one hour worth of events at any given point in time. I showed how to do this with stock PostgreSQL and a bit of Java here and here.

If you also want to know, how many distinct events that stream has seen in total, it's impossible, unless you store all distinct values and update their counts every time a new event arrives. But then, you might end up in storing all events - which is not what you wanted in the first place if you chose to use a stream table.

With HyperLogLog it's easy. Update your HyperLogLog estimator on every new event and you get a good approximation how many distinct values the stream has seen in total.

Approximating indexes

 

9.5 introduced BRIN indexes for very large tables. Unlike e.g. a btree, BRIN stores only ranges of values and points to the physical pages where a value that falls into that range could possibly be found.

A BRIN index thus only gives precise answers to the question where a certain value could not be found on disk.

9.6 will have Bloom-Filter indexes as an extension. Bloom filters can tell you that a value does not exist in a set with perfect accuracy. But the question if a value exists in the set can only be answered with a probability that increases with the collision resilience of the underlying hash.

So, as BRIN and Bloom indexes both are approximating indexes, every index hit has to be rechecked by the DBMS against the underlying data. But if you know their limitations and use them accordingly, they too can speed up your queries quite a bit.

Monday, March 14, 2016

Vítaný!

According to the access statistics, my blog has now more readers from the Czech Republic than from the U.S.A. which had the lead for the last few years.

Wednesday, March 9, 2016

More fun with a integrarelational DBMS: SoilGrids

While the SoilGrids FDW in my first post on this subject works fine, I now think there is a better, more flexible, and more elegant way to do it.

Since PostgreSQL has JSON built in, why not return the whole response and parse with SQL? This way you can get all the data from SoilGrids without having to return a bazillion columns, the JSON response can be stored for future use, and indexed as well.

And this is how it looks like:

CREATE FOREIGN TABLE public.soilgrids1km
   (response jsonb , -- json also works, PostgreSQL does the correct cast for us
    latitude real ,
    longitude real )
   SERVER soilgrids_srv;

select (response->'properties'->'PHIHOX'->'M'->>'sd1')::REAL / 10.0::real from soilgrids1km where latitude = '51.57' and longitude = '5.39'

5.8

And again, that level of extensibility is one of the reasons why I think that PostgreSQL is one awesome DBMS. Integrarelational even...

Wednesday, February 17, 2016

The Devil is in the details: Returning bytea from PL/R

If you return an R object from PL/R as bytea, it is passed through the R serialization interface.

This is mentioned in the fine manual, but not really prominently so.

I assume the rationale behind this is, that you can use R objects returned from one PL/R function directly as input to some other PL/R function.

But if you export an image or something else binary for use outside PL/R, you must get rid of the serialization first.

And there is a utility function for this in PL/R already:

SELECT plr_get_raw(some_plr_function_returning_bytea());

Otherwise the binary output will not be what you expect...

Friday, February 12, 2016

Fun with a integrarelational DBMS: SoilGrids

PostgreSQL has many ways to extend it's capabilities in well defined ways: Custom datatypes, custom functions, custom operators, even custom indexes.

And then there's the Foreign Data Wrapper, an API to pull (almost) any kind of data into PostgreSQL and treat it (almost) like a native table. There is already an impressive list of them, but sometimes you have to write your own one. Like yesterday, when I wanted to integrate pH data from the SoilGrids1km project into an already existing PostgreSQL/PostGIS system.

The data from SoilGrids is freely available, so I just could have downloaded it and put it into PostGIS. But the data set is a) huge and b) constantly updated and c) they have a REST API, so why not tap into it directly?

Maybe because the native language of the Foreign Data Wrapper is C and that's not exactly well suited for a fast prototype. :-) But then there is Multicorn, a bridge between the FDW API and Python, so I gave it a try...

After two hours, including the installation of Multicorn itself, I had this up and running:

CREATE SERVER soilgrids_srv
   FOREIGN DATA WRAPPER multicorn
  OPTIONS (wrapper 'soilgrids_fdw.SoilGridsForeignDataWrapper');

CREATE FOREIGN TABLE public.soilgrids1km
   (latitude real ,
    longitude real ,
    ph_l real ,
    ph_m real ,
    ph_u real ,
    depth real ,
    publication_date date )
   SERVER soilgrids_srv;

select * from public.soilgrids1km where latitude = '51.57' and longitude = '5.39'

latitude longitude ph_l ph_m ph_u depth  publication_date
51.57    5.39      4.2  5.8  7.4  -0.025 2014-04-02 

I bet a more pythonic person than me could write something like this in under one hour or so.

And that level of extensibility is one of the reasons why I think that PostgreSQL is one awesome DBMS. Integrarelational even...

Thursday, January 28, 2016

ML based prediction with PostgreSQL & PL/R in four rounds - IV

Further digging into the PL/R documentation shows a way to run code at startup and make it globally available.

Round 4

First we need a table called plr_modules:
CREATE TABLE plr_modules (
  modseq int4,
  modsrc text
);
Then we add the necessary entries:

The SVM is now globally available and the predictor function can be reduced to the following:

Let's run this statement three times again:

select s.*, r_predict4(s.*) from generate_series(1,1000) s;

541 ms for the first run. 281 ms for each of the following two. Average: 368 ms.

That's only a 73% improvement compared to the original code. predict3() is faster then predict4().

Since the only difference is, that the mysvm object is now global, that might explain the difference. Maybe it is more expensive to work with global than local objects in R?

But there is another way, suggested by a reader of the first installment of this series.

Push the whole loop into R and wrap it into an SRF.


Let's see and run this statement three times again:

select * from r_predict5(array(select * from generate_series(1,1000)));

341 ms for the first run. 31 ms for each of the following two. Average: 134 ms.

Now that's a 90% improvement compared to the original code. Ten times faster.
If only the post initialization runs are taken into account it's even better: about 45x.

If you process sets anyway and you can live with passing arrays we have a winner here!

Bottom line:
  1. Do expensive initializations only once.
  2. Pre-can R objects that are immutable and expensive to create with saveRDS().
  3. If you process sets, push the whole loop down to R and stay there as long as possible.

ML based prediction with PostgreSQL & PL/R in four rounds - III

The predict2() function initializes the SVM only on first call which improves performance significantly. But it still needs to build the model from scratch.

If the training takes a comparatively long time or the training data cannot be provided along with the code, this is a problem.

Round 3

R has the ability to serialize objects to disk and read them back with saveRDS() and readRDS().

Having saved the SVM object like that, we can restore it from disk instead of rebuilding it each time.

Let's run this statement three times again:

select s.*, r_predict3(s.*) from generate_series(1,1000) s;

484 ms for the first run. 302 ms for each of the following two. Average: 363 ms.

That's a 75% improvement compared to the original code.

Still, the first call is more expensive than the subsequent ones.

Can we do better?

Wednesday, January 27, 2016

ML based prediction with PostgreSQL & PL/R in four rounds - II

The predict1() function from the first post of this series has a performance problem: The svm is trained every time the function is called. Can this be corrected?

Round 2:

Sifting through the PL/R documentation reveals a way to do expensive initializations only once and persist them between function calls.

This leads to the first optimized version of our predictor function:

Let's run this statement three times again:

select s.*, r_predict2(s.*) from generate_series(1,1000) s;

671 ms for the first run. 302 ms for each of the following two. Average: 425 ms.

That's a 60% improvement compared to the original code.

But we still need to provide the training data and run the training once.
What if we can't, because e.g. of sheer size, legal or intellectual property restrictions?

Can we do better?

ML based prediction with PostgreSQL & PL/R in four rounds - I

By means of PL/R,  PostgreSQL can execute R code inside the database server since 2008 or so (welcome MS SQL Server 2016. ;->).

I tried to teach PostgreSQL a new trick lately with the help of PL/R and here are the results...

Round 1:

Let's start with a very simple example of supervised machine learning, using a Support vector machine (SVM) from the e1071 package that I stole from here.

When you run this, the output is something like this

predclasses

ab
a40
b06

This is the confusion matrix of the SVM, and it tells us, that it predicted all classes correctly from the numerical input.

Which in this case is not surprising, because we cross validated with the very same data used for training. Usually you don't do this but split the data into a training and a validation set, but for the sake of brevity this model will do.

The model is now ready to predict classes from numerical input, like so:

And  this is already all we need for a naive implementation of a SVM based predictor function in PostgreSQL.

Let's try...

select r_predict1(7);

a

select r_predict1(2);

b

Whoa, it lives! :-) But what about performance? Let's run this statement three times:

select s.*, r_predict1(s.*) from generate_series(1,1000) s;

1.4 seconds for each run. Average: 1.4 s.

That's not exactly stellar.

Can we do better?

Monday, January 11, 2016

9.5 Feature: ALTER TABLE SET LOGGED / UNLOGGED

With PostgreSQL 9.5 you now can alternate tables between LOGGED and UNLOGGED mode on the fly:

CREATE TABLE demo (id INTEGER, val INTEGER); -- LOGGED

INSERT INTO demo (id,val) select s.*,s.* FROM generate_series(1,1000000) s;


-- Query returned successfully: 1000000 rows affected, 5.4 secs execution time.

TRUNCATE TABLE demo;

ALTER TABLE demo SET UNLOGGED; -- now UNLOGGED

INSERT INTO demo (id,val) select s.*,s.* FROM generate_series(1,1000000) s;


-- Query returned successfully: 1000000 rows affected, 515 msec execution time.

ALTER TABLE demo SET LOGGED;  -- and back again

Noted the INSERT speed difference between LOGGED and UNLOGGED?

Very convenient, e.g. when initially populating large tables and then switch to production mode.

So, 9.5 is not only jsonb, there is much more goodness inside... :-)

One reader pointed out that:

You haven't showed the time for "ALTER TABLE demo SET LOGGED" and haven't said that it gets AccessExclusiveLock on the table so you can't even read from it.

OK:

ALTER TABLE demo SET LOGGED;

-- Query returned successfully with no result in 3.9 secs.

So it takes some time. I still maintain my opinion that this is a useful feature given the fact, that it was impossible before 9.5 to do it in-place.

Especially for tasks that do not require constant availability like: "initially populating [sic] large tables and then switch to production mode".

Thursday, January 7, 2016

pgchem::tigress - bugfix release

I recently discovered a bug in the fingerprinter code of pgchem::tigress which led to the following undesirable behaviour:

select 'ClC1=CC=C(CN2CCN\C2=N\N(=O)=O)C=N1'::molecule = '[O-][N+](=O)N=C1NCCN1Cc1ccc(Cl)nc1'::molecule

true

but:

select fp2string('ClC1=CC=C(CN2CCN\C2=N\N(=O)=O)C=N1'::molecule) = fp2string('[O-][N+](=O)N=C1NCCN1Cc1ccc(Cl)nc1'::molecule)

false

I.e. the fingerprinter code regards the two differently drawn nitro groups as chemically different while the exact matcher correctly recognizes them as chemically equal. So, whenever an index is used for exact match searching, it was overselective, possibly filtering out correct hits.

This has been fixed.

Also, the add_hydrogens() function now accepts a fourth parameter to specify the PH value for protonation. If unspecified, it has a default of 7.4 so existing code using this function won't break.