Thursday, January 28, 2016

ML based prediction with PostgreSQL & PL/R in four rounds - IV

Further digging into the PL/R documentation shows a way to run code at startup and make it globally available.

Round 4

First we need a table called plr_modules:
CREATE TABLE plr_modules (
  modseq int4,
  modsrc text
);
Then we add the necessary entries:

The SVM is now globally available and the predictor function can be reduced to the following:

Let's run this statement three times again:

select s.*, r_predict4(s.*) from generate_series(1,1000) s;

541 ms for the first run. 281 ms for each of the following two. Average: 368 ms.

That's only a 73% improvement compared to the original code. predict3() is faster then predict4().

Since the only difference is, that the mysvm object is now global, that might explain the difference. Maybe it is more expensive to work with global than local objects in R?

But there is another way, suggested by a reader of the first installment of this series.

Push the whole loop into R and wrap it into an SRF.


Let's see and run this statement three times again:

select * from r_predict5(array(select * from generate_series(1,1000)));

341 ms for the first run. 31 ms for each of the following two. Average: 134 ms.

Now that's a 90% improvement compared to the original code. Ten times faster.
If only the post initialization runs are taken into account it's even better: about 45x.

If you process sets anyway and you can live with passing arrays we have a winner here!

Bottom line:
  1. Do expensive initializations only once.
  2. Pre-can R objects that are immutable and expensive to create with saveRDS().
  3. If you process sets, push the whole loop down to R and stay there as long as possible.

ML based prediction with PostgreSQL & PL/R in four rounds - III

The predict2() function initializes the SVM only on first call which improves performance significantly. But it still needs to build the model from scratch.

If the training takes a comparatively long time or the training data cannot be provided along with the code, this is a problem.

Round 3

R has the ability to serialize objects to disk and read them back with saveRDS() and readRDS().

Having saved the SVM object like that, we can restore it from disk instead of rebuilding it each time.

Let's run this statement three times again:

select s.*, r_predict3(s.*) from generate_series(1,1000) s;

484 ms for the first run. 302 ms for each of the following two. Average: 363 ms.

That's a 75% improvement compared to the original code.

Still, the first call is more expensive than the subsequent ones.

Can we do better?

Wednesday, January 27, 2016

ML based prediction with PostgreSQL & PL/R in four rounds - II

The predict1() function from the first post of this series has a performance problem: The svm is trained every time the function is called. Can this be corrected?

Round 2:

Sifting through the PL/R documentation reveals a way to do expensive initializations only once and persist them between function calls.

This leads to the first optimized version of our predictor function:

Let's run this statement three times again:

select s.*, r_predict2(s.*) from generate_series(1,1000) s;

671 ms for the first run. 302 ms for each of the following two. Average: 425 ms.

That's a 60% improvement compared to the original code.

But we still need to provide the training data and run the training once.
What if we can't, because e.g. of sheer size, legal or intellectual property restrictions?

Can we do better?

ML based prediction with PostgreSQL & PL/R in four rounds - I

By means of PL/R,  PostgreSQL can execute R code inside the database server since 2008 or so (welcome MS SQL Server 2016. ;->).

I tried to teach PostgreSQL a new trick lately with the help of PL/R and here are the results...

Round 1:

Let's start with a very simple example of supervised machine learning, using a Support vector machine (SVM) from the e1071 package that I stole from here.

When you run this, the output is something like this

predclasses

ab
a40
b06

This is the confusion matrix of the SVM, and it tells us, that it predicted all classes correctly from the numerical input.

Which in this case is not surprising, because we cross validated with the very same data used for training. Usually you don't do this but split the data into a training and a validation set, but for the sake of brevity this model will do.

The model is now ready to predict classes from numerical input, like so:

And  this is already all we need for a naive implementation of a SVM based predictor function in PostgreSQL.

Let's try...

select r_predict1(7);

a

select r_predict1(2);

b

Whoa, it lives! :-) But what about performance? Let's run this statement three times:

select s.*, r_predict1(s.*) from generate_series(1,1000) s;

1.4 seconds for each run. Average: 1.4 s.

That's not exactly stellar.

Can we do better?

Monday, January 11, 2016

9.5 Feature: ALTER TABLE SET LOGGED / UNLOGGED

With PostgreSQL 9.5 you now can alternate tables between LOGGED and UNLOGGED mode on the fly:

CREATE TABLE demo (id INTEGER, val INTEGER); -- LOGGED

INSERT INTO demo (id,val) select s.*,s.* FROM generate_series(1,1000000) s;


-- Query returned successfully: 1000000 rows affected, 5.4 secs execution time.

TRUNCATE TABLE demo;

ALTER TABLE demo SET UNLOGGED; -- now UNLOGGED

INSERT INTO demo (id,val) select s.*,s.* FROM generate_series(1,1000000) s;


-- Query returned successfully: 1000000 rows affected, 515 msec execution time.

ALTER TABLE demo SET LOGGED;  -- and back again

Noted the INSERT speed difference between LOGGED and UNLOGGED?

Very convenient, e.g. when initially populating large tables and then switch to production mode.

So, 9.5 is not only jsonb, there is much more goodness inside... :-)

One reader pointed out that:

You haven't showed the time for "ALTER TABLE demo SET LOGGED" and haven't said that it gets AccessExclusiveLock on the table so you can't even read from it.

OK:

ALTER TABLE demo SET LOGGED;

-- Query returned successfully with no result in 3.9 secs.

So it takes some time. I still maintain my opinion that this is a useful feature given the fact, that it was impossible before 9.5 to do it in-place.

Especially for tasks that do not require constant availability like: "initially populating [sic] large tables and then switch to production mode".

Thursday, January 7, 2016

pgchem::tigress - bugfix release

I recently discovered a bug in the fingerprinter code of pgchem::tigress which led to the following undesirable behaviour:

select 'ClC1=CC=C(CN2CCN\C2=N\N(=O)=O)C=N1'::molecule = '[O-][N+](=O)N=C1NCCN1Cc1ccc(Cl)nc1'::molecule

true

but:

select fp2string('ClC1=CC=C(CN2CCN\C2=N\N(=O)=O)C=N1'::molecule) = fp2string('[O-][N+](=O)N=C1NCCN1Cc1ccc(Cl)nc1'::molecule)

false

I.e. the fingerprinter code regards the two differently drawn nitro groups as chemically different while the exact matcher correctly recognizes them as chemically equal. So, whenever an index is used for exact match searching, it was overselective, possibly filtering out correct hits.

This has been fixed.

Also, the add_hydrogens() function now accepts a fourth parameter to specify the PH value for protonation. If unspecified, it has a default of 7.4 so existing code using this function won't break.

Thursday, December 31, 2015

From dot product to matrix multiplication

From dot product to matrix multiplication is only a small step now:

CREATE OR REPLACE FUNCTION public._matmul(
    m1 anyarray,
    m2 anyarray,
    r anyarray)
  RETURNS anyarray AS
$BODY$
DECLARE
ubm1 integer;
ubm2 integer;
BEGIN
ubm1 := array_upper(m1,1);
ubm2 := array_upper(m2,1);

FOR i IN 1..ubm1 LOOP
FOR j in 1..ubm1 loop
r[i][j] := m1[i:i][1:ubm2] +* m2[1:ubm2][j:j];
END LOOP;
END LOOP;

return r;
END;
$BODY$
  LANGUAGE plpgsql IMMUTABLE STRICT;

Since I didn't find a way to either create an array of type anyarray, or determine the type of the input array elements, a correctly sized result array has to be created outside and is passed as an anyarray argument into the function:

select _matmul('{{1.0,2.0,3.0},{4.0,5.0,6.0}}'::double precision[], '{{7.0,8.0},{9.0,10.0},{11.0,12.0}}'::double precision[], '{{null,null},{null,null}}');

{{58,64},{139,154}}

11 ms

This is only a proof of concept. It likely has problems but works good enough to run the example. I especially don't like the two nested loops and the result element access by index, maybe there is a better solution...

Tuesday, December 29, 2015

The arraymath extension vs. plpgsql

After arraymath was fixed, I did some simple benchmarks against my plpgsql code from the previous posts.

All tests were done with PostgreSQL 9.5rc1 on a Fujitsu Celsius H730 (Core i7 4700MQ, 16 GB RAM, 500 GB SSD), 1 million random elements per array.

Here are the results.

Array * Array, plpgsql:

select array(select random() from generate_series(1,1000000)) * array(select random() from generate_series(1,1000000));

1208 ms

Array * Array, arraymath:

select array(select random() from generate_series(1,1000000)) @* array(select random() from generate_series(1,1000000));

935 ms

Array * Scalar, plpgsql:

select array(select random() from generate_series(1,1000000)) * 2.0::double precision;

784 ms

Array * Scalar, arraymath:

select array(select random() from generate_series(1,1000000)) @* 2.0::double precision;

743 ms

So, arraymath is 33% faster for Array * Array and 6% faster for Array * Scalar. C wins over plpgsql, but not as dominating as I initially thought. The hard work is indeed the array traversal - and arraymath has to do it too.

And it's probably no good idea to access arrays by index in plpgsql if you intend to scan the whole array anyway.

More FORTRAN 90 like vector operations in PostgreSQL

It is also possible to do element wise operations of a vector and a scalar:

CREATE OR REPLACE FUNCTION public._vectormuls(
    v1 anyarray,
    v2 anyelement)
  RETURNS anyarray AS
$BODY$
DECLARE
BEGIN
if array_ndims(v1) != 1 Then
RAISE EXCEPTION 'Only vectors supported';
end if;

return array(select v.a * v.b from (select unnest(v1) as a, v2 as b) v);
END;
$BODY$
  LANGUAGE plpgsql IMMUTABLE STRICT;

CREATE OPERATOR public.*(
  PROCEDURE = _vectormuls,
  LEFTARG = anyarray,
  RIGHTARG = anyelement);

The runtime of the vector/scalar function is about 60% of the vector/vector variant, so the effort for unnest() clearly dominates the runtime of such functions.

And since the dot product of two vectors is the sum of the partial products, it is as simple as this:

CREATE OR REPLACE FUNCTION public._vectordotprod(
    v1 anyarray,
    v2 anyarray)
  RETURNS anyelement AS
$BODY$
DECLARE
BEGIN
if array_ndims(v1) != 1 OR array_ndims(v2) != 1 Then
RAISE EXCEPTION 'Only vectors supported';
end if;

if array_length(v1,1) != array_length(v2,1)  Then
RAISE EXCEPTION 'Vector size mismatch';
end if;

-- return SUM v.a + v.b from (select unnest(v1) as a, unnest(v2) as b) v; --9.4
return SUM(a*b) from unnest(v1,v2) v (a,b); -- 9.5
END;
$BODY$
  LANGUAGE plpgsql IMMUTABLE STRICT;

CREATE OPERATOR public.+*(
  PROCEDURE = _vectordotprod,
  LEFTARG = anyarray,
  RIGHTARG = anyarray,
  COMMUTATOR = +*);

The operator name should be . or *+, but since PostgreSQL won't allow that, +* was the next best thing that worked...

BTW: I tried the same with sql instead of plpgsql as function language and the timings are the same. So the additional sanity checks and the language overhead effectively costs nothing compared to the unrolling of the vector(s).

Thursday, December 24, 2015

FORTRAN 90 like vector operations in PostgreSQL revisited

I tried to compare the arraymath extension suggested in a comment with my plpgsql solution from the last post.

PostgreSQL was 9.5rc1 on Ubuntu 14.04 LTS.

make
make install
create extension arraymath;

No problems so far, but...

select array(select random() from generate_series(1,1000000)) @+ array(select random() from generate_series(1,1000000));

NOTICE:  Hello from ArrayMath 1.0
********** Error **********

Server crash!

Whereas the plpgsql code does the addition of two one million element vectors in 1.2 seconds.

Interestingly this works:

SELECT ARRAY[1,2,3,4,5,6,7,8] @+ ARRAY[1,2,3,4,5,6,7,8];

While this crashes:

SELECT ARRAY[1,2,3,4,5,6,7,8,9] @+ ARRAY[1,2,3,4,5,6,7,8,9];

So arraymath seems to fail with vectors larger than 8 elements. Probably only on 9.5, but I couldn't get it to work properly.