Thursday, December 31, 2015

From dot product to matrix multiplication

From dot product to matrix multiplication is only a small step now:

CREATE OR REPLACE FUNCTION public._matmul(
    m1 anyarray,
    m2 anyarray,
    r anyarray)
  RETURNS anyarray AS
$BODY$
DECLARE
ubm1 integer;
ubm2 integer;
BEGIN
ubm1 := array_upper(m1,1);
ubm2 := array_upper(m2,1);

FOR i IN 1..ubm1 LOOP
FOR j in 1..ubm1 loop
r[i][j] := m1[i:i][1:ubm2] +* m2[1:ubm2][j:j];
END LOOP;
END LOOP;

return r;
END;
$BODY$
  LANGUAGE plpgsql IMMUTABLE STRICT;

Since I didn't find a way to either create an array of type anyarray, or determine the type of the input array elements, a correctly sized result array has to be created outside and is passed as an anyarray argument into the function:

select _matmul('{{1.0,2.0,3.0},{4.0,5.0,6.0}}'::double precision[], '{{7.0,8.0},{9.0,10.0},{11.0,12.0}}'::double precision[], '{{null,null},{null,null}}');

{{58,64},{139,154}}

11 ms

This is only a proof of concept. It likely has problems but works good enough to run the example. I especially don't like the two nested loops and the result element access by index, maybe there is a better solution...

Tuesday, December 29, 2015

The arraymath extension vs. plpgsql

After arraymath was fixed, I did some simple benchmarks against my plpgsql code from the previous posts.

All tests were done with PostgreSQL 9.5rc1 on a Fujitsu Celsius H730 (Core i7 4700MQ, 16 GB RAM, 500 GB SSD), 1 million random elements per array.

Here are the results.

Array * Array, plpgsql:

select array(select random() from generate_series(1,1000000)) * array(select random() from generate_series(1,1000000));

1208 ms

Array * Array, arraymath:

select array(select random() from generate_series(1,1000000)) @* array(select random() from generate_series(1,1000000));

935 ms

Array * Scalar, plpgsql:

select array(select random() from generate_series(1,1000000)) * 2.0::double precision;

784 ms

Array * Scalar, arraymath:

select array(select random() from generate_series(1,1000000)) @* 2.0::double precision;

743 ms

So, arraymath is 33% faster for Array * Array and 6% faster for Array * Scalar. C wins over plpgsql, but not as dominating as I initially thought. The hard work is indeed the array traversal - and arraymath has to do it too.

And it's probably no good idea to access arrays by index in plpgsql if you intend to scan the whole array anyway.

More FORTRAN 90 like vector operations in PostgreSQL

It is also possible to do element wise operations of a vector and a scalar:

CREATE OR REPLACE FUNCTION public._vectormuls(
    v1 anyarray,
    v2 anyelement)
  RETURNS anyarray AS
$BODY$
DECLARE
BEGIN
if array_ndims(v1) != 1 Then
RAISE EXCEPTION 'Only vectors supported';
end if;

return array(select v.a * v.b from (select unnest(v1) as a, v2 as b) v);
END;
$BODY$
  LANGUAGE plpgsql IMMUTABLE STRICT;

CREATE OPERATOR public.*(
  PROCEDURE = _vectormuls,
  LEFTARG = anyarray,
  RIGHTARG = anyelement);

The runtime of the vector/scalar function is about 60% of the vector/vector variant, so the effort for unnest() clearly dominates the runtime of such functions.

And since the dot product of two vectors is the sum of the partial products, it is as simple as this:

CREATE OR REPLACE FUNCTION public._vectordotprod(
    v1 anyarray,
    v2 anyarray)
  RETURNS anyelement AS
$BODY$
DECLARE
BEGIN
if array_ndims(v1) != 1 OR array_ndims(v2) != 1 Then
RAISE EXCEPTION 'Only vectors supported';
end if;

if array_length(v1,1) != array_length(v2,1)  Then
RAISE EXCEPTION 'Vector size mismatch';
end if;

-- return SUM v.a + v.b from (select unnest(v1) as a, unnest(v2) as b) v; --9.4
-- return SUM(v.a * v.b) from (select unnest(v1) as a, unnest(v2) as b) v; --9.4
return SUM(a*b) from unnest(v1,v2) v (a,b); -- 9.5
END;
$BODY$
  LANGUAGE plpgsql IMMUTABLE STRICT;

CREATE OPERATOR public.+*(
  PROCEDURE = _vectordotprod,
  LEFTARG = anyarray,
  RIGHTARG = anyarray,
  COMMUTATOR = +*);

The operator name should be . or *+, but since PostgreSQL won't allow that, +* was the next best thing that worked...

BTW: I tried the same with sql instead of plpgsql as function language and the timings are the same. So the additional sanity checks and the language overhead effectively costs nothing compared to the unrolling of the vector(s).

Thursday, December 24, 2015

FORTRAN 90 like vector operations in PostgreSQL revisited

I tried to compare the arraymath extension suggested in a comment with my plpgsql solution from the last post.

PostgreSQL was 9.5rc1 on Ubuntu 14.04 LTS.

make
make install
create extension arraymath;

No problems so far, but...

select array(select random() from generate_series(1,1000000)) @+ array(select random() from generate_series(1,1000000));

NOTICE:  Hello from ArrayMath 1.0
********** Error **********

Server crash!

Whereas the plpgsql code does the addition of two one million element vectors in 1.2 seconds.

Interestingly this works:

SELECT ARRAY[1,2,3,4,5,6,7,8] @+ ARRAY[1,2,3,4,5,6,7,8];

While this crashes:

SELECT ARRAY[1,2,3,4,5,6,7,8,9] @+ ARRAY[1,2,3,4,5,6,7,8,9];

So arraymath seems to fail with vectors larger than 8 elements. Probably only on 9.5, but I couldn't get it to work properly.


Sunday, December 20, 2015

FORTRAN 90 like vector operations in PostgreSQL

Let's say you have two vectors A and B and want to do element wise operations, creating a new vector C, FORTRAN 90 (and many other languages now) has a concise notation for that. E.g.

C = A + B

does the element wise addition of A and B into C.

Can we have that in PostgreSQL too?

CREATE OR REPLACE FUNCTION _vectoradd(
    v1 anyarray,
    v2 anyarray)
  RETURNS anyarray AS
$BODY$
DECLARE
BEGIN
if array_ndims(v1) != 1 OR array_ndims(v2) != 1 Then
RAISE EXCEPTION 'Only vectors supported';
end if;

if array_length(v1,1) != array_length(v2,1)  Then
RAISE EXCEPTION 'Vector size mismatch';
end if;

-- return array(select v.a + v.b from (select unnest(v1) as a, unnest(v2) as b) v); --9.4
return array(select a+b from unnest(v1,v2) v (a,b)); -- 9.5
END;
$BODY$
  LANGUAGE plpgsql immutable strict;

CREATE OPERATOR + (
    LEFTARG = anyarray,
    RIGHTARG = anyarray,
    PROCEDURE = _vectoradd,
    COMMUTATOR = +);

select ARRAY[1,2,3,4] + ARRAY[1,2,3,4];

{2,4,6,8}

Yes, we can! :-)

Friday, December 11, 2015

The MongoDB BI analytics connector ... revisited

Woah, comments! :-)

Instead of individually answering them, I'll try to write a follow up.

In the meantime I've read the MongoDB BI connector setup guide. First it doesn't mention that Python is required, so the real thing could be written in C. Makes some sense to develop the concept with Python / Multicorn first and if it works go to C for performance. RDKit is developed that way.

Second I assumed that it would pull in data from MongoDB into PostgreSQL for performance reasons. Apparently it doesn't. From the look of it, PostgreSQL is just the runtime for the FDW, routing queries directly to MongoDB.

This attenuates my initial concern a bit, that from a managerial view, a setup where you have to administer MongoDB and PostgreSQL and develop against MongoDB and PostgreSQL will soon raise questions if one database wouldn't do.

And with Tableau, explicitly mentioned in the installation document, the winner wouldn't be MongoDB.

I've worked with Tableau and it is totally geared towards the relational model. To makes things worse for anything not SQL, it relies on ODBC, which even more limits the queries and data types it understands.
If accessing PostgreSQL's advanced features from Tableau is difficult, direct MongoDB would be hard. So this happened instead...

Still, if there is no business reason why it must be MongoDB and you can't go with PostgreSQL jsonb for example, justifying to continue with MongoDB once the users have licked NewSQL (aaah, those buzzwords) blood could be difficult, especially with the BI folks who just love having a rich analysis toolbox right in the database. So it's a slippery route for MongoDB as a company, to introduce their users to a viable competitor product.

And for PostgreSQL not being a distributed database: It isn't out of the box, but it can be. If and what will work for you, as always, depends on the use case:

BDR
pg_shard
PostgreSQL FDW
PL/Proxy
pgpool-II
pmpp

Pick your poison...

(PostgreSQL even can do map/reduce, if you want it to. Mind's the limit.)

Wednesday, December 9, 2015

The MongoDB BI analytics connector ... PostgreSQL FDW

Now this gets interesting. Apparently MongoDB will get a BI connector - which seems to be a Multicorn foreign data wrapper for PostgreSQL!

While the step is logical in some way given MongoDBs limited built-in analytical capabilities vs. what PostgreSQL can do by declaration, e.g. CTEs, window functions or TABLESAMPLE, this also could backfire badly. Well, I'm almost convinced it will backfire.

PostgreSQL already has 'NoSQL' capabilities like native JSON and HSTORE k/v, there is ToroDB, emulating a wire protocol compatible MongoDB on top of PostgreSQL. There is already work on Views (Slide 34) in ToroDB, which will enable Users to query documents stored in ToroDB not only with the MongoDB query language but also with SQL, thus seamlessly integrating ToroDB document data with plain PostgreSQL relational data.

Then, there is no reason to use MongoDB at all, except maybe data ingestion speed. Data ingestion in ToroDB is way slower than with a 'real' MongoDB, but this is being worked on.

And from my experience in a current project, with a bit of anticipatory thinking, PostgreSQL data ingestion speed can at least challenge MongoDB, with security, integrity, transactions and all - on a server with 1/4 the CPU cores than the Mongo-Server has.

So, the wolf and the lamb will feed together, and the lion will eat straw like the ox... - there are truly interesting times ahead. :-)