Thursday, January 28, 2016

ML based prediction with PostgreSQL & PL/R in four rounds - IV

Further digging into the PL/R documentation shows a way to run code at startup and make it globally available.

Round 4

First we need a table called plr_modules:
CREATE TABLE plr_modules (
  modseq int4,
  modsrc text
);
Then we add the necessary entries:

The SVM is now globally available and the predictor function can be reduced to the following:

Let's run this statement three times again:

select s.*, r_predict4(s.*) from generate_series(1,1000) s;

541 ms for the first run. 281 ms for each of the following two. Average: 368 ms.

That's only a 73% improvement compared to the original code. predict3() is faster then predict4().

Since the only difference is, that the mysvm object is now global, that might explain the difference. Maybe it is more expensive to work with global than local objects in R?

But there is another way, suggested by a reader of the first installment of this series.

Push the whole loop into R and wrap it into an SRF.


Let's see and run this statement three times again:

select * from r_predict5(array(select * from generate_series(1,1000)));

341 ms for the first run. 31 ms for each of the following two. Average: 134 ms.

Now that's a 90% improvement compared to the original code. Ten times faster.
If only the post initialization runs are taken into account it's even better: about 45x.

If you process sets anyway and you can live with passing arrays we have a winner here!

Bottom line:
  1. Do expensive initializations only once.
  2. Pre-can R objects that are immutable and expensive to create with saveRDS().
  3. If you process sets, push the whole loop down to R and stay there as long as possible.

No comments:

Post a Comment