Thursday, March 20, 2014

Eclipse compiler produces faster FP code?

While testing this very simple FP code in Java:

package the.plateisbad;

public class Simple {

  public static void main(String[] args) {
    final long arg_flops;
    double i,x = 0.0d, y = 0.0d;
    final long start, end;

    if (args.length != 2) {
      System.out.println("Usage: the.plateisbad.Simple ");
      return;
    }

    arg_flops = Long.parseLong(args[0]);
    y = Long.parseLong(args[1]);

    System.out.println("Thinking really hard for " + arg_flops + " flops...");

    start = System.currentTimeMillis();

    for (i = 0; i < arg_flops; i++) {
      x = i * y;
    }

    end = System.currentTimeMillis();

    System.out.println("We calculated: " + x + " in " +(end-start)+ " ms");
  }
}



I've stumbled over the fact, that it runs considerably faster when compiled with the Eclipse ECJ compiler compared to a standard javac.

With ECJ, executed with JDK 1.7:

java -server the.plateisbad.Simple 1000000000 3

Thinking really hard for 1000000000 flops...
We calculated: 2.999999997E9 in 1964 ms

With javac, executed with JDK 1.7:

java -server the.plateisbad.Simple 1000000000 3

Thinking really hard for 1000000000 flops...
We calculated: 2.999999997E9 in 3514 ms

With the new JDK 1.8, there is no noticeable difference between javac and ECJ:

java -server the.plateisbad.Simple 1000000000 3

Thinking really hard for 1000000000 flops...
We calculated: 2.999999997E9 in 3727 ms

but it is always the slowest of the three. The Bytecode tells me that ECJ builds a tail controlled loop which loops while i is < arg_flops:

  64: invokestatic  #52                 // Method java/lang/System.currentTimeMillis:()J
      67: lstore        9
      69: dconst_0     
      70: dstore_3     
      71: goto          84
      74: dload_3      
      75: dload         7
      77: dmul         
      78: dstore        5
      80: dload_3      
      81: dconst_1     
      82: dadd         
      83: dstore_3     
      84: dload_3      
      85: lload_1      
      86: l2d          
      87: dcmpg        
      88: iflt          74
      91: invokestatic  #52                 // Method java/lang/System.currentTimeMillis:()J


while javac builds a head controlled loop that exits if i >= arg_flops:

      67: invokestatic  #13                 // Method java/lang/System.currentTimeMillis:()J
      70: lstore        9
      72: dconst_0     
      73: dstore_3     
      74: dload_3      
      75: lload_1      
      76: l2d          
      77: dcmpg        
      78: ifge          94
      81: dload_3      
      82: dload         7
      84: dmul         
      85: dstore        5
      87: dload_3      
      88: dconst_1     
      89: dadd         
      90: dstore_3     
      91: goto          74
      94: invokestatic  #13                 // Method java/lang/System.currentTimeMillis:()J

And ECJ uses StringBuffer while javac uses StringBuilder for the String operations, but since these are not in the loop, that should not make any difference.

Does somebody know what is going on here?

UPDATE: This seems to be an anomaly. SciMark 2.0 shows now significant differences between ECJ and javac and jdk1.7 and jdk1.8 - with 1.8 being slightly faster.

Wednesday, March 5, 2014

A suggestion to all architects of high-security buildings ;->

Please, don't put the restrooms outside the security gates!

Wednesday, February 12, 2014

Arbitrary parallel (well, almost) ad-hoc queries with PostgreSQL

Contemporary PostgreSQL lacks the ability to run single queries on multiple cores, nodes etc., i.e. it lacks automatic horizontal scaling. While this seems to be under development, what can be done today?

PL/Proxy allows database partitioning and RUN ON ALL executes the function on all nodes simultaneously. PL/Proxy is limited to the partitioned execution of functions and has good reasons for this design. But PostgreSQL can execute dynamic SQL within functions, so let's see how far we can get.

Worker function (on all worker nodes):

CREATE OR REPLACE FUNCTION parallel_query(statement text)
  RETURNS SETOF record AS
$BODY$
DECLARE r record;
BEGIN
IF lower($1) LIKE 'select%' THEN
FOR r IN EXECUTE $1 LOOP
RETURN NEXT r;
END LOOP;
ELSE
RAISE EXCEPTION 'Only queries allowed';
END IF;
END
$BODY$
  LANGUAGE plpgsql VOLATILE;


Proxy function (on all head nodes):

CREATE OR REPLACE FUNCTION parallel_query(statement text)
  RETURNS SETOF record AS
$BODY$
 CLUSTER 'head'; RUN ON ALL;
$BODY$
  LANGUAGE plproxy VOLATILE;


Table (on all worker nodes):

CREATE TABLE users
(
  username text NOT NULL,
  CONSTRAINT users_pkey PRIMARY KEY (username)
)
WITH (
  OIDS=FALSE
);


With 10000 rows in two nodes, partitioned by username hash (~5000 on each node)

select * from parallel_query('select * from users') as (username text);

returns all 10000 rows. Since the nodes can be databases within the same server, there is no need for additional hardware, server installations etc. But if more performance is required in the future, adding more boxes is possible.

All it takes is logical partitioning and a bit of PL/pgSQL if you really need to run parallel queries.

There are some differences though. Take the following query:

select * from execute_query('select max(username) from users') as (username text);

"user_name_9995"
"user_name_9999"

It now returns two maximums, one for each partition. To get the expected result a second stage is needed:

select max(username) from execute_query('select max(username) from users') as (username text);

"user_name_9999"

The same applies for other aggregation functions like avg() etc.

The proxy function can finally be hidden in a VIEW:

CREATE OR REPLACE VIEW "users" AS select * from parallel_query('select * from users') as (username text);

Thursday, January 23, 2014

From palloc() to palloc0()

I've just replaced all palloc()/memset() pairs for getting zeroed-out memory with palloc0(). I doubt that this will show a significant speedup, but it fits better into the PostgreSQL memory model and reduces the lines of code.

E.g.:

retval = (text *) palloc (len + VARHDRSZ);
memset(retval,0x0,len + VARHDRSZ);

becomes:

retval = (text *) palloc0 (len + VARHDRSZ);

The changes have been comitted to Github.

Tuesday, December 17, 2013

It compiles...

>But, since Postgresql 9.3 just came out, I'll see if I can manage at least to compile pgchem::tigress against 9.3, OpenBabel 2.3.2 and Indigo 1.1.11 before 2013 ends...

Short update: I have merged the two pull requests from Steffen Neumann and Björn Grüning into the repository and then made some minor corrections. LANGUAGE C (case insensitive but without the '') solves the one problem, replacing int4 with int32 the other. Actually this was no gcc issue, but the PostgreSQL guys have apparently decided to remove int4 as an PostgreSQL internal datatype from 9.2 to 9.3.

I had always wondered why there were so many duplicate but internally identical (e.g. int4 (number of bytes) and int32 (number of bits)) datatypes anyway. Probably some legacy...

Also, pgchem_tigress can now be installed as an relocatable extension with the CREATE EXTENSION mechanism. No need to run various installation scripts anymore (but still supported).

I still need to put an 'install' target into the Makefile to fully integrate pgchem_tigress into the extension system. Currently, the shared objects still have to be copied by hand before calling CREATE EXTENSION The PGXS Makefile already takes care of that - there is progress towards a new release.