Thursday, January 29, 2015

Finding mass spectra with PostgreSQL: Tanimoto similarity in SQL

Now that we can store mass spectra in PostgreSQL, the next question is how to find them?

There is more than one solution to this problem. One is the Tanimoto coefficient or Jaccard index of the peaks.

The Tanimoto coefficient T can be calculated as T = c / (a+b-c) with c the number of peaks both spectra have in common and a and b the number of peaks in each spectrum.

Similarity is then expressed as a number between 0 and 1, with 0 meaning no and 1 maximum similarity. Likewise the Tanimoto distance 1-T expresses the distance between two spectra.

On a table like

CREATE TABLE <spectra_table>
  id integer NOT NULL,
  "m/z" numeric NOT NULL,
  intensity numeric NOT NULL


this can easily be done in SQL with a common table expression:

WITH SQ AS (select "m/z" from <spectra_table> where id=1),
ST AS (select "m/z" from <spectra_table> where id=2),
SC AS (select count(1)::float as common from SQ, ST where sq."m/z" = st."m/z")
select (select common FROM SC) / ((select count(1) from sq) + (select count(1) from st) - (SELECT common FROM SC))
as spectral_tanimoto;

If desired, further properties like the peak intensity can be used to narrow the match:

WITH SQ AS (select "m/z", intensity from <spectra_table> where id=1),
ST AS (select "m/z", intensity from <spectra_table> where id=2),
SC AS (select count(1)::float as common from SQ, ST where sq."m/z" = st."m/z" and sq.intensity = st.intensity)
select (select common FROM SC) / ((select count(1) from sq) + (select count(1) from st) - (SELECT common FROM SC))
as spectral_tanimoto;

The problem here is, that the join between the peaks relies on exact matching properties. 'Fuzzy' joins can relax this to some extent. The simple variant is to cast the m/z values to integers:

WITH SQ AS (select "m/z"::int from <spectra_table> where id=1),
ST AS (select "m/z"::int from <spectra_table> where id=2),
SC AS (select count(1)::float as common from SQ, ST where sq."m/z" = st."m/z")
select (select common FROM SC) / ((select count(1) from sq) + (select count(1) from st) - (SELECT common FROM SC))
as spectral_tanimoto;

PostgreSQL's range types allow finer grained control:

WITH SQ AS (select "m/z" from <spectra_table> where id=1),
ST AS (select "m/z" from
<spectra_table> where id=2),
SC AS (select count(1)::float as common from SQ, ST where numrange(sq."m/z"-0.25, sq."m/z"+0.25,'[]') && numrange(st."m/z"-0.25, st."m/z"+0.25,'[]'))
select (select common FROM SC) / ((select count(1) from sq) + (select count(1) from st) - (SELECT common FROM SC))
as spectral_tanimoto;

Neat. But this comes with a serious caveat, since 'fuzzy' joins may produce excess tuples by their very nature. Given a table with two entries 1.1 and 0.9, joining with 1.1 and a range of +/- 0.2, now both tuples will match, generating two result tuples!

In the context of the Tanimoto coefficient, this can lead to the anomaly that the numerator c becomes larger than the denominator. If this happens, T > 1 or T < 0 may occur, which are strictly undefined. So handle with care.

The nicer properties of the calculation are, that you can easily transform spectra measured with different ionization levels and that it can be made run very fast, if you know the target T you like to filter with beforehand.

1.) Since an m/z of 50 measured at an ionization of +1 means a mass of 50 u but a m/z of 50 measured at ionization +2 means a mass of 100 u (m/z is the ratio between mass and charge of the Ion measured), the statements above would give false results if the ionization was different and not taken into account. This can be done at the join level:

...where sq."m/z"*2 = st."m/z"...
Usually the MS software does that for you so you have normalized spectra, but anyway...

2.) If you know the number of peaks in each spectrum and the target T you want to filter with, similarity searches can be made very fast. Here is the paper on how to do it.

That's it for today. Spectral contrast in SQL next.

Tuesday, January 27, 2015

oracle_fdw: 'cannot use LDAP' means 'don't even touch'

The README for the oracle_fdw clearly says: "You cannot use LDAP functionality both in PostgreSQL and in Oracle, period."

Actually this means: "DO NOT EVEN TOUCH!"

I found this out yesterday when tracking down frequent sigsevs in one of our PostgreSQL servers.

The Oracle instant client was set up with NAMES.DIRECTORY_PATH=(TNSNAMES, LDAP) in sqlnet.ora.

Even when the connect string used in the FDW is in the //<server>:<port>/<service> form, i.e. no Oracle name resolution is required, just allowing LDAP in sqlnet.ora will kill your server.

Thursday, January 22, 2015

Calculating theoretical isotope patterns for mass spectrometry with PostgreSQL and pgchem::tigress

New year, new challenge: mass spectra.

To calculate the theoretical isotope pattern for a given molecule in PostgreSQL you need two things:
  1. a cheminformatics toolkit to determine the element composition of the molecule and it's exact mass
  2. a usably fast and accurate algorithm to infer the isotope pattern from the composition
For 1. I take OpenBabel, for 2. the Mercury7 algorithm as implemented in libmercury++.
Then, the two have to be made working together. Like so:
typedef struct
    size_t num_entries;
    double *mz;
    double *intensity;
    double *intensity_normalized;
    unsigned int *md;
ob_molfile_to_isotope_pattern(char *molfile, int charge, double normal)
    _ISOTOPE_PATTERN *retval;
    vector msa_mz;
    vector msa_abundance;
    vector composition;
    vector m;
    const double limit = 10e-30;
    OBMol mol;
    OBConversion conv;
    string tmpStr (molfile);
    istringstream molstream (tmpStr);
    double mass_diff, monoisotopic_mass, scale = 1.0, delta = numeric_limits::max();
    int i=0;

    //Initialize composition vector to all zeroes
    for(unsigned int i=0; i         composition.push_back(0);


    conv.Read(&mol, &molstream);

    if (mol.Empty ())
        return NULL;

    // Add missing hydrogens, if any
    mol.AddHydrogens(false, false); // All Hs, no PH correction

    //Iterate over atoms and update the composition vector accordingly
    FOR_ATOMS_OF_MOL(atom, mol)
        switch (atom->GetAtomicNum())
        case 1: // H
        case 6: // C
        case 7: // N
        case 8: // O
        case 16: // S
        case 35: // Br
        case 17: // Cl
        case 53: // I
        case 15: // P
        case 9: // F
        default: // Others ignored

    // Calculate isotope patterns with MERCURY7
    if(0 != mercury::mercury(msa_mz, msa_abundance, composition, charge, limit))
        return NULL;

    monoisotopic_mass = mol.GetExactMass();

    // Allocate return value and copy values
    retval = (_ISOTOPE_PATTERN*) calloc(1,sizeof(_ISOTOPE_PATTERN));

    retval->num_entries = msa_mz.size();
    retval->mz=(double*) calloc(msa_mz.size(), sizeof(double));
    retval->intensity=(double*) calloc(msa_abundance.size(), sizeof(double));
    retval->intensity_normalized=(double*) calloc(msa_abundance.size(), sizeof(double));
    retval->md=(unsigned int*) calloc(msa_abundance.size(), sizeof(unsigned int));

    copy(msa_mz.begin(), msa_mz.end(), retval->mz);
    copy(msa_abundance.begin(), msa_abundance.end(), retval->intensity);
    copy(msa_abundance.begin(), msa_abundance.end(), retval->intensity_normalized);

    for(std::vector::iterator it = msa_mz.begin(); it != msa_mz.end(); it++)
        mass_diff = *it-monoisotopic_mass;


        if(abs(mass_diff) < delta)
            delta = abs(mass_diff);
            scale = msa_abundance[i];

    scale = normal / scale;

    for(i=0; retval->num_entries; i++)
        retval->intensity_normalized[i] *= scale;

    copy(m.begin(), m.end(), retval->md);

    return retval;

This parses a V2000 or V3000 molfile into a molecule object, iterates over the atoms and generates the elemental composition for Mercury7, calculates the exact mass and the theoretical isotope pattern. It then stores the m/z vector, the peak vector and the derived peak vector normalized to some value, e.g. 100%, and the mass distance relative to the base peak m/z value. All of this is stored in the C structure _ISOTOPE_PATTERN in order to transfer it from C++ to the C code of PostgreSQL.
Over to the PostgreSQL side of things :-)
PG_FUNCTION_INFO_V1 (pgchem_isotope_pattern);

pgchem_isotope_pattern (PG_FUNCTION_ARGS)
    _ISOTOPE_PATTERN *isotope_pattern = NULL;
    _ISOTOPE_PATTERN *pattern_copy = NULL;
    FuncCallContext *ctx=NULL;
    MemoryContext original_ctx=NULL;
    DECOMPRESSED_DATA *original_data=NULL;
    char *tmpMolfile=NULL;

        MOLECULE *arg_molecule = PG_GETARG_MOLECULE_P (0);
        int32 arg_charge = PG_GETARG_INT32 (1);
        float8 arg_normal = PG_GETARG_FLOAT8 (2);
        ctx = SRF_FIRSTCALL_INIT();

        original_ctx = MemoryContextSwitchTo(ctx->multi_call_memory_ctx);

        if(arg_molecule->original_format==FORMAT_V2000 || arg_molecule->original_format==FORMAT_V3000)
            original_data = decompress_data(CIPTR(arg_molecule),arg_molecule->compressed_sizeo, arg_molecule->sizeo);
            tmpMolfile = strdup(original_data->decompressed_data);
            tmpMolfile = ob_smiles_to_V2000 (SMIPTR(arg_molecule));

        isotope_pattern = ob_molfile_to_isotope_pattern(tmpMolfile, arg_charge, arg_normal);


        if(NULL == isotope_pattern)
            elog (ERROR, "Could not generate isotope pattern!");

        pattern_copy = (_ISOTOPE_PATTERN*) palloc0(sizeof(_ISOTOPE_PATTERN));
        pattern_copy->mz = (double*) palloc0(isotope_pattern->num_entries*sizeof(double));
        pattern_copy->intensity = (double*) palloc0(isotope_pattern->num_entries*sizeof(double));
        pattern_copy->intensity_normalized = (double*) palloc0(isotope_pattern->num_entries*sizeof(double));
        pattern_copy->md = (unsigned int*) palloc0(isotope_pattern->num_entries*sizeof(unsigned int));

        pattern_copy->num_entries = isotope_pattern->num_entries;

        memcpy(pattern_copy->mz, isotope_pattern->mz, isotope_pattern->num_entries*sizeof(double));
        memcpy(pattern_copy->intensity, isotope_pattern->intensity, isotope_pattern->num_entries*sizeof(double));
        memcpy(pattern_copy->intensity_normalized, isotope_pattern->intensity_normalized, isotope_pattern->num_entries*sizeof(double));
        memcpy(pattern_copy->md, isotope_pattern->md, isotope_pattern->num_entries*sizeof(unsigned int));

        if(NULL != isotope_pattern->md)

        if(NULL != isotope_pattern->mz)

        if(NULL != isotope_pattern->intensity)

        if(NULL != isotope_pattern->intensity_normalized)

        if(NULL != isotope_pattern)

        isotope_pattern = NULL;

        ctx->max_calls = pattern_copy->num_entries;
        ctx->user_fctx = pattern_copy;

        if(get_call_result_type(fcinfo, NULL, &ctx->tuple_desc) != TYPEFUNC_COMPOSITE)
            elog (ERROR, "Calling the isotope pattern SRF in a context that does not expect tuples in return is not supported!");



    ctx = SRF_PERCALL_SETUP();

    isotope_pattern = ctx->user_fctx;

    if(ctx->call_cntr < ctx->max_calls)
        HeapTuple rettuple = NULL;
        Datum *retvals;
        bool *retnulls;

        retvals = (Datum*) palloc0(4*sizeof(Datum));
        retnulls = (bool*) palloc0(4*sizeof(bool));

        retvals[0] = Float8GetDatum(isotope_pattern->mz[ctx->call_cntr]);
        retvals[1] = Float8GetDatum(isotope_pattern->intensity[ctx->call_cntr]);
        retvals[2] = Float8GetDatum(isotope_pattern->intensity_normalized[ctx->call_cntr]);
        retvals[3] = UInt8GetDatum(isotope_pattern->md[ctx->call_cntr]);

        retnulls[0] = false;
        retnulls[1] = false;
        retnulls[2] = false;
        retnulls[3] = false;

        rettuple = heap_form_tuple(ctx->tuple_desc, retvals, retnulls);


        SRF_RETURN_NEXT(ctx, HeapTupleGetDatum(rettuple));

This is a set returning function, meaning that it can return multiple rows in one call, i.e. it returns a complete table. Basically it retrieves the molecule argument from PostgreSQL, calls ob_molfile_to_isotope_pattern(), constructs a set of tuples and returns them to PostgreSQL. The basics of set returning functions can be found here and here.
Now we can declare the function to PostgreSQL:
CREATE OR REPLACE FUNCTION isotope_pattern(IN molecule, IN integer DEFAULT 0, IN double precision DEFAULT 100.0)
  RETURNS TABLE("m/z" double precision, intensity double precision, intensity_normalized double precision, mdtbp integer) AS
'libpgchem', 'pgchem_isotope_pattern'

and use it:
select * from isotope_pattern('c1ccccc1I'::molecule) where intensity >= 0.01;
m/z  intensity intensity_normalized mdtbp
203.9436 0.934689933304048 100 0
204.946987674881 0.0634781699018538 6.79136124612619 1


Visualized it looks like this:

Nice. And everything (except the GUI) in the database.

Thursday, December 4, 2014

Remember the Therac!

Personal opinion warning!!!

Every time I read something like this, I immediately think of the famous Therac-25.

But let's start from the bottom...

"If you are deploying technology in your healthcare organization, ask yourself first if your people are ready. If they are not, you are setting the stage for failure."

This is so generic, Mr. Crounse should win a a prize. I'm sure, there is a prize somewhere for phrases like that.

How about:

"If you are deploying technology in your organization, ask yourself first if your people are ready. If they are not, you are setting the stage for failure."

Almost universally true but the gain of knowledge is essentially zilch.

But what really irks me is stuff like this (on the training of physicians):

"We spend our young adulthood immersed in the scholarly pursuit of a medical degree. We take four or more years in specialty training. [...] It’s all about following a certain process, a definitive kind of workflow"

Call me oldschool, but I expect a physician to be well trained and following certain, proven processes, when I lay my fate into her hands!

IT has already a major role in healthcare. Without advanced signal processing and medical imaging, MRI would be useless. Genome data simply cannot be analyzed without the help of computers. The human body can be simulated to some extent to reduce the need of testing new drugs on animals and human beings alike. Particle therapy needs absolute precision controls, provided by computers. And so on and on...

And what do I read?

"They are embracing cloud technologies to streamline IT resources and focus more of those limited resources on that which healthcare systems are all about—providing care to patients and increasingly, focusing on ways to improve population health and disease prevention."

OK, people are expensive, care is expensive so let's use IT to squeeze more out of the staff we have.

There is no single sentence about medical progress in this article. It's from top to bottom about cost-cutting.

I have to give credit to the overall assumption, that something good could be gained  from the integration of health-care data, though. A lot of knowledge about how diseases work and what weapon works best to fight them with could be gained from a holistic view on all the data available.

Yet, in many countries, there is a strict regulation on medical data. And here I have to give credit again - it's all about people. People building systems, people using systems and people having trust in systems.

In 2010 and 2011, physicians manipulated medical data of their patients waiting for a liver transplant in order to push them higher up the waiting list. This so called Organspendeskandal led to the lowest trust and henceforth lowest number of potential organ donors in Germany since 2002.

I am a registered bone marrow donor at DKMS. Yet, we are only a scant 25 million worldwide.  Already way to few. Guess what would happen, if there was a major security breach in this database? Or in any other large health data database?

"Information is no candy bar" they say in Bombardiers, (hilarious and highly recommended) If you eat it the candy bar its gone, but information can be stolen without notice.

Let's assume somebody would steal all the paper patient files from a hospital, or the insurance company. This could be noticed, because a) they are gone and b) somebody might ask what the four trucks full of paper are all about. Electronic files can be stolen without notice on an USB stick.

We, the people, still cannot build reliable and secure software systems at a large scale. SQL Injection is still #1 at OWASP. As it was 2010. 2007 it was on #2. Unbelievable.

Even in highly specialized and controlled areas, we fail, fail, fail.

So, where does this strict regulation on medical data come from again?

Whenever you work on hard or software for healthcare applications - please remember the Therac.

So much useful knowledge could be gained, but so much trust can be lost.

Personal opinion warning!!!

Wednesday, November 26, 2014

A working RAMDisk for PostgreSQL on Windows

Aside from the debate if this is useful or dangerous, I tried several free RAMDisks for Windows, Gavotte RamDisk, AMD Radeon RAMDisk and SoftPerfect RAM Disk - and none of them works well with PostgreSQL.

Either PostgreSQL refuses to use them at all for datafiles or it seems to work, but at some point in time the server tells you, that there is something wrong with the datafiles on the ramdrive and data corruption bites.

Apparently most ramdrives on Windows provide drive emulations only good enough to support simple file storage.

ImDisk is the first I tried, that seems to work well with PostgreSQL.

It's free, has a GUI and a command line, and the driver is properly signed for x64 Windows.

Friday, November 7, 2014

There are people := {a,b} for a ≠ b IN (SELECT people FROM world): Those who understand SQL, and those who have to wait.

Couldn't resist. ;->

I just pushed a view on an Oracle Database from 464 seconds down to 3 seconds (~ 157x speedup or 0.0065% of the original execution time) by removing a few pointless ORDER BYs and replacing one expensive JOIN with subqueries.

Thursday, October 30, 2014

toroDB Adventures: Adding unimplemented commands


> db.createCollection("test")
{ "ok" : 0, "errmsg" : "Unimplemented command: create", "code" : 1000002 }


> db.createCollection("test")
{ "ok" : 1 }

in five simple steps:

1.) Find out how the command is called internally. toroDB tells you that in the message:

"Unimplemented command: create"

so it is "create".

2.) Look up the "create" command. Start in with the QueryCommandGroup enum which contains all enums of all known commands. Since it is an administration command, you'll find it in

3.)  "create" does nothing at the moment, so add some code.
I just copied it from "createIndexes" so


becomes now

create {
        public void doCall(RequestBaseMessage queryMessage, BSONDocument query, ProcessorCaller caller) throws Exception {

4.) Now, ProcessorCaller needs to know about the new command. It's  enclosed in, so by adding

public void create(@Nonnull BSONDocument document) throws Exception {
queryCommandProcessor.create(document, messageReplier);

to ProcessorCaller AND the prototype

public void create(@Nonnull BSONDocument document, @Nonnull MessageReplier messageReplier) throws Exception;

to the enclosing QueryCommandProcessor class, make it known.

5.) Implement the command in, which is the actual implementation of QueryCommandProcessor:

    public void create(BSONDocument document, MessageReplier messageReplier) {
      Map keyValues = new HashMap();
      keyValues.put("ok", MongoWP.OK);
      String collectionName = (String) document.getValue("create");
      Boolean capped = (Boolean) document.getValue("capped");
      Boolean autoIndexId = (Boolean) document.getValue("autoIndexId");
      Boolean usePowerOf2Sizes = (Boolean) document.getValue("usePowerOf2Sizes");
      Double size = (Double) document.getValue("size");
      Double max = (Double) document.getValue("max");
      BSONDocument reply = new MongoBSONDocument(keyValues);

And that's pretty much it. As of now it just reads all allowed values from the command and acknowledges OK. But now everything is set to make it a "real" command if needed.