## Thursday, January 29, 2015

### Finding mass spectra with PostgreSQL: Tanimoto similarity in SQL

Now that we can store mass spectra in PostgreSQL, the next question is how to find them?

There is more than one solution to this problem. One is the Tanimoto coefficient or Jaccard index of the peaks.

The Tanimoto coefficient T can be calculated as T = c / (a+b-c) with c the number of peaks both spectra have in common and a and b the number of peaks in each spectrum.

Similarity is then expressed as a number between 0 and 1, with 0 meaning no and 1 maximum similarity. Likewise the Tanimoto distance 1-T expresses the distance between two spectra.

On a table like

CREATE TABLE <spectra_table>
(
id integer NOT NULL,
"m/z" numeric NOT NULL,
intensity numeric NOT NULL

)

this can easily be done in SQL with a common table expression:

WITH SQ AS (select "m/z" from <spectra_table> where id=1),
ST AS (select "m/z" from <spectra_table> where id=2),
SC AS (select count(1)::float as common from SQ, ST where sq."m/z" = st."m/z")
select (select common FROM SC) / ((select count(1) from sq) + (select count(1) from st) - (SELECT common FROM SC))
as spectral_tanimoto;

If desired, further properties like the peak intensity can be used to narrow the match:

WITH SQ AS (select "m/z", intensity from <spectra_table> where id=1),
ST AS (select "m/z", intensity from <spectra_table> where id=2),
SC AS (select count(1)::float as common from SQ, ST where sq."m/z" = st."m/z" and sq.intensity = st.intensity)
select (select common FROM SC) / ((select count(1) from sq) + (select count(1) from st) - (SELECT common FROM SC))
as spectral_tanimoto;

The problem here is, that the join between the peaks relies on exact matching properties. 'Fuzzy' joins can relax this to some extent. The simple variant is to cast the m/z values to integers:

WITH SQ AS (select "m/z"::int from <spectra_table> where id=1),
ST AS (select "m/z"::int from <spectra_table> where id=2),
SC AS (select count(1)::float as common from SQ, ST where sq."m/z" = st."m/z")
select (select common FROM SC) / ((select count(1) from sq) + (select count(1) from st) - (SELECT common FROM SC))
as spectral_tanimoto;

PostgreSQL's range types allow finer grained control:

WITH SQ AS (select "m/z" from <spectra_table> where id=1),
ST AS (select "m/z" from
<spectra_table> where id=2),
SC AS (select count(1)::float as common from SQ, ST where numrange(sq."m/z"-0.25, sq."m/z"+0.25,'[]') && numrange(st."m/z"-0.25, st."m/z"+0.25,'[]'))
select (select common FROM SC) / ((select count(1) from sq) + (select count(1) from st) - (SELECT common FROM SC))
as spectral_tanimoto;

Neat. But this comes with a serious caveat, since 'fuzzy' joins may produce excess tuples by their very nature. Given a table with two entries 1.1 and 0.9, joining with 1.1 and a range of +/- 0.2, now both tuples will match, generating two result tuples!

In the context of the Tanimoto coefficient, this can lead to the anomaly that the numerator c becomes larger than the denominator. If this happens, T > 1 or T < 0 may occur, which are strictly undefined. So handle with care.

The nicer properties of the calculation are, that you can easily transform spectra measured with different ionization levels and that it can be made run very fast, if you know the target T you like to filter with beforehand.

1.) Since an m/z of 50 measured at an ionization of +1 means a mass of 50 u but a m/z of 50 measured at ionization +2 means a mass of 100 u (m/z is the ratio between mass and charge of the Ion measured), the statements above would give false results if the ionization was different and not taken into account. This can be done at the join level:

...where sq."m/z"*2 = st."m/z"...

Usually the MS software does that for you so you have normalized spectra, but anyway...

2.) If you know the number of peaks in each spectrum and the target T you want to filter with, similarity searches can be made very fast. Here is the paper on how to do it.

That's it for today. Spectral contrast in SQL next.