Wednesday, February 11, 2015

Finding mass spectra with PostgreSQL: Indexing

When naively searching for a specific spectral contrast angle, e.g.

select id from spectrum, round(spectral_contrast('[{"m/z":144.1150218,"intensity":0.908209840784744},{"m/z":145.118459841064,"intensity":0.0841924148716925}]'::json, id),2) as spectral_contrast where spectral_contrast = 1.0;

performance is not really stellar on my ~ 270k spectra database:

Execution time: 18695.619 ms for 71 hits.

This is no surprise, since the database has to calculate the spectral contrast angle for all rows and filter each row. How about indexing?

The simplest method I can envision, is to precalculate the number of peaks for each spectrum, index the column and add it to the query. Since the spectral contrast angle requires the number of peaks of the compared spectra to be equal, this is a safe restriction.

select id from spectrum, round(spectral_contrast('[{"m/z":144.1150218,"intensity":0.908209840784744},{"m/z":145.118459841064,"intensity":0.0841924148716925}]'::json, id),2) as spectral_contrast where spectral_contrast = 1.0 and num_peaks = 2;

Execution time: 797.346 ms for 71 hits.

OK, but can we do better? Yes, by using a more selective restriction, like the lower and upper bounds of the m/z values of the spectra, index the column and add it to the query. These can easily be obtained by min(m/z)and max(m/z) and then be stored in a numrange column. Since the spectral contrast angle requires the m/z values of the peaks of the compared spectra to be equal, this again is a safe restriction.

select id from spectrum, round(spectral_contrast('[{"m/z":144.1150218,"intensity":0.908209840784744},{"m/z":145.118459841064,"intensity":0.0841924148716925}]'::json, id),2) as spectral_contrast where spectral_contrast = 1.0 and mzrange = numrange(144.1150218,145.118459841064,'[]');

Execution time: 35.128 ms for 71 hits.

About 534 times faster.

Using a range type instead of storing the lower and upper bound in discrete columns gives more flexibility when querying with other criteria, like the Tanimoto index. Then, PostgreSQL's range operators, like &&  (overlap), might come in handy.

No comments:

Post a Comment