Thursday, June 25, 2015

nseq - A datatype for the efficient storage of nucleotide sequences in PostgreSQL

The nseq datatype allows to store DNA and RNA sequences consisting of the letters AGCT or AGCU respectively in PostgreSQL.

By encoding four bases per Byte, it uses 75 percent less space on disk than text. While this idea is obvious and far from being novel, it still is one of the most efficient compression schemes for DNA/RNA, if not the most efficient.

As of now, nseq only supports very basic native operations. The main shortcoming is, that it has to be decompressed into text hence expanding by 4x, for e. g. substring operations and the like.

This will change.

Enough said - here it is...

Check out PostBIS as well, it already has much more features.

Wednesday, June 3, 2015

Update to pgchem::tigress isotope pattern generation code

The isotope_pattern() function now contains data for the stable isotopes of 82 elements.

Thus, it fully supports HMDB, UNPD and ChEBI (except the transuranics) and is available here.

The individually affected files are obwrapper.cpp and libmercury++.h, in case you want to update your installation in-place.