Thursday, June 25, 2015

nseq - A datatype for the efficient storage of nucleotide sequences in PostgreSQL

The nseq datatype allows to store DNA and RNA sequences consisting of the letters AGCT or AGCU respectively in PostgreSQL.

By encoding four bases per Byte, it uses 75 percent less space on disk than text. While this idea is obvious and far from being novel, it still is one of the most efficient compression schemes for DNA/RNA, if not the most efficient.

As of now, nseq only supports very basic native operations. The main shortcoming is, that it has to be decompressed into text hence expanding by 4x, for e. g. substring operations and the like.

This will change.

Enough said - here it is...

Check out PostBIS as well, it already has much more features.

2 comments:

  1. PostBIS seems to have some more features e.g. efficient substring operation: https://colab.mpi-bremen.de/wiki/display/pbis/PostBIS . It was developed by a student in our group and we use it for quiet some time without any issues. Would be happy to join forces!?

    ReplyDelete
  2. To be honest, I took a look at PostBIS before starting nseq and it is way more feature complete. To me, nseq is a way of learning more about the problem and how to write datatypes for PostgreSQL. :-)

    I'll definitely take a second look on PostBIS now, especially how well the PostBIS dna_sequence and rna_sequence types compress compared to nseq and yes, maybe there is a chance of collaboration.

    BTW: If you need a good chemoinformatics companion for PostBIS, you might want to take a look at pgchem::tigress. :-)

    ReplyDelete