Thursday, August 25, 2016

Timeseries: How long can the elephant remember?

Frankly, I don't know where the practical limit for the number of rows in a single PostgreSQL table is from experience, but the interwebs seems to agree on 10^9 for narrow tables.

After a lively discussion with a NoSQL afficionado yesterday about the (in)ability to effectively store timeseries data in a RDBMS I made a quick calculation.

Timeseries data is usually a triple of the form key timestamp value, so it can be stored in a pretty narrow table, hence I stick to the 10^9 rows limit.

If we get a data point every second, we can store 10^9 seconds worth of data. 10^9 seconds is 16666666.6667 minutes, which is 277777.777778 hours, which is 11574.0740741 days, which is good for about 31 years of recording.

Every second of 31 years. Per table.

12 comments:

  1. Most people storing time series would laugh at the idea of storing only one data point per second. A typical usecase is performance data, were a single app on a single node can generate hundreds of points each minute without being seen as wasteful. On the other hand, you probably don't care about going back more than 1 year with 1 second granularity; you can either delete or aggregate old data.

    Postgres might still be a good tool for that, but you'll have to use many tables, probably with inheritance.

    ReplyDelete
    Replies
    1. True. If there were no use cases for specialized time series databases, there would be not many of them, I guess. However in the case discussed, the sensors monitor something slow, so even 5 second resolution might be more than enough. But the data points need to be annotated, so the debate was: Is the problem really big enough to justify a dedicated time series database plus another database to store related data, e.g. the annotations, or could it be done with PostgreSQL alone?

      Delete
  2. True....
    But, I believe if use PostgreSQL in conjunction with the Pg_pathman extension, you essentially get unlimited time-series storage while maintaining good performance.

    ReplyDelete
    Replies
    1. Or a long tailed table: http://theplateisbad.blogspot.de/2015/02/the-long-tail-vertical-table.html, depending on the type of queries.

      Delete
  3. Postgres can hold far more than 2^31 rows in a single table, and not just impractically. Currently the only major limitation on the number of rows a table can hold is that the table itself cannot take up more than 32TB of hard drive space; even with rows 128 bytes wide, that's rather more than 100x your estimate.

    ReplyDelete
    Replies
    1. Furthermore, there is the option of using inheritance and partitioning data by time, allowing you to create a literally unlimited number of tables, each with hundreds of billions of rows.

      2.147 billion was a lot of things 15 years ago; more than anyone was likely to need to keep track of. Today it is beginning to verge on mundane, and there are many practical scenarios where a 32 bit ID is not sufficient. Likewise with databases: any database that chokes on only 2.2 billion rows is rather poor.

      Delete
  4. I typically denormalize and use an Array datatype to store a range of timeseries per row.
    Add a view to get the classic row per sample data back out. A stored procedure or two to handle updates/inserts.


    ReplyDelete
    Replies
    1. I just tried it the other way round, store data points from different sources belonging to the same timestamp in an array attached to that timestamp.
      With array_positions() in 9.5 I can find any entry in a 10^6 element array in 50 milliseconds, so arrays are definitely not slow and a viable option.

      Delete
    2. Array Datatypes also have the benefit of transparent compression when they get sent to toast.. Combined with the savings on indexes (Vs a Normalized representation).. And you can get some very interesting use cases.. I just wish I didn't have to re-invent the wheel every time with views/stored procs/etc.. Also, I haven't gone back and looked at how this compares with BRIN indexes.

      Delete
    3. I just took a brief look at BRIN, but it seems to be working as advertised. Index creation on a 100 million row timeseries table is 3 minutes for btree, 15 seconds for BRIN. Queries like: select count(*) from timeseries2 where ts between '2016-08-25 23:44:39.778459+02' and '2016-08-25 23:45:39.778459+02' (covering 50 million rows) take 8 seconds with btree and 6 seconds with BRIN.

      Delete
  5. in a tick database use case, there can easily be hundreds of events per millisecond. partitioning can address this use case at least partially.

    ReplyDelete
    Replies
    1. True. If there were no use cases for specialized time series databases, there would be not many of them, I guess. However in the case discussed, the sensors monitor something slow, so even 5 second resolution might be more than enough. But the data points need to be annotated, so the debate was: Is the problem really big enough to justify a dedicated time series database plus another database to store related data, e.g. the annotations, or could it be done with PostgreSQL alone?

      Delete