The plate is bad: FDW

Showing posts with label FDW. Show all posts

Friday, February 14, 2020

Excel and ODF support for cloudfs_fdw

cloudfs_fdw now supports .xls (Excel 97-2003), .xlsx, and .ods (Open Document Format) Spreadsheets via pandas, xlrd, and odfpy. It requires pandas >= 1.0.1, so Multicorn must be compiled against Python 3.

Since pandas provides sorting and filtering capabilities, cloudfs_fdw tries to push down SQL qualifiers and sort keys when they can be translated into pandas notation.

Take a look and have fun.

Wednesday, September 25, 2019

cloudfs_fdw

Since I needed a Foreign Data Wrapper for files stored on S3, and the ones I found did things like loading the whole file in memory before sending the first rows, I wrote my own, using Multicorn.

Along the way, I discovered libraries like smart-open and ijson that allow to stream various file formats from various filesystems - and so this escalated a bit, into cloudfs_fdw.

It currently supports CSV and JSON files from S3, HTTP/HTTPS sources and local or network filesystems but since smart-open supports more than that (e.g. HDFS, SSH), it certainly can be extended if needed.

For now, have fun.

Thursday, September 15, 2016

Blending gene sequence variation data into PostgreSQL

VCF, the Variant Call Format, is a clever idea. Instead of storing all genetic information of a sampled genome, it only stores the delta against some reference genome. This squeezes a lot of redundancy out of the data and thus occupies a lot less storage space.

Unfortunately, VCF is also a unwieldy format. Only a part is fixed, with metadata in the header describing fields in the body which in turn describe the actual data format of the samples.

This makes VCF especially hard to grasp for systems like Hadoop that work on chunked files spread over many compute nodes. With VCF, every chunk has to carry a copy of the header to make sense of the data it carries.

Formats like ADAM are under development that tackle this (any many more) problems, but when I was given the task to make VCF files accessible to a database last year, I took the direct route:

Since there have been already quite powerful tools and libraries to work with VCF files around for some time, I used Multicorn and PyVCF and wrote a foreign data wrapper for PostgreSQL that understands the VCF format.

After it was finished, I realized that it had been done before...

However, my implementation is different in some parts. Most notably it abstracts more from the actual storage of the VCF files and it works with vanilla PyVCF and does not need any special modifications.

A few days ago, I was granted permission to release the code into the wild by the people who paid for it in the first place, so now you have the freedom of choice. ;-)

Ladies and Gentlemen, please take a look at just another multicorn based foreign data wrapper for VCF files for PostgreSQL, the one and only truly integrarelational DBMS.

Wednesday, March 9, 2016

More fun with a integrarelational DBMS: SoilGrids

While the SoilGrids FDW in my first post on this subject works fine, I now think there is a better, more flexible, and more elegant way to do it.

Since PostgreSQL has JSON built in, why not return the whole response and parse with SQL? This way you can get all the data from SoilGrids without having to return a bazillion columns, the JSON response can be stored for future use, and indexed as well.

And this is how it looks like:

CREATE FOREIGN TABLE public.soilgrids1km
(response jsonb , -- json also works, PostgreSQL does the correct cast for us
latitude real ,
longitude real )
SERVER soilgrids_srv;

select (response->'properties'->'PHIHOX'->'M'->>'sd1')::REAL / 10.0::real from soilgrids1km where latitude = '51.57' and longitude = '5.39'

5.8

And again, that level of extensibility is one of the reasons why I think that PostgreSQL is one awesome DBMS. Integrarelational even...

Friday, February 12, 2016

Fun with a integrarelational DBMS: SoilGrids

PostgreSQL has many ways to extend it's capabilities in well defined ways: Custom datatypes, custom functions, custom operators, even custom indexes.

And then there's the Foreign Data Wrapper, an API to pull (almost) any kind of data into PostgreSQL and treat it (almost) like a native table. There is already an impressive list of them, but sometimes you have to write your own one. Like yesterday, when I wanted to integrate pH data from the SoilGrids1km project into an already existing PostgreSQL/PostGIS system.

The data from SoilGrids is freely available, so I just could have downloaded it and put it into PostGIS. But the data set is a) huge and b) constantly updated and c) they have a REST API, so why not tap into it directly?

Maybe because the native language of the Foreign Data Wrapper is C and that's not exactly well suited for a fast prototype. :-) But then there is Multicorn, a bridge between the FDW API and Python, so I gave it a try...

After two hours, including the installation of Multicorn itself, I had this up and running:

CREATE SERVER soilgrids_srv
FOREIGN DATA WRAPPER multicorn
OPTIONS (wrapper 'soilgrids_fdw.SoilGridsForeignDataWrapper');

CREATE FOREIGN TABLE public.soilgrids1km
(latitude real ,
longitude real ,
ph_l real ,
ph_m real ,
ph_u real ,
depth real ,
publication_date date )
SERVER soilgrids_srv;

select * from public.soilgrids1km where latitude = '51.57' and longitude = '5.39'

latitude longitude ph_l ph_m ph_u depth publication_date
51.57 5.39 4.2 5.8 7.4 -0.025 2014-04-02

I bet a more pythonic person than me could write something like this in under one hour or so.

And that level of extensibility is one of the reasons why I think that PostgreSQL is one awesome DBMS. Integrarelational even...