Monday, May 15, 2023

How to build RDKit 2023 on Windows w/o Conda

It looks like the build instructions for RDKit on Windows have not been updated for some time, at least they did not work for me. So I dug out the correct build process out of the Conda builds:

General instructions:

https://github.com/rdkit/rdkit/blob/master/Docs/Book/Install.md

Required to run the binaries, IF you don't have the Visual Studio C++ compiler and SDK installed:

Visual Studio 2015, 2017, 2019 and 2022 redistributable

https://learn.microsoft.com/en-us/cpp/windows/latest-supported-vc-redist?view=msvc-170

Development environment:

Windows 11 x64

Microsoft Visual Studio 2022 (Community Edition or Professional)

Windows SDK version 10.0.22000.0

Compiler MSVC 19.35.32217.1

PowerShell

Directory structure:

c:\Devel\RDBuild

boost
cairo
eigen3
freetype
RDKit
zlib

RDBASE = C:\Devel\RDBuild\RDKit

RDKit source 2023_03_1:

https://codeload.github.com/rdkit/rdkit/zip/refs/tags/Release_2023_03_1

ZLib 1.2.13:

https://github.com/kiyolee/zlib-win-build

Eigen 3.4.0:

https://gitlab.com/libeigen/eigen/-/archive/3.4.0/eigen-3.4.0.zip

Cairo 1.17.2:

https://github.com/preshing/cairo-windows/releases

Boost 1.82.0:

https://sourceforge.net/projects/boost/files/boost-binaries/

freetype 2.11.1:

https://codeload.github.com/ubawurinna/freetype-windows-binaries

CMake 3.26.3:

https://github.com/Kitware/CMake/releases/download/v3.26.3/cmake-3.26.3-windows-x86_64.msi

PostgreSQL 15.2.1:

https://get.enterprisedb.com/postgresql/postgresql-15.2-1-windows-x64-binaries.zip

Python 3.10.11:

https://www.python.org/downloads/release/python-31011/

Build commands:

1.) Prepare for DLLs and Python package:

c:/cmake/bin/cmake -DRDK_BUILD_PYTHON_WRAPPERS=ON -DBOOST_ROOT=C:/Devel/RDBuild/boost -DRDK_BUILD_CAIRO_SUPPORT=ON -DRDK_BUILD_INCHI_SUPPORT=ON -DRDK_BUILD_AVALON_SUPPORT=ON -DRDK_BUILD_PGSQL=OFF -DPostgreSQL_ROOT="C:\PostgreSQL\15" -DRDK_INSTALL_INTREE=OFF -DCMAKE_INSTALL_PREFIX=c:/RDKit -DEIGEN3_INCLUDE_DIR=C:/Devel/RDBuild/eigen3 -DFREETYPE_INCLUDE_DIRS=c:/Devel/RDbuild/freetype/include -DFREETYPE_LIBRARY="c:/Devel/RDBuild/freetype/release dll/win64/freetype.lib" -DZLIB_INCLUDE_DIR=c:/Devel/RDBuild/zlib/include -DZLIB_LIBRARY=c:/Devel/RDBuild/zlib/libz.lib -DCAIRO_INCLUDE_DIRS=c:/Devel/RDBuild/cairo/include -DCAIRO_LIBRARIES=c:/Devel/RDBuild/cairo/lib/x64/cairo.lib -DRDK_BUILD_FREETYPE_SUPPORT=ON -DRDK_BUILD_COMPRESSED_SUPPLIERS=ON -DRDK_OPTIMIZE_POPCNT=ON -DRDK_INSTALL_STATIC_LIBS=OFF -DRDK_INSTALL_DLLS_MSVC=ON -DCMAKE_BUILD_TYPE=Release -DRDK_BUILD_THREADSAFE_SSS=ON -G"Visual Studio 17 2022" -A x64 ..

2.) Build & install:

C:\CMake\bin\cmake --build . --config=Release --target install

3.) Prepare for static libraries and PostgreSQL extension:

c:/cmake/bin/cmake -DRDK_BUILD_PYTHON_WRAPPERS=OFF -DBOOST_ROOT=C:/Devel/RDBuild/boost -DRDK_BUILD_CAIRO_SUPPORT=ON -DRDK_BUILD_INCHI_SUPPORT=ON -DRDK_BUILD_AVALON_SUPPORT=ON -DRDK_BUILD_PGSQL=ON -DPostgreSQL_ROOT="C:\PostgreSQL\15" -DRDK_INSTALL_INTREE=OFF -DCMAKE_INSTALL_PREFIX=c:/RDKit -DEIGEN3_INCLUDE_DIR=C:/Devel/RDBuild/eigen3 -DFREETYPE_INCLUDE_DIRS=c:/Devel/RDbuild/freetype/include -DFREETYPE_LIBRARY="c:/Devel/RDBuild/freetype/release dll/win64/freetype.lib" -DZLIB_INCLUDE_DIR=c:/Devel/RDBuild/zlib/include -DZLIB_LIBRARY=c:/Devel/RDBuild/zlib/libz.lib -DCAIRO_INCLUDE_DIRS=c:/Devel/RDBuild/cairo/include -DCAIRO_LIBRARIES=c:/Devel/RDBuild/cairo/lib/x64/cairo.lib -DRDK_BUILD_FREETYPE_SUPPORT=ON -DRDK_BUILD_COMPRESSED_SUPPLIERS=ON -DRDK_OPTIMIZE_POPCNT=ON -DRDK_INSTALL_STATIC_LIBS=ON -DRDK_INSTALL_DLLS_MSVC=OFF -DCMAKE_BUILD_TYPE=Release -DRDK_BUILD_THREADSAFE_SSS=ON -G"Visual Studio 17 2022" -A x64 ..

4.) Build & install:

C:\CMake\bin\cmake --build . --config=Release --target install

Usage:

If you want to use the PostgreSQL extension, it needs rdkit.dll, which is automatically copied to the PostgreSQL /lib directory BUT also freetype.dll, and boost_serialization-vc142-mt-x64-1_82.dll. The easiest way is to copy them to the PostgreSQL /lib directory AND set the PATH, e. g. $env:PATH+=';C:\PostgreSQL\15\lib'. You also need to define RDBASE, e. g.  $env:RDBASE='c:\RDKit'. Start PostgreSQL and 'CREATE EXTENSION IF NOT EXISTS rdkit;' should work.

If you want to use the Python package, you must point PYTHONPATH to the RDKit installation directory AND for Python < 3.8 add <where the dlls live> to the PATH OR for Python >= 3.8 add the DLL path explicitly with:

os.add_dll_directory(<where the DLLs live>)

In addition to the RDKit DLLs, freetype.dll, cairo.dll, libz.dll, boost_serialization-vc142-mt-x64-1_82.dll, boost_bzip2-vc143-mt-x64-1_82.dll, boost_iostreams-vc143-mt-x64-1_82.dll, boost_python310-vc143-mt-x64-1_82.dll, and boost_zlib-vc143-mt-x64-1_82.dll must be present.

Result:

  • C++ static libraries
  • C++ shared libraries
  • PostgreSQL extension
  • Python package

Sunday, February 5, 2023

The BfArM database of essential drugs in short supply is lacking an API. So I built one.

The BfArM database of essential drugs in short supply is lacking an API. So I implemented one in about 12 hours. The source code can be found on GitHub.

tl;dr 

The state of digitalization in Germany is in dire straits, especially so when looking at the healtcare system and govenment/administration. If APIs exist, they are often difficult to discover and/or undocumented (The private bund.dev project tries to collect and document them in a central repository).

Yet I was surprised when I, inspired by an an article about the shortage of essential drugs in Germany, took a closer look at the official governmental database on this matter.

This database is hosted by the "Bundesinstitut für Arzneimittel und Medizinprodukte (BfArM)", and can be found here.

It can be accessed in two ways:

  1. As filterable dynamic HTML/JavaScript table rendered using JSF.
  2. As CSV download.

No API for M2M communication or the like. At least I could not find one. There is also no documentation about the data itself. 

Regarding data quality, I noticed the following issues:

  1. The CSV file is encoded in ISO-8859-1 (Latin1) and not UTF-8. While this is not uncommon it is a bit unexpected, since ISO-8859-1 only covers the first 256 Unicode characters. The file encoding is not documented.
  2. The CSV is actually not comma, but semicolon separated.
  3. Data not available is not only NULL, it is also encoded as "N/A", "n.a.", "-", and "'-", respectively. There might me more undocumented encodings.
  4. "*" encodes "Altdatenübernahme war nicht möglich", meaning that older data could not be transferred. This is documented in the legend of the table, but not what it actually means.
  5. The update frequency of the data on display is not documented.

Having worked with databases for about 30 years now, this looks like this data comes directly from some kind of manually curated data set to me. There is obviously no decent data standardization process in place.

But whining alone doesn't help, so I decided to implement the missing API on a tiny server hosted in Germany. It took me about 12 hours of my private time, including implementing basic data sanitization.

The most difficult part was to automate the CSV download, since the submit button calls some JavaScript function, and thus can't be called using a HTTP request or a scraping library like beautifulsoup. I'm now using a remote controlled headless Browser via selenium. That the HTML name attribute frequently changes does not especially help, either. This has been solved with an XPATH expression on the value.

The demo API is available under https://3.73.42.17:8443/docs. You will be warned because of the self-signed SSL certificate. This is ok, AWS did not want to register a domain, so no Let's Encrypt. But this just swaps SSL with SSH semantics anyway. Since it is for demonstration purposes only, and runs on a small server, there is a rate limiter in place. Resource names and data are in German, like in the source system.

The source code can be found on GitHub. Maybe somebody at BfArM realizes that it does not cost a fortune to implement an API on top of what they already have, and finds my example useful to build on.

Public data (I assume it is public, there is no need to register) is updated daily at 11:00 UTC. In complicance with Article 4 of the GDPR, personal information (Client IP address, Telephone number and E-Mail address) is not stored or displayed. As mentioned before, the API is hosted in Germany.

I'm not intending to run this forever, and outages are possible at any time. I have shut down the service by now.

Tuesday, December 27, 2022

pg_sentinel - Update

In 2016 I released pg_sentinel as a proof-of-concept for implementing a sentinel value sensor deep in the PostgreSQL sever. This does not compile since PostgreSQL >= 12.x because of changes in the internal API. So I adapted my old code to make it work again.

As a bonus, it does not need SPI any more.

Saturday, July 23, 2022

After 18 years, pgchem::tigress retires

To whom it may concern.

Today I will retire pgchem::tigress, the PostgreSQL chemoinformatics extension based on OpenBabel, after 18 years of service. This decision is based on three main reasons:

  1. I have not touched the GIST-Index Code for at least eight years, but beginning with PostgreSQL 14.x it started to cause SIGSEVs when building the index on molecules, and I'm unable to find the cause.
  2. OpenBabel 3.x made changes in their API that would require me to rewrite functions or to disable them. And those changes are not very well documented.
  3. Since my recent brush with death, I have decided that there are better ways to spend my time, than chasing Signal 11s. Especially since the RDKit cartridge has come a long way, and is more powerful than pgchem::tigress ever was.

This decision was not easy, since building pgchem::tigress was a part of my life. It was the first open source ever released by Bayer AG (at least in Germany). It also is the foundation of my PhD thesis. 

The code will remain public as long as there is a way to publish it.

Friday, May 22, 2020

Native (PostgreSQL only) streaming data tables

If you want to see (and analyze) only a window of data over some continuous data stream in PostgreSQL, one way is to use a specialized tool like the PipelineDB extension. But if you can't do that, e.g. because you are stuck with AWS RDS or for some other reason, streaming data tables, or continuous views, can be implemented with pretty much PostgreSQL alone.

The basic idea is to have a table that allows for fast INSERT operations, is aggressively VACUUMed, and has some key that can be used to prune outdated entries. This table is fed with the events from the data stream and regularly pruned. Voilà: a streaming data table.

We have done some testing with two approaches on an UNLOGGED table, prune on every INSERT, and pruning at reqular intervals. UNLOGGED is not a problem here, since a view on a data stream can be considered pretty much as ephemeral.

The timed variant is about 5x - 8x faster on INSERTs. And if you balance the timing and the pruning interval right, the window size is almost as stable.

The examples are implemented in Python3 with psycopg2. Putting an index on the table can help or hurt performance, INSERT might get slower but pruning with DELETE faster, depending on the size and structure of the data. Feel free to experiment. In our case, a vanilla BRIN index did just fine.

Instead of using an external scheduler for pruning, like the Python daemon thread in the stream_timed_cleanup.py example, other scheduling mechanisms can be of course used, e.g. pg_cron, or a scheduled Lambda on AWS, or similar.

Feel free to experiment and improve...

Tuesday, May 19, 2020

MQTT as transport for PostgreSQL events

MQTT has become a de-facto standard for the transport of messages between IoT devices. As a result, a plethora of libraries and MQTT message brokers have become available. Can we use this to transport messages originating from PostgreSQL?

Aa message broker we use Eclipse Mosquitto which is dead simple to set up if you don't have to change the default settings. Such a default installation is neither secure nor highly available, but for our demo it will do just fine. The event generators are written in Python3 with Eclipse paho mqtt for Python.

There are at least two ways to generate events from a PostgreSQL database, pg_recvlogical and NOTIFY / LISTEN. Both have their advantages and shortcomings.

pg_recvlogical:

  • Configured on server and database level
  • Generates comprehensive information about everything that happens in the database
  • No additional programming neccessary
  • Needs plugins to decode messages, e.g. into JSON
  • Filtering has to be done later, e.g. by the decoder plugin
NOTIFY / LISTEN:
  • Configured on DDL and table level
  • Generates exactly the information and format you program into the triggers
  • Filtering can be done before sending the message
  • Needs trigger programming
  • The message size is limited to 8000 bytes
Examples for both approaches can be found here. The NOTIFY / LISTEN example lacks a proper decoder but this makes be a good excercise to start with. The pg_recvlogical example needs the wal2json plugin, which can be found here and the proper setup, which is also explained in the Readme. Please note, that the slot used in the example is mqtt_slot, not test_slot:


pg_recvlogical -d postgres --slot mqtt_slot --create-slot -P wal2json

Otherwise, setup.sql should generate all objects to run both examples.

Saturday, April 25, 2020

It looks like pgchem::tigress just got a major upgrade

With the Release of PostgreSQL 12.x and OpenBabel 3.x, I decided to see if pgchem::tigress would still compile. Well, it took some minor changes, but YES, it does!

And - it seems like OpenBabel now handles E/Z and enantiomer stereochemistry correctly, at least in SMILES notation. This is a major step forward, but I have to do some more checks before the next release...