Thursday, July 23, 2015

Deriving the elemental composition from a molformula in pgchem::tigress

pgchem::tigress can generate molecular formulae like C3H6NO2- from chemical structures.

But what if we need access to the elemental composition as a relation, e.g:

elementcount
C 3
N 1
O 2

Fortunately, PostgreSQL is awesome:

CREATE OR REPLACE FUNCTION elemental_composition(molformula TEXT)
  RETURNS TABLE(element TEXT, count INTEGER) AS
$BODY$
DECLARE token TEXT[];
DECLARE elements TEXT[];
BEGIN
elements := ARRAY['C','N','O','P','S','Cl']; --expand as needed
molformula := REPLACE(REPLACE(molformula,'-',''),'+','');

FOREACH element IN ARRAY elements LOOP
count := 1;
token := REGEXP_MATCHES(molformula, element || '[\d?]*');

IF (token[1] IS NOT NULL) THEN
    token :=
REGEXP_MATCHES(token[1],'[0-9]+');
        IF (token[1] iS NOT NULL) THEN
            count := token[1]::INTEGER;
        END IF;
RETURN NEXT;
END IF;
END LOOP;
RETURN;
END;
$BODY$
  LANGUAGE plpgsql IMMUTABLE STRICT
  COST 1000;


SELECT * FROM elemental_composition('C3H6NO2-');

And that's it. Did I already mention that PostgreSQL is awesome? :-)

5 comments:

  1. You can do this without writing a function, or you could wrap the SQL in a function without PL/pgsql. The orignal was indented, but I don't know what blogspot will do to it.

    SELECT
    (regexp_split_to_array(pair,' '))[1] AS element,
    (regexp_split_to_array(pair,' '))[2] AS count
    FROM
    regexp_split_to_table( /* Split elemens and counts into a table */
    regexp_replace( /* Inject spaces between element and count */
    regexp_replace( /* Inject commas between elements */
    regexp_replace( /* Normalize element counts with 1s */
    'C3H6NO2',
    '([A-Za-z])([A-Z])',
    $$\11\2$$,
    'g'
    ),
    '([[:digit:]]+)(.)',
    $$\1,\2$$,
    'g'
    ),
    '([^[:digit:]])([[:digit:]])',
    $$\1 \2$$,
    'g'
    ),
    ','
    ) AS elements(pair);

    ReplyDelete
    Replies
    1. Whoa, cool.

      This approach has the benefit of recognizing all element types, without needing a list of 'allowed' atoms. However it does not remove charges (easy to fix), if the last atom has no count it evaluates the count to NULL instead of 1, and speed wise I can measure no difference between the two.

      So since some poor guys in operations may have to take a look at this too, I stay with the function for readability, albeit the Kung-Fu factor of your pure SQL solution is way higher of course. :-)

      Delete
    2. To fix that problem, replace

      '([A-Za-z])([A-Z])',

      with

      '([A-Za-z])([A-Z])?',

      If you want to stick with enumerating every element, you should at least include all known elements... but of course then it breaks when a new one is discovered. With proper indentation and trivial knowledge of regex it's really not hard to figure out what David's code is doing...

      Delete
    3. Damn blogger ate my reply. :(

      If you indent David's code properly and have a small understanding of regex then it's really not hard to figure out what the code is doing.

      To fix the bug, replace

      '([A-Za-z])([A-Z])',

      with

      '([A-Za-z])([A-Z])?',

      Delete
    4. "Damn blogger ate my reply. :(" And I should check my waiting comments more often. :-(

      "To fix the bug, replace"

      Indeed it does. Unfortunately it also does this now:

      'C3H6NO2' ->

      C 13
      H 16
      N 1
      O 2

      which is wrong.

      It also counts elements that do not exist:

      'C3H6NO2X' ->

      C 13
      H 16
      N 1
      O 2
      X 1

      So, having a whitelist of valid elements might come in handy. (In this particular case, it was also an application requirement.)

      And then, chemists might show up with something like this: 'CH3COOH'

      Or 'D2O' should be recognized as 'H2O' anyway.

      The tigress does not generate such formula but since they are perfectly valid, users might. This is easier to intercept in a function I guess.

      Delete