The Mess That Is PostgreSQL Foreign Data Wrappers, tds_fdw, & diacritic characters

This post may seem off-topic, however, after spending considerable time over the past week grappling with getting data containing text with diacritic remarks–in my case, German names & email addresses that contain ß, ä, ö, and/or ü–capturing some notes about the issue would be worth it in case I ever need to deal with this again!

I’m not going to do a deep dive into Foreign Data Wrappers (FDWs), but I will say that they are perhaps my favorite feature of PostgreSQL. And when combined with Airflow, FDWs have allowed me to keep many of my DAGs simple by just using PostgresHook & PostgresOperator for all SQL interactions, rather than having to deal with other database operators and/or write my own custom operators to move data from other databases into PostgreSQL.

So the messy part of this is that MS SQL–at least the SQL Server databases that I have to interact with–don’t (appear to) do UTF-8 encoding. That means that instead of being able to assume that some text fields, like name & email, if they happen to contain characters with the aforementioned diacritic marks, almost certainly are not going to be inserted properly into PostgreSQL as long as those columns in the foreign table have a data type of text or character varying (varchar).

The errors seen in psql were similar to:

[Code: 0, SQL State: 22021] ERROR: invalid byte sequence for encoding "UTF8": 0xe9 0x73 0x40

With the 3 hex values depending on the character to be rendered.

There were 3 things I did to resolve this issue:

  1. Exclude the table from IMPORT FOREIGN SCHEMA statement.
  2. Manually CREATE FOREIGN TABLE and set the data type for the name & email columns to BYTEA (binary format).
  3. Modify any queries in my DAG(s) to use CONVERT_FROM() string function for name & email columns.

Exclude (or Limit) Tables When Importing Foreign Schema

IMPORT FOREIGN SCHEMA dbo
    EXCEPT (customer)
    FROM SERVER mssql INTO mssrvr;

This command will import ALL tables found in the remote database schema into my local schema, called mssrvr. However, that could be overkill as you may import far more tables than you really need to. So you can limit the import to be only a specified list of tables:

IMPORT FOREIGN SCHEMA dbo
    LIMIT TO (products, transactions)
    FROM SERVER mssql INTO mssrvr;

Either way, you want to make sure that the customer table is not imported!

Manually Create Foreign Table

There’s nothing revolutionary here, just replace the data type for the necessary column(s).

CREATE FOREIGN TABLE customer (
    id integer NOT NULL
    , name bytea   -- Changed from text to bytea
    , email bytea  -- Changed from text to bytea
    , city text
    , state text
    , country text
    , username text
    , created timestamp without time zone
    , ...
)
    SERVER mssql
    OPTIONS (schema_name 'dbo', table_name 'customer');

Modify SQL Query to Convert Fields

Simply casting a bytea object to text isn’t going to work here. We need to actually convert it with the convert_from() function. MS SQL (or at least older versions of it) default to LATIN-1 encoding.

INSERT INTO customer_extract (
    id, name, email, city, state, country, username, created, ...)
(
    SELECT
        id
        , convert_from(name, 'latin-1') AS name
        , convert_from(email, 'latin-1') AS email
        , city
        , state
        , country
        , username
        , created
        , ...
    FROM mssrvr.customer
);

And that does it. However, a word of warning: convert_from() still doesn’t like all values as I can only retrieve data on the remote database that were created from mid-2013 on. I might investigate further, and if I figure it out, I’ll update this post.

3 thoughts on “The Mess That Is PostgreSQL Foreign Data Wrappers, tds_fdw, & diacritic characters

  1. Pierre-Jean ALBERTINI-FUJII

    An ugly regex based solution : find everything in a specific regular expression pattern… and choose not to display it / replace it with nothing.
    Foreign data wrapper is configured once, then a view – which can be later modified – displays the cleaned / slashed text.

    REGEXP_REPLACE(your_faulty_column, E'[\n\r\t\u0080-\uFFFF]’, ”, ‘gi’)

    Reply
    1. lalligood Post author

      That would be great…except I didn’t know exactly what the diacritic characters were. I needed a more broad & general solution. REGEXP_REPLACE() is a fantastically useful function when you know what pattern you want to replace though.

      Reply

Leave a Reply

Your email address will not be published. Required fields are marked *

This site uses Akismet to reduce spam. Learn how your comment data is processed.