This post may seem off-topic, however, after spending considerable time over the past week grappling with getting data containing text with diacritic remarks–in my case, German names & email addresses that contain ß, ä, ö, and/or ü–capturing some notes about the issue would be worth it in case I ever need to deal with this again!
I’m not going to do a deep dive into Foreign Data Wrappers (FDWs), but I will say that they are perhaps my favorite feature of PostgreSQL. And when combined with Airflow, FDWs have allowed me to keep many of my DAGs simple by just using PostgresHook
& PostgresOperator
for all SQL interactions, rather than having to deal with other database operators and/or write my own custom operators to move data from other databases into PostgreSQL.
So the messy part of this is that MS SQL–at least the SQL Server databases that I have to interact with–don’t (appear to) do UTF-8 encoding. That means that instead of being able to assume that some text fields, like name & email, if they happen to contain characters with the aforementioned diacritic marks, almost certainly are not going to be inserted properly into PostgreSQL as long as those columns in the foreign table have a data type of text
or character varying
(varchar).
The errors seen in psql were similar to:
[Code: 0, SQL State: 22021] ERROR: invalid byte sequence for encoding "UTF8": 0xe9 0x73 0x40
With the 3 hex values depending on the character to be rendered.
There were 3 things I did to resolve this issue:
- Exclude the table from
IMPORT FOREIGN SCHEMA
statement. - Manually
CREATE FOREIGN TABLE
and set the data type for the name & email columns toBYTEA
(binary format). - Modify any queries in my DAG(s) to use
CONVERT_FROM()
string function for name & email columns.
Exclude (or Limit) Tables When Importing Foreign Schema
IMPORT FOREIGN SCHEMA dbo
EXCEPT (customer)
FROM SERVER mssql INTO mssrvr;
This command will import ALL tables found in the remote database schema into my local schema, called mssrvr
. However, that could be overkill as you may import far more tables than you really need to. So you can limit the import to be only a specified list of tables:
IMPORT FOREIGN SCHEMA dbo
LIMIT TO (products, transactions)
FROM SERVER mssql INTO mssrvr;
Either way, you want to make sure that the customer
table is not imported!
Manually Create Foreign Table
There’s nothing revolutionary here, just replace the data type for the necessary column(s).
CREATE FOREIGN TABLE customer (
id integer NOT NULL
, name bytea -- Changed from text to bytea
, email bytea -- Changed from text to bytea
, city text
, state text
, country text
, username text
, created timestamp without time zone
, ...
)
SERVER mssql
OPTIONS (schema_name 'dbo', table_name 'customer');
Modify SQL Query to Convert Fields
Simply casting a bytea object to text isn’t going to work here. We need to actually convert it with the convert_from()
function. MS SQL (or at least older versions of it) default to LATIN-1 encoding.
INSERT INTO customer_extract (
id, name, email, city, state, country, username, created, ...)
(
SELECT
id
, convert_from(name, 'latin-1') AS name
, convert_from(email, 'latin-1') AS email
, city
, state
, country
, username
, created
, ...
FROM mssrvr.customer
);
And that does it. However, a word of warning: convert_from()
still doesn’t like all values as I can only retrieve data on the remote database that were created from mid-2013 on. I might investigate further, and if I figure it out, I’ll update this post.
Very helpful.
Thanks
An ugly regex based solution : find everything in a specific regular expression pattern… and choose not to display it / replace it with nothing.
Foreign data wrapper is configured once, then a view – which can be later modified – displays the cleaned / slashed text.
REGEXP_REPLACE(your_faulty_column, E'[\n\r\t\u0080-\uFFFF]’, ”, ‘gi’)
That would be great…except I didn’t know exactly what the diacritic characters were. I needed a more broad & general solution. REGEXP_REPLACE() is a fantastically useful function when you know what pattern you want to replace though.