The Data Studio

Load a File to a Database and Convert the Character Encoding

PostgreSQL

PostgreSQL makes it very easy, as we have come to expect:

  copy gazetteer 
  from '/Users/ronballard/Downloads/gaz50k_gb/Data/50kgaz2016.txt'
  with
      csv
      delimiter ':'
      encoding 'LATIN1';

  COPY 258382
    

See the Postgres documentation: Character Set Support for the list of encoding names.

Hive

We have not been able to find much information in the Hive documentation, but the Hortonworks Community website came to the rescue. This is the relevant page: Processing Files in Hive Using Native (Non-UTF8) Character Sets. Be sure to look at the comments too, because these address some particular issues that we were having in a recent project.

The way to use this feature is to specify the encoding of the underlying file when we create a Hive external table over that file. Here is an example:

  create external table snpfddr 
     (line string) 
  row format
      serde 'org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe'
      with serdeproperties("serialization.encoding"='WINDOWS-1252') 
  location
      'hdfs://cluster/landing/snpfddr'
      tblproperties('skip.header.line.count'='1');
    

(Mainframe files often have stupid names like "snpfddr" - some people seem to think that data names should be like this.)

You can now select data from this table and see all the characters as they would have been seen in the source system. More usefully you can now use this table to populate a Hive table (in orc or some other format) either by using create table as select ... or by creating the target table and using insert into table select ... In either case, the data will be nicely converted into UTF-8 for you and the target table will be usable by itself or in conjunction with other Hive tables.

In this example we used the WINDOWS-1252 encoding because that is what the source system used for this file. There are many, many encodings supported; you can find out which ones by typing on an edge node:

  iconv -l