The Data Studio

Hive "OpenCSVSerde" Changes Your Table Definition

This thing with an ugly name is described in the Hive documentation. Its behaviour is described accurately, but that is no excuse for the vandalism that this thing inflicts on data quality.

When you define a table you specify a data-type for every column. If you use the "OpenCSVSerde" it changes the definition silently so that every column is defined as a string.

The only commands that should change a table definition are create and alter. The behaviour of the OpenCSVSerde is in line with the cavalier attitude to data quality and error handling that is rife in Hive. This "feature" caused us to waste a lot of time in a project I was working on.

This is how it works. You define a table as you want it to be, and specify that you want to read the data from a csv file, like this:

      create external table using_open_csv_serde
      (
          test_id         int,
          version         int,
          account_name    varchar(16),
          start_date      date,
          paid_timestamp  timestamp,
          amount          decimal(9,2),
          approved        boolean,
          comment         varchar(40)
      )
      row format serde 'org.apache.hadoop.hive.serde2.OpenCSVSerde'
      with serdeproperties 
      (
          'separatorChar' = ',',
          'quoteChar'     = '\"',
          'escapeChar'    = '\\'
      )
      location '/user/ron/hive_load_failure/';
    

I think they are trying to wind me up by forcing me to use the ridiculous abbreviation "serde" four times in one statement, but this irritation is nothing compared to the way I feel about having my table definition silently changed. This is the resulting table.

      describe using_open_csv_serde;

      +-----------------+------------+--------------------+--+
      |    col_name     | data_type  |      comment       |
      +-----------------+------------+--------------------+--+
      | test_id         | string     | from deserializer  |
      | version         | string     | from deserializer  |
      | account_name    | string     | from deserializer  |
      | start_date      | string     | from deserializer  |
      | paid_timestamp  | string     | from deserializer  |
      | amount          | string     | from deserializer  |
      | approved        | string     | from deserializer  |
      | comment         | string     | from deserializer  |
      +-----------------+------------+--------------------+--+
      8 rows selected (0.079 seconds)
    

Adding the comment does not diminish the bad behaviour of the OpenCSVSerde. It is a crime against software architecture and a crime against data quality.

I will talk about Data Quality in many other pages on this website. If it is not apparent to you just how bad this feature is, then I hope you will be curious enough to read some of those pages.