The Data Studio

Hive: Silent failure on loading data

One process for loading data into Hive is:

  1. copy a file, or set of files, into an HDFS directory
  2. define an external table over the HDFS directory
  3. create a table with a more efficient structure (orc, say) and copy the data from the external table into the orc table

There are problems with the process defined above.

Hopefully you will define your tables with strong data-types. By this I mean the data-type most appropriate to the data. If a data field is an integer you should use an int data-type, if it is a date use a date, and so on. We use the data-type that is most restrictive to hold the data we are loading. The intention of this is to be able to know that our data is clean. This saves work over and over again further down the line as we try to derive meaning from our data. If you are trying to calculate sales volume, for example, you want all the individual sales amounts to be valid decimal numbers so that they add up without errors and without loss of data.

Unfortunately, this does not work well in Hive. Hive loves silent failures. (Scroll further down to see the test_ids for each of the cases listed here.)

All this happens silently. Neither the create table... nor the select... will give any error message. Unless you do more research, you will not even know that there is missing or dirty data in your table. All that wonderful Data Science you want to apply to your data will now produce unreliable results.

How To Ensure Data Quality When Using Hive

Here is a method that you can use to detect data quality errors and report on them. If you do this you will then be able to fix them once, rather than having the Data Science and Analytics teams battle with the data every time they use it.

What you do is to define two external tables over every set of files that make up one table. The first external table has all data-types defined as "string" (because that will accept anything) and the second has the data-types defined as you want them. You can then run queries across both tables to highlight values that are different between the two tables. These will be your data quality errors. To do this you need a reliable unique key. In these example we use test_id and we have arranged for this to be unique. You may not be so lucky with the data you are loading. If there is no reliable unique key we pre-process every file to add one. The key can be a sequence of integers (big integers if you have, or expect, over 2 billion rows), or a hash of the data columns, or a UUID. However you do it, it is extra work and the promise of the Data Lake as being something that is really easy and may have some practical use, takes another knock.

Here's a test to illustrate what we have described so far. We start with a csv file (you don't need to read it, there are easier presentations of this data below):

        1,123,Smith,2014-07-19,2017-03-02 10:47:33,157.26,true,All~valid
        2,3999456772,Green,2011-11-30,2017-02-13 12:36:45,931.75,True,id too long
        3,ABC,Brown,2013-01-29,2017-01-23 05:42:00,76.42,false,id non-numeric
        4,345,F'tang F'tang Ole Biscuit-Barrel,1999-12-12,2016-12-17 14:30:20,1572.33,False,account_name too long
        5,567,Wilson,02/24/09,2017-01-31 11:46:30,722.45,TRUE,US-format date
        6,789,Jones,2017-02-29,2017-02-24 12:15:00,4.25,FALSE,invalid day for February in date
        7,201,Philips,ABC,2017-03-30 10:45:55,76243.00,TrUe,nothing like a date
        8,432,Clarke,1989-10-03,2017-01-26T10:24:23,17.62,FaLsE,"T" between date and time in timestamp
        9,756,Walters,2002-11-15,2017-03-34 07:15:15,48.00,tRuE,invalid day number in timestamp
        10,642,Anderson,2008-04-01,ABCDEFGHIJKLMNOPQRS,473.67,fAlSe,nothing like a timestamp
        11,315,Billings,2007-05-27,2017-02-28 13:57:20,42.0462946482,true,too many decimal places in amount
        12,742,Carson,2008-09-22,2017-03-11 16:00:45,1234567890.42,false,too many digits before decimal point
        13,246,Diprose,1994-02-30,2017-01-10 15:35:48,ABCDEFGHIJK,true,should be decimal but not numeric
        14,174,Emery,2003-04-17,2017-01-22 10:45:25,121.11,yes,boolean "yes"
        15,476,Finnegan,2012-02-25,2017-02-08 13:46:15,32.46,no,boolena "no"
        16,927,Granger,2010-07-23,2017-03-06 09:32:50,-46.20,0,boolean "0"
        17,913,Harris,2004-07-01,2017-01-29 11:23:45,239.50,1,boolean "1"
        18,327,Ingliss,2001-11-19,2017-02-14 12:36:32,-157.25,Junk,invalid boolean
      

Here's the first external table definition: all columns are defined as strings (except for test_id, for the reason described above):

        create external table raw
        (
            test_id         int,
            version         string,
            account_name    string,
            start_date      string,
            paid_timestamp  string,
            amount          string,
            approved        string,
            comment         string
        )
        row format delimited fields terminated by ','
        location '/user/ron/hive_load_failure/';
      
and this makes our csv file into a table looking like this:
+--------------+--------------+-----------------------------------+-----------------+----------------------+----------------+---------------+-----------------------------------------+--+
| raw.test_id  | raw.version  |         raw.account_name          | raw.start_date  |  raw.paid_timestamp  |   raw.amount   | raw.approved  |               raw.comment               |
+--------------+--------------+-----------------------------------+-----------------+----------------------+----------------+---------------+-----------------------------------------+--+
| 1            | 123          | Smith                             | 2014-07-19      | 2017-03-02 10:47:33  | 157.26         | true          | All~valid                               |
| 2            | 3999456772   | Green                             | 2011-11-30      | 2017-02-13 12:36:45  | 931.75         | True          | id too long                             |
| 3            | ABC          | Brown                             | 2013-01-29      | 2017-01-23 05:42:00  | 76.42          | false         | id non-numeric                          |
| 4            | 345          | F'tang F'tang Ole Biscuit-Barrel  | 1999-12-12      | 2016-12-17 14:30:20  | 1572.33        | False         | account_name too long                   |
| 5            | 567          | Wilson                            | 02/24/09        | 2017-01-31 11:46:30  | 722.45         | TRUE          | US-format date                          |
| 6            | 789          | Jones                             | 2017-02-29      | 2017-02-24 12:15:00  | 4.25           | FALSE         | invalid day for February in date        |
| 7            | 201          | Philips                           | ABC             | 2017-03-30 10:45:55  | 76243.00       | TrUe          | nothing like a date                     |
| 8            | 432          | Clarke                            | 1989-10-03      | 2017-01-26T10:24:23  | 17.62          | FaLsE         | "T" between date and time in timestamp  |
| 9            | 756          | Walters                           | 2002-11-15      | 2017-03-34 07:15:15  | 48.00          | tRuE          | invalid day number in timestamp         |
| 10           | 642          | Anderson                          | 2008-04-01      | ABCDEFGHIJKLMNOPQRS  | 473.67         | fAlSe         | nothing like a timestamp                |
| 11           | 315          | Billings                          | 2007-05-27      | 2017-02-28 13:57:20  | 42.0462946482  | true          | too many decimal places in amount       |
| 12           | 742          | Carson                            | 2008-09-22      | 2017-03-11 16:00:45  | 1234567890.42  | false         | too many digits before decimal point    |
| 13           | 246          | Diprose                           | 1994-02-30      | 2017-01-10 15:35:48  | ABCDEFGHIJK    | true          | should be decimal but not numeric       |
| 14           | 174          | Emery                             | 2003-04-17      | 2017-01-22 10:45:25  | 121.11         | yes           | boolean "yes"                           |
| 15           | 476          | Finnegan                          | 2012-02-25      | 2017-02-08 13:46:15  | 32.46          | no            | boolena "no"                            |
| 16           | 927          | Granger                           | 2010-07-23      | 2017-03-06 09:32:50  | -46.20         | 0             | boolean "0"                             |
| 17           | 913          | Harris                            | 2004-07-01      | 2017-01-29 11:23:45  | 239.50         | 1             | boolean "1"                             |
| 18           | 327          | Ingliss                           | 2001-11-19      | 2017-02-14 12:36:32  | -157.25        | Junk          | invalid boolean                         |
+--------------+--------------+-----------------------------------+-----------------+----------------------+----------------+---------------+-----------------------------------------+--+
18 rows selected (0.201 seconds)
      

And here's the second external table definition: all columns are defined as the data-types they really are:

        create external table typed
        (
            test_id         int,
            version         int,
            account_name    varchar(16),
            start_date      date,
            paid_timestamp  timestamp,
            amount          decimal(9,2),
            approved        boolean,
            comment         varchar(40)
        )
        row format delimited fields terminated by ','
        location '/user/ron/hive_load_failure/';
      
and this makes our csv file into a table looking like this:
+----------------+----------------+---------------------+-------------------+------------------------+---------------+-----------------+-----------------------------------------+--+
| typed.test_id  | typed.version  | typed.account_name  | typed.start_date  |  typed.paid_timestamp  | typed.amount  | typed.approved  |              typed.comment              |
+----------------+----------------+---------------------+-------------------+------------------------+---------------+-----------------+-----------------------------------------+--+
| 1              | 123            | Smith               | 2014-07-19        | 2017-03-02 10:47:33.0  | 157.26        | true            | All~valid                               |
| 2              | NULL           | Green               | 2011-11-30        | 2017-02-13 12:36:45.0  | 931.75        | true            | id too long                             |
| 3              | NULL           | Brown               | 2013-01-29        | 2017-01-23 05:42:00.0  | 76.42         | false           | id non-numeric                          |
| 4              | 345            | F'tang F'tang Ol    | 1999-12-12        | 2016-12-17 14:30:20.0  | 1572.33       | false           | account_name too long                   |
| 5              | 567            | Wilson              | NULL              | 2017-01-31 11:46:30.0  | 722.45        | true            | US-format date                          |
| 6              | 789            | Jones               | 2017-03-01        | 2017-02-24 12:15:00.0  | 4.25          | false           | invalid day for February in date        |
| 7              | 201            | Philips             | NULL              | 2017-03-30 10:45:55.0  | 76243.00      | true            | nothing like a date                     |
| 8              | 432            | Clarke              | 1989-10-03        | NULL                   | 17.62         | false           | "T" between date and time in timestamp  |
| 9              | 756            | Walters             | 2002-11-15        | NULL                   | 48.00         | true            | invalid day number in timestamp         |
| 10             | 642            | Anderson            | 2008-04-01        | NULL                   | 473.67        | false           | nothing like a timestamp                |
| 11             | 315            | Billings            | 2007-05-27        | 2017-02-28 13:57:20.0  | 42.05         | true            | too many decimal places in amount       |
| 12             | 742            | Carson              | 2008-09-22        | 2017-03-11 16:00:45.0  | NULL          | false           | too many digits before decimal point    |
| 13             | 246            | Diprose             | 1994-03-02        | 2017-01-10 15:35:48.0  | NULL          | true            | should be decimal but not numeric       |
| 14             | 174            | Emery               | 2003-04-17        | 2017-01-22 10:45:25.0  | 121.11        | NULL            | boolean "yes"                           |
| 15             | 476            | Finnegan            | 2012-02-25        | 2017-02-08 13:46:15.0  | 32.46         | NULL            | boolena "no"                            |
| 16             | 927            | Granger             | 2010-07-23        | 2017-03-06 09:32:50.0  | -46.20        | NULL            | boolean "0"                             |
| 17             | 913            | Harris              | 2004-07-01        | 2017-01-29 11:23:45.0  | 239.50        | NULL            | boolean "1"                             |
| 18             | 327            | Ingliss             | 2001-11-19        | 2017-02-14 12:36:32.0  | -157.25       | NULL            | invalid boolean                         |
+----------------+----------------+---------------------+-------------------+------------------------+---------------+-----------------+-----------------------------------------+--+
18 rows selected (0.159 seconds)
      

With 18 rows in the test table it is easy to see where things have gone wrong. In a real-world data-set it is usually much harder to find the errors. The only feasible way to manage it is to automate the checks. We end up needing a variety of SQL statements to cover all the necessary checks and several of these checks are made more complex by the missing set operators. Here we'll show some sample queries; in a later article we will show you how to automate this.

Finding data that fails to load into integer data-type:

      select
          a.version,
          b.version,
          b.comment
      from
          raw a
          join
          typed b
          on a.test_id = b.test_id
      where 
          a.version is not null and
          b.version is null;

      +-------------+------------+-----------------+--+
      |  a.version  | b.version  |    b.comment    |
      +-------------+------------+-----------------+--+
      | 3999456772  | NULL       | id too long     |
      | ABC         | NULL       | id non-numeric  |
      +-------------+------------+-----------------+--+
      2 rows selected (0.909 seconds)
    

Finding data that is truncated when loading into varchar data-type:

      select
          a.account_name,
          b.account_name,
          b.comment
      from
          raw a
          join
          typed b
          on a.test_id = b.test_id
      where 
          length(a.account_name) > length(b.account_name);

      +-----------------------------------+-------------------+------------------------+--+
      |          a.account_name           |  b.account_name   |       b.comment        |
      +-----------------------------------+-------------------+------------------------+--+
      | F'tang F'tang Ole Biscuit-Barrel  | F'tang F'tang Ol  | account_name too long  |
      +-----------------------------------+-------------------+------------------------+--+
      1 row selected (0.931 seconds)
    

Finding dates that get "adjusted" on loading into date data-type:

      select
          a.start_date,
          b.start_date,
          b.comment
      from
          raw a
          join
          typed b
          on a.test_id = b.test_id
      where 
          a.start_date <> cast(b.start_date as varchar(10));

      +---------------+---------------+------------------------------------+--+
      | a.start_date  | b.start_date  |             b.comment              |
      +---------------+---------------+------------------------------------+--+
      | 2017-02-29    | 2017-03-01    | invalid day for February in date   |
      | 1994-02-30    | 1994-03-02    | should be decimal but not numeric  |
      +---------------+---------------+------------------------------------+--+
      2 rows selected (1.259 seconds)
    

This is not clever because the system has no way of knowing what the user actually intended. This is one possible guess, and it will usually be wrong.

Finding data that fails to load into date data-type:

      select
          a.start_date,
          b.start_date,
          b.comment
      from
          raw a
          join
          typed b
          on a.test_id = b.test_id
      where 
          a.start_date is not null and
          b.start_date is null;

      +---------------+---------------+----------------------+--+
      | a.start_date  | b.start_date  |      b.comment       |
      +---------------+---------------+----------------------+--+
      | 02/24/09      | NULL          | US-format date       |
      | ABC           | NULL          | nothing like a date  |
      +---------------+---------------+----------------------+--+
      2 rows selected (0.798 seconds)
    

You can see from these queries that, although each one is very simple, there will be a lot of queries to run to detect all the data-quality issues that we have illustrated here. The automation (that we will provide in a later article) makes it a lot easier, but you still have to run all the tests and check the results. This can take significant resources on the cluster and significant human effort to follow it all through and get the problems fixed. This is extra cost that we incur because the product set is badly designed. I will not accept that there is any justification for the silent failure approach. If the quality of the data you are loading is not important then why on earth are you bothering to load it and then run analyses that may have serious errors. If you are happy to settle for fiction, then just write fiction - it is much cheaper.

Protest Warning I have to issue a Protest Warning here. Maybe you will say that it is a bit overdue.

Even if you do all this, things can go wrong

I issue the Protest Warning because my description of what can go wrong all sounds very cynical. Unfortunately it happens and the only practical solution for you may be to start looking for a better organisation to work for. You probably invested a lot in this one. But a job like this will eat your soul; you owe it to yourself to take your skills somewhere more rewarding.