The Data Studio

Hive: Booleans Are Too Confusing To Be Usable

Tested Using Hortonworks Data Platform (HDP) Sandbox, Release 2.5 (Hive 1.2.1)

(Update for Hive 2.1.0 here)

PostgreSQL has a data-type called "boolean" that has just three possible states: true, false and null. Booleans can be very useful, because in most systems there are some elements that can only have these three states (true, false and unknown (= null)) and the boolean type models those efficiently and with no possibility of any other data getting into the column. (If you used a varchar, for example, you could store any character you wanted in that column.) Booleans also make queries simpler, so long as you name the column properly, for example:

      select
          first_name,
          last_name,
          department 
      from
          employee 
      where
          security_cleared;
    

Hive does support the boolean data-type but the implementation is flawed in several ways. In fact I would advise not using booleans in Hive because there are too many ways you can get the wrong answer. You don't want to be sending someone who isn't security_cleared on a secret mission, do you?

In common usage and in coding, there are several synonyms for true (T, Yes, Y, 1, etc.) and for false (F, No, N, 0, etc.). It is handy if your database accepts these synonyms.

Loading Data From A File Into Hive

For these tests we define a table like this:

      create external table boolean_test
      (
         varchar_type varchar(8),
         boolean_type boolean
      )
      row format delimited fields terminated by ','
      location '/user/ron/boolean_test/';
    

Now we need a data file in the HDFS directory specified in the location above. This file contains various potential boolean values and one definitely invalid value "Junk". There are two values on every row: one to be interpreted as a string (varchar) so that we know what to expect in the boolean field, and the same value to be interpreted as a boolean. Here's the file:

      TRUE,TRUE
      true,true
      True,True
      T,T
      t,t
      FALSE,FALSE
      false,false
      False,False
      F,F
      f,f
      0,0
      1,1
      YES,YES
      yes,yes
      Yes,Yes
      Y,Y
      y,y
      NO,NO
      no,no
      No,No
      N,N
      n,n
      Junk,Junk
    

Having created the file and copied it into HDFS, we can see what Hive makes of it through the external table definition:

      select * from boolean_test;
      +----------------------------+----------------------------+--+
      | boolean_test.varchar_type  | boolean_test.boolean_type  |
      +----------------------------+----------------------------+--+
      | TRUE                       | true                       |
      | true                       | true                       |
      | True                       | true                       |
      | T                          | NULL                       |
      | t                          | NULL                       |
      | FALSE                      | false                      |
      | false                      | false                      |
      | False                      | false                      |
      | F                          | NULL                       |
      | f                          | NULL                       |
      | 0                          | NULL                       |
      | 1                          | NULL                       |
      | YES                        | NULL                       |
      | yes                        | NULL                       |
      | Yes                        | NULL                       |
      | Y                          | NULL                       |
      | y                          | NULL                       |
      | NO                         | NULL                       |
      | no                         | NULL                       |
      | No                         | NULL                       |
      | N                          | NULL                       |
      | n                          | NULL                       |
      | Junk                       | NULL                       |
      +----------------------------+----------------------------+--+
      23 rows selected (0.434 seconds)
    

Hive interprets the true and false values as we want, without being phased by the use of upper and lower case, so that is good. The other values are all silently ignored and set to null.

This does mean that we have to be careful when loading booleans from a file and we do need to write code to check what is in the file so that we don't miss anything. In many cases the developers won't bother, or (more likely in my experience) their managers will be so keen to get all the data loaded really fast (as the vendor promised) that the developers will feel that they shouldn't be using up valuable time checking this sort of thing. This is a big factor in the poor data quality in many of our databases.

Using SQL Insert Statements To Load Data In Hive

Firstly we'll create an "orc" table containing the same data. This is generally advised because the orc format will be more efficient.

      create table boolean_test_orc
      (
          varchar_type varchar(8),
          boolean_type boolean
      )
      stored as orc;

      insert into table boolean_test_orc
      select * from boolean_test;
    

Now we can run our insert statements:

      insert into table boolean_test_orc values ('TRUE', 'TRUE');
      insert into table boolean_test_orc values ('true', 'true');
      insert into table boolean_test_orc values ('True', 'True');
      insert into table boolean_test_orc values ('T', 'T');
      insert into table boolean_test_orc values ('t', 't');
      insert into table boolean_test_orc values ('FALSE', 'FALSE');
      insert into table boolean_test_orc values ('false', 'false');
      insert into table boolean_test_orc values ('False', 'False');
      insert into table boolean_test_orc values ('F', 'F');
      insert into table boolean_test_orc values ('f', 'f');
      insert into table boolean_test_orc values ('0', '0');
      insert into table boolean_test_orc values ('1', '1');
      insert into table boolean_test_orc values ('YES', 'YES');
      insert into table boolean_test_orc values ('yes', 'yes');
      insert into table boolean_test_orc values ('Yes', 'Yes');
      insert into table boolean_test_orc values ('Y', 'Y');
      insert into table boolean_test_orc values ('y', 'y');
      insert into table boolean_test_orc values ('NO', 'NO');
      insert into table boolean_test_orc values ('no', 'no');
      insert into table boolean_test_orc values ('No', 'No');
      insert into table boolean_test_orc values ('N', 'N');
      insert into table boolean_test_orc values ('n', 'n');
      insert into table boolean_test_orc values ('Junk', 'Junk');
    

Every one of the above statements gives output like the following:

      0: jdbc:hive2://172.16.190.194:10000> insert into table boolean_test_orc values ('FALSE', 'FALSE');
      INFO  : Session is already open
      INFO  : Dag name: insert into table boolean_test_or...'FALSE')(Stage-1)
      INFO  : 

      INFO  : Status: Running (Executing on YARN cluster with App id application_1489395979188_0001)

      INFO  : Map 1: 0(+1)/1	
      INFO  : Map 1: 1/1	
      INFO  : Loading data to table default.boolean_test_orc from hdfs://sandbox.hortonworks.com:8020/apps/hive/warehouse/boolean_test_orc/.hive-staging_hive_2017-03-13_09-44-24_976_3660911276893003694-1/-ext-10000
      INFO  : Table default.boolean_test_orc stats: [numFiles=6, numRows=6, totalSize=1708, rawDataSize=547]
      No rows affected (0.938 seconds)
    

Note, in particular, that this output says "No rows affected" when we just inserted one, and that there is no error message. What we now have in the table is:

      0: jdbc:hive2://172.16.190.194:10000> select * from boolean_test_orc;
      +--------------------------------+--------------------------------+--+
      | boolean_test_orc.varchar_type  | boolean_test_orc.boolean_type  |
      +--------------------------------+--------------------------------+--+
      | TRUE                           | true                           |
      | true                           | true                           |
      | 0                              | true                           |
      | 1                              | true                           |
      | YES                            | true                           |
      | yes                            | true                           |
      | Yes                            | true                           |
      | Y                              | true                           |
      | y                              | true                           |
      | NO                             | true                           |
      | no                             | true                           |
      | No                             | true                           |
      | True                           | true                           |
      | N                              | true                           |
      | n                              | true                           |
      | Junk                           | true                           |
      | T                              | true                           |
      | t                              | true                           |
      | FALSE                          | true                           |
      | false                          | true                           |
      | False                          | true                           |
      | F                              | true                           |
      | f                              | true                           |
      +--------------------------------+--------------------------------+--+
      23 rows selected (0.184 seconds)
    

So all of the input strings have been interpreted as "true"! Do I need to explain how crazy that is?

It is crazy, but my insert statements are probably wrong, since the boolean values such as "true" and "false" should be keywords rather than string literals.

I'll fix them and try again.

      truncate table boolean_test_orc;

      insert into table boolean_test_orc values ('TRUE', TRUE);
      insert into table boolean_test_orc values ('true', true);
      insert into table boolean_test_orc values ('True', True);
      insert into table boolean_test_orc values ('T', T);
      insert into table boolean_test_orc values ('t', t);
      insert into table boolean_test_orc values ('FALSE', FALSE);
      insert into table boolean_test_orc values ('false', false);
      insert into table boolean_test_orc values ('False', False);
      insert into table boolean_test_orc values ('F', F);
      insert into table boolean_test_orc values ('f', f);
      insert into table boolean_test_orc values ('0', 0);
      insert into table boolean_test_orc values ('1', 1);
      insert into table boolean_test_orc values ('YES', YES);
      insert into table boolean_test_orc values ('yes', yes);
      insert into table boolean_test_orc values ('Yes', Yes);
      insert into table boolean_test_orc values ('Y', Y);
      insert into table boolean_test_orc values ('y', y);
      insert into table boolean_test_orc values ('NO', NO);
      insert into table boolean_test_orc values ('no', no);
      insert into table boolean_test_orc values ('No', No);
      insert into table boolean_test_orc values ('N', N);
      insert into table boolean_test_orc values ('n', n);
      insert into table boolean_test_orc values ('Junk', Junk);
    

Wow! I got some error messages, like this:

      0: jdbc:hive2://172.16.190.194:10000> insert into table boolean_test_orc values ('F', F);
      Error: Error while compiling statement: FAILED: SemanticException [Error 10293]: Unable to create temp file for insert values Expression of type TOK_TABLE_OR_COL not supported in insert/values (state=42000,code=10293)
    

It isn't in plain English, but I can work out that my unquoted F doesn't make sense to Hive, or as the message says: "Expression of type TOK_TABLE_OR_COL not supported".

However, some of these did work:

      select * from boolean_test_orc;
      +--------------------------------+--------------------------------+--+
      | boolean_test_orc.varchar_type  | boolean_test_orc.boolean_type  |
      +--------------------------------+--------------------------------+--+
      | TRUE                           | true                           |
      | true                           | true                           |
      | True                           | true                           |
      | FALSE                          | false                          |
      | false                          | false                          |
      | False                          | false                          |
      | 0                              | true                           |
      | 1                              | true                           |
      +--------------------------------+--------------------------------+--+
      8 rows selected (0.084 seconds)
    

Oops! How come 0 and 1 both mean true?

You may say that I should not be using insert statements with Hive. I would have two issues with that:

  1. The Big Data vendors are pushing hard to get Hadoop-based systems to replace relational databases. The temptation from that $46 billion database market is too great to resist.
  2. The insert statement is implemented and documented. If I'm not supposed to use it, don't make it available. If I am, then make it work.

What Would Postgres Do?

Postgres is a pretty good guide to the right response to any contentious behaviour by database systems. This is what happens in Postgres:

      create table boolean_test 
      (
          varchar_type varchar(8),
          boolean_type boolean
      );
      CREATE TABLE

      # \d boolean_test 
                 Table "public.boolean_test"
          Column    |         Type         | Modifiers 
      --------------+----------------------+-----------
       varchar_type | character varying(8) | 
       boolean_type | boolean              | 

      # copy boolean_test from '/Users/ronballard/SVN/the_data_studio/research_in_progress/boolean.csv' with csv;
      ERROR:  invalid input syntax for type boolean: "Junk"
      CONTEXT:  COPY boolean_test, line 23, column boolean_type: "Junk"
    

Postgres refuses to copy bad data, as it should. So I clean up the file by removing the junk row (the appropriate action in this test) and try again.

      # copy boolean_test from '/Users/ronballard/SVN/the_data_studio/research_in_progress/boolean.csv' with csv;
      COPY 22

      claims=# select * from boolean_test;
       varchar_type | boolean_type 
      --------------+--------------
       TRUE         | t
       true         | t
       True         | t
       T            | t
       t            | t
       FALSE        | f
       false        | f
       False        | f
       F            | f
       f            | f
       0            | f
       1            | t
       YES          | t
       yes          | t
       Yes          | t
       Y            | t
       y            | t
       NO           | f
       no           | f
       No           | f
       N            | f
       n            | f
      (22 rows)
    

That is exactly what I expect and want.

Now let's try the separate inserts:

      # truncate table boolean_test;
      TRUNCATE TABLE
      # insert into boolean_test values ('TRUE', 'TRUE');
      INSERT 0 1
      # insert into boolean_test values ('true', 'true');
      INSERT 0 1
      # insert into boolean_test values ('True', 'True');
      INSERT 0 1
      # insert into boolean_test values ('T', 'T');
      INSERT 0 1
      # insert into boolean_test values ('t', 't');
      INSERT 0 1
      # insert into boolean_test values ('FALSE', 'FALSE');
      INSERT 0 1
      # insert into boolean_test values ('false', 'false');
      INSERT 0 1
      # insert into boolean_test values ('False', 'False');
      INSERT 0 1
      # insert into boolean_test values ('F', 'F');
      INSERT 0 1
      # insert into boolean_test values ('f', 'f');
      INSERT 0 1
      # insert into boolean_test values ('0', '0');
      INSERT 0 1
      # insert into boolean_test values ('1', '1');
      INSERT 0 1
      # insert into boolean_test values ('YES', 'YES');
      INSERT 0 1
      # insert into boolean_test values ('yes', 'yes');
      INSERT 0 1
      # insert into boolean_test values ('Yes', 'Yes');
      INSERT 0 1
      # insert into boolean_test values ('Y', 'Y');
      INSERT 0 1
      # insert into boolean_test values ('y', 'y');
      INSERT 0 1
      # insert into boolean_test values ('NO', 'NO');
      INSERT 0 1
      # insert into boolean_test values ('no', 'no');
      INSERT 0 1
      # insert into boolean_test values ('No', 'No');
      INSERT 0 1
      # insert into boolean_test values ('N', 'N');
      INSERT 0 1
      # insert into boolean_test values ('n', 'n');
      INSERT 0 1
      # insert into boolean_test values ('Junk', 'Junk');
      ERROR:  invalid input syntax for type boolean: "Junk"
      LINE 1: insert into boolean_test values ('Junk', 'Junk');
                                                 ^
      # select * from boolean_test;
       varchar_type | boolean_type 
      --------------+--------------
       TRUE         | t
       true         | t
       True         | t
       T            | t
       t            | t
       FALSE        | f
       false        | f
       False        | f
       F            | f
       f            | f
       0            | f
       1            | t
       YES          | t
       yes          | t
       Yes          | t
       Y            | t
       y            | t
       NO           | f
       no           | f
       No           | f
       N            | f
       n            | f
      (22 rows)
    

Postgres gives the right results and it stops the input of junk, with a simple clear error message. That is all I ask for.

Update: Tested Using Hortonworks Data Platform (HDP) Sandbox, Release 2.5, Hive 2.1.0.2.5.0.0-1245

The results were exactly the same as with Hive 1.2.1 except for the output on each insert. With Hive 2.1.0 we get:

INFO  : Executing command(queryId=hive_20170314105037_9c85dd4f-2734-4c8c-b4ed-4463491baed9): insert into table boolean_test_orc values ('1', 1)
INFO  : Query ID = hive_20170314105037_9c85dd4f-2734-4c8c-b4ed-4463491baed9
INFO  : Total jobs = 1
INFO  : Launching Job 1 out of 1
INFO  : Starting task [Stage-1:MAPRED] in serial mode
INFO  : Session is already open
INFO  : Dag name: insert into table boolean_test_orc valu...1)(Stage-1)
INFO  : 

INFO  : Status: Running (Executing on YARN cluster with App id application_1489485677402_0002)

INFO  : Map 1: 0(+1)/1	
INFO  : Map 1: 1/1	
INFO  : Status: DAG finished successfully in 0.22 seconds
INFO  : 
INFO  : Query Execution Summary
INFO  : ----------------------------------------------------------------------------------------------
INFO  : ----------------------------------------------------------------------------------------------
INFO  : Compile Query                           0.07s
INFO  : Prepare Plan                            0.06s
INFO  : Submit Plan                             0.13s
INFO  : Start DAG                               0.21s
INFO  : Run DAG                                 0.22s
INFO  : ----------------------------------------------------------------------------------------------
INFO  : 
INFO  : Task Execution Summary
INFO  : --------------------------------------------------------------------------------------------------------------------------------
INFO  : --------------------------------------------------------------------------------------------------------------------------------
INFO  :      Map 1            1                0             0         217.00             0            0              1               0
INFO  : --------------------------------------------------------------------------------------------------------------------------------
INFO  : 
INFO  : LLAP IO Summary
INFO  : ----------------------------------------------------------------------------------------------
INFO  : ----------------------------------------------------------------------------------------------
INFO  :      Map 1         0         0          0        0B         0B          0B       0B     0.00s
INFO  : ----------------------------------------------------------------------------------------------
INFO  : 
INFO  : FileSystem Counters Summary
INFO  : 
INFO  : ----------------------------------------------------------------------------------------------
INFO  : ----------------------------------------------------------------------------------------------
INFO  :      Map 1              0B             0                  0                 0B             0
INFO  : ----------------------------------------------------------------------------------------------
INFO  : 
INFO  : ----------------------------------------------------------------------------------------------
INFO  : ----------------------------------------------------------------------------------------------
INFO  :      Map 1              4B             4                  0               364B             3
INFO  : ----------------------------------------------------------------------------------------------
INFO  : 
INFO  : org.apache.tez.common.counters.DAGCounter:
INFO  :    NUM_SUCCEEDED_TASKS: 1
INFO  :    TOTAL_LAUNCHED_TASKS: 1
INFO  :    DATA_LOCAL_TASKS: 1
INFO  :    AM_CPU_MILLISECONDS: 130
INFO  :    AM_GC_TIME_MILLIS: 0
INFO  : File System Counters:
INFO  :    FILE_BYTES_READ: 0
INFO  :    FILE_BYTES_WRITTEN: 0
INFO  :    FILE_READ_OPS: 0
INFO  :    FILE_LARGE_READ_OPS: 0
INFO  :    FILE_WRITE_OPS: 0
INFO  :    HDFS_BYTES_READ: 4
INFO  :    HDFS_BYTES_WRITTEN: 364
INFO  :    HDFS_READ_OPS: 4
INFO  :    HDFS_LARGE_READ_OPS: 0
INFO  :    HDFS_WRITE_OPS: 3
INFO  : org.apache.tez.common.counters.TaskCounter:
INFO  :    INPUT_RECORDS_PROCESSED: 1
INFO  :    INPUT_SPLIT_LENGTH_BYTES: 4
INFO  :    OUTPUT_RECORDS: 0
INFO  : HIVE:
INFO  :    CREATED_FILES: 1
INFO  :    DESERIALIZE_ERRORS: 0
INFO  :    RECORDS_IN_Map_1: 1
INFO  :    RECORDS_OUT_1_default.boolean_test_orc: 1
INFO  : TaskCounter_Map_1_INPUT_values__tmp__table__35:
INFO  :    INPUT_RECORDS_PROCESSED: 1
INFO  :    INPUT_SPLIT_LENGTH_BYTES: 4
INFO  : TaskCounter_Map_1_OUTPUT_out_Map_1:
INFO  :    OUTPUT_RECORDS: 0
INFO  : Starting task [Stage-2:DEPENDENCY_COLLECTION] in serial mode
INFO  : Starting task [Stage-0:MOVE] in serial mode
INFO  : Loading data to table default.boolean_test_orc from hdfs://sandbox.hortonworks.com:8020/apps/hive/warehouse/boolean_test_orc/.hive-staging_hive_2017-03-14_10-50-37_632_6559944984770499213-3/-ext-10000
INFO  : Starting task [Stage-3:STATS] in serial mode
INFO  : Table default.boolean_test_orc stats: [numFiles=8, numRows=8, totalSize=2368, rawDataSize=733]
INFO  : Completed executing command(queryId=hive_20170314105037_9c85dd4f-2734-4c8c-b4ed-4463491baed9); Time taken: 0.725 seconds
INFO  : OK
No rows affected (0.822 seconds)
    

That's what I call verbose. And it still says "No rows affected" which is still not true.