The Data Studio

Hive does not support the "or" logical operator in a join

Tested Using Hortonworks Data Platform (HDP) Sandbox, Release 2.5, Hive 2.1.0.2.5.0.0-1245

Here's a simple example. We have two tables, driver and passenger, and this is what they contain:

      select * from driver;

      +------------+--------------------+-------------------+--+
      | driver.id  | driver.first_name  | driver.last_name  |
      +------------+--------------------+-------------------+--+
      | 6338       | James              | Smith             |
      | 6392       | Julie              | James             |
      | 5437       | Martin             | Walters           |
      | 4482       | Ruby               | Jones             |
      | 6389       | Thomas             | Kelly             |
      +------------+--------------------+-------------------+--+
      5 rows selected (0.161 seconds)

      select * from passenger;

      +---------------+-----------------------+----------------------+--+
      | passenger.id  | passenger.first_name  | passenger.last_name  |
      +---------------+-----------------------+----------------------+--+
      | 7492          | Martha                | James                |
      | 7641          | Peter                 | Warren               |
      | 1324          | Lucy                  | Wilson               |
      | 3521          | Martin                | Ruby                 |
      | 2749          | Mary                  | Baldwin              |
      +---------------+-----------------------+----------------------+--+
      5 rows selected (0.196 seconds)
    

For some reason that we don't need to go into here, we want to find cases where a driver has the same last_name as a passenger or the same first name as a passenger. Here's the obvious way to write this query:

      select
          a.id,
          a.first_name,
          a.last_name,
          b.id,
          b.first_name,
          b.last_name
      from
          driver a
          join
          passenger b
          on
              a.last_name = b.last_name or
              a.first_name = b.first_name;
      

But this query gives an error message:

      Error: Error while compiling statement: FAILED: SemanticException [Error 10019]: Line 13:8 OR not supported in JOIN currently 'first_name' (state=42000,code=10019)
    

It is possible to workaround this using "union", if you are very careful. The obvious way to write this as a union is:

      select
          a.id,
          a.first_name,
          a.last_name,
          b.id,
          b.first_name,
          b.last_name
      from
          driver a
          join
          passenger b
          on a.last_name = b.last_name
      union
      select
          c.id,
          c.first_name,
          c.last_name,
          d.id,
          d.first_name,
          d.last_name
      from
          driver c
          join
          passenger d
          on c.first_name = d.first_name;
    

... and the results are:

      +--------+----------------+---------------+--+
      | u2.id  | u2.first_name  | u2.last_name  |
      +--------+----------------+---------------+--+
      | 5437   | Martin         | Walters       |
      | 6392   | Julie          | James         |
      +--------+----------------+---------------+--+
      2 rows selected (1.755 seconds)
    

That gives me three result columns and I specified six. Hive always prefixes the column heading with the table it came from but it seems to have lost track of what it is doing here.

Maybe this will fix it, we'll give the result columns specific unique names:

      select
          a.id          as driver_id,
          a.first_name  as driver_first_name,
          a.last_name   as driver_last_name,
          b.id          as passenger_id,
          b.first_name  as passenger_first_name,
          b.last_name   as passenger_last_name
      from
          driver a
          join
          passenger b
          on a.last_name = b.last_name
      union
      select
          c.id,
          c.first_name,
          c.last_name,
          d.id,
          d.first_name,
          d.last_name
      from
          driver c
          join
          passenger d
          on c.first_name = d.first_name;
    
      Error: Error while compiling statement: FAILED: SemanticException [Error 10007]: Ambiguous column reference id in _u1-subquery2 (state=42000,code=10007)
    

OK, we'll spoonfeed Hive and give it as little to work out as possible.

      select
          a.id          as driver_id,
          a.first_name  as driver_first_name,
          a.last_name   as driver_last_name,
          b.id          as passenger_id,
          b.first_name  as passenger_first_name,
          b.last_name   as passenger_last_name
      from
          driver a
          join
          passenger b
          on a.last_name = b.last_name
      union
      select
          c.id          as driver_id,
          c.first_name  as driver_first_name,
          c.last_name   as driver_last_name,
          d.id          as passenger_id,
          d.first_name  as passenger_first_name,
          d.last_name   as passenger_last_name
      from
          driver c
          join
          passenger d
          on c.first_name = d.first_name
    

At last, the correct answer.

      +---------------+-----------------------+----------------------+------------------+--------------------------+-------------------------+--+
      | u2.driver_id  | u2.driver_first_name  | u2.driver_last_name  | u2.passenger_id  | u2.passenger_first_name  | u2.passenger_last_name  |
      +---------------+-----------------------+----------------------+------------------+--------------------------+-------------------------+--+
      | 5437          | Martin                | Walters              | 3521             | Martin                   | Ruby                    |
      | 6392          | Julie                 | James                | 7492             | Martha                   | James                   |
      +---------------+-----------------------+----------------------+------------------+--------------------------+-------------------------+--+
      2 rows selected (1.354 seconds)
    

What Would Postgres Do?

Postgres is a pretty good guide to the right response to any contentious behaviour by database systems. This is what happens in Postgres:

      create table driver
      (
          id           integer,
          first_name   varchar(32),
          last_name    varchar(32)
      );
      CREATE TABLE
      create table passenger
      (
          id           integer,
          first_name   varchar(32),
          last_name    varchar(32)
      );
      CREATE TABLE
      copy driver from '/Users/ronballard/SVN/the_data_studio/research_in_progress/driver.csv' with csv;
      COPY 5
      copy passenger from '/Users/ronballard/SVN/the_data_studio/research_in_progress/passenger.csv' with csv;
      COPY 5
      select
          a.id,
          a.first_name,
          a.last_name,
          b.id,
          b.first_name,
          b.last_name
      from
          driver a
          join
          passenger b
          on
              a.last_name = b.last_name or
              a.first_name = b.first_name;

        id  | first_name | last_name |  id  | first_name | last_name 
      ------+------------+-----------+------+------------+-----------
       6392 | Julie      | James     | 7492 | Martha     | James
       5437 | Martin     | Walters   | 3521 | Martin     | Ruby
      (2 rows)
    

This gives you some idea of the productivity level you can achieve with PostgreSQL compared with Hive. Everything in Hive seems to be a big drama. With Hive you have to keep trying things, re-reading the documentation, rummage around on the community pages. Eventually you get something that is right, if you're lucky. PostgreSQL is particularly good, but most of the established real relational databases are much closer to PostgreSQL in productivity and miles ahead of Hive.