The Data Studio

Hive: How To Find Invisible Characters

Invisible characters are things like: horizontal tab, "null" byte, bell, etc. These are generally control characters: hex 00 thru hex 1f and hex 7f.

Invisible characters are a problem because if they are embedded in a string you are searching for, you probably won't find it because you will type in what you see and the database stores what you see plus one or more hidden characters.

For a more complete description of this and related problems see: Where Did These Funny Characters Come From?.

If you have got past the stage of complete bafflement about what is in your data, and you now have some suspicion that particular hidden characters are present you can search for them using the ascii() function and using character literals in the form 'u\xxxx' where xxxx is a hexadecimal number.

The tab character is hex 09 (= decimal 9, = octal 011). So, if I suspect that some values in a column have a tab character in position 50, I can search for them like this:

      select
          ascii(substring(line, 50, 1)) as possible_tab
      from
          hdkehb_fixed_length_table 
      where
          ascii(substring(line, 50, 1)) = 9 
      limit 3;
 
      +--------------+--+ 
      | possible_tab | 
      +--------------+--+ 
      | 9            | 
      | 9            | 
      | 9            | 
      +--------------+--+ 
      3 rows selected (8.087 seconds)
    

That's a pretty specific suspicion, suggesting that I already kniow quite a lot about this data. If I think there is a tab somewhere in a column value, but I don't know where, I can use the chr() function like this:

      select
          instr(line, chr(9)) as position_of_tab 
      from
          hdkehb_fixed_length_table 
      where
          line like concat('%', chr(9), '%') 
      limit 3;
  
      +-----------------+--+ 
      | position_of_tab |
      +-----------------+--+ 
      | 49              | 
      | 50              | 
      | 47              | 
      +-----------------+--+ 
      3 rows selected (8.099 seconds) 
    

We are using the function chr(9) in two ways:

The chr() function only works for values from zero to 127. This is not what the documentation says. If you search on chr(226) for example, you will not get a match, even if that byte value exists in the data.

You can also search on the Unicode Code Point using the contruct \uxxxx where x is a hexadecimal digit. The number is the Code Point, not the value stored in the bytes. As an example, the Euro symbol (€) has Unicode Code Point \u20ac. If you search for this, you will find the Euro symbol although it is stored as hex:e2 82 ac in UTF-8 (and as hex:80 in Windows-1252).

The approaches described here are simple and quick. If you have a strong idea of what may be causing a problem these are good tools to test your suspicions. If not, then you need to take a more comprehensive approach as described in Where Did These Funny Characters Come From?. The comprehensive approach is essential if you are looking to load files of similar data on a regular basis. It isn't difficult and it can save you a lot of ad-hoc trouble-shooting.