The Data Studio

Common Data Quality Problems - How Your Data Gets Into A Mess

I have seen estimates of between 60% and 80% of Data Scientists' time being spent on cleaning up data. That level seems feasible to me based on the amount of time I have spent in my life dealing with data quality issues.

This page describes some of the causes of dirty data. In other pages we will look at what to do about it.

Inconsistency In Tables That Are Not "Normalised"

We are talking here about normalisation in the relational database sense. This kind of normalisation is a way of organising data so that things which belong together are in the same table. Normalisation also provides a consistent way of linking between related tables. Normalisation reduces duplication of data, and duplication is an enemy of data quality.

In many cases, tables are not normalised to start with. In other cases we start with normalised tables and then de-normalise them so that, for example, a calculated value is stored "to save recalculating it every time". So we might define an active customer as one who has made a transaction in the last month. We can always determine if a customer is active or not by looking at her transactions. If we decide to store the "active" attribute with the customer, then we have the same information twice. So then, what does it mean if the customer status is "not active" but there is a transaction in the last month? If this happened it would be a defect – perhaps the application that maintained the "active" attribute would not have taken account of late transactions. If we de-normalise then we have to check that the update happened. We have introduced a possibility for error so have we really saved any time?

It is now very common to "de-normalise" groups of two or more tables to create one "Big Wide Table". The motivation is to reduce the number of joins between tables and to get all the data required for some analysis into one rectangular table, like a big Excel spreadsheet. Very often the Big Wide Table is an Excel spreadsheet. In the insurance industry this is called a "Bordereau" which sounds very grand, but it is just a Big Wide Table. Here's an example:

      select
          a.first_name,
          a.family_name,
          a.city,
          a.policy_number,
          d.description,
          c.created_at        as transaction_time,
          c.actual_or_reserve,
          c.amount
      from
          person a
          join
          claimant b
          on a.id = b.person_id
          join
          transaction c 
          on b.id = c.claimant_id
          join
          claim_head d on
          d.id = c.claim_head_id
      order by 
          family_name,
          first_name,
          transaction_time;
    
  first_name  |     family_name      |       city       | policy_number |                  description                   |  transaction_time   | actual_or_reserve |  amount  
--------------+----------------------+------------------+---------------+------------------------------------------------+---------------------+-------------------+----------
 Angus        | Addams               | Thousand Oaks    | EET/4429      | Third Party Bodily Injury Defendant Fees       | 2015-02-21 12:25:13 | Actual            |    13.24
 Angus        | Addams               | Thousand Oaks    | EET/4429      | First Party Malicious damage                   | 2015-05-17 18:49:17 | Reserve           |    75.54
 Angus        | Addams               | Thousand Oaks    | EET/4429      | First Party Personal Accident                  | 2015-06-16 02:55:14 | Actual            |    88.31
 Angus        | Addams               | Thousand Oaks    | EET/4429      | First Party Fire                               | 2015-07-27 20:01:30 | Reserve           |    70.29
 Angus        | Addams               | Thousand Oaks    | EET/4429      | First Party Theft                              | 2015-11-15 23:48:06 | Reserve           |    83.93
 Angus        | Addams               | Thousand Oaks    | EET/4429      | Third Party Bodily Injury Damages              | 2015-12-22 10:29:55 | Actual            |  6474.68
 Angus        | Addams               | Thousand Oaks    | EET/4429      | First Party Keys                               | 2016-01-04 16:10:30 | Reserve           |  9785.91
 Sachiko      | Addams               | Santa Ana        | MN.938.M      | First Party Locks                              | 2013-05-09 21:50:55 | Actual            |    36.83
 Sachiko      | Addams               | Santa Ana        | MN.938.M      | First Party Fire                               | 2013-06-17 09:03:51 | Reserve           |    15.86
 Sachiko      | Addams               | Santa Ana        | MN.938.M      | Third Party Property Damage Hire (credit)      | 2013-08-11 23:44:32 | Reserve           |    90.86
 Sachiko      | Addams               | Santa Ana        | MN.938.M      | First Party Accidental damage                  | 2013-12-20 19:26:15 | Reserve           |    83.06
 Sachiko      | Addams               | Santa Ana        | MN.938.M      | First Party Fire                               | 2014-02-09 03:45:17 | Actual            |  9305.91
 Sachiko      | Addams               | Santa Ana        | MN.938.M      | Third Party Bodily Injury Legal Costs          | 2014-02-17 19:44:27 | Reserve           |    45.92
 Sachiko      | Addams               | Santa Ana        | MN.938.M      | Third Party Property Damage Non vehicle damage | 2014-02-25 01:22:10 | Actual            |    15.50
    

I have restricted the number of columns to be able to show this on the web page, but usually the user will want all the columns. He will then create cleaned or manipulated tables from this. There are many opportunities to make mistakes with the duplicated data (the first four columns) and, now that the data is de-normalised, the database cannot do anything to help protect the integrity.

Duplicate Keys

If the keys are in a flat file, or are in a database but with primary key constraints not enforced, there is a really high probabilty that duplicates will creep in. The process for generating new keys may be flawed, or there maybe some back-door process that lets a maintenance team edit the database. It happens.

Once you have duplicate keys, they may not be noticed by the users. The users may arrange to look at only the first record they find for a particular key. We worked with an insurance company that managed to duplicate some of its policy numbers. There were a number of very dissatisfied customers before the error was noticed.

If a database constraint is being used to ensure that the key is unique, this error cannot occur and there is no need for another check. Often database constraints are not enforced, or the data is in a file (an XML file perhaps) and this does not support such constraints. In these cases it is necessary to check for data quality problems often at a high cost.

Relational databases have a feature called "referential integrity" or "foreign key constraints". These rather grand terms just mean that the database automatically checks that references from one record to another exist where they should do, in other words there are no broken links. An example would be when you pay some money into your bank account. In the bank's database, there must be a reference from the record of your transaction to your account. If not the bank would have lost the information that makes this your money. You would care about this.

Unfortunately it is common practice for database admninistrators to fail to implement referential integrity constraints or to turn them off. Most times the excuse is performance. Firstly, they should make sure that there really is a performance issue, and secondly they must make sure that the link can be guaranteed somehow. Usually the "referential integrity constraint" in the database is the most efficient way to do the job.

When there are broken links, the users often add "outer joins" to make sure that they collect unmatched as well as matched records. This is inefficient but, more importantly, it can give the wrong results. The worst cases happen where the results are plausible but wrong. Then the wrong business decisions are taken and they can be very costly.

If all else fails, we can explicitly check for broken links in a database like this:

        select person_id from claimant 
        except 
        select id from person;
      

If this does not return zero then one or more values are missing from the person table. This query works, but it can be expensive way to tackle the problem, not so much in running the query, but in trying to fix the broken data, which often takes a lot of research.

Inappropriate Data Types

Relational databases allow you to define "strong" data-types, for example an integer data-type will hold only whole numbers between -2147483648 and +2147483647 (4 bytes as a binary number). Similarly, a varchar(16) data-type will hold up to 16 characters. If you try to load anything outside the defined range you get an error, that you can trap in your application and warn the user. "Crash, don't trash" as Andy Hunt and Dave Thomas say in The Pragmatic Programmer. It is so much better to stop if the data is wrong, and put it right, than it is to carry on regardless and trash your database.

The great benefit of strong data-types is that they allow you to detect completely invalid data values and keep them out of the database. Strong data-types enable you to crash when you need to and that is a very good thing.

Often, though we see long character fields (or the non-standard "string" data-type in some databases) used to store everything. This means that you get alphabetic characters in numbers, months with 37 days in dates, telephone numbers stored as amounts of money. And then you get highly skilled Data Scientists wasting their time trying to find gems in the mountain of dirt that is your database. It stops the data load from crashing, but we want it to crash if the data is wrong!

Sometimes we are forced to load bad data, so we load it into a staging area using weak data-types, and then clean it before loading it inot the reporting system. This wastes resources and is unreliable. Once the data is dirty you can never be sure what it was intended to be, the best you can do is a guess, whatever the salesman with a "data cleansing tool" tells you. (And why is it "cleansing", rather than "cleaning"? It sounds more thorough somehow, but it is still a guess.)

Nulls

Relational databases have the concept of null, which means that if a column in a particular row (a cell) is null then is has no value. Since a null cell has no value it cannot be equal to anything, not even another null. This can be useful, but it is widely misunderstood.

In cases where a variety of users will be querying the database, it is best to avoid nulls altogether. Most databases can ensure that a column cannot be null by labelling that column "not null" in the table definition.

Where possible, we should use default values instead of nulls because they are easier for the users to understand. Whether this is possible or not depends on the data-type of the column and the possible valid values. In a character (varchar) column, it may make sense to use the string "UNKNOWN" where the value is unknown, rather than making it null. If the column can validly hold positive integers only, then -1 may be a useful default value meaning unknown. But if the users need to do arithmetic on the column values then -1 would not be accptable.

However you decide to attack this problem, you must make sure that the users are trained so that they understand what they are dealing with. We really do not want our users to be guessing.

Late or Incomplete Data Loads

Users really need to know what data is available to them. If data is loaded to a reporting system for users then they must know how up-to-date the data is, and if there is any data missing. Data can be missing from time-to-time especially if it is supplied from multiple sources. One such example we saw was in a mobile phone company where sometimes one switch would fail to transmit its data successfully. If one hour's worth of data is missing from one switch, that can have a significant impact on how much should be billed to customers whose calls went through that switch.

If the data never arrives, then the statistics will be wrong, but the gap may be interpreted as being a significant change in behaviour.

The system must be responsible for not only for providing the data to the users, but also telling them what they have and what may be missing.

Missing Data

Sometimes the data is not just late, it is missing for ever. The Internet of Things is particularly vulnerable to this type of error because the data is being tramnsmitted from many small devices to a central processing facility. A particular device may fail, or it may be unable to transmit for a period of time. These devices usually have limited storage capacity and are designed to overwrite the oldest data with new data when they become full. If the old data has not been transmitted then it is lost for ever.

We will need to know if any data is missing, so such interfaces should be controlled by some kind of double-entry auditing, for example the device could send a list of files which should have been received every day so that the central process can detect the failure if any are missing.

Even with such auditing it is possible to miss files if the loss occurs before the point at which the source system creates the audit file. Sometimes we can identify a loss of this type with detailed trending, which might show an unusually low value for the day on which the loss occurred.

Problems With Character Encoding

There is a separate article on this subject: Dodgy Characters. I would recommend reading this if you run into this kind of problem.

What happens is that either a marker for an unknown character appears, or the data looks normal but does not match a search-string that looks exactly the same. The marker may be a question mark, or may be the special "replacement character" �. If you get the replacement character then that is helpful because it gives you a clue about what is wrong. But maybe you get something like this:

         select id, name from customer where id = 15650;

          id   |      name      
        -------+----------------
         15650 | Jasmine Tatlow
        (1 row)
      

That looks normal, so maybe next time you want to search on the name:

        select id, name from customer where name = 'Jasmine Tatlow';

         id | name 
        ----+------
        (0 rows)
      

Has someone deleted it already? You go back and search on the id again and the row you want is still there, so you check the spelling. You copy and paste the name from the one that worked. Still nothing! You could go on for days like this, but we all have lives to lead, so here is the answer:

        select id, name, ascii(substring(name from 8 for 1)) from customer where id = 15650;

         id   |      name      | ascii 
       -------+----------------+-------
        15650 | Jasmine Tatlow |     9
       (1 row)
      

Ascii 9 is the tab character. It looks like a space but it does not match space because a space is ascii 32.

There are many character values that will not show up on your display or will look like something else. If you run into such problems, have a look at Dodgy Characters.

Some databases offer more protection than others in this area. As usual PostgreSQL is one of the best and Hive hasn't even worked out what the problem is yet. But even PostgreSQL won't protect you against tab characters, because they could be valid in some circumstances. The best data quality means that users get what they expect as often as possible. When it isn't possible then we must give them more training to explain what they are seeing.

Format Errors

When we load data into a database, we must follow some conventions about how we tell the database what we are giving it.

Loading dates is particularly sensitive. If we see a date like this "12/07/29" what does it mean? Is it 12-Jul-1929 or 7-Dec-2029 or 29-Jul-2012 or what? Americans, Swedes and British people would all guess differently.

Most websites now use date-pickers to make sure we specify what we want. That is a good idea, but there are many systems that let us type in whatever we like and then interpret our input however they like. I once worked in a company in which half the PCs were set up to use UK-format dates and half with US-format dates (not deliberately, of course, it was just careless). When this was discovered, there were many thousands of dates that were wrong, and we could not tell which were which. The company decided to live with the problem rather than fix the database, because the only solution (for the ambiguous dates) was to go back to the source documents and that would have been prohibitively expensive.

We can also get format errors in numbers. There are different conventions in different countries for decimal points and thousands separators, for example. Sometimes we see scientific notation (1.5367E+7, for example).

Time may be specified in seconds, milliseconds, hours:minutes:seconds and several other formats.

We want our data to be accurate, so we need to know what the format is, interpret it correctly, and crash (yes, really) if we detect any error.

Bad Names

People roll their eyes when I say that bad naming is a data quality issue. OK I'm picky, and proud to be.

I went into the train station recently to collect some tickets that I had ordered online. When I got to the front of the queue and asked for my tickets the guy behind the counter said "TVM". I said "what?" He said "TVM". I said "what is TVM?". He gesticualted to his left. By this time the queue was getting restless. I understand; they all had trains to catch and every second counts. So I gave up my place in the queue and wandered off to the right to solve this puzzle in my own time. Eventually I guessed that he meant "Ticket Vending Machine". There was a ticket machine but it wasn't labelled "TVM" or even "Ticket Vending Machine". I don't know what other kind of ticket machine would be useful apart from one that sells tickets, but they obviously needed to add a word to be able to get a TLA.

Later that day, in the office, I was looking at the specification of a file that we had to load into our Data Lake. The file was a transaction file. One of the 150 fields in this file was called "TRAN_IND". I scrolled across the brightly coloured cells of the Excel spreadsheet to the "Description" column. It said "transaction indicator". I said "WTF?". The company that produced this file was bought by my client many years ago. Nobody could tell me what TRAN_IND means. I marked the TRAN_IND field "worthless" and moved on.

Bad naming is a Data Quality issue, a big one. We can see codes and numbers in a file, but if we don't know what they mean, then they are worthless. It is better to ignore data we don't understand than to guess what it means.

Deprecation

What an ugly word - "deprecation"! I know some smart people who can't even say it right.

I like Java, I think it is a good development language, not perfect, none are, but good. But why, oh why, does it come with all this flowery, subtly misused language? Anyway, the tide of language moves on and "deprecated" is the word we now use for something that is obsolete, that can still be used, but it's better if we don't.

The concept of deprecation is also something to avoid. Most databases I see are littered with zombie tables and zombie columns. These are deprecated, but nobody thought to tell the users. So when something is wrong, we need to correct it and remove the wrong data. Yes it will cause some upheaval, but we will get to a better place more quickly if we force the complete change now. If we leave the zombies there they will keep reappearing as detected or undetected data quality issues. They also eat expensive storage space.

Here's a concrete example from a mobile phone company:

A column called payment_type was used to show if a customer pays by either a pre-pay, post-pay or a hybrid method. This was shown to be unreliable for certain customers, and a new column called payment_combination was added from a more reliable source. However the old value was left in the database. New users developed reports using the old value, and every time somebody noticed that the reports were wrong, it was raised as another data quality issue. What a waste of time!

What Can We Do About It?