The Data Studio

Data Profiling

Data profiling looks at characteristics of data for the purpose finding potential data quality issues.

Data profiling is a technique that shows the content and structure of the data, by measuring a set of data in general, rather than examining individual rows in detail. For example, a check on the maximum and minimum values of a column, rather than spot checking individual values, will quickly tell you if you have any values that are outside the range you would expect.

You can use Data Profiling on data sets that you are bringing into a database from existing systems in your organisation and data sets that come from external sources. External sources might be data that you are buying from another company or data that is published by government or international sources. There is a lot of free data available now.

You could reasonably expect that data you are paying for and data from official sources would be good quality. Profile it and prepare to be surprised! Some is good, but I have adjusted my view of some big organisations after profiling their data.

You can also profile data that is in your own databases. Again you should be prepared to be surprised.

Data Profiling Tools

Tools are available from several vendors: Trillium Software (TS Discovery), SAS (DataFlux), IBM (InfoSphere Information Analyzer), Oracle (EDQ), etc. These tools vary in capability - Trillium and DataFlux are very capable, but expensive.

Alternatively you can use our tool which is is free and will certainly add to your understanding of your data.

The Output from Data Profiling

You can see some samples of the output from our Data Profiler here. The output from any data profiling tool needs to be interpreted before it is useful. For example, if a high proportion of values for a particular field are null, this often indicates that there is a problem with that field. But think about a system that records customer accounts. Most of these, in a successful business, will be active accounts, so a field that records the cancellation date will be null in most records; the cancellation date will be recorded only for those customers who have cancelled their accounts. In this case, a large proportion of nulls is a good thing.

The characteristics that we measure for each data field will mean less to our users than they do to data analysts. It is not enough for the data analyst to run the data profiling reports and paste them into a specification. The analysts need to read the profile report, investigate any anomalies, decide what action should be taken and describe the impact of these anomalies on the users.

The actions we take may include:

Implementing data cleaning rules is a last resort. It is unsatisfactory because:

You can buy expensive Data Cleansing tools. Note that they don't just "clean" they "cleanse". Right! In practice there is the possibility of standardising addresses because reliable data is available for almost all valid addresses in the UK, and because there is duplication between the actual address and the postcode, so there is the possibility to cross-check between these two forms of the address. This is the exception. There are many ways to find data that is invalid, but not to work out what the correct values should be.

Protest Warning Your manager (we'll call him Dumas), of course, does not want to hear that there is an issue that requires ongoing work. Not only is it work but also it is highly variable and will not fit nicely into a plan. No sooner does this thought cross his mind than a call from a sales person comes through telling him that there is a Data Cleansing Silver Bullet that the sales person would be only too happy to demonstrate to him. There's no point in getting the techies involved; they are always so negative. And who is going to have to battle with this piece of fiction? Who is going to get beaten up when, despite being given this wonderful, expensive Silver Bullet you still have not been able to polish the garbage that lives in the company's data? It won't be Dumas the manager.

Back in the real world, the developers find the true characteristics of the data when their applications fail on conditions that were not described in the data specification. This is an expensive way to do it, and delays the project, usually at a critical time. If we profile the data effectively then any development using the data will avoid much of the rework that normally takes place.

Another option, and a better one in my opinion, is to combine the analyst and programmer roles. It sounds very old-fashioned to have an "analyst/programmer"; that was my job title 40 years ago. But it makes sense. If the same person is doing the analysis and the programming then she can optimise her work and do each task at the time it gives most benefit.

Source Data

The data provided for data profiling should be produced using exactly the same process as that which will be used for the applications being developed.

Ideally, the data for profiling, and for development, should be provided in the operational source system. Normally the source system keeps its data in a database, so a read-only connection to the operational database is the ideal mechanism.

If the source data is in a file system, we want a link to the operational files. Data profiling will only read the data, so it is safe to do this. At worst there will be a small performance impact as the data profiling tool reads it. We need to be sensitive to peak loads on the source system and run data profiling jobs at quiet times. But usually the impact is much, much smaller than the fear of the owners of the operational system would suggest.

Sometimes the data being provided for profiling and being used for the new application will be extracted from the operational database by the owners of that system. You need to politely, diplomatically and very firmly resist this approach because it introduces another source of error. In practice, developers writing data extract processes can rarely resist the temptation to be selective about what they extract and to “improve” the data in the process. You need to see the data as it really is, otherwise this exercise is a waste of time and we hate to waste time.

If an extract is the only mechanism available to us, then it is essential that the data is extracted by an automated process that always behaves in exactly the same way, otherwise both profiling and our new applications will not be reliable. We have seen source data that is edited by hand after being extracted from the source and before being delivered to the target system. This is disastrous - time to stamp your feet and have a big tantrum.

Desktop tools, particularly Microsoft Excel, should never be used as an intermediary between the source system and the data profiler or the application. Such tools change the data. For example Microsoft Excel strips leading zeros off anything it considers to be a number. It therefore converts telephone numbers from, for example: "07123456789" to "7123456789". It also reformats dates, times and floating point numbers. Microsoft Word changes quotation characters and hyphens, corrects spelling and changes capitalisation. Other desktop tools may be less cavalier, but they are still unreliable for the purposes of data analysis, data profiling and data extraction so they should never be used in these processes.

If the data arrives in files do not copy them, or allow them to be copied, between operating systems, specifically not between Microsoft Windows and UNIX. If this is really necessary, then use sftp (or ftp) in binary mode, because this will not change the content of the file. Copying in ASCII mode converts end-of-line markers between UNIX and Windows conventions. If the file contains binary data, then any byte that looks like an end-of-line marker will be converted in error.

Always use full data sets for data profiling. The process of taking a sample will necessarily make assumptions that are very likely to exclude data that breaks the rules we are trying to check. However big your data set is, it may be a million or a billion or more, we are interested in the one row that will cause our data load process to fail, or will load OK but mess up our database, If we analyse a sample of the data then there is very little chance of us finding the bad guy.

Protest Warning The usual analysis process that takes place at the start of a data migration or a data load into a Data Lake, goes like this. The analyst (who got into this because he didn't like programming) gets a few rows of data (maybe 10, maybe 100) and pastes them into an Excel spreadsheet. Alongside the spreadsheet he has a "specification" of the data set, prepared by someone else, probably a long time ago, probably also in a spreadsheet, probably with the cells colour-coded in primary colours. In this "specification" there is a list of obscure field names. Further to the right there will be a column which has some hint at the data-types and/or lengths of some of the fields. This may be specified in various ways. I have recently seen COBOL-style field definitions such as "PIC S99V9999 COMP". It's just as well the analyst didn't try to interpret these because he would have got them wrong. Then, further along is a description column. Some of the cells in this are empty, some repeat the field name in all its obscurity, some are guesses about what the cryptic names mean (we know they are guesses because some of them are wrong) and, once in a while there is some useful information.

Once again, this process generally works better with an analyst programmer doing the analysis, having the technical skills to dig into what the data actually contains.

Character Encoding

Amazingly, many analysts and even programmers have no idea what character encoding is, or how it can affect what you see when you look at a file on your screen.

A file is a series of bytes. Each byte is a number between zero and 256. What those numbers mean depends on the encoding. If you don't know the encoding then you don't know what the bytes mean.

For those of us using English as our main language, it seems simpler than that because the first 128 byte values (zero to 128) are the same in the most common encodings, and they represent unaccented letters, numeric digits and common punctuation. As soon as you use accented letters, or even quite common symbols like £, €, ¢, ¥, ×, ÷, you cannot make any assumptions. The Web is already completely committed to multiple languages and writing systems. Corporate and government systems move more slowly, but the change is happening and we must now be aware of the different encoding possibilities for data we process.

Have a look at Dodgy Characters where we explain what you need to know about character encoding. We also provide a free tool that, in a few seconds, gives you valuable information about the encoding of any file, and tells you how to deal with whatever you find. This can save you a lot of time, and help you prevent a lot of errors, as you analyse and process the data that is used in your organisation.

When you understand the character encoding of the data you are working with, you can then profile the fields (or data elements) that you find there. Again this saves a lot of time later, when you try to use the data for whatever purposes you have, from printing invoices to analysing your customers' sentiments. The next section explains this kind of data profiling.

Working With The Data Profiling Output

Structure

We should use the data profiler to answer the following questions about the structure of the data supplied:

Content

Patterns

The data profiler lists the patterns found in the first 64 characters of strings. The patterns are shown using the following codes:

Code Meaning
A An alphabetic character
9 A numeric character
s A normal space (U+0020)
! A punctuation mark
$ A currency symbol
é An accented character
+ A mathematical or other symbol
^ A control character
b A non-breaking space (U+00a0)
There are many uses for patterns:

To see some examples of the profiler results click here.