The Data Studio

Standardise Text For Searching

The Problem

It can be quite hard to find what you want in a database, even assuming that you know which table and column to look in.

The strong data-types (numbers, dates, timestamps) are straightforward, although we do often need to use ranges or rounded values to catch all the values we are interested in.

Searching for text is more complicated.

If you, the person reading this, are from Google, then you are very welcome and I may be able to offer something useful for your internal systems, but I don't think I can offer anything to help with your core business. In general, on this website, I am assuming that you are not from Google.

Google became successful because it is very smart at searching vast quantities of text. We are interested in some of the things that Google does very successfully and we do want to emulate some of Google's functionality. This does not mean that we want to build our systems the way that Google does. Google tells us How Google search works and gives us some indication of the scale of its operation. As I write this (17-Apr-2017) Google says that it searches 130 trillion web pages. That figure will have increased by the time you read this. Even for a large organisation, like an international bank or an airline or a mobile phone company, the volume of data being processed will be a tiny fraction of this. If, for example, your corporate database has 130 billion records, that would be quite big and it would be one thousandth of the size of the data that Google is working with. And this assumes that your average record is as big as the average web-page. It probably isn't. So don't think that trying to copy Google's technical architecture will be economical for you.

One big thing that Google does to improve searches is to process the data before the search happens. We can learn from that.

A very common problem when searching for a text string is that what the user types does not match what is in the databases in a number of subtle ways:

Description Example
The capitalisation may be different First Class vs First class
The punctuation may be different HARRINGTON-SMITH & CO vs HARRINGTON, SMITH & CO
The user may misspell the text or it may be misspelt in the database NOBULL-COMPUTING LTD vs NOBLE COMPUTING LTD
The text may be abbreviated by the user or in the database Abbey Fm vs Abbey Farm
There may be characters that are accented inconsistently between the user and the database İstanbul vs Istanbul (capital I with and without dot)
There may be symbols that look similar but the user types a different one from the one in the database ST JAMES’S vs ST JAMES'S (apostrophes are different)

In all these cases the strings would not match; the user would not find what he was looking for. If these differences were happening between human beings then one would say to the other, "oh well, you know what I mean". There have been many attempts to make the computer as smart as we are. For these kinds of problems there are various solutions:

Standardisation - A Possible Solution

Especially in Data Warehouses, we often use standardisation, especially for names. Names may be:

For any column that we want to standardise, we create another separate column. we leave the original column as it is is, as we received it. In the new column we put a standardised version of the original column. We take the original data and do the following:

These can be implemented quite efficiently. You can see an example of this as a PostgreSQL user-defined function called standard on our downloads page. We have used this function to build a version of the companies_house data, in which we have added standardised versions of the company name, the town where the registerd office is and the 10 previous company names. We keep the original name and keep the standardised name as a separate column having the same name but prefixed with standard_. (This is an acceptable use of a prefix, for those of you who have read my thoughts on naming things.) Here are the first few columns of this table to show this:

                          Table "public.companies_house"
                    Column                  |          Type          | Modifiers 
  ------------------------------------------+------------------------+-----------
   company_name                             | character varying(160) | 
   standard_company_name                    | character varying(160) | 
   company_number                           | character varying(8)   | 
   registered_office_care_of                | character varying(100) | 
   registered_office_p_o_box                | character varying(10)  | 
   registered_office_address_1              | character varying(300) | 
   registered_office_address_2              | character varying(300) | 
   registered_office_town                   | character varying(50)  | 
   standard_registered_office_town          | character varying(50)  | 
   registered_office_county                 | character varying(50)  | 
   registered_office_country                | character varying(50)  | 
   [...]
    

We used the standard function to populate the standardised columns and now we can use them in queries. Companies registered in the UK may have their registered office outside the UK. Suppose we want to find any companies that are in Istanbul. We have noticed that the place names seem to be in uppercase and sometimes there is a district following the name of the town or city, so we try this:

  # select
  #     company_name,
  #     incorporation_date,
  #     registered_office_town 
  # from
  #     companies_house 
  # where
  #     registered_office_town like 'ISTANBUL%';
   company_name | incorporation_date | registered_office_town 
  --------------+--------------------+------------------------
  (0 rows)
    

We expected to find something, so we try searching on the standardised town name:

  # select
  #     company_name,
  #     incorporation_date,
  #     registered_office_town
  # from
  #     companies_house
  # where
  #     standard_registered_office_town like 'ISTANBUL%';
        company_name       | incorporation_date | registered_office_town 
  -------------------------+--------------------+------------------------
   NEZIH AND ERCAN LIMITED | 2016-02-19         | İSTANBUL KARTAL
  (1 row)
    

Our query searches on standard_registered_office_town and returns the original registered_office_town. We can see that this has a dot over the "I" in "İSTANBUL". "I" does not match "İ", and that is why we did not find this row the first time.

Standardising does take some extra processing time. In fact the most costly part of the process is replacing double characters with single characteters ("CC" ⇾ "C" for example). But this cost is worthwhile if it means that we do this process once as we load the data and our report writers or data scientists can use what we have done without having to repeat it whenever they access the data.

Protest Warning Vendors of Big Data products often suggest that cleaning is not necessary with their tools. But they do not offer anything to clean the data or anything that magically gives the correct answers from dirty data. They say what your manager wants to hear, but isn't true. Given that we do, actually, have to clean the data we receive to make it useful, it is cheaper to do the cleaning once, up front, so that all the users get the benefit.

Notice that we do not clean the data in the sense of replacing the original values. We leave the original values exactly as they are in these cases, and provide the "standardised" version as well to make searching easier. This is important. One reason why data scientists often do not like working with cleaned data, is that the cleaning may lose information, sometimes, just as some genuine emails get filtered to junk sometimes. None of us are clever enough to foresee all the possible cases that may go through our cleaning rules. This is a better solution: keep the original and provided a cleaned version as well.

Often we can go further than we have here; it depends on the content and the ways in which that content gets into the system. In particular data sets there will be common abbreviations and common noise words. People querying these datasets may get to know their idiosyncracies eventually, but new users will stumble over the same problems. If we look at the datasets we have used in the character encoding profiles we can see some examples of what is possible.

The next example comes from the gazetteer and shows how standardisation can make it easier to search for unusual names, with accents or hard-to-remember puntuation. If you were asked to look for "Durbans Farm " would you guess that the name was "D'Urban's Farm "? I would have guessed the second apostrophe, but not the first. Would you?

 definitive_name | standard  
-----------------+-----------
 Ty'n-y-Cwm      | TYNYCWM
 Br o' Turk      | BROTURK
 Ty'n y-Maes     | TYNYMAES
 St John's Br    | STJOHNSBR
 Ty'n-y-lôn      | TYNYLON
 Back o' Wall    | BACKOWALL
 Ty'n-y-llwyn    | TYNYLLWYN
 Cave Fm, The    | CAVEFMTHE
 Cow & Calf      | COWCALF
 Mare & Foal     | MAREFOAL
 Allt a' Ghil    | ALLTAGHIL
 D'Urban's Fm    | DURBANSFM
 Ord/An t-Òrd    | ORDANTORD
 Ford & Àird     | FORDAIRD
 Cow & Calves    | COWCALVES
 Cnoc a' Sgà     | CNOCASGA
 Cross & Hand    | CROSSHAND
 Old Man's Br    | OLDMANSBR
 Sròn a' Chrò    | SRONACHRO
    

This small sample hints at another problem in searching this data. It looks as though "Farm " may be abbreviated to "Fm " and maybe "Bridge " is abbreviated to "Br ". Looking at this table in more detail I found "Fm ", "Br ", "Ho ", "Ct ", "Pt ", "Pl ", "Ch ", "Pk ", "Sq ", "Lo " and quite a few others. What could be full-length equivalents of these abbreviations appear in other rows of the data as "Farm ", "Bridge ", "House ", "Court ", "Point ", "Place ", Square " and "Lodge ". These are my guesses. Before implementing any cleaning, I would want to talk to the owners of this data to see if these are indeed standard abbreviations and if I have guessed correctly. What often happens is that nobody really knows and it may be that some abbreviations have been used to mean two different things. In this gazetteer there are places called <something> Lookout. Could it be that "Lo " stands for Lookout rather than Lodge?

Having researched these abbreviations as much as we can we will often want to go with what is our best guess. The approach of keeping the original value as well as the standardised and/or cleaned value gives us a chance to correct errors if we get more information. Until we are absolutely certain that these are correct we should make sure that our users understand that the cleaned values are our best guesses and they should treat them with caution. We will need to remind our users about this frequently, because once they have produced guessed data that was plausible they will start to believe it. I have seen one large company go out of business based on such a wrong assumption.

With that big caveat, we may decide to apply extra standardisation to this particular data set to change this set of abbreviations to the full words they stand for. If we do this the process must be: