The Data Studio

Control Characters

Control Characters are like paper bank statments. You don't want them because you can check your balance and transactions any time you like on your phone or your computer. They arrive with other mail, some of which you do want. You can't just throw them in the recycling because they have personal infomation in them so you have to go to the bother of shredding them. They waste resources and you, the customer, end up paying for them even though they are redundant. Probably, like me, you are gradually reducing the number of paper statments you receive, but some organisations stubbornly keep on sending them.

The good news about control characters is that they are the same in UTF-8, ASCII, the ISO/IEC 8859 family and Windows-1252. The bad news is that they are invisible and cause SQL queries to fail silently. This means failing to return all the data you expect, with no warning. When our customers point out gaps in our reports it is embarrassing at best. We therefore must deal with control characters.

Most of the control characters are obsolete device control codes. Bell (which sounds the bell on a teletype), carriage return, vertical tab, etc. (which refer back to typewriters and early printers) and several others are no longer needed. Sometimes we see some of these re-used as field delimiters. It's a big mistake, but it happens.

Newline (also known as "linefeed") is the only important control character because, by convention, it marks the end of a record in a UNIX file.

Unfortunately, Windows marks the end of a record with a pair of control characters: carriage return followed by newline.

We use newline, or carriage return and newline, to delimit the records in a file. If a control character has been used as a field delimiter, we split each record into fields using the delimiter. After splitting the data into records and fields, we have no use for any control charactersin any database system. We should therefore leave the delimiters behind and not carry them forward to our database. OK, you may have multi-line text fields in your database. In general, I try to avoid that. If you have quite long text fields I would put them in separate files and include a link to each one in the database. For very short notes fields it sometimes makes sense to keep them in the database, sometimes. They will be a pain to import from files, but it can be done.

All other control characters must be removed and we should never see them in any database.

The control characters are decimal 0 to 31 and 127. That is hex 00 to 1F and hex 7F. Newline is decimal 10, hex 0a. Carriage return is decimal 13, hex 0d.

There is a simple bash script that will read a file and give you a copy of it with all the control characters except newline converted to spaces. Each control character is converted to a single space so the file will still work as a fixed format file. csv files will also convert correctly. Delimited files that (foolishly) use a control character as a delimiter will not survive this conversion, but you could modify the script to preserve only the control character that is used as a delimiter.

You can download the script frm our downloads page. The interface is simply:

      clean_control_characters.bash INPUT_FILE OUTPUT_FILE
    

The script, as written, will not allow you to overwrite a file; you have to specify a new file as the output, so it's hard to do any harm with it. On a moderately sized Linux box it processes over 100MB/sec; that's about 250,000 records/sec (depending on the record width) so it won't break the bank either.