The Data Studio

Where Did These Funny Characters Come From?

Are you seeing something
like either of these?

The question marks here are a problem because they represent values in our data that are not valid characters;
we don't know what those bytes mean. This is an example of mojibake. This article explains what is happening and how to solve this problem.

Problems You May See

These are problems you may see when you load data from some source file or database into your target database.

What Causes These Problems

Here are some possible reasons:

Here we are going to focus on the character encoding issues. We will explain how to see what is happening, and how to resolve it. We also offer some free tools to help with this.

Character Encoding - The Basics

When your application reads data from a file, or a message or a stream, it is getting a series of bytes each of which is a number between zero and 255. If you do not know the encoding of the file, you do not know what these bytes mean!

Most programmers have been able to ignore this awkward fact for years because we were dealing with very limited character sets that were all consistent with one another. This all worked fine when we were only processing Latin uppercase and lowercase letters, numeric digits and a few basic punctuation marks. Now, more and more, we have friends around the world and we work across the boundaries of writing systems, we have many different currency symbols, we may want to write complex mathematical formulae, we may want to use emojis. We need more characters and we need a more sophisticated way of encoding them. You can read more about the history, if you're interested. For our purposes here, we need to know that several different encoding systems exist, we need to know which encodings we are dealing with and how to convert between them correctly.

Of the many encoding systems that have been used, just a few account for most of the data in the world today and we will focus on those. I am taking a "western" view, from the countries using western European languages. Those with an "eastern" view, from China, Japan, and other countries with non-European writing systems, are generally more well-informed about character encoding because they needed to be. I apologise for the bias, but I know more about what happens in the west and, generally, those of us in the west are the people who need to learn more about this.

The encoding systems we most need to understand are:

Encoding What We Need To Know
ASCII One character = one byte. Uses values from 0-127 only.
ISO/IEC 8859 family One character = one byte. Values 0-127 are the same as ASCII. Values 128-255 give different characters from one family member to another (and different from the Windows-12xx family).
Windows 12xx family One character = one byte. Values 0-127 are the same as ASCII. Values 128-255 give different characters from one family member to another (and different from the ISO/IEC 8859 family).
UTF-8 One character may take 1, 2, 3 or 4 bytes. Values 0-127 are the same as ASCII. (More details below).

How It Works

A byte is 8 bits, and the value can conveniently be represented in hexadecimal (usually abbreviated to "hex") or in decimal or, less conveniently, in octal or binary.

For example the character "A" in is represented in a single byte like this:

Binary 0 1 0 0 0 0 0 1
Hex 4 1
Decimal 65
Octal 1 0 1
Unicode Code Point U+0041

The character "A" is the same in UTF-8, ASCII, ISO/IEC 8859 and Windows 12xx, all our usual sources. So in this case we don't have to worry about any incompatibility because there isn't any.

If we look at the Euro symbol (€) it's a completely different story:

€ - Euro currency symbol
Character Encoding UTF-8 (3-byte sequence) ISO/IEC 8859-15 Windows-1252
Binary 11100010 10000010 10101100 10100100 10000000
Hex e2 82 ac a4 80
Decimal 225 130 172 164 128
Octal 342 202 254 244 200
Unicode Code Point U+20ac

Our commonly-used encoding systems all represent the Euro symbol differently. If we copy the bytes from a file encoded in ISO-8859-15 to a database running in Windows-1252, our Euro symbol (hex a4) will not look like a Euro symbol any more. In Windows-1252 hex a4 is "¤". Going from Windows-1252 to ISO-8859-15 we would get a question mark or a "◼" because in ISO-8859-15 hex 80 is undefined.  7-bit ASCII and EBCDIC do not have any way to represent a Euro symbol. These encoding systems were defined before the Euro existed, so that is not surprising.

While we could get away with using one consistent 8-bit code then everything was very simple, but we can't do that in the real world any more, so we need something better. UTF-8 is that something better, so we'll explain a bit about how that works.

UTF-8 is one implementation of Unicode. There are other implementations that you may come across (UTF-16 and UTF-32, for example). UTF-8 is the most popular by far (see history, if you want to know why).

UTF-8 has a very clever scheme for encoding over one million different characters economically and consistently.

Byte values What they are used for
Decimal Hex
0 to 127 00 to 7f 1-byte characters, exactly the same as ASCII
194 to 223 c2 to df Leading Bytes Introduce a 2-byte sequence
224 to 239 e0 to ef Introduce a 3-byte sequence
240 to 244 f0 to f4 Introduce a 4-byte sequence
128 to 191 80 to bf Continuation Bytes Follow a Leading Byte

If you look again at the table showing the encoding of the Euro currency symbol you will see that the 3-byte sequence hex e2 82 ac is the Leading Byte hex e2 followed by the two Continuation Bytes hex 82 ac.

The ranges are distinct so that any given byte value can be either a single-byte character or a Leading Byte or a Continuation Byte. If you get a corrupted or truncated file you can always find the next, or previous, valid sequence within four bytes following, or preceding, the corruption.

UTF-8 also gives us the possibility to validate the file because:

This means that we can tell if a file is a valid UTF-8 file or is using a different encoding or is corrupted.

There is a relationship betweeen the Unicode Code Point and the values in the encoded bytes. You will probably never need to calculate one from the other, but if you want to, you can find out how on Wikipedia. Also, there is working code to convert from the UTF-8 bytes to the Unicode Code Point in the toCodePoint() method in the EncodingProfile source code provided on our downloads page.

Practical Steps

Firstly, we need to understand what character encoding is used in our source data. If we are lucky, then the provider of the data will have told us. They usually do not, and when they do they are sometimes wrong. It's cheap to check and worth doing. Have a look at How To Find Out What Character Encoding Is Used In a Particular File. This explains what to do, and provides a tool that will scan any file in seconds and give you comprehensive information about the encoding. (When I say seconds, this is for files up to 2GB or so; if you have a file of 10GB it will take three or four minutes.)

Secondly, we need to translate between character encoding systems when copying data from one to another. That's OK, there are quite simple ways to do that. See Load a File to a Database and Convert the Character Encoding for examples.