The Data Studio

How To Find Out What Character Encoding Is Used In a Particular File

Although the character encoding is essential for us to know what the bytes in a file mean, it is amazing how often nobody seems to know. So you would hope that there would be some clever application that would work it out for you. Wikipedia has some good sense on Character-set detection. The bottom line is that character-set detection is not an exact science. There are some tools available but none of them can give a completely reliable answer to what actual character set has been used.

We do have our own tool and you can get it free from our Downloads page. It has the same limitations that all the tools have, but it can help to point you in the right direction. It is a simple Java program that reads a byte-stream and provides some information about what the character encoding is likely to be. It can analyse a 12GB file in 4 minutes and 30 seconds on my MacBook Pro, so you really can afford to find out this important information about your source file, and it is a lot less bother to do it early on than to struggle with all the odd things you find in your data if you just load it without checking the character encoding first.

There are four examples below, showing what you see with different kinds of files:

First you need to download the program and compile it. Stop complaining, it is just a single class using only standard Java classes, so this is all that is necessary:

      $ javac EncodingProfile.java 
    

This gives you a Java class file (EncodingProfile.class).

Example 1: A File Encoded In UTF-8

This is a file of UK Company data (available from Companies House. I'm running the program with just one parameter, the path to the file I want to analyse:

      $ java EncodingProfile /Users/ronballard/Downloads/BasicCompanyData-2017-04-01-part4_5.csv 
    

Here's the first part of the output:

  Scanning file: /Users/ronballard/Downloads/BasicCompanyData-2017-04-01-part4_5.csv

  +------------------------------------------------------------|-----------------+
  | Lines:                                                     |         850,002 |
  | Bytes:                                                     |     409,479,315 |
  | Windows line ends:                                         |               0 |
  | UNIX line ends:                                            |         850,001 |
  | Control Characters (excluding newline & carriage return):  |               0 |
  | UTF-8 Leading bytes:                                       |             388 |
  | UTF-8 Continuation bytes:                                  |             542 |
  | UTF-8 Leading bytes without enough Continuation bytes:     |               0 |
  | UTF-8 Continuation bytes without a preceding Leading byte: |               0 |
  | UTF-8 Invalid bytes:                                       |               0 |
  | Byte values specific to Windows 1252:                      |             523 |
  | Byte values valid in ISO 8859 and Windows 12xx families:   |             930 |
  | Carriage returns without newline:                          |               0 |
  +------------------------------------------------------------|-----------------+
    

It turns out that this is a very clean file. I wish all our source files were as good as this. It is clear from this output that the file is a UNIX (or Mac) file because the end of each line is marked by a newline character (not carriage return, newline as in Windows files). We can also see that most of the characters are single bytes, with 388 UTF-8 multi-byte sequences, all of which have the right number of continuation bytes following the leading byte. There are no bytes in this file that are invalid in UTF-8.

This means that if we read this file using the UTF-8 encoding and load it into a database that is also in UTF-8, then all the characters should appear correctly.

We can also see that this file has no control characters apart from newlines. This is very good news and will make working with this data much easier.

There may still be some issues with characters that we want to regard as the same, but that are actually variants, such as straight quotes and curly quotes, various kinds of hyphens, accented characters, etc. That isn't a fault in the file; it is a feature of working in a multi-lingual world.

The next section of the analysis tells us some more about the characters we find in this file. The report gives us a table with one row for each of the 256 possible byte values, how many of each there are in the file, and what they might mean. Since we have already convinced ourselves that this is a valid UTF-8 file, the "Name" column and the "Specific to Windows-1252" column are not very interesting to us (we'll see examples where they are interesting later). The UTF-8 Group is interesting.


+-----+-----+-----+-----------------+-----------+--------+----------+------------------------------+--------------+-----------------------------------------------+
|     |     |     |                 |   ASCII   |   C    | teletype | Name (W = Windows-1252,      | Specific to  |                                               |
| Dec | Hex | Oct | Number of Bytes | Printable | Escape | notation | I = ISO 8859-1, C = Control  | Windows-1252 |                  UTF-8 Group                  |
+-----+-----+-----+-----------------+-----------+--------+----------+------------------------------+--------------+-----------------------------------------------+
| 000 |  00 | 000 |               0 |           |   \0   |    ^@    | C:null                       |              | Control character, ASCII compatible           |
| 001 |  01 | 001 |               0 |           |        |    ^A    | C:start of heading           |              | Control character, ASCII compatible           |
| 002 |  02 | 002 |               0 |           |        |    ^B    | C:start of text              |              | Control character, ASCII compatible           |
| 003 |  03 | 003 |               0 |           |        |    ^C    | C:end of text                |              | Control character, ASCII compatible           |
| 004 |  04 | 004 |               0 |           |        |    ^D    | C:end of transmission        |              | Control character, ASCII compatible           |
| 005 |  05 | 005 |               0 |           |        |    ^E    | C:enquiry                    |              | Control character, ASCII compatible           |
| 006 |  06 | 006 |               0 |           |        |    ^F    | C:acknowledgement            |              | Control character, ASCII compatible           |
| 007 |  07 | 007 |               0 |           |   \a   |    ^G    | C:bell                       |              | Control character, ASCII compatible           |
| 008 |  08 | 010 |               0 |           |   \b   |    ^H    | C:backspace                  |              | Control character, ASCII compatible           |
| 009 |  09 | 011 |               0 |           |   \t   |    ^I    | C:horizontal tab             |              | Control character, ASCII compatible           |
| 010 |  0a | 012 |         850,001 |           |   \n   |    ^J    | C:newline                    |              | Control character, ASCII compatible           |
| 011 |  0b | 013 |               0 |           |   \v   |    ^K    | C:vertical tab               |              | Control character, ASCII compatible           |
| 012 |  0c | 014 |               0 |           |   \f   |    ^L    | C:form feed                  |              | Control character, ASCII compatible           |
| 013 |  0d | 015 |               0 |           |   \r   |    ^M    | C:carriage return            |              | Control character, ASCII compatible           |
| 014 |  0e | 016 |               0 |           |        |    ^N    | C:shift out                  |              | Control character, ASCII compatible           |
| 015 |  0f | 017 |               0 |           |        |    ^O    | C:shift in                   |              | Control character, ASCII compatible           |
| 016 |  10 | 020 |               0 |           |        |    ^P    | C:data link escape           |              | Control character, ASCII compatible           |
| 017 |  11 | 021 |               0 |           |        |    ^Q    | C:device control 1           |              | Control character, ASCII compatible           |
| 018 |  12 | 022 |               0 |           |        |    ^R    | C:device control 2           |              | Control character, ASCII compatible           |
| 019 |  13 | 023 |               0 |           |        |    ^S    | C:device control 3           |              | Control character, ASCII compatible           |
| 020 |  14 | 024 |               0 |           |        |    ^T    | C:device control 4           |              | Control character, ASCII compatible           |
| 021 |  15 | 025 |               0 |           |        |    ^U    | C:negative acknowledgement   |              | Control character, ASCII compatible           |
| 022 |  16 | 026 |               0 |           |        |    ^V    | C:synchronous idle           |              | Control character, ASCII compatible           |
| 023 |  17 | 027 |               0 |           |        |    ^W    | C:end of transmission block  |              | Control character, ASCII compatible           |
| 024 |  18 | 030 |               0 |           |        |    ^X    | C:cancel                     |              | Control character, ASCII compatible           |
| 025 |  19 | 031 |               0 |           |        |    ^Y    | C:end of medium              |              | Control character, ASCII compatible           |
| 026 |  1a | 032 |               0 |           |        |    ^Z    | C:substitute                 |              | Control character, ASCII compatible           |
| 027 |  1b | 033 |               0 |           |   \e   |    ^[    | C:escape                     |              | Control character, ASCII compatible           |
| 028 |  1c | 034 |               0 |           |        |    ^\    | C:file separator             |              | Control character, ASCII compatible           |
| 029 |  1d | 035 |               0 |           |        |    ^]    | C:group separator            |              | Control character, ASCII compatible           |
| 030 |  1e | 036 |               0 |           |        |    ^^    | C:record separator           |              | Control character, ASCII compatible           |
| 031 |  1f | 037 |               0 |           |        |    ^_    | C:unit separator             |              | Control character, ASCII compatible           |
| 032 |  20 | 040 |      15,800,241 |           |        |          | space                        |              | Punctuation and symbols, ASCII compatible     |
| 033 |  21 | 041 |             263 |     !     |        |          |                              |              | Punctuation and symbols, ASCII compatible     |
| 034 |  22 | 042 |      93,500,826 |     "     |        |          |                              |              | Punctuation and symbols, ASCII compatible     |
| 035 |  23 | 043 |              64 |     #     |        |          |                              |              | Punctuation and symbols, ASCII compatible     |
| 036 |  24 | 044 |              15 |     $     |        |          |                              |              | Punctuation and symbols, ASCII compatible     |
| 037 |  25 | 045 |              18 |     %     |        |          |                              |              | Punctuation and symbols, ASCII compatible     |
| 038 |  26 | 046 |          67,442 |     &     |        |          |                              |              | Punctuation and symbols, ASCII compatible     |
| 039 |  27 | 047 |          39,568 |     '     |        |          |                              |              | Punctuation and symbols, ASCII compatible     |
| 040 |  28 | 050 |         107,261 |     (     |        |          |                              |              | Punctuation and symbols, ASCII compatible     |
| 041 |  29 | 051 |         107,245 |     )     |        |          |                              |              | Punctuation and symbols, ASCII compatible     |
| 042 |  2a | 052 |              22 |     *     |        |          |                              |              | Punctuation and symbols, ASCII compatible     |
| 043 |  2b | 053 |             611 |     +     |        |          |                              |              | Punctuation and symbols, ASCII compatible     |
| 044 |  2c | 054 |      46,119,290 |     ,     |        |          |                              |              | Punctuation and symbols, ASCII compatible     |
| 045 |  2d | 055 |       1,052,240 |     -     |        |          |                              |              | Punctuation and symbols, ASCII compatible     |
| 046 |  2e | 056 |       3,154,454 |     .     |        |          |                              |              | Punctuation and symbols, ASCII compatible     |
| 047 |  2f | 057 |      14,650,996 |     /     |        |          |                              |              | Punctuation and symbols, ASCII compatible     |
| 048 |  30 | 060 |      20,486,203 |     0     |        |          |                              |              | Numeric digits, ASCII compatible              |
| 049 |  31 | 061 |      13,402,111 |     1     |        |          |                              |              | Numeric digits, ASCII compatible              |
| 050 |  32 | 062 |      10,311,624 |     2     |        |          |                              |              | Numeric digits, ASCII compatible              |
| 051 |  33 | 063 |       5,145,397 |     3     |        |          |                              |              | Numeric digits, ASCII compatible              |
| 052 |  34 | 064 |       2,842,819 |     4     |        |          |                              |              | Numeric digits, ASCII compatible              |
| 053 |  35 | 065 |       3,059,618 |     5     |        |          |                              |              | Numeric digits, ASCII compatible              |
| 054 |  36 | 066 |       4,062,589 |     6     |        |          |                              |              | Numeric digits, ASCII compatible              |
| 055 |  37 | 067 |       3,974,086 |     7     |        |          |                              |              | Numeric digits, ASCII compatible              |
| 056 |  38 | 070 |       3,648,346 |     8     |        |          |                              |              | Numeric digits, ASCII compatible              |
| 057 |  39 | 071 |       3,556,712 |     9     |        |          |                              |              | Numeric digits, ASCII compatible              |
| 058 |  3a | 072 |         850,834 |     :     |        |          |                              |              | Punctuation and symbols, ASCII compatible     |
| 059 |  3b | 073 |           2,521 |     ;     |        |          |                              |              | Punctuation and symbols, ASCII compatible     |
| 060 |  3c | 074 |               0 |     <     |        |          |                              |              | Punctuation and symbols, ASCII compatible     |
| 061 |  3d | 075 |              13 |     =     |        |          |                              |              | Punctuation and symbols, ASCII compatible     |
| 062 |  3e | 076 |               5 |     >     |        |          |                              |              | Punctuation and symbols, ASCII compatible     |
| 063 |  3f | 077 |              35 |     ?     |        |          |                              |              | Punctuation and symbols, ASCII compatible     |
| 064 |  40 | 100 |             365 |     @     |        |          |                              |              | Punctuation and symbols, ASCII compatible     |
| 065 |  41 | 101 |       5,863,105 |     A     |        |          |                              |              | English alphabet, ASCII compatible            |
| 066 |  42 | 102 |       1,084,065 |     B     |        |          |                              |              | English alphabet, ASCII compatible            |
| 067 |  43 | 103 |       3,235,111 |     C     |        |          |                              |              | English alphabet, ASCII compatible            |
| 068 |  44 | 104 |       3,935,404 |     D     |        |          |                              |              | English alphabet, ASCII compatible            |
| 069 |  45 | 105 |       7,459,938 |     E     |        |          |                              |              | English alphabet, ASCII compatible            |
| 070 |  46 | 106 |       1,036,159 |     F     |        |          |                              |              | English alphabet, ASCII compatible            |
| 071 |  47 | 107 |       1,540,215 |     G     |        |          |                              |              | English alphabet, ASCII compatible            |
| 072 |  48 | 110 |       2,000,497 |     H     |        |          |                              |              | English alphabet, ASCII compatible            |
| 073 |  49 | 111 |       4,967,303 |     I     |        |          |                              |              | English alphabet, ASCII compatible            |
| 074 |  4a | 112 |         178,031 |     J     |        |          |                              |              | English alphabet, ASCII compatible            |
| 075 |  4b | 113 |       1,617,689 |     K     |        |          |                              |              | English alphabet, ASCII compatible            |
| 076 |  4c | 114 |       6,309,948 |     L     |        |          |                              |              | English alphabet, ASCII compatible            |
| 077 |  4d | 115 |       2,984,347 |     M     |        |          |                              |              | English alphabet, ASCII compatible            |
| 078 |  4e | 116 |       5,480,952 |     N     |        |          |                              |              | English alphabet, ASCII compatible            |
| 079 |  4f | 117 |       5,801,125 |     O     |        |          |                              |              | English alphabet, ASCII compatible            |
| 080 |  50 | 120 |       2,637,947 |     P     |        |          |                              |              | English alphabet, ASCII compatible            |
| 081 |  51 | 121 |         162,305 |     Q     |        |          |                              |              | English alphabet, ASCII compatible            |
| 082 |  52 | 122 |       4,451,269 |     R     |        |          |                              |              | English alphabet, ASCII compatible            |
| 083 |  53 | 123 |       4,716,094 |     S     |        |          |                              |              | English alphabet, ASCII compatible            |
| 084 |  54 | 124 |       6,000,582 |     T     |        |          |                              |              | English alphabet, ASCII compatible            |
| 085 |  55 | 125 |       2,771,194 |     U     |        |          |                              |              | English alphabet, ASCII compatible            |
| 086 |  56 | 126 |         465,045 |     V     |        |          |                              |              | English alphabet, ASCII compatible            |
| 087 |  57 | 127 |         921,131 |     W     |        |          |                              |              | English alphabet, ASCII compatible            |
| 088 |  58 | 130 |         647,040 |     X     |        |          |                              |              | English alphabet, ASCII compatible            |
| 089 |  59 | 131 |         826,496 |     Y     |        |          |                              |              | English alphabet, ASCII compatible            |
| 090 |  5a | 132 |          63,695 |     Z     |        |          |                              |              | English alphabet, ASCII compatible            |
| 091 |  5b | 133 |              58 |     [     |        |          |                              |              | Punctuation and symbols, ASCII compatible     |
| 092 |  5c | 134 |              72 |     \     |        |          |                              |              | Punctuation and symbols, ASCII compatible     |
| 093 |  5d | 135 |              56 |     ]     |        |          |                              |              | Punctuation and symbols, ASCII compatible     |
| 094 |  5e | 136 |               0 |     ^     |        |          |                              |              | Punctuation and symbols, ASCII compatible     |
| 095 |  5f | 137 |              36 |     _     |        |          |                              |              | Punctuation and symbols, ASCII compatible     |
| 096 |  60 | 140 |             213 |     `     |        |          |                              |              | Punctuation and symbols, ASCII compatible     |
| 097 |  61 | 141 |       6,897,508 |     a     |        |          |                              |              | English alphabet, ASCII compatible            |
| 098 |  62 | 142 |       1,105,763 |     b     |        |          |                              |              | English alphabet, ASCII compatible            |
| 099 |  63 | 143 |       3,457,617 |     c     |        |          |                              |              | English alphabet, ASCII compatible            |
| 100 |  64 | 144 |       5,090,043 |     d     |        |          |                              |              | English alphabet, ASCII compatible            |
| 101 |  65 | 145 |       7,947,337 |     e     |        |          |                              |              | English alphabet, ASCII compatible            |
| 102 |  66 | 146 |         614,145 |     f     |        |          |                              |              | English alphabet, ASCII compatible            |
| 103 |  67 | 147 |       2,332,947 |     g     |        |          |                              |              | English alphabet, ASCII compatible            |
| 104 |  68 | 150 |       1,600,725 |     h     |        |          |                              |              | English alphabet, ASCII compatible            |
| 105 |  69 | 151 |      10,134,107 |     i     |        |          |                              |              | English alphabet, ASCII compatible            |
| 106 |  6a | 152 |          23,360 |     j     |        |          |                              |              | English alphabet, ASCII compatible            |
| 107 |  6b | 153 |         913,405 |     k     |        |          |                              |              | English alphabet, ASCII compatible            |
| 108 |  6c | 154 |       1,258,171 |     l     |        |          |                              |              | English alphabet, ASCII compatible            |
| 109 |  6d | 155 |       4,032,702 |     m     |        |          |                              |              | English alphabet, ASCII compatible            |
| 110 |  6e | 156 |       6,957,174 |     n     |        |          |                              |              | English alphabet, ASCII compatible            |
| 111 |  6f | 157 |       5,520,560 |     o     |        |          |                              |              | English alphabet, ASCII compatible            |
| 112 |  70 | 160 |       3,402,028 |     p     |        |          |                              |              | English alphabet, ASCII compatible            |
| 113 |  71 | 161 |          36,113 |     q     |        |          |                              |              | English alphabet, ASCII compatible            |
| 114 |  72 | 162 |       2,688,756 |     r     |        |          |                              |              | English alphabet, ASCII compatible            |
| 115 |  73 | 163 |       4,714,021 |     s     |        |          |                              |              | English alphabet, ASCII compatible            |
| 116 |  74 | 164 |       9,014,245 |     t     |        |          |                              |              | English alphabet, ASCII compatible            |
| 117 |  75 | 165 |       2,504,613 |     u     |        |          |                              |              | English alphabet, ASCII compatible            |
| 118 |  76 | 166 |       3,134,225 |     v     |        |          |                              |              | English alphabet, ASCII compatible            |
| 119 |  77 | 167 |         118,911 |     w     |        |          |                              |              | English alphabet, ASCII compatible            |
| 120 |  78 | 170 |          25,252 |     x     |        |          |                              |              | English alphabet, ASCII compatible            |
| 121 |  79 | 171 |       1,995,000 |     y     |        |          |                              |              | English alphabet, ASCII compatible            |
| 122 |  7a | 172 |           8,690 |     z     |        |          |                              |              | English alphabet, ASCII compatible            |
| 123 |  7b | 173 |               3 |     {     |        |          |                              |              | English alphabet, ASCII compatible            |
| 124 |  7c | 174 |               2 |     |     |        |          |                              |              | English alphabet, ASCII compatible            |
| 125 |  7d | 175 |               4 |     }     |        |          |                              |              | English alphabet, ASCII compatible            |
| 126 |  7e | 176 |               0 |     ~     |        |          |                              |              | English alphabet, ASCII compatible            |
| 127 |  7f | 177 |               0 |           |        |    ^?    | C: delete                    |              | Control character, ASCII compatible           |
| 128 |  80 | 200 |             168 |           |        |          | W: Euro symbol               |     Yes      | Continuation byte                             |
| 129 |  81 | 201 |              14 |           |        |          |                              |              | Continuation byte                             |
| 130 |  82 | 202 |               2 |           |        |          | W: curved single open-quote  |     Yes      | Continuation byte                             |
| 131 |  83 | 203 |               2 |           |        |          | W: small f with hook         |     Yes      | Continuation byte                             |
| 132 |  84 | 204 |               5 |           |        |          | W: curved double open-quote  |     Yes      | Continuation byte                             |
| 133 |  85 | 205 |               1 |           |        |          | W: ellipsis                  |     Yes      | Continuation byte                             |
| 134 |  86 | 206 |               3 |           |        |          | W: dagger                    |     Yes      | Continuation byte                             |
| 135 |  87 | 207 |               5 |           |        |          | W: double dagger             |     Yes      | Continuation byte                             |
| 136 |  88 | 210 |              12 |           |        |          | W: circumflex                |     Yes      | Continuation byte                             |
| 137 |  89 | 211 |              91 |           |        |          | W: permille                  |     Yes      | Continuation byte                             |
| 138 |  8a | 212 |               5 |           |        |          | W: capital S caron           |     Yes      | Continuation byte                             |
| 139 |  8b | 213 |               2 |           |        |          | W: open single guillemet     |     Yes      | Continuation byte                             |
| 140 |  8c | 214 |               1 |           |        |          | W: capital O+E               |     Yes      | Continuation byte                             |
| 141 |  8d | 215 |               3 |           |        |          |                              |              | Continuation byte                             |
| 142 |  8e | 216 |               1 |           |        |          | W: capital Z caron           |     Yes      | Continuation byte                             |
| 143 |  8f | 217 |               2 |           |        |          |                              |              | Continuation byte                             |
| 144 |  90 | 220 |               0 |           |        |          |                              |              | Continuation byte                             |
| 145 |  91 | 221 |               2 |           |        |          | W: curved single close quote |     Yes      | Continuation byte                             |
| 146 |  92 | 222 |               2 |           |        |          | W: curved single open quote  |     Yes      | Continuation byte                             |
| 147 |  93 | 223 |               9 |           |        |          | W: curved double open quote  |     Yes      | Continuation byte                             |
| 148 |  94 | 224 |               5 |           |        |          | W: curved double close quote |     Yes      | Continuation byte                             |
| 149 |  95 | 225 |               0 |           |        |          | W: bullet                    |     Yes      | Continuation byte                             |
| 150 |  96 | 226 |              19 |           |        |          | W: en-dash                   |     Yes      | Continuation byte                             |
| 151 |  97 | 227 |               0 |           |        |          | W: em-dash                   |     Yes      | Continuation byte                             |
| 152 |  98 | 230 |               7 |           |        |          | W: tilde                     |     Yes      | Continuation byte                             |
| 153 |  99 | 231 |             154 |           |        |          | W: trade mark                |     Yes      | Continuation byte                             |
| 154 |  9a | 232 |               4 |           |        |          | W: small S caron             |     Yes      | Continuation byte                             |
| 155 |  9b | 233 |               1 |           |        |          | W: close single guillemet    |     Yes      | Continuation byte                             |
| 156 |  9c | 234 |               9 |           |        |          | W: small O+E                 |     Yes      | Continuation byte                             |
| 157 |  9d | 235 |               0 |           |        |          |                              |              | Continuation byte                             |
| 158 |  9e | 236 |               0 |           |        |          | W: small Z caron             |     Yes      | Continuation byte                             |
| 159 |  9f | 237 |               0 |           |        |          | W: capital Y umlaut          |     Yes      | Continuation byte                             |
| 160 |  a0 | 240 |               0 |           |        |          | W+I: non-breaking space      |              | Continuation byte                             |
| 161 |  a1 | 241 |               0 |           |        |          | W+I: inverted exclamation    |              | Continuation byte                             |
| 162 |  a2 | 242 |               0 |           |        |          | W+I: cent                    |              | Continuation byte                             |
| 163 |  a3 | 243 |               6 |           |        |          | W+I: UK pound sign           |              | Continuation byte                             |
| 164 |  a4 | 244 |               0 |           |        |          | W+I: currency                |              | Continuation byte                             |
| 165 |  a5 | 245 |               0 |           |        |          | W+I: Yen sign                |              | Continuation byte                             |
| 166 |  a6 | 246 |               0 |           |        |          | W+I: broken bar              |              | Continuation byte                             |
| 167 |  a7 | 247 |               0 |           |        |          | W+I: section                 |              | Continuation byte                             |
| 168 |  a8 | 250 |               0 |           |        |          | W+I: trema                   |              | Continuation byte                             |
| 169 |  a9 | 251 |               0 |           |        |          | W+I: copyright sign          |              | Continuation byte                             |
| 170 |  aa | 252 |               3 |           |        |          | W+I: feminine ordinal        |              | Continuation byte                             |
| 171 |  ab | 253 |               0 |           |        |          | W+I: open double guillemet   |              | Continuation byte                             |
| 172 |  ac | 254 |               0 |           |        |          | W+I: logical complement      |              | Continuation byte                             |
| 173 |  ad | 255 |               0 |           |        |          | W+I: soft hyphen             |              | Continuation byte                             |
| 174 |  ae | 256 |               0 |           |        |          | W+I: registered sign         |              | Continuation byte                             |
| 175 |  af | 257 |               0 |           |        |          | W+I: macron                  |              | Continuation byte                             |
| 176 |  b0 | 260 |               1 |           |        |          | W+I: degree                  |              | Continuation byte                             |
| 177 |  b1 | 261 |               0 |           |        |          | W+I: plus-or-minus sign      |              | Continuation byte                             |
| 178 |  b2 | 262 |               0 |           |        |          | W+I: raise to power of 2     |              | Continuation byte                             |
| 179 |  b3 | 263 |               0 |           |        |          | W+I: raise to power of 3     |              | Continuation byte                             |
| 180 |  b4 | 264 |               2 |           |        |          | W+I: acute accent            |              | Continuation byte                             |
| 181 |  b5 | 265 |               0 |           |        |          | W+I: Greek letter mu         |              | Continuation byte                             |
| 182 |  b6 | 266 |               0 |           |        |          | W+I: pilcrow (paragraph)     |              | Continuation byte                             |
| 183 |  b7 | 267 |               0 |           |        |          | W+I: middle dot              |              | Continuation byte                             |
| 184 |  b8 | 270 |               0 |           |        |          | W+I: cedilla                 |              | Continuation byte                             |
| 185 |  b9 | 271 |               0 |           |        |          | W+I: superscript 1           |              | Continuation byte                             |
| 186 |  ba | 272 |               0 |           |        |          | W+I: masculine ordinal       |              | Continuation byte                             |
| 187 |  bb | 273 |               0 |           |        |          | W+I: close double guillemet  |              | Continuation byte                             |
| 188 |  bc | 274 |               0 |           |        |          | W+I: one quarter             |              | Continuation byte                             |
| 189 |  bd | 275 |               1 |           |        |          | W+I: one half                |              | Continuation byte                             |
| 190 |  be | 276 |               0 |           |        |          | W+I: three quarters          |              | Continuation byte                             |
| 191 |  bf | 277 |               0 |           |        |          | W+I: inverted question mark  |              | Continuation byte                             |
| 192 |  c0 | 300 |               0 |           |        |          | W+I: capital A grave         |              | Invalid in UTF-8                              |
| 193 |  c1 | 301 |               0 |           |        |          | W+I: capital A  acute        |              | Invalid in UTF-8                              |
| 194 |  c2 | 302 |               6 |           |        |          | W+I: capital A circumflex    |              | Leading Byte, Latin                           |
| 195 |  c3 | 303 |             212 |           |        |          | W+I: capital A tilde         |              | Leading Byte, Latin                           |
| 196 |  c4 | 304 |               7 |           |        |          | W+I: capital A umlaut        |              | Leading Byte, Latin                           |
| 197 |  c5 | 305 |               9 |           |        |          | W+I: capital A ring          |              | Leading Byte, Latin                           |
| 198 |  c6 | 306 |               0 |           |        |          | W+I: capital ash A+E         |              | Leading Byte, Latin                           |
| 199 |  c7 | 307 |               0 |           |        |          | W+I: capital C cedilla       |              | Leading Byte, Latin                           |
| 200 |  c8 | 310 |               0 |           |        |          | W+I: capital E grave         |              | Leading Byte, Latin                           |
| 201 |  c9 | 311 |               0 |           |        |          | W+I: capital E acute         |              | Leading Byte, International Phonetic Alphabet |
| 202 |  ca | 312 |               0 |           |        |          | W+I: capital E circumflex    |              | Leading Byte, International Phonetic Alphabet |
| 203 |  cb | 313 |               0 |           |        |          | W+I: capital E umlaut        |              | Leading Byte, International Phonetic Alphabet |
| 204 |  cc | 314 |               0 |           |        |          | W+I: capital I grave         |              | Leading Byte, accents                         |
| 205 |  cd | 315 |               0 |           |        |          | W+I: capital I acute         |              | Leading Byte, accents                         |
| 206 |  ce | 316 |               0 |           |        |          | W+I: capital I circumflex    |              | Leading Byte, Greek                           |
| 207 |  cf | 317 |               0 |           |        |          | W+I: capital I umlaut        |              | Leading Byte, Greek                           |
| 208 |  d0 | 320 |               0 |           |        |          | W+I: capital D with bar      |              | Leading Byte, Cyrillic                        |
| 209 |  d1 | 321 |               0 |           |        |          | W+I: capital N tilde         |              | Leading Byte, Cyrillic                        |
| 210 |  d2 | 322 |               0 |           |        |          | W+I: capital O grave         |              | Leading Byte, Cyrillic                        |
| 211 |  d3 | 323 |               0 |           |        |          | W+I: capital O acute         |              | Leading Byte, Cyrillic                        |
| 212 |  d4 | 324 |               0 |           |        |          | W+I: capital O circumflex    |              | Leading Byte, Cyrillic                        |
| 213 |  d5 | 325 |               0 |           |        |          | W+I: capital O tilde         |              | Leading Byte, Armenian                        |
| 214 |  d6 | 326 |               0 |           |        |          | W+I: capital O umlaut        |              | Leading Byte, Hebrew                          |
| 215 |  d7 | 327 |               0 |           |        |          | W+I: multiplication sign     |              | Leading Byte, Hebrew                          |
| 216 |  d8 | 330 |               0 |           |        |          | W+I: capital O with slash    |              | Leading Byte, Arabic                          |
| 217 |  d9 | 331 |               0 |           |        |          | W+I: capital U grave         |              | Leading Byte, Arabic                          |
| 218 |  da | 332 |               0 |           |        |          | W+I: capital U acute         |              | Leading Byte, Arabic                          |
| 219 |  db | 333 |               0 |           |        |          | W+I: capital U circumflex    |              | Leading Byte, Arabic                          |
| 220 |  dc | 334 |               0 |           |        |          | W+I: capital u umlaut        |              | Leading Byte, Syriac                          |
| 221 |  dd | 335 |               0 |           |        |          | W+I: capital Y acute         |              | Leading Byte, Arabic                          |
| 222 |  de | 336 |               0 |           |        |          | W+I: capital thorn           |              | Leading Byte, Thaana                          |
| 223 |  df | 337 |               0 |           |        |          | W+I: capital eszett          |              | Leading Byte, N'Ko                            |
| 224 |  e0 | 340 |               0 |           |        |          | W+I: small a grave           |              | Leading Byte, Indic                           |
| 225 |  e1 | 341 |               0 |           |        |          | W+I: small a acute           |              | Leading Byte, Miscellaneous                   |
| 226 |  e2 | 342 |             154 |           |        |          | W+I: small a circumflex      |              | Leading Byte, Symbol                          |
| 227 |  e3 | 343 |               0 |           |        |          | W+I: small a tilde           |              | Leading Byte, Kana & Chinese/Japanese/Korean  |
| 228 |  e4 | 344 |               0 |           |        |          | W+I: small a umlaut          |              | Leading Byte, Chinese/Japanese/Korean unified |
| 229 |  e5 | 345 |               0 |           |        |          | W+I: small a ring            |              | Leading Byte, Chinese/Japanese/Korean unified |
| 230 |  e6 | 346 |               0 |           |        |          | W+I: small ash a+e           |              | Leading Byte, Chinese/Japanese/Korean unified |
| 231 |  e7 | 347 |               0 |           |        |          | W+I: small c cedilla         |              | Leading Byte, Chinese/Japanese/Korean unified |
| 232 |  e8 | 350 |               0 |           |        |          | W+I: small e grave           |              | Leading Byte, Chinese/Japanese/Korean unified |
| 233 |  e9 | 351 |               0 |           |        |          | W+I: small e acute           |              | Leading Byte, Chinese/Japanese/Korean unified |
| 234 |  ea | 352 |               0 |           |        |          | W+I: small e circumflex      |              | Leading Byte, Asian                           |
| 235 |  eb | 353 |               0 |           |        |          | W+I: small e umlaut          |              | Leading Byte, Hangul                          |
| 236 |  ec | 354 |               0 |           |        |          | W+I: small i grave           |              | Leading Byte, Hangul                          |
| 237 |  ed | 355 |               0 |           |        |          | W+I: small i acute           |              | Leading Byte, Hangul                          |
| 238 |  ee | 356 |               0 |           |        |          | W+I: small i circumflex      |              | Leading Byte, Private Use Areas               |
| 239 |  ef | 357 |               0 |           |        |          | W+I: small i umlaut          |              | Leading Byte, Forms                           |
| 240 |  f0 | 360 |               0 |           |        |          | W+I: small eth               |              | Leading Byte, Supplementary Planes            |
| 241 |  f1 | 361 |               0 |           |        |          | W+I: small n tilde           |              | Leading Byte                                  |
| 242 |  f2 | 362 |               0 |           |        |          | W+I: small o grave           |              | Leading Byte                                  |
| 243 |  f3 | 363 |               0 |           |        |          | W+I: small o acute           |              | Leading Byte, Supplementary Planes            |
| 244 |  f4 | 364 |               0 |           |        |          | W+I: small o circumflex      |              | Leading Byte, Supplementary Planes            |
| 245 |  f5 | 365 |               0 |           |        |          | W+I: small o tilde           |              | Invalid in UTF-8                              |
| 246 |  f6 | 366 |               0 |           |        |          | W+I: small o umlaut          |              | Invalid in UTF-8                              |
| 247 |  f7 | 367 |               0 |           |        |          | W+I: division sign           |              | Invalid in UTF-8                              |
| 248 |  f8 | 370 |               0 |           |        |          | W+I: small o with slash      |              | Invalid in UTF-8                              |
| 249 |  f9 | 371 |               0 |           |        |          | W+I: small u grave           |              | Invalid in UTF-8                              |
| 250 |  fa | 372 |               0 |           |        |          | W+I: small u acute           |              | Invalid in UTF-8                              |
| 251 |  fb | 373 |               0 |           |        |          | W+I: small u circumflex      |              | Invalid in UTF-8                              |
| 252 |  fc | 374 |               0 |           |        |          | W+I: small u umlaut          |              | Invalid in UTF-8                              |
| 253 |  fd | 375 |               0 |           |        |          | W+I: small y acute           |              | Invalid in UTF-8                              |
| 254 |  fe | 376 |               0 |           |        |          | W+I: small thorn             |              | Invalid in UTF-8                              |
| 255 |  ff | 377 |               0 |           |        |          | W+I: small y umlaut          |              | Invalid in UTF-8                              |
+-----+-----+-----+-----------------+-----------+--------+----------+------------------------------+--------------+-----------------------------------------------+

Start Time: 12-Apr-2017 10:18:12
End Time: 12-Apr-2017 10:18:20
Elapsed Time: 00:00:07
    

From this table, we see that almost all of the bytes are valid ASCII characters. ASCII is a subset of UTF-8, so that is good.

Next there are a number of Continuation Bytes. These only make sense with their Leading Bytes, so we'll look at those next.

The Names shown for Windows-1252 and ISO 8859-1 are irrelevant for a UTF-8 file. We didn't know that it was UTF-8 until we had run the analysis, but now that we do know, don't be distracted by those names.

The leading bytes we have are:

Leading Byte
(hex)
Number of Bytes
in Sequence
Character Group
(Unicode block)
Number Found
In This File
c2 2 Latin 6
c3 2 Latin 212
c4 2 Latin 7
c5 2 Latin 9
e2 3 Symbol 154

I want to find these, to see what kinds of characters are being used, because our users probably will not think to search for accented characters or special symbols. So how are we going to do this in large files? I thought back to the time when I was using the Netezza Data Warehouse Appliance (happy days!). The Netezza Data Loader was very good at telling me:

More recently, I was working on a Hive project. Hive works on the philosophy that you just don't want to know about errors. Its design principle is "keep quiet and carry on". (That's why Data Scientists spend 80% of their time cleaning data. Brilliant. Not!). Anyway, having spent much time helping developers play "Hunt The Dodgy Character" I decided to write a program to give me this information.

So, if we run the same command as before, but add a second parameter to switch on verbose output, we get this.

  $ java EncodingProfile ~/Downloads/BasicCompanyData-2017-04-01-part4_5.csv true | more
  Scanning file: /Users/ronballard/Downloads/BasicCompanyData-2017-04-01-part4_5.csv
  Line: 2470, column: 77  Found UTF-8 leading byte 226, Hex: e2, Octal: 342, UTF-8 Leading Byte, Symbol, Windows-1252 Lowercase A circumflex
  Valid UTF-8 multi-byte sequence: e2 80 99  Unicode Code Point: U+2019
  Line: 4878, column: 109  Found UTF-8 leading byte 196, Hex: c4, Octal: 304, UTF-8 Leading Byte, Latin, Windows-1252 Uppercase A umlaut
  Valid UTF-8 multi-byte sequence: c4 b0  Unicode Code Point: U+0130
  Line: 5880, column: 36  Found UTF-8 leading byte 195, Hex: c3, Octal: 303, UTF-8 Leading Byte, Latin, Windows-1252 Uppercase A tilde
  Valid UTF-8 multi-byte sequence: c3 84  Unicode Code Point: U+00c4
  Line: 7098, column: 16  Found UTF-8 leading byte 195, Hex: c3, Octal: 303, UTF-8 Leading Byte, Latin, Windows-1252 Uppercase A tilde
  Valid UTF-8 multi-byte sequence: c3 9c  Unicode Code Point: U+00dc
  Line: 9359, column: 6  Found UTF-8 leading byte 195, Hex: c3, Octal: 303, UTF-8 Leading Byte, Latin, Windows-1252 Uppercase A tilde
  Valid UTF-8 multi-byte sequence: c3 88  Unicode Code Point: U+00c8
  Line: 15699, column: 10  Found UTF-8 leading byte 226, Hex: e2, Octal: 342, UTF-8 Leading Byte, Symbol, Windows-1252 Lowercase A circumflex
  Valid UTF-8 multi-byte sequence: e2 80 99  Unicode Code Point: U+2019
    

This is just the first few lines of what can be very verbose. It highlights possible errors (as we'll see in later examples) and shows all the multi-byte UTF-8 sequences. For most files using a Latin alphabet this will be fine, but if you are using a writing system that generates a lot of multi-byte sequences (Cyrillic or Chinese, for example) this would be a pain. You could redirect it to a file, or pipe it through more as I did here.

Where we have a source file like this one, with a very small proportion of multi-byte characters, it can be useful to see which of these characters are used. We might then choose to provide standardised versions of these characters for searching.

In the extract above we see some valid UTF-8 characters. The report shows the line number and character position within the line for each of the interesting characters (which are all UTF-8 Leading Bytes in this case). Given the line number, we can find that line in the file. Here's an example using standard UNIX commands (you aren't trying to do this in Windows are you?):

  $ head -2470 /Users/ronballard/Downloads/BasicCompanyData-2017-04-01-part4_5.csv | tail -1
  "NEWTREE CAPITAL MANAGEMENT LLP","OC392611","","","NORFOLK HOUSE 31 ST JAMES’S SQUARE","1ST FLOOR","LONDON",
  "","ENGLAND","SW1Y 4JJ","Limited Liability Partnership","Active","United Kingdom","","11/04/2014","30","4",
  "31/01/2018","30/04/2016","TOTAL EXEMPTION SMALL","09/05/2017","11/04/2016","0","0","0","0","None Supplied",
  "","","","0","0","http://business.data.gov.uk/id/company/OC392611","","","","","","","","","","","","","","",
  "","","","","","","25/04/2018",""
    

This takes us directly to line 2470, the first line mentioned in the verbose report above. Then we have to find the 77th character in that line. We can see that this is the apostrophe in "JAMES’S". There are several characters that can be used as an apostrophe. Here's one way to find out which of the options we are using here (just piping the output from the previous command through hexdump):

  $ head -2470 /Users/ronballard/Downloads/BasicCompanyData-2017-04-01-part4_5.csv | tail -1 | hexdump -C
  00000000  22 4e 45 57 54 52 45 45  20 43 41 50 49 54 41 4c  |"NEWTREE CAPITAL|
  00000010  20 4d 41 4e 41 47 45 4d  45 4e 54 20 4c 4c 50 22  | MANAGEMENT LLP"|
  00000020  2c 22 4f 43 33 39 32 36  31 31 22 2c 22 22 2c 22  |,"OC392611","","|
  00000030  22 2c 22 4e 4f 52 46 4f  4c 4b 20 48 4f 55 53 45  |","NORFOLK HOUSE|
  00000040  20 33 31 20 53 54 20 4a  41 4d 45 53 e2 80 99 53  | 31 ST JAMES...S|
  00000050  20 53 51 55 41 52 45 22  2c 22 31 53 54 20 46 4c  | SQUARE","1ST FL|
  00000060  4f 4f 52 22 2c 22 4c 4f  4e 44 4f 4e 22 2c 22 22  |OOR","LONDON",""|
    

We can see that the hex string for the apostrophe between "JAMES" and "S" is e2 80 99. Hex e2 is a UTF-8 Leading Byte that introduces a 3-byte sequence.

We've made this a bit easier by showing in the verbose report the 3-byte UTF-8 sequence, and the Unicode Code Point that this sequence encodes:

      Valid UTF-8 multi-byte sequence: e2 80 99  Unicode Code Point: U+2019
    

There are over a million Unicode Code Points, so how do we find this one? There are several websites that I have found useful for this:

The next one we find is at line: 4878, column: 109. This is Unicode Code Point: U+0130 which is "İ", capital "I" with a dot above it. This is used in Turkish to show one particular pronunciation of the letter "I". "İ" is the correct first letter of the city "İstanbul". Since Istanbul is often spelled (outside Turkey) with a plain "I" we would want to find both "İstanbul" and "Istanbul" if we were searching for this city in our databases. This is another reason for maintaining a standardised version for searching.

What this shows is that, even with a very clean file like this one, it is worth understanding what data you have so that you can make informed decisions about how to offer this data to users.

Example 2: A file that appears to be encoded using ISO/IEC 8859-1

This is another file from the UK Government's Open Data initiative. It is a list of places with map references (a gazetteer). You can download the file from Open Data Gazeteer. The format is described as "ASCII text, Colon separated".

This is what we found:

  Scanning file: /Users/ronballard/Downloads/gaz50k_gb/Data/50kgaz2016.txt

  +------------------------------------------------------------|-----------------+
  | Lines:                                                     |         258,383 |
  | Bytes:                                                     |      27,865,091 |
  | Windows line ends:                                         |         258,382 |
  | UNIX line ends:                                            |               0 |
  | Control Characters (excluding newline & carriage return):  |               0 |
  | UTF-8 Leading bytes:                                       |           6,728 |
  | UTF-8 Continuation bytes:                                  |               0 |
  | UTF-8 Leading bytes without enough Continuation bytes:     |           6,728 |
  | UTF-8 Continuation bytes without a preceding Leading byte: |               0 |
  | UTF-8 Invalid bytes:                                       |           1,586 |
  | Byte values specific to Windows 1252:                      |               0 |
  | Byte values valid in ISO 8859 and Windows 12xx families:   |           8,314 |
  | Carriage returns without newline:                          |               0 |
  +------------------------------------------------------------|-----------------+
    

All the line ends are Windows -style (carriage return/linefeed) so this is clearly a Windows file. There are no other control characters and no carriage returns without linefeeds, so all that is good.

The encoding is clearly not UTF-8 because there are 1,586 bytes that are invalid in UTF-8, and all the bytes that could be UTF-8 Leading Bytes are not followed by the correct number of Continuation Bytes. In fact, there are no Continuation Bytes.

The documentation says that the file format is ASCII. It is not really ASCII because there are 8-bit values. These could be valid in the ISO 8859 and Windows 12xx families, both of which are sometimes described as "8-bit ASCII". It's time to look at what byte values we actually have.


+-----+-----+-----+-----------------+-----------+--------+----------+------------------------------+--------------+--------------------------------------------------+
|     |     |     |                 |   ASCII   |   C    | teletype | Name (W = Windows-1252,      | Specific to  |                   UTF-8 Group                    |
| Dec | Hex | Oct | Number of Bytes | Printable | Escape | notation | I = ISO 8859-1, C = Control  | Windows-1252 |Leading Byte(n) = first byte of an n-byte sequence|
+-----+-----+-----+-----------------+-----------+--------+----------+------------------------------+--------------+--------------------------------------------------+
| 000 |  00 | 000 |               0 |           |   \0   |    ^@    | C:null                       |              | Control character, ASCII compatible              |
| 001 |  01 | 001 |               0 |           |        |    ^A    | C:start of heading           |              | Control character, ASCII compatible              |
| 002 |  02 | 002 |               0 |           |        |    ^B    | C:start of text              |              | Control character, ASCII compatible              |
| 003 |  03 | 003 |               0 |           |        |    ^C    | C:end of text                |              | Control character, ASCII compatible              |
| 004 |  04 | 004 |               0 |           |        |    ^D    | C:end of transmission        |              | Control character, ASCII compatible              |
| 005 |  05 | 005 |               0 |           |        |    ^E    | C:enquiry                    |              | Control character, ASCII compatible              |
| 006 |  06 | 006 |               0 |           |        |    ^F    | C:acknowledgement            |              | Control character, ASCII compatible              |
| 007 |  07 | 007 |               0 |           |   \a   |    ^G    | C:bell                       |              | Control character, ASCII compatible              |
| 008 |  08 | 010 |               0 |           |   \b   |    ^H    | C:backspace                  |              | Control character, ASCII compatible              |
| 009 |  09 | 011 |               0 |           |   \t   |    ^I    | C:horizontal tab             |              | Control character, ASCII compatible              |
| 010 |  0a | 012 |         258,382 |           |   \n   |    ^J    | C:newline                    |              | Control character, ASCII compatible              |
| 011 |  0b | 013 |               0 |           |   \v   |    ^K    | C:vertical tab               |              | Control character, ASCII compatible              |
| 012 |  0c | 014 |               0 |           |   \f   |    ^L    | C:form feed                  |              | Control character, ASCII compatible              |
| 013 |  0d | 015 |         258,382 |           |   \r   |    ^M    | C:carriage return            |              | Control character, ASCII compatible              |
| 014 |  0e | 016 |               0 |           |        |    ^N    | C:shift out                  |              | Control character, ASCII compatible              |
| 015 |  0f | 017 |               0 |           |        |    ^O    | C:shift in                   |              | Control character, ASCII compatible              |
| 016 |  10 | 020 |               0 |           |        |    ^P    | C:data link escape           |              | Control character, ASCII compatible              |
| 017 |  11 | 021 |               0 |           |        |    ^Q    | C:device control 1           |              | Control character, ASCII compatible              |
| 018 |  12 | 022 |               0 |           |        |    ^R    | C:device control 2           |              | Control character, ASCII compatible              |
| 019 |  13 | 023 |               0 |           |        |    ^S    | C:device control 3           |              | Control character, ASCII compatible              |
| 020 |  14 | 024 |               0 |           |        |    ^T    | C:device control 4           |              | Control character, ASCII compatible              |
| 021 |  15 | 025 |               0 |           |        |    ^U    | C:negative acknowledgement   |              | Control character, ASCII compatible              |
| 022 |  16 | 026 |               0 |           |        |    ^V    | C:synchronous idle           |              | Control character, ASCII compatible              |
| 023 |  17 | 027 |               0 |           |        |    ^W    | C:end of transmission block  |              | Control character, ASCII compatible              |
| 024 |  18 | 030 |               0 |           |        |    ^X    | C:cancel                     |              | Control character, ASCII compatible              |
| 025 |  19 | 031 |               0 |           |        |    ^Y    | C:end of medium              |              | Control character, ASCII compatible              |
| 026 |  1a | 032 |               0 |           |        |    ^Z    | C:substitute                 |              | Control character, ASCII compatible              |
| 027 |  1b | 033 |               0 |           |   \e   |    ^[    | C:escape                     |              | Control character, ASCII compatible              |
| 028 |  1c | 034 |               0 |           |        |    ^\    | C:file separator             |              | Control character, ASCII compatible              |
| 029 |  1d | 035 |               0 |           |        |    ^]    | C:group separator            |              | Control character, ASCII compatible              |
| 030 |  1e | 036 |               0 |           |        |    ^^    | C:record separator           |              | Control character, ASCII compatible              |
| 031 |  1f | 037 |               0 |           |        |    ^_    | C:unit separator             |              | Control character, ASCII compatible              |
| 032 |  20 | 040 |         450,803 |           |        |          | space                        |              | Punctuation and symbols, ASCII compatible        |
| 033 |  21 | 041 |               1 |     !     |        |          |                              |              | Punctuation and symbols, ASCII compatible        |
| 034 |  22 | 042 |               0 |     "     |        |          |                              |              | Punctuation and symbols, ASCII compatible        |
| 035 |  23 | 043 |               0 |     #     |        |          |                              |              | Punctuation and symbols, ASCII compatible        |
| 036 |  24 | 044 |               0 |     $     |        |          |                              |              | Punctuation and symbols, ASCII compatible        |
| 037 |  25 | 045 |               0 |     %     |        |          |                              |              | Punctuation and symbols, ASCII compatible        |
| 038 |  26 | 046 |          22,612 |     &     |        |          |                              |              | Punctuation and symbols, ASCII compatible        |
| 039 |  27 | 047 |          10,798 |     '     |        |          |                              |              | Punctuation and symbols, ASCII compatible        |
| 040 |  28 | 050 |             936 |     (     |        |          |                              |              | Punctuation and symbols, ASCII compatible        |
| 041 |  29 | 051 |             936 |     )     |        |          |                              |              | Punctuation and symbols, ASCII compatible        |
| 042 |  2a | 052 |               0 |     *     |        |          |                              |              | Punctuation and symbols, ASCII compatible        |
| 043 |  2b | 053 |               0 |     +     |        |          |                              |              | Punctuation and symbols, ASCII compatible        |
| 044 |  2c | 054 |           5,026 |     ,     |        |          |                              |              | Punctuation and symbols, ASCII compatible        |
| 045 |  2d | 055 |         533,238 |     -     |        |          |                              |              | Punctuation and symbols, ASCII compatible        |
| 046 |  2e | 056 |         461,258 |     .     |        |          |                              |              | Punctuation and symbols, ASCII compatible        |
| 047 |  2f | 057 |             842 |     /     |        |          |                              |              | Punctuation and symbols, ASCII compatible        |
| 048 |  30 | 060 |       2,374,346 |     0     |        |          |                              |              | Numeric digits, ASCII compatible                 |
| 049 |  31 | 061 |       1,511,488 |     1     |        |          |                              |              | Numeric digits, ASCII compatible                 |
| 050 |  32 | 062 |         989,312 |     2     |        |          |                              |              | Numeric digits, ASCII compatible                 |
| 051 |  33 | 063 |       1,024,366 |     3     |        |          |                              |              | Numeric digits, ASCII compatible                 |
| 052 |  34 | 064 |         843,459 |     4     |        |          |                              |              | Numeric digits, ASCII compatible                 |
| 053 |  35 | 065 |       1,461,683 |     5     |        |          |                              |              | Numeric digits, ASCII compatible                 |
| 054 |  36 | 066 |         657,658 |     6     |        |          |                              |              | Numeric digits, ASCII compatible                 |
| 055 |  37 | 067 |         537,147 |     7     |        |          |                              |              | Numeric digits, ASCII compatible                 |
| 056 |  38 | 070 |         633,110 |     8     |        |          |                              |              | Numeric digits, ASCII compatible                 |
| 057 |  39 | 071 |         992,640 |     9     |        |          |                              |              | Numeric digits, ASCII compatible                 |
| 058 |  3a | 072 |       4,909,258 |     :     |        |          |                              |              | Punctuation and symbols, ASCII compatible        |
| 059 |  3b | 073 |               0 |     ;     |        |          |                              |              | Punctuation and symbols, ASCII compatible        |
| 060 |  3c | 074 |               0 |     <     |        |          |                              |              | Punctuation and symbols, ASCII compatible        |
| 061 |  3d | 075 |               0 |     =     |        |          |                              |              | Punctuation and symbols, ASCII compatible        |
| 062 |  3e | 076 |               0 |     >     |        |          |                              |              | Punctuation and symbols, ASCII compatible        |
| 063 |  3f | 077 |               0 |     ?     |        |          |                              |              | Punctuation and symbols, ASCII compatible        |
| 064 |  40 | 100 |               0 |     @     |        |          |                              |              | Punctuation and symbols, ASCII compatible        |
| 065 |  41 | 101 |         344,725 |     A     |        |          |                              |              | English alphabet, ASCII compatible               |
| 066 |  42 | 102 |         119,886 |     B     |        |          |                              |              | English alphabet, ASCII compatible               |
| 067 |  43 | 103 |         154,944 |     C     |        |          |                              |              | English alphabet, ASCII compatible               |
| 068 |  44 | 104 |         133,774 |     D     |        |          |                              |              | English alphabet, ASCII compatible               |
| 069 |  45 | 105 |         121,824 |     E     |        |          |                              |              | English alphabet, ASCII compatible               |
| 070 |  46 | 106 |         123,899 |     F     |        |          |                              |              | English alphabet, ASCII compatible               |
| 071 |  47 | 107 |          88,399 |     G     |        |          |                              |              | English alphabet, ASCII compatible               |
| 072 |  48 | 110 |         196,118 |     H     |        |          |                              |              | English alphabet, ASCII compatible               |
| 073 |  49 | 111 |         274,687 |     I     |        |          |                              |              | English alphabet, ASCII compatible               |
| 074 |  4a | 112 |          49,845 |     J     |        |          |                              |              | English alphabet, ASCII compatible               |
| 075 |  4b | 113 |          72,503 |     K     |        |          |                              |              | English alphabet, ASCII compatible               |
| 076 |  4c | 114 |         121,584 |     L     |        |          |                              |              | English alphabet, ASCII compatible               |
| 077 |  4d | 115 |         341,014 |     M     |        |          |                              |              | English alphabet, ASCII compatible               |
| 078 |  4e | 116 |         355,446 |     N     |        |          |                              |              | English alphabet, ASCII compatible               |
| 079 |  4f | 117 |         112,953 |     O     |        |          |                              |              | English alphabet, ASCII compatible               |
| 080 |  50 | 120 |          92,189 |     P     |        |          |                              |              | English alphabet, ASCII compatible               |
| 081 |  51 | 121 |          21,345 |     Q     |        |          |                              |              | English alphabet, ASCII compatible               |
| 082 |  52 | 122 |         286,448 |     R     |        |          |                              |              | English alphabet, ASCII compatible               |
| 083 |  53 | 123 |         429,591 |     S     |        |          |                              |              | English alphabet, ASCII compatible               |
| 084 |  54 | 124 |         157,599 |     T     |        |          |                              |              | English alphabet, ASCII compatible               |
| 085 |  55 | 125 |          72,011 |     U     |        |          |                              |              | English alphabet, ASCII compatible               |
| 086 |  56 | 126 |           6,729 |     V     |        |          |                              |              | English alphabet, ASCII compatible               |
| 087 |  57 | 127 |         348,404 |     W     |        |          |                              |              | English alphabet, ASCII compatible               |
| 088 |  58 | 130 |         160,216 |     X     |        |          |                              |              | English alphabet, ASCII compatible               |
| 089 |  59 | 131 |          83,431 |     Y     |        |          |                              |              | English alphabet, ASCII compatible               |
| 090 |  5a | 132 |          14,263 |     Z     |        |          |                              |              | English alphabet, ASCII compatible               |
| 091 |  5b | 133 |               0 |     [     |        |          |                              |              | Punctuation and symbols, ASCII compatible        |
| 092 |  5c | 134 |               0 |     \     |        |          |                              |              | Punctuation and symbols, ASCII compatible        |
| 093 |  5d | 135 |               0 |     ]     |        |          |                              |              | Punctuation and symbols, ASCII compatible        |
| 094 |  5e | 136 |               0 |     ^     |        |          |                              |              | Punctuation and symbols, ASCII compatible        |
| 095 |  5f | 137 |               0 |     _     |        |          |                              |              | Punctuation and symbols, ASCII compatible        |
| 096 |  60 | 140 |               0 |     `     |        |          |                              |              | Punctuation and symbols, ASCII compatible        |
| 097 |  61 | 141 |         436,387 |     a     |        |          |                              |              | English alphabet, ASCII compatible               |
| 098 |  62 | 142 |          89,308 |     b     |        |          |                              |              | English alphabet, ASCII compatible               |
| 099 |  63 | 143 |         102,788 |     c     |        |          |                              |              | English alphabet, ASCII compatible               |
| 100 |  64 | 144 |         241,431 |     d     |        |          |                              |              | English alphabet, ASCII compatible               |
| 101 |  65 | 145 |         602,135 |     e     |        |          |                              |              | English alphabet, ASCII compatible               |
| 102 |  66 | 146 |         104,573 |     f     |        |          |                              |              | English alphabet, ASCII compatible               |
| 103 |  67 | 147 |         138,304 |     g     |        |          |                              |              | English alphabet, ASCII compatible               |
| 104 |  68 | 150 |         340,458 |     h     |        |          |                              |              | English alphabet, ASCII compatible               |
| 105 |  69 | 151 |         380,724 |     i     |        |          |                              |              | English alphabet, ASCII compatible               |
| 106 |  6a | 152 |             137 |     j     |        |          |                              |              | English alphabet, ASCII compatible               |
| 107 |  6b | 153 |          96,302 |     k     |        |          |                              |              | English alphabet, ASCII compatible               |
| 108 |  6c | 154 |         361,293 |     l     |        |          |                              |              | English alphabet, ASCII compatible               |
| 109 |  6d | 155 |         154,664 |     m     |        |          |                              |              | English alphabet, ASCII compatible               |
| 110 |  6e | 156 |         418,551 |     n     |        |          |                              |              | English alphabet, ASCII compatible               |
| 111 |  6f | 157 |         452,656 |     o     |        |          |                              |              | English alphabet, ASCII compatible               |
| 112 |  70 | 160 |          33,636 |     p     |        |          |                              |              | English alphabet, ASCII compatible               |
| 113 |  71 | 161 |             565 |     q     |        |          |                              |              | English alphabet, ASCII compatible               |
| 114 |  72 | 162 |         581,728 |     r     |        |          |                              |              | English alphabet, ASCII compatible               |
| 115 |  73 | 163 |         379,348 |     s     |        |          |                              |              | English alphabet, ASCII compatible               |
| 116 |  74 | 164 |         298,914 |     t     |        |          |                              |              | English alphabet, ASCII compatible               |
| 117 |  75 | 165 |         151,608 |     u     |        |          |                              |              | English alphabet, ASCII compatible               |
| 118 |  76 | 166 |          32,055 |     v     |        |          |                              |              | English alphabet, ASCII compatible               |
| 119 |  77 | 167 |          85,049 |     w     |        |          |                              |              | English alphabet, ASCII compatible               |
| 120 |  78 | 170 |          26,774 |     x     |        |          |                              |              | English alphabet, ASCII compatible               |
| 121 |  79 | 171 |         124,604 |     y     |        |          |                              |              | English alphabet, ASCII compatible               |
| 122 |  7a | 172 |           1,276 |     z     |        |          |                              |              | English alphabet, ASCII compatible               |
| 123 |  7b | 173 |               0 |     {     |        |          |                              |              | English alphabet, ASCII compatible               |
| 124 |  7c | 174 |               0 |     |     |        |          |                              |              | English alphabet, ASCII compatible               |
| 125 |  7d | 175 |               0 |     }     |        |          |                              |              | English alphabet, ASCII compatible               |
| 126 |  7e | 176 |               0 |     ~     |        |          |                              |              | English alphabet, ASCII compatible               |
| 127 |  7f | 177 |               0 |           |        |    ^?    | C: delete                    |              | Control character, ASCII compatible              |
| 128 |  80 | 200 |               0 |           |        |          | W: Euro symbol               |     Yes      | Continuation byte                                |
| 129 |  81 | 201 |               0 |           |        |          |                              |              | Continuation byte                                |
| 130 |  82 | 202 |               0 |           |        |          | W: curved single open-quote  |     Yes      | Continuation byte                                |
| 131 |  83 | 203 |               0 |           |        |          | W: small f with hook         |     Yes      | Continuation byte                                |
| 132 |  84 | 204 |               0 |           |        |          | W: curved double open-quote  |     Yes      | Continuation byte                                |
| 133 |  85 | 205 |               0 |           |        |          | W: ellipsis                  |     Yes      | Continuation byte                                |
| 134 |  86 | 206 |               0 |           |        |          | W: dagger                    |     Yes      | Continuation byte                                |
| 135 |  87 | 207 |               0 |           |        |          | W: double dagger             |     Yes      | Continuation byte                                |
| 136 |  88 | 210 |               0 |           |        |          | W: circumflex                |     Yes      | Continuation byte                                |
| 137 |  89 | 211 |               0 |           |        |          | W: permille                  |     Yes      | Continuation byte                                |
| 138 |  8a | 212 |               0 |           |        |          | W: capital S caron           |     Yes      | Continuation byte                                |
| 139 |  8b | 213 |               0 |           |        |          | W: open single guillemet     |     Yes      | Continuation byte                                |
| 140 |  8c | 214 |               0 |           |        |          | W: capital O+E               |     Yes      | Continuation byte                                |
| 141 |  8d | 215 |               0 |           |        |          |                              |              | Continuation byte                                |
| 142 |  8e | 216 |               0 |           |        |          | W: capital Z caron           |     Yes      | Continuation byte                                |
| 143 |  8f | 217 |               0 |           |        |          |                              |              | Continuation byte                                |
| 144 |  90 | 220 |               0 |           |        |          |                              |              | Continuation byte                                |
| 145 |  91 | 221 |               0 |           |        |          | W: curved single close quote |     Yes      | Continuation byte                                |
| 146 |  92 | 222 |               0 |           |        |          | W: curved single open quote  |     Yes      | Continuation byte                                |
| 147 |  93 | 223 |               0 |           |        |          | W: curved double open quote  |     Yes      | Continuation byte                                |
| 148 |  94 | 224 |               0 |           |        |          | W: curved double close quote |     Yes      | Continuation byte                                |
| 149 |  95 | 225 |               0 |           |        |          | W: bullet                    |     Yes      | Continuation byte                                |
| 150 |  96 | 226 |               0 |           |        |          | W: en-dash                   |     Yes      | Continuation byte                                |
| 151 |  97 | 227 |               0 |           |        |          | W: em-dash                   |     Yes      | Continuation byte                                |
| 152 |  98 | 230 |               0 |           |        |          | W: tilde                     |     Yes      | Continuation byte                                |
| 153 |  99 | 231 |               0 |           |        |          | W: trade mark                |     Yes      | Continuation byte                                |
| 154 |  9a | 232 |               0 |           |        |          | W: small S caron             |     Yes      | Continuation byte                                |
| 155 |  9b | 233 |               0 |           |        |          | W: close single guillemet    |     Yes      | Continuation byte                                |
| 156 |  9c | 234 |               0 |           |        |          | W: small O+E                 |     Yes      | Continuation byte                                |
| 157 |  9d | 235 |               0 |           |        |          |                              |              | Continuation byte                                |
| 158 |  9e | 236 |               0 |           |        |          | W: small Z caron             |     Yes      | Continuation byte                                |
| 159 |  9f | 237 |               0 |           |        |          | W: capital Y umlaut          |     Yes      | Continuation byte                                |
| 160 |  a0 | 240 |               0 |           |        |          | W+I: non-breaking space      |              | Continuation byte                                |
| 161 |  a1 | 241 |               0 |           |        |          | W+I: inverted exclamation    |              | Continuation byte                                |
| 162 |  a2 | 242 |               0 |           |        |          | W+I: cent                    |              | Continuation byte                                |
| 163 |  a3 | 243 |               0 |           |        |          | W+I: UK pound sign           |              | Continuation byte                                |
| 164 |  a4 | 244 |               0 |           |        |          | W+I: currency                |              | Continuation byte                                |
| 165 |  a5 | 245 |               0 |           |        |          | W+I: Yen sign                |              | Continuation byte                                |
| 166 |  a6 | 246 |               0 |           |        |          | W+I: broken bar              |              | Continuation byte                                |
| 167 |  a7 | 247 |               0 |           |        |          | W+I: section                 |              | Continuation byte                                |
| 168 |  a8 | 250 |               0 |           |        |          | W+I: trema                   |              | Continuation byte                                |
| 169 |  a9 | 251 |               0 |           |        |          | W+I: copyright sign          |              | Continuation byte                                |
| 170 |  aa | 252 |               0 |           |        |          | W+I: feminine ordinal        |              | Continuation byte                                |
| 171 |  ab | 253 |               0 |           |        |          | W+I: open double guillemet   |              | Continuation byte                                |
| 172 |  ac | 254 |               0 |           |        |          | W+I: logical complement      |              | Continuation byte                                |
| 173 |  ad | 255 |               0 |           |        |          | W+I: soft hyphen             |              | Continuation byte                                |
| 174 |  ae | 256 |               0 |           |        |          | W+I: registered sign         |              | Continuation byte                                |
| 175 |  af | 257 |               0 |           |        |          | W+I: macron                  |              | Continuation byte                                |
| 176 |  b0 | 260 |               0 |           |        |          | W+I: degree                  |              | Continuation byte                                |
| 177 |  b1 | 261 |               0 |           |        |          | W+I: plus-or-minus sign      |              | Continuation byte                                |
| 178 |  b2 | 262 |               0 |           |        |          | W+I: raise to power of 2     |              | Continuation byte                                |
| 179 |  b3 | 263 |               0 |           |        |          | W+I: raise to power of 3     |              | Continuation byte                                |
| 180 |  b4 | 264 |               0 |           |        |          | W+I: acute accent            |              | Continuation byte                                |
| 181 |  b5 | 265 |               0 |           |        |          | W+I: Greek letter mu         |              | Continuation byte                                |
| 182 |  b6 | 266 |               0 |           |        |          | W+I: pilcrow (paragraph)     |              | Continuation byte                                |
| 183 |  b7 | 267 |               0 |           |        |          | W+I: middle dot              |              | Continuation byte                                |
| 184 |  b8 | 270 |               0 |           |        |          | W+I: cedilla                 |              | Continuation byte                                |
| 185 |  b9 | 271 |               0 |           |        |          | W+I: superscript 1           |              | Continuation byte                                |
| 186 |  ba | 272 |               0 |           |        |          | W+I: masculine ordinal       |              | Continuation byte                                |
| 187 |  bb | 273 |               0 |           |        |          | W+I: close double guillemet  |              | Continuation byte                                |
| 188 |  bc | 274 |               0 |           |        |          | W+I: one quarter             |              | Continuation byte                                |
| 189 |  bd | 275 |               0 |           |        |          | W+I: one half                |              | Continuation byte                                |
| 190 |  be | 276 |               0 |           |        |          | W+I: three quarters          |              | Continuation byte                                |
| 191 |  bf | 277 |               0 |           |        |          | W+I: inverted question mark  |              | Continuation byte                                |
| 192 |  c0 | 300 |             487 |           |        |          | W+I: capital A grave         |              | Invalid in UTF-8                                 |
| 193 |  c1 | 301 |               0 |           |        |          | W+I: capital A  acute        |              | Invalid in UTF-8                                 |
| 194 |  c2 | 302 |               0 |           |        |          | W+I: capital A circumflex    |              | Leading Byte(2), Latin                           |
| 195 |  c3 | 303 |               0 |           |        |          | W+I: capital A tilde         |              | Leading Byte(2), Latin                           |
| 196 |  c4 | 304 |               0 |           |        |          | W+I: capital A umlaut        |              | Leading Byte(2), Latin                           |
| 197 |  c5 | 305 |               0 |           |        |          | W+I: capital A ring          |              | Leading Byte(2), Latin                           |
| 198 |  c6 | 306 |               0 |           |        |          | W+I: capital ash A+E         |              | Leading Byte(2), Latin                           |
| 199 |  c7 | 307 |               0 |           |        |          | W+I: capital C cedilla       |              | Leading Byte(2), Latin                           |
| 200 |  c8 | 310 |              12 |           |        |          | W+I: capital E grave         |              | Leading Byte(2), Latin                           |
| 201 |  c9 | 311 |               0 |           |        |          | W+I: capital E acute         |              | Leading Byte(2), International Phonetic Alphabet |
| 202 |  ca | 312 |               0 |           |        |          | W+I: capital E circumflex    |              | Leading Byte(2), International Phonetic Alphabet |
| 203 |  cb | 313 |               0 |           |        |          | W+I: capital E umlaut        |              | Leading Byte(2), International Phonetic Alphabet |
| 204 |  cc | 314 |               5 |           |        |          | W+I: capital I grave         |              | Leading Byte(2), accents                         |
| 205 |  cd | 315 |               0 |           |        |          | W+I: capital I acute         |              | Leading Byte(2), accents                         |
| 206 |  ce | 316 |               0 |           |        |          | W+I: capital I circumflex    |              | Leading Byte(2), Greek                           |
| 207 |  cf | 317 |               0 |           |        |          | W+I: capital I umlaut        |              | Leading Byte(2), Greek                           |
| 208 |  d0 | 320 |               0 |           |        |          | W+I: capital D with bar      |              | Leading Byte(2), Cyrillic                        |
| 209 |  d1 | 321 |               0 |           |        |          | W+I: capital N tilde         |              | Leading Byte(2), Cyrillic                        |
| 210 |  d2 | 322 |              31 |           |        |          | W+I: capital O grave         |              | Leading Byte(2), Cyrillic                        |
| 211 |  d3 | 323 |               0 |           |        |          | W+I: capital O acute         |              | Leading Byte(2), Cyrillic                        |
| 212 |  d4 | 324 |               0 |           |        |          | W+I: capital O circumflex    |              | Leading Byte(2), Cyrillic                        |
| 213 |  d5 | 325 |               0 |           |        |          | W+I: capital O tilde         |              | Leading Byte(2), Armenian                        |
| 214 |  d6 | 326 |               0 |           |        |          | W+I: capital O umlaut        |              | Leading Byte(2), Hebrew                          |
| 215 |  d7 | 327 |               0 |           |        |          | W+I: multiplication sign     |              | Leading Byte(2), Hebrew                          |
| 216 |  d8 | 330 |               0 |           |        |          | W+I: capital O with slash    |              | Leading Byte(2), Arabic                          |
| 217 |  d9 | 331 |              11 |           |        |          | W+I: capital U grave         |              | Leading Byte(2), Arabic                          |
| 218 |  da | 332 |               0 |           |        |          | W+I: capital U acute         |              | Leading Byte(2), Arabic                          |
| 219 |  db | 333 |               0 |           |        |          | W+I: capital U circumflex    |              | Leading Byte(2), Arabic                          |
| 220 |  dc | 334 |               0 |           |        |          | W+I: capital u umlaut        |              | Leading Byte(2), Syriac                          |
| 221 |  dd | 335 |               0 |           |        |          | W+I: capital Y acute         |              | Leading Byte(2), Arabic                          |
| 222 |  de | 336 |               0 |           |        |          | W+I: capital thorn           |              | Leading Byte(2), Thaana                          |
| 223 |  df | 337 |               0 |           |        |          | W+I: capital eszett          |              | Leading Byte(2), N'Ko                            |
| 224 |  e0 | 340 |           2,169 |           |        |          | W+I: small a grave           |              | Leading Byte(3), Indic                           |
| 225 |  e1 | 341 |               1 |           |        |          | W+I: small a acute           |              | Leading Byte(3), Miscellaneous                   |
| 226 |  e2 | 342 |             219 |           |        |          | W+I: small a circumflex      |              | Leading Byte(3), Symbol                          |
| 227 |  e3 | 343 |               0 |           |        |          | W+I: small a tilde           |              | Leading Byte(3), Kana & Chinese/Japanese/Korean  |
| 228 |  e4 | 344 |               0 |           |        |          | W+I: small a umlaut          |              | Leading Byte(3), Chinese/Japanese/Korean unified |
| 229 |  e5 | 345 |               0 |           |        |          | W+I: small a ring            |              | Leading Byte(3), Chinese/Japanese/Korean unified |
| 230 |  e6 | 346 |               0 |           |        |          | W+I: small ash a+e           |              | Leading Byte(3), Chinese/Japanese/Korean unified |
| 231 |  e7 | 347 |               0 |           |        |          | W+I: small c cedilla         |              | Leading Byte(3), Chinese/Japanese/Korean unified |
| 232 |  e8 | 350 |             486 |           |        |          | W+I: small e grave           |              | Leading Byte(3), Chinese/Japanese/Korean unified |
| 233 |  e9 | 351 |               4 |           |        |          | W+I: small e acute           |              | Leading Byte(3), Chinese/Japanese/Korean unified |
| 234 |  ea | 352 |              50 |           |        |          | W+I: small e circumflex      |              | Leading Byte(3), Asian                           |
| 235 |  eb | 353 |               1 |           |        |          | W+I: small e umlaut          |              | Leading Byte(3), Hangul                          |
| 236 |  ec | 354 |             422 |           |        |          | W+I: small i grave           |              | Leading Byte(3), Hangul                          |
| 237 |  ed | 355 |               0 |           |        |          | W+I: small i acute           |              | Leading Byte(3), Hangul                          |
| 238 |  ee | 356 |              51 |           |        |          | W+I: small i circumflex      |              | Leading Byte(3), Private Use Areas               |
| 239 |  ef | 357 |               0 |           |        |          | W+I: small i umlaut          |              | Leading Byte(3), Forms                           |
| 240 |  f0 | 360 |               0 |           |        |          | W+I: small eth               |              | Leading Byte(4), Supplementary Planes            |
| 241 |  f1 | 361 |               0 |           |        |          | W+I: small n tilde           |              | Leading Byte(4)                                  |
| 242 |  f2 | 362 |           3,060 |           |        |          | W+I: small o grave           |              | Leading Byte(4)                                  |
| 243 |  f3 | 363 |               0 |           |        |          | W+I: small o acute           |              | Leading Byte(4), Supplementary Planes            |
| 244 |  f4 | 364 |             206 |           |        |          | W+I: small o circumflex      |              | Leading Byte(4), Supplementary Planes            |
| 245 |  f5 | 365 |               0 |           |        |          | W+I: small o tilde           |              | Invalid in UTF-8                                 |
| 246 |  f6 | 366 |               0 |           |        |          | W+I: small o umlaut          |              | Invalid in UTF-8                                 |
| 247 |  f7 | 367 |               0 |           |        |          | W+I: division sign           |              | Invalid in UTF-8                                 |
| 248 |  f8 | 370 |               0 |           |        |          | W+I: small o with slash      |              | Invalid in UTF-8                                 |
| 249 |  f9 | 371 |           1,074 |           |        |          | W+I: small u grave           |              | Invalid in UTF-8                                 |
| 250 |  fa | 372 |               0 |           |        |          | W+I: small u acute           |              | Invalid in UTF-8                                 |
| 251 |  fb | 373 |              25 |           |        |          | W+I: small u circumflex      |              | Invalid in UTF-8                                 |
| 252 |  fc | 374 |               0 |           |        |          | W+I: small u umlaut          |              | Invalid in UTF-8                                 |
| 253 |  fd | 375 |               0 |           |        |          | W+I: small y acute           |              | Invalid in UTF-8                                 |
| 254 |  fe | 376 |               0 |           |        |          | W+I: small thorn             |              | Invalid in UTF-8                                 |
| 255 |  ff | 377 |               0 |           |        |          | W+I: small y umlaut          |              | Invalid in UTF-8                                 |
+-----+-----+-----+-----------------+-----------+--------+----------+------------------------------+--------------+--------------------------------------------------+
    

This shows us nearly 5 million colons, and that is consistent with the field delimiter being a colon.

The most interesting thing here is to see which characters are used from the ISO 8859 and Windows 12xx families, with byte values between 128 and 255. None of the characters used are both:

Assuming that we have either ISO 8859-1 or Windows 1252, the characters that are used are: À, È, Ì, Ò, Ù, à, á, â, è, é, ê, ë, ì, î, ò, ô, ù and û. The gazetteer covers England, Scotland and Wales. Place names in Scotland and Wales are often shown in the Gaelic and Welsh languages and these languages do use the accents we have found. Everything so far suggests that this file is valid. We can look in more detail by running the verbose analysis. Here are the first few lines:

  Scanning file: /Users/ronballard/Downloads/gaz50k_gb/Data/50kgaz2016.txt
  Line: 5, column: 15  Found UTF-8 leading byte 224, Hex: e0, Octal: 340, UTF-8 Leading Byte, Indic, Windows-1252 Lowercase A grave
  Line: 5, column: 16  Not enough UTF-8 Continuation bytes for preceding UTF-8 Leading byte 224, Hex: e0, Octal: 340
  Line: 11, column: 19  Found invalid UTF-8 byte, 192, Hex: c0, Octal: 300, Invalid in UTF-8, Windows-1252 Uppercase A grave
  Line: 12, column: 17  Found UTF-8 leading byte 236, Hex: ec, Octal: 354, UTF-8 Leading Byte, Hangul, Windows-1252 Lowercase I grave
  Line: 12, column: 18  Not enough UTF-8 Continuation bytes for preceding UTF-8 Leading byte 236, Hex: ec, Octal: 354
  Line: 16, column: 29  Found UTF-8 leading byte 242, Hex: f2, Octal: 362, UTF-8 Leading Byte, Windows-1252 Lowercase O grave
  Line: 16, column: 30  Not enough UTF-8 Continuation bytes for preceding UTF-8 Leading byte 242, Hex: f2, Octal: 362
    

Here is the data relating to these messages:

  1:TQ6004:1066 Country Walk:TQ60:50:49:0:16.7:104500:560500:E:ES:E Susx:East Sussex:X:20-SEP-2011:I:199:0:0
  2:TQ7715:1066 Country Walk:TQ60:50:54.7:0:31.5:115500:577500:E:ES:E Susx:East Sussex:X:20-SEP-2011:I:199:0:0
  3:TQ7610:1066 Country Walk Bexhill Link:TQ60:50:52:0:30.5:110500:576500:E:ES:E Susx:East Sussex:X:20-SEP-2011:I:199:0:0
  4:TQ8315:1066 Country Walk Hastings Link:TQ80:50:54.5:0:36.6:115500:583500:E:ES:E Susx:East Sussex:X:20-SEP-2011:I:199:0:0
  5:NF9073:A' Bhàd:NF86:57:38.7:7:11.3:873500:90500:W:WI:N Eil:Na h-Eileanan an Iar:X:19-SEP-2014:U:18:0:0
  6:NG7656:A' Bhainlir:NG64:57:32.6:5:44.1:856500:176500:W:HL:Highld:Highland:H:05-JAN-2012:U:24:0:0
  7:NM8403:A' Bheinn:NM80:56:10.5:5:28.3:703500:184500:W:AR:Arg & Bt:Argyll and Bute:X:13-NOV-2014:U:55:0:0
  8:NM9466:A' Bheinn Bhan:NM86:56:44.7:5:21.6:766500:194500:W:HL:Highld:Highland:X:01-MAR-1993:I:40:0:0
  9:NB5464:A' Bheirghe:NB46:58:30:6:12.9:964500:154500:W:WI:N Eil:Na h-Eileanan an Iar:X:15-SEP-2014:U:8:0:0
  10:NB1743:A' Bheirigh:NB04:58:17.4:6:49.3:943500:117500:W:WI:N Eil:Na h-Eileanan an Iar:X:05-JAN-2012:U:8:13:0
  11:NM5838:A' Bhog-Àirigh:NM42:56:28.6:5:55.3:738500:158500:W:AR:Arg & Bt:Argyll and Bute:X:10-NOV-2014:U:47:48:0
  12:NM2999:A' Bhrìdeanach:NM28:57:.4:6:27.4:799500:129500:W:HL:Highld:Highland:X:01-MAR-1993:I:39:0:0
  13:NN4790:A' Bhuidheanach:NN48:56:58.8:4:30.6:790500:247500:W:HL:Highld:Highland:H:01-MAR-1993:I:34:0:0
  14:NN6579:A' Bhuidheanach:NN66:56:53.2:4:12.5:779500:265500:W:HL:Highld:Highland:H:01-MAR-1993:I:42:0:0
  15:NN6677:A' Bhuidheanach Bheag:NN66:56:52.2:4:11.4:777500:266500:W:PK:Pth & Kin:Perth and Kinross:H:01-MAR-1993:I:42:0:0
  16:NN6678:A' Bhuidheanach Mhòr:NN66:56:52.7:4:11.4:778500:266500:W:PK:Pth & Kin:Perth and Kinross:X:18-JUL-2000:I:42:0:0
    

Line 5, column 15 is "à"
Line 11, column 19 is "À"
Line 12, column 17 is "ì"
Line 16, column 29 is "ò"

So, at this level, this is another good quality file. The encoding is compatible with Windows 1252 and ISO/IEC 8859-1, as well as some of the less common variants in the Windows 12xx family and in the ISO/IEC 8859 family. I would be happy to load this file into a database as a one-off exercise, but if I were planning to load a new version of it every month, say, then I would want to know exactly what the encoding is, because next month's data might have something that works properly only in one of the various possible encodings that we have found.

Databases should usually be in UTF-8, now, because more and more of the data that we receive in the future will arrive in UTF-8. For a database in an established organisation, with a set of applications consistently using one of the older encodings, such as ISO 8859-1, then it would be easier now to create the database to use that encoding. But I would prefer to be ahead of the game, so I would always favour creating databases in UTF-8 now and converting all data to UTF-8 on loading. This isn't difficult (see Load a File to a Database and Convert the Character Encoding). What will make life difficult is dirty data, but that is always true, whatever encoding we use.

Example 3: An XML file encoded in UTF-8

This is also a file from the UK Government's Open Data initiative. It is a list of roads, giving the road name, classification, coordinates (as points and series of connected points) and some other data. You can download the file from Open Roads. The format is described as "GML 3"; this is XML with a particular XML schema that is designed to support geograpical information. The XML header says that the file in encoded in UTF-8.

  +------------------------------------------------------------|-----------------+
  | Lines:                                                     |      13,740,370 |
  | Bytes:                                                     |     786,690,530 |
  | Windows line ends:                                         |      13,740,369 |
  | UNIX line ends:                                            |               0 |
  | Control Characters (excluding newline & carriage return):  |      39,732,865 |
  | UTF-8 Leading bytes:                                       |               0 |
  | UTF-8 Continuation bytes:                                  |               0 |
  | UTF-8 Leading bytes without enough Continuation bytes:     |               0 |
  | UTF-8 Continuation bytes without a preceding Leading byte: |               0 |
  | UTF-8 Invalid bytes:                                       |               0 |
  | Byte values specific to Windows 1252:                      |               0 |
  | Byte values valid in ISO 8859 and Windows 12xx families:   |               0 |
  | Carriage returns without newline:                          |               0 |
  +------------------------------------------------------------|-----------------+
    

This shows us that we have a Windows file. It appears to use only ASCII characters. Since ASCII is a valid subset of UTF-8, this is consistent with its description in the XML header. There is an average of three control characters per line. We will need to understand what these control characters are and decide what to do about them.

+-----+-----+-----+-----------------+-----------+--------+----------+------------------------------+--------------+--------------------------------------------------+
|     |     |     |                 |   ASCII   |   C    | teletype | Name (W = Windows-1252,      | Specific to  |                   UTF-8 Group                    |
| Dec | Hex | Oct | Number of Bytes | Printable | Escape | notation | I = ISO 8859-1, C = Control  | Windows-1252 |Leading Byte(n) = first byte of an n-byte sequence|
+-----+-----+-----+-----------------+-----------+--------+----------+------------------------------+--------------+--------------------------------------------------+
| 000 |  00 | 000 |               0 |           |   \0   |    ^@    | C:null                       |              | Control character, ASCII compatible              |
| 001 |  01 | 001 |               0 |           |        |    ^A    | C:start of heading           |              | Control character, ASCII compatible              |
| 002 |  02 | 002 |               0 |           |        |    ^B    | C:start of text              |              | Control character, ASCII compatible              |
| 003 |  03 | 003 |               0 |           |        |    ^C    | C:end of text                |              | Control character, ASCII compatible              |
| 004 |  04 | 004 |               0 |           |        |    ^D    | C:end of transmission        |              | Control character, ASCII compatible              |
| 005 |  05 | 005 |               0 |           |        |    ^E    | C:enquiry                    |              | Control character, ASCII compatible              |
| 006 |  06 | 006 |               0 |           |        |    ^F    | C:acknowledgement            |              | Control character, ASCII compatible              |
| 007 |  07 | 007 |               0 |           |   \a   |    ^G    | C:bell                       |              | Control character, ASCII compatible              |
| 008 |  08 | 010 |               0 |           |   \b   |    ^H    | C:backspace                  |              | Control character, ASCII compatible              |
| 009 |  09 | 011 |      39,732,865 |           |   \t   |    ^I    | C:horizontal tab             |              | Control character, ASCII compatible              |
| 010 |  0a | 012 |      13,740,369 |           |   \n   |    ^J    | C:newline                    |              | Control character, ASCII compatible              |
| 011 |  0b | 013 |               0 |           |   \v   |    ^K    | C:vertical tab               |              | Control character, ASCII compatible              |
| 012 |  0c | 014 |               0 |           |   \f   |    ^L    | C:form feed                  |              | Control character, ASCII compatible              |
| 013 |  0d | 015 |      13,740,369 |           |   \r   |    ^M    | C:carriage return            |              | Control character, ASCII compatible              |
| 014 |  0e | 016 |               0 |           |        |    ^N    | C:shift out                  |              | Control character, ASCII compatible              |
| 015 |  0f | 017 |               0 |           |        |    ^O    | C:shift in                   |              | Control character, ASCII compatible              |
| 016 |  10 | 020 |               0 |           |        |    ^P    | C:data link escape           |              | Control character, ASCII compatible              |
| 017 |  11 | 021 |               0 |           |        |    ^Q    | C:device control 1           |              | Control character, ASCII compatible              |
| 018 |  12 | 022 |               0 |           |        |    ^R    | C:device control 2           |              | Control character, ASCII compatible              |
| 019 |  13 | 023 |               0 |           |        |    ^S    | C:device control 3           |              | Control character, ASCII compatible              |
| 020 |  14 | 024 |               0 |           |        |    ^T    | C:device control 4           |              | Control character, ASCII compatible              |
| 021 |  15 | 025 |               0 |           |        |    ^U    | C:negative acknowledgement   |              | Control character, ASCII compatible              |
| 022 |  16 | 026 |               0 |           |        |    ^V    | C:synchronous idle           |              | Control character, ASCII compatible              |
| 023 |  17 | 027 |               0 |           |        |    ^W    | C:end of transmission block  |              | Control character, ASCII compatible              |
| 024 |  18 | 030 |               0 |           |        |    ^X    | C:cancel                     |              | Control character, ASCII compatible              |
| 025 |  19 | 031 |               0 |           |        |    ^Y    | C:end of medium              |              | Control character, ASCII compatible              |
| 026 |  1a | 032 |               0 |           |        |    ^Z    | C:substitute                 |              | Control character, ASCII compatible              |
| 027 |  1b | 033 |               0 |           |   \e   |    ^[    | C:escape                     |              | Control character, ASCII compatible              |
| 028 |  1c | 034 |               0 |           |        |    ^\    | C:file separator             |              | Control character, ASCII compatible              |
| 029 |  1d | 035 |               0 |           |        |    ^]    | C:group separator            |              | Control character, ASCII compatible              |
| 030 |  1e | 036 |               0 |           |        |    ^^    | C:record separator           |              | Control character, ASCII compatible              |
| 031 |  1f | 037 |               0 |           |        |    ^_    | C:unit separator             |              | Control character, ASCII compatible              |
| 032 |  20 | 040 |      14,472,393 |           |        |          | space                        |              | Punctuation and symbols, ASCII compatible        |
| 033 |  21 | 041 |               0 |     !     |        |          |                              |              | Punctuation and symbols, ASCII compatible        |
| 034 |  22 | 042 |      18,971,808 |     "     |        |          |                              |              | Punctuation and symbols, ASCII compatible        |
| 035 |  23 | 043 |         818,026 |     #     |        |          |                              |              | Punctuation and symbols, ASCII compatible        |
| 036 |  24 | 044 |               0 |     $     |        |          |                              |              | Punctuation and symbols, ASCII compatible        |
| 037 |  25 | 045 |               0 |     %     |        |          |                              |              | Punctuation and symbols, ASCII compatible        |
| 038 |  26 | 046 |               0 |     &     |        |          |                              |              | Punctuation and symbols, ASCII compatible        |
| 039 |  27 | 047 |           7,852 |     '     |        |          |                              |              | Punctuation and symbols, ASCII compatible        |
| 040 |  28 | 050 |             580 |     (     |        |          |                              |              | Punctuation and symbols, ASCII compatible        |
| 041 |  29 | 051 |             580 |     )     |        |          |                              |              | Punctuation and symbols, ASCII compatible        |
| 042 |  2a | 052 |               0 |     *     |        |          |                              |              | Punctuation and symbols, ASCII compatible        |
| 043 |  2b | 053 |               0 |     +     |        |          |                              |              | Punctuation and symbols, ASCII compatible        |
| 044 |  2c | 054 |               0 |     ,     |        |          |                              |              | Punctuation and symbols, ASCII compatible        |
| 045 |  2d | 055 |      10,639,979 |     -     |        |          |                              |              | Punctuation and symbols, ASCII compatible        |
| 046 |  2e | 056 |       7,151,553 |     .     |        |          |                              |              | Punctuation and symbols, ASCII compatible        |
| 047 |  2f | 057 |      18,239,744 |     /     |        |          |                              |              | Punctuation and symbols, ASCII compatible        |
| 048 |  30 | 060 |       8,871,710 |     0     |        |          |                              |              | Numeric digits, ASCII compatible                 |
| 049 |  31 | 061 |       9,775,357 |     1     |        |          |                              |              | Numeric digits, ASCII compatible                 |
| 050 |  32 | 062 |       8,544,777 |     2     |        |          |                              |              | Numeric digits, ASCII compatible                 |
| 051 |  33 | 063 |       7,047,548 |     3     |        |          |                              |              | Numeric digits, ASCII compatible                 |
| 052 |  34 | 064 |       9,201,085 |     4     |        |          |                              |              | Numeric digits, ASCII compatible                 |
| 053 |  35 | 065 |       9,102,900 |     5     |        |          |                              |              | Numeric digits, ASCII compatible                 |
| 054 |  36 | 066 |       6,960,439 |     6     |        |          |                              |              | Numeric digits, ASCII compatible                 |
| 055 |  37 | 067 |       8,556,155 |     7     |        |          |                              |              | Numeric digits, ASCII compatible                 |
| 056 |  38 | 070 |       7,657,738 |     8     |        |          |                              |              | Numeric digits, ASCII compatible                 |
| 057 |  39 | 071 |       7,395,388 |     9     |        |          |                              |              | Numeric digits, ASCII compatible                 |
| 058 |  3a | 072 |      29,042,956 |     :     |        |          |                              |              | Punctuation and symbols, ASCII compatible        |
| 059 |  3b | 073 |               0 |     ;     |        |          |                              |              | Punctuation and symbols, ASCII compatible        |
| 060 |  3c | 074 |      18,477,664 |     <     |        |          |                              |              | Punctuation and symbols, ASCII compatible        |
| 061 |  3d | 075 |       9,485,904 |     =     |        |          |                              |              | Punctuation and symbols, ASCII compatible        |
| 062 |  3e | 076 |      18,477,664 |     >     |        |          |                              |              | Punctuation and symbols, ASCII compatible        |
| 063 |  3f | 077 |               2 |     ?     |        |          |                              |              | Punctuation and symbols, ASCII compatible        |
| 064 |  40 | 100 |               0 |     @     |        |          |                              |              | Punctuation and symbols, ASCII compatible        |
| 065 |  41 | 101 |       5,152,142 |     A     |        |          |                              |              | English alphabet, ASCII compatible               |
| 066 |  42 | 102 |       4,979,436 |     B     |        |          |                              |              | English alphabet, ASCII compatible               |
| 067 |  43 | 103 |       6,242,345 |     C     |        |          |                              |              | English alphabet, ASCII compatible               |
| 068 |  44 | 104 |       5,108,060 |     D     |        |          |                              |              | English alphabet, ASCII compatible               |
| 069 |  45 | 105 |       5,079,116 |     E     |        |          |                              |              | English alphabet, ASCII compatible               |
| 070 |  46 | 106 |       7,053,084 |     F     |        |          |                              |              | English alphabet, ASCII compatible               |
| 071 |  47 | 107 |       1,594,446 |     G     |        |          |                              |              | English alphabet, ASCII compatible               |
| 072 |  48 | 110 |          39,137 |     H     |        |          |                              |              | English alphabet, ASCII compatible               |
| 073 |  49 | 111 |           2,088 |     I     |        |          |                              |              | English alphabet, ASCII compatible               |
| 074 |  4a | 112 |           3,045 |     J     |        |          |                              |              | English alphabet, ASCII compatible               |
| 075 |  4b | 113 |           7,686 |     K     |        |          |                              |              | English alphabet, ASCII compatible               |
| 076 |  4c | 114 |       3,523,003 |     L     |        |          |                              |              | English alphabet, ASCII compatible               |
| 077 |  4d | 115 |       1,573,303 |     M     |        |          |                              |              | English alphabet, ASCII compatible               |
| 078 |  4e | 116 |       4,181,157 |     N     |        |          |                              |              | English alphabet, ASCII compatible               |
| 079 |  4f | 117 |       2,240,726 |     O     |        |          |                              |              | English alphabet, ASCII compatible               |
| 080 |  50 | 120 |       1,443,338 |     P     |        |          |                              |              | English alphabet, ASCII compatible               |
| 081 |  51 | 121 |           2,067 |     Q     |        |          |                              |              | English alphabet, ASCII compatible               |
| 082 |  52 | 122 |       7,119,818 |     R     |        |          |                              |              | English alphabet, ASCII compatible               |
| 083 |  53 | 123 |       3,592,027 |     S     |        |          |                              |              | English alphabet, ASCII compatible               |
| 084 |  54 | 124 |          20,649 |     T     |        |          |                              |              | English alphabet, ASCII compatible               |
| 085 |  55 | 125 |         281,792 |     U     |        |          |                              |              | English alphabet, ASCII compatible               |
| 086 |  56 | 126 |       2,312,353 |     V     |        |          |                              |              | English alphabet, ASCII compatible               |
| 087 |  57 | 127 |       1,272,947 |     W     |        |          |                              |              | English alphabet, ASCII compatible               |
| 088 |  58 | 130 |               1 |     X     |        |          |                              |              | English alphabet, ASCII compatible               |
| 089 |  59 | 131 |           1,042 |     Y     |        |          |                              |              | English alphabet, ASCII compatible               |
| 090 |  5a | 132 |              80 |     Z     |        |          |                              |              | English alphabet, ASCII compatible               |
| 091 |  5b | 133 |               0 |     [     |        |          |                              |              | Punctuation and symbols, ASCII compatible        |
| 092 |  5c | 134 |               0 |     \     |        |          |                              |              | Punctuation and symbols, ASCII compatible        |
| 093 |  5d | 135 |               0 |     ]     |        |          |                              |              | Punctuation and symbols, ASCII compatible        |
| 094 |  5e | 136 |               0 |     ^     |        |          |                              |              | Punctuation and symbols, ASCII compatible        |
| 095 |  5f | 137 |               0 |     _     |        |          |                              |              | Punctuation and symbols, ASCII compatible        |
| 096 |  60 | 140 |               0 |     `     |        |          |                              |              | Punctuation and symbols, ASCII compatible        |
| 097 |  61 | 141 |      35,755,077 |     a     |        |          |                              |              | English alphabet, ASCII compatible               |
| 098 |  62 | 142 |       3,897,775 |     b     |        |          |                              |              | English alphabet, ASCII compatible               |
| 099 |  63 | 143 |      13,262,679 |     c     |        |          |                              |              | English alphabet, ASCII compatible               |
| 100 |  64 | 144 |      26,580,440 |     d     |        |          |                              |              | English alphabet, ASCII compatible               |
| 101 |  65 | 145 |      44,588,944 |     e     |        |          |                              |              | English alphabet, ASCII compatible               |
| 102 |  66 | 146 |      11,671,023 |     f     |        |          |                              |              | English alphabet, ASCII compatible               |
| 103 |  67 | 147 |       9,099,361 |     g     |        |          |                              |              | English alphabet, ASCII compatible               |
| 104 |  68 | 150 |       3,297,806 |     h     |        |          |                              |              | English alphabet, ASCII compatible               |
| 105 |  69 | 151 |      36,252,810 |     i     |        |          |                              |              | English alphabet, ASCII compatible               |
| 106 |  6a | 152 |         219,286 |     j     |        |          |                              |              | English alphabet, ASCII compatible               |
| 107 |  6b | 153 |       4,462,008 |     k     |        |          |                              |              | English alphabet, ASCII compatible               |
| 108 |  6c | 154 |      24,962,776 |     l     |        |          |                              |              | English alphabet, ASCII compatible               |
| 109 |  6d | 155 |      16,836,685 |     m     |        |          |                              |              | English alphabet, ASCII compatible               |
| 110 |  6e | 156 |      31,912,294 |     n     |        |          |                              |              | English alphabet, ASCII compatible               |
| 111 |  6f | 157 |      42,172,928 |     o     |        |          |                              |              | English alphabet, ASCII compatible               |
| 112 |  70 | 160 |      10,690,187 |     p     |        |          |                              |              | English alphabet, ASCII compatible               |
| 113 |  71 | 161 |           1,911 |     q     |        |          |                              |              | English alphabet, ASCII compatible               |
| 114 |  72 | 162 |      31,991,890 |     r     |        |          |                              |              | English alphabet, ASCII compatible               |
| 115 |  73 | 163 |      24,470,636 |     s     |        |          |                              |              | English alphabet, ASCII compatible               |
| 116 |  74 | 164 |      27,894,640 |     t     |        |          |                              |              | English alphabet, ASCII compatible               |
| 117 |  75 | 165 |      12,524,979 |     u     |        |          |                              |              | English alphabet, ASCII compatible               |
| 118 |  76 | 166 |         804,032 |     v     |        |          |                              |              | English alphabet, ASCII compatible               |
| 119 |  77 | 167 |       4,870,569 |     w     |        |          |                              |              | English alphabet, ASCII compatible               |
| 120 |  78 | 170 |       5,509,663 |     x     |        |          |                              |              | English alphabet, ASCII compatible               |
| 121 |  79 | 171 |       4,020,240 |     y     |        |          |                              |              | English alphabet, ASCII compatible               |
| 122 |  7a | 172 |           1,597 |     z     |        |          |                              |              | English alphabet, ASCII compatible               |
| 123 |  7b | 173 |               0 |     {     |        |          |                              |              | English alphabet, ASCII compatible               |
| 124 |  7c | 174 |               0 |     |     |        |          |                              |              | English alphabet, ASCII compatible               |
| 125 |  7d | 175 |               0 |     }     |        |          |                              |              | English alphabet, ASCII compatible               |
| 126 |  7e | 176 |               0 |     ~     |        |          |                              |              | English alphabet, ASCII compatible               |
| 127 |  7f | 177 |               0 |           |        |    ^?    | C: delete                    |              | Control character, ASCII compatible              |
| 128 |  80 | 200 |               0 |           |        |          | W: Euro symbol               |     Yes      | Continuation byte                                |
| 129 |  81 | 201 |               0 |           |        |          |                              |              | Continuation byte                                |
| 130 |  82 | 202 |               0 |           |        |          | W: curved single open-quote  |     Yes      | Continuation byte                                |
| 131 |  83 | 203 |               0 |           |        |          | W: small f with hook         |     Yes      | Continuation byte                                |
| 132 |  84 | 204 |               0 |           |        |          | W: curved double open-quote  |     Yes      | Continuation byte                                |
| 133 |  85 | 205 |               0 |           |        |          | W: ellipsis                  |     Yes      | Continuation byte                                |
| 134 |  86 | 206 |               0 |           |        |          | W: dagger                    |     Yes      | Continuation byte                                |
| 135 |  87 | 207 |               0 |           |        |          | W: double dagger             |     Yes      | Continuation byte                                |
| 136 |  88 | 210 |               0 |           |        |          | W: circumflex                |     Yes      | Continuation byte                                |
| 137 |  89 | 211 |               0 |           |        |          | W: permille                  |     Yes      | Continuation byte                                |
| 138 |  8a | 212 |               0 |           |        |          | W: capital S caron           |     Yes      | Continuation byte                                |
| 139 |  8b | 213 |               0 |           |        |          | W: open single guillemet     |     Yes      | Continuation byte                                |
| 140 |  8c | 214 |               0 |           |        |          | W: capital O+E               |     Yes      | Continuation byte                                |
| 141 |  8d | 215 |               0 |           |        |          |                              |              | Continuation byte                                |
| 142 |  8e | 216 |               0 |           |        |          | W: capital Z caron           |     Yes      | Continuation byte                                |
| 143 |  8f | 217 |               0 |           |        |          |                              |              | Continuation byte                                |
| 144 |  90 | 220 |               0 |           |        |          |                              |              | Continuation byte                                |
| 145 |  91 | 221 |               0 |           |        |          | W: curved single close quote |     Yes      | Continuation byte                                |
| 146 |  92 | 222 |               0 |           |        |          | W: curved single open quote  |     Yes      | Continuation byte                                |
| 147 |  93 | 223 |               0 |           |        |          | W: curved double open quote  |     Yes      | Continuation byte                                |
| 148 |  94 | 224 |               0 |           |        |          | W: curved double close quote |     Yes      | Continuation byte                                |
| 149 |  95 | 225 |               0 |           |        |          | W: bullet                    |     Yes      | Continuation byte                                |
| 150 |  96 | 226 |               0 |           |        |          | W: en-dash                   |     Yes      | Continuation byte                                |
| 151 |  97 | 227 |               0 |           |        |          | W: em-dash                   |     Yes      | Continuation byte                                |
| 152 |  98 | 230 |               0 |           |        |          | W: tilde                     |     Yes      | Continuation byte                                |
| 153 |  99 | 231 |               0 |           |        |          | W: trade mark                |     Yes      | Continuation byte                                |
| 154 |  9a | 232 |               0 |           |        |          | W: small S caron             |     Yes      | Continuation byte                                |
| 155 |  9b | 233 |               0 |           |        |          | W: close single guillemet    |     Yes      | Continuation byte                                |
| 156 |  9c | 234 |               0 |           |        |          | W: small O+E                 |     Yes      | Continuation byte                                |
| 157 |  9d | 235 |               0 |           |        |          |                              |              | Continuation byte                                |
| 158 |  9e | 236 |               0 |           |        |          | W: small Z caron             |     Yes      | Continuation byte                                |
| 159 |  9f | 237 |               0 |           |        |          | W: capital Y umlaut          |     Yes      | Continuation byte                                |
| 160 |  a0 | 240 |               0 |           |        |          | W+I: non-breaking space      |              | Continuation byte                                |
| 161 |  a1 | 241 |               0 |           |        |          | W+I: inverted exclamation    |              | Continuation byte                                |
| 162 |  a2 | 242 |               0 |           |        |          | W+I: cent                    |              | Continuation byte                                |
| 163 |  a3 | 243 |               0 |           |        |          | W+I: UK pound sign           |              | Continuation byte                                |
| 164 |  a4 | 244 |               0 |           |        |          | W+I: currency                |              | Continuation byte                                |
| 165 |  a5 | 245 |               0 |           |        |          | W+I: Yen sign                |              | Continuation byte                                |
| 166 |  a6 | 246 |               0 |           |        |          | W+I: broken bar              |              | Continuation byte                                |
| 167 |  a7 | 247 |               0 |           |        |          | W+I: section                 |              | Continuation byte                                |
| 168 |  a8 | 250 |               0 |           |        |          | W+I: trema                   |              | Continuation byte                                |
| 169 |  a9 | 251 |               0 |           |        |          | W+I: copyright sign          |              | Continuation byte                                |
| 170 |  aa | 252 |               0 |           |        |          | W+I: feminine ordinal        |              | Continuation byte                                |
| 171 |  ab | 253 |               0 |           |        |          | W+I: open double guillemet   |              | Continuation byte                                |
| 172 |  ac | 254 |               0 |           |        |          | W+I: logical complement      |              | Continuation byte                                |
| 173 |  ad | 255 |               0 |           |        |          | W+I: soft hyphen             |              | Continuation byte                                |
| 174 |  ae | 256 |               0 |           |        |          | W+I: registered sign         |              | Continuation byte                                |
| 175 |  af | 257 |               0 |           |        |          | W+I: macron                  |              | Continuation byte                                |
| 176 |  b0 | 260 |               0 |           |        |          | W+I: degree                  |              | Continuation byte                                |
| 177 |  b1 | 261 |               0 |           |        |          | W+I: plus-or-minus sign      |              | Continuation byte                                |
| 178 |  b2 | 262 |               0 |           |        |          | W+I: raise to power of 2     |              | Continuation byte                                |
| 179 |  b3 | 263 |               0 |           |        |          | W+I: raise to power of 3     |              | Continuation byte                                |
| 180 |  b4 | 264 |               0 |           |        |          | W+I: acute accent            |              | Continuation byte                                |
| 181 |  b5 | 265 |               0 |           |        |          | W+I: Greek letter mu         |              | Continuation byte                                |
| 182 |  b6 | 266 |               0 |           |        |          | W+I: pilcrow (paragraph)     |              | Continuation byte                                |
| 183 |  b7 | 267 |               0 |           |        |          | W+I: middle dot              |              | Continuation byte                                |
| 184 |  b8 | 270 |               0 |           |        |          | W+I: cedilla                 |              | Continuation byte                                |
| 185 |  b9 | 271 |               0 |           |        |          | W+I: superscript 1           |              | Continuation byte                                |
| 186 |  ba | 272 |               0 |           |        |          | W+I: masculine ordinal       |              | Continuation byte                                |
| 187 |  bb | 273 |               0 |           |        |          | W+I: close double guillemet  |              | Continuation byte                                |
| 188 |  bc | 274 |               0 |           |        |          | W+I: one quarter             |              | Continuation byte                                |
| 189 |  bd | 275 |               0 |           |        |          | W+I: one half                |              | Continuation byte                                |
| 190 |  be | 276 |               0 |           |        |          | W+I: three quarters          |              | Continuation byte                                |
| 191 |  bf | 277 |               0 |           |        |          | W+I: inverted question mark  |              | Continuation byte                                |
| 192 |  c0 | 300 |               0 |           |        |          | W+I: capital A grave         |              | Invalid in UTF-8                                 |
| 193 |  c1 | 301 |               0 |           |        |          | W+I: capital A  acute        |              | Invalid in UTF-8                                 |
| 194 |  c2 | 302 |               0 |           |        |          | W+I: capital A circumflex    |              | Leading Byte(2), Latin                           |
| 195 |  c3 | 303 |               0 |           |        |          | W+I: capital A tilde         |              | Leading Byte(2), Latin                           |
| 196 |  c4 | 304 |               0 |           |        |          | W+I: capital A umlaut        |              | Leading Byte(2), Latin                           |
| 197 |  c5 | 305 |               0 |           |        |          | W+I: capital A ring          |              | Leading Byte(2), Latin                           |
| 198 |  c6 | 306 |               0 |           |        |          | W+I: capital ash A+E         |              | Leading Byte(2), Latin                           |
| 199 |  c7 | 307 |               0 |           |        |          | W+I: capital C cedilla       |              | Leading Byte(2), Latin                           |
| 200 |  c8 | 310 |               0 |           |        |          | W+I: capital E grave         |              | Leading Byte(2), Latin                           |
| 201 |  c9 | 311 |               0 |           |        |          | W+I: capital E acute         |              | Leading Byte(2), International Phonetic Alphabet |
| 202 |  ca | 312 |               0 |           |        |          | W+I: capital E circumflex    |              | Leading Byte(2), International Phonetic Alphabet |
| 203 |  cb | 313 |               0 |           |        |          | W+I: capital E umlaut        |              | Leading Byte(2), International Phonetic Alphabet |
| 204 |  cc | 314 |               0 |           |        |          | W+I: capital I grave         |              | Leading Byte(2), accents                         |
| 205 |  cd | 315 |               0 |           |        |          | W+I: capital I acute         |              | Leading Byte(2), accents                         |
| 206 |  ce | 316 |               0 |           |        |          | W+I: capital I circumflex    |              | Leading Byte(2), Greek                           |
| 207 |  cf | 317 |               0 |           |        |          | W+I: capital I umlaut        |              | Leading Byte(2), Greek                           |
| 208 |  d0 | 320 |               0 |           |        |          | W+I: capital D with bar      |              | Leading Byte(2), Cyrillic                        |
| 209 |  d1 | 321 |               0 |           |        |          | W+I: capital N tilde         |              | Leading Byte(2), Cyrillic                        |
| 210 |  d2 | 322 |               0 |           |        |          | W+I: capital O grave         |              | Leading Byte(2), Cyrillic                        |
| 211 |  d3 | 323 |               0 |           |        |          | W+I: capital O acute         |              | Leading Byte(2), Cyrillic                        |
| 212 |  d4 | 324 |               0 |           |        |          | W+I: capital O circumflex    |              | Leading Byte(2), Cyrillic                        |
| 213 |  d5 | 325 |               0 |           |        |          | W+I: capital O tilde         |              | Leading Byte(2), Armenian                        |
| 214 |  d6 | 326 |               0 |           |        |          | W+I: capital O umlaut        |              | Leading Byte(2), Hebrew                          |
| 215 |  d7 | 327 |               0 |           |        |          | W+I: multiplication sign     |              | Leading Byte(2), Hebrew                          |
| 216 |  d8 | 330 |               0 |           |        |          | W+I: capital O with slash    |              | Leading Byte(2), Arabic                          |
| 217 |  d9 | 331 |               0 |           |        |          | W+I: capital U grave         |              | Leading Byte(2), Arabic                          |
| 218 |  da | 332 |               0 |           |        |          | W+I: capital U acute         |              | Leading Byte(2), Arabic                          |
| 219 |  db | 333 |               0 |           |        |          | W+I: capital U circumflex    |              | Leading Byte(2), Arabic                          |
| 220 |  dc | 334 |               0 |           |        |          | W+I: capital u umlaut        |              | Leading Byte(2), Syriac                          |
| 221 |  dd | 335 |               0 |           |        |          | W+I: capital Y acute         |              | Leading Byte(2), Arabic                          |
| 222 |  de | 336 |               0 |           |        |          | W+I: capital thorn           |              | Leading Byte(2), Thaana                          |
| 223 |  df | 337 |               0 |           |        |          | W+I: capital eszett          |              | Leading Byte(2), N'Ko                            |
| 224 |  e0 | 340 |               0 |           |        |          | W+I: small a grave           |              | Leading Byte(3), Indic                           |
| 225 |  e1 | 341 |               0 |           |        |          | W+I: small a acute           |              | Leading Byte(3), Miscellaneous                   |
| 226 |  e2 | 342 |               0 |           |        |          | W+I: small a circumflex      |              | Leading Byte(3), Symbol                          |
| 227 |  e3 | 343 |               0 |           |        |          | W+I: small a tilde           |              | Leading Byte(3), Kana & Chinese/Japanese/Korean  |
| 228 |  e4 | 344 |               0 |           |        |          | W+I: small a umlaut          |              | Leading Byte(3), Chinese/Japanese/Korean unified |
| 229 |  e5 | 345 |               0 |           |        |          | W+I: small a ring            |              | Leading Byte(3), Chinese/Japanese/Korean unified |
| 230 |  e6 | 346 |               0 |           |        |          | W+I: small ash a+e           |              | Leading Byte(3), Chinese/Japanese/Korean unified |
| 231 |  e7 | 347 |               0 |           |        |          | W+I: small c cedilla         |              | Leading Byte(3), Chinese/Japanese/Korean unified |
| 232 |  e8 | 350 |               0 |           |        |          | W+I: small e grave           |              | Leading Byte(3), Chinese/Japanese/Korean unified |
| 233 |  e9 | 351 |               0 |           |        |          | W+I: small e acute           |              | Leading Byte(3), Chinese/Japanese/Korean unified |
| 234 |  ea | 352 |               0 |           |        |          | W+I: small e circumflex      |              | Leading Byte(3), Asian                           |
| 235 |  eb | 353 |               0 |           |        |          | W+I: small e umlaut          |              | Leading Byte(3), Hangul                          |
| 236 |  ec | 354 |               0 |           |        |          | W+I: small i grave           |              | Leading Byte(3), Hangul                          |
| 237 |  ed | 355 |               0 |           |        |          | W+I: small i acute           |              | Leading Byte(3), Hangul                          |
| 238 |  ee | 356 |               0 |           |        |          | W+I: small i circumflex      |              | Leading Byte(3), Private Use Areas               |
| 239 |  ef | 357 |               0 |           |        |          | W+I: small i umlaut          |              | Leading Byte(3), Forms                           |
| 240 |  f0 | 360 |               0 |           |        |          | W+I: small eth               |              | Leading Byte(4), Supplementary Planes            |
| 241 |  f1 | 361 |               0 |           |        |          | W+I: small n tilde           |              | Leading Byte(4)                                  |
| 242 |  f2 | 362 |               0 |           |        |          | W+I: small o grave           |              | Leading Byte(4)                                  |
| 243 |  f3 | 363 |               0 |           |        |          | W+I: small o acute           |              | Leading Byte(4), Supplementary Planes            |
| 244 |  f4 | 364 |               0 |           |        |          | W+I: small o circumflex      |              | Leading Byte(4), Supplementary Planes            |
| 245 |  f5 | 365 |               0 |           |        |          | W+I: small o tilde           |              | Invalid in UTF-8                                 |
| 246 |  f6 | 366 |               0 |           |        |          | W+I: small o umlaut          |              | Invalid in UTF-8                                 |
| 247 |  f7 | 367 |               0 |           |        |          | W+I: division sign           |              | Invalid in UTF-8                                 |
| 248 |  f8 | 370 |               0 |           |        |          | W+I: small o with slash      |              | Invalid in UTF-8                                 |
| 249 |  f9 | 371 |               0 |           |        |          | W+I: small u grave           |              | Invalid in UTF-8                                 |
| 250 |  fa | 372 |               0 |           |        |          | W+I: small u acute           |              | Invalid in UTF-8                                 |
| 251 |  fb | 373 |               0 |           |        |          | W+I: small u circumflex      |              | Invalid in UTF-8                                 |
| 252 |  fc | 374 |               0 |           |        |          | W+I: small u umlaut          |              | Invalid in UTF-8                                 |
| 253 |  fd | 375 |               0 |           |        |          | W+I: small y acute           |              | Invalid in UTF-8                                 |
| 254 |  fe | 376 |               0 |           |        |          | W+I: small thorn             |              | Invalid in UTF-8                                 |
| 255 |  ff | 377 |               0 |           |        |          | W+I: small y umlaut          |              | Invalid in UTF-8                                 |
+-----+-----+-----+-----------------+-----------+--------+----------+------------------------------+--------------+--------------------------------------------------+
    
Scanning file: /Users/ronballard/Downloads/oproad_gml3_gb/data/OSOpenRoads_TQ.gml
Line: 3, column: 1  Found control character Decimal: 9, Hex: 09, Octal: 011, Control character ^I \t horizontal tab
Line: 4, column: 1  Found control character Decimal: 9, Hex: 09, Octal: 011, Control character ^I \t horizontal tab
Line: 5, column: 1  Found control character Decimal: 9, Hex: 09, Octal: 011, Control character ^I \t horizontal tab
Line: 5, column: 2  Found control character Decimal: 9, Hex: 09, Octal: 011, Control character ^I \t horizontal tab
Line: 6, column: 1  Found control character Decimal: 9, Hex: 09, Octal: 011, Control character ^I \t horizontal tab
Line: 6, column: 2  Found control character Decimal: 9, Hex: 09, Octal: 011, Control character ^I \t horizontal tab
Line: 6, column: 3  Found control character Decimal: 9, Hex: 09, Octal: 011, Control character ^I \t horizontal tab
Line: 7, column: 1  Found control character Decimal: 9, Hex: 09, Octal: 011, Control character ^I \t horizontal tab
Line: 7, column: 2  Found control character Decimal: 9, Hex: 09, Octal: 011, Control character ^I \t horizontal tab
Line: 7, column: 3  Found control character Decimal: 9, Hex: 09, Octal: 011, Control character ^I \t horizontal tab
    

Running the verbose output gives a file of 4.8GB, because there are so many tab characters. From a quick scan of the output, it seems that tabs are used to indent the GML. See Loading XML Data to a Database. So the verbose output is not very useful in this case.

If we are going to load this data into a database, then the encoding will not give us any problems. Because it is purely 7-bit ASCII (which is compatible with all the Latin character encoding systems we can load it into any of these without conversion. However, any XML format should always be restructured into atomic values, in tables and columns as described in Loading XML Data to a Database. While doing this we would remove the tab characters because they are not part of the data.

Example 4: A file from a commercial back-office system

This section is about the file that made me realise the need for a character profiler. The file comes from a system that was the definitive record of customers for a large company. The company was formed by buying a piece of an even larger company, and this particular system was still run by the old company while the new offshoot built its own customer database. The old system was written decades ago and still runs on an IBM mainframe. The relationship between the old company and the new one is such that it is very difficult to get questions answered about data in this file. We know that the file comes from a mainframe, but the process by which the file is periodically delivered to a Windows server in the new company is a mystery. In fact the file is one of hundreds that make up the old customer database, but it is the file that contains the main customer details. It is nowhere near being normalised; the file has over 150 fields in each record, some of which are directly related to the customer and many that should be off in other tables because they have a many-to-one relationship, or an indirect relationship, to the customer.

None of this is unusual. It is the stuff that is everyday work for people in big organisations migrating data to new systems or building Data Warehouses, and now Data Lakes. Somehow we have to try to make sense of such data. Before we even get to that stage though, we have to load the data into some system that can help us to analyse it. This file was a delimited file, like a csv file but with an exclamation mark as the delimiter. The choice of delimiter is not a problem, but this file was a problem. Whatever parser we used to divide the file into records, and then fields within records, would periodically be broken by some unexpected characters in the data. Hence our need to find out, quickly, what had caused today's failure.

Here is the first part of the analysis. It does not include all the UTF-8 summary statistics because these were not written when I ran this and I no longer have access to the file to rerun it. It doesn't matter because this was not a UTF-8 file.

  +------------------------------------------------------------|-----------------+
  | Lines:                                                     |       3,811,261 |
  | Bytes:                                                     |   3,898,861,310 |
  | Windows line ends:                                         |       3,811,261 |
  | UNIX line ends:                                            |               0 |
  | Control Characters  (excluding newline & carriage return): |           1,456 |
  | Carriage returns without newline:                          |               2 |
  +------------------------------------------------------------|-----------------+
    

A line is defined here as any sequence of characters ending in newline, or carriage return/newline. "Windows line ends" counts the number of carriage return/newline pairs, and in this case we have 3,811,261 of these. UNIX line ends counts the number of newlines that are not preceded by a carriage return and there are none of these. This tells us that the file we are looking at was produced on a Windows system (not UNIX and not Mac).

This summary also tells us that there are two carriage returns without newlines and that there are 1,456 other control characters.

The next extract is the start of the table that lists every byte value (from 0 to 255) and the count of bytes with that value. I'm showing just the first 32 values. This covers all but one of the control characters:

+-----+-----+-----+-----------------+-----------+--------+----------+------------------------------+--------------+-----------------------------------------------+
|     |     |     |                 |   ASCII   |   C    | teletype | Name (W = Windows-1252,      | Specific to  |                                               |
| Dec | Hex | Oct | Number of Bytes | Printable | Escape | notation | I = ISO 8859-1, C = Control  | Windows-1252 |                  UTF-8 Group                  |
+-----+-----+-----+-----------------+-----------+--------+----------+------------------------------+--------------+-----------------------------------------------+
| 000 |  00 | 000 |              28 |           |   \0   |    ^@    | C:null                       |              | Control character, ASCII compatible           |
| 001 |  01 | 001 |               0 |           |        |    ^A    | C:start of heading           |              | Control character, ASCII compatible           |
| 002 |  02 | 002 |               0 |           |        |    ^B    | C:start of text              |              | Control character, ASCII compatible           |
| 003 |  03 | 003 |               0 |           |        |    ^C    | C:end of text                |              | Control character, ASCII compatible           |
| 004 |  04 | 004 |               0 |           |        |    ^D    | C:end of transmission        |              | Control character, ASCII compatible           |
| 005 |  05 | 005 |               0 |           |        |    ^E    | C:enquiry                    |              | Control character, ASCII compatible           |
| 006 |  06 | 006 |               0 |           |        |    ^F    | C:acknowledgement            |              | Control character, ASCII compatible           |
| 007 |  07 | 007 |               0 |           |   \a   |    ^G    | C:bell                       |              | Control character, ASCII compatible           |
| 008 |  08 | 010 |               0 |           |   \b   |    ^H    | C:backspace                  |              | Control character, ASCII compatible           |
| 009 |  09 | 011 |           1,428 |           |   \t   |    ^I    | C:horizontal tab             |              | Control character, ASCII compatible           |
| 010 |  0a | 012 |       3,811,261 |           |   \n   |    ^J    | C:newline                    |              | Control character, ASCII compatible           |
| 011 |  0b | 013 |               0 |           |   \v   |    ^K    | C:vertical tab               |              | Control character, ASCII compatible           |
| 012 |  0c | 014 |               0 |           |   \f   |    ^L    | C:form feed                  |              | Control character, ASCII compatible           |
| 013 |  0d | 015 |       3,811,263 |           |   \r   |    ^M    | C:carriage return            |              | Control character, ASCII compatible           |
| 014 |  0e | 016 |               0 |           |        |    ^N    | C:shift out                  |              | Control character, ASCII compatible           |
| 015 |  0f | 017 |               0 |           |        |    ^O    | C:shift in                   |              | Control character, ASCII compatible           |
| 016 |  10 | 020 |               0 |           |        |    ^P    | C:data link escape           |              | Control character, ASCII compatible           |
| 017 |  11 | 021 |               0 |           |        |    ^Q    | C:device control 1           |              | Control character, ASCII compatible           |
| 018 |  12 | 022 |               0 |           |        |    ^R    | C:device control 2           |              | Control character, ASCII compatible           |
| 019 |  13 | 023 |               0 |           |        |    ^S    | C:device control 3           |              | Control character, ASCII compatible           |
| 020 |  14 | 024 |               0 |           |        |    ^T    | C:device control 4           |              | Control character, ASCII compatible           |
| 021 |  15 | 025 |               0 |           |        |    ^U    | C:negative acknowledgement   |              | Control character, ASCII compatible           |
| 022 |  16 | 026 |               0 |           |        |    ^V    | C:synchronous idle           |              | Control character, ASCII compatible           |
| 023 |  17 | 027 |               0 |           |        |    ^W    | C:end of transmission block  |              | Control character, ASCII compatible           |
| 024 |  18 | 030 |               0 |           |        |    ^X    | C:cancel                     |              | Control character, ASCII compatible           |
| 025 |  19 | 031 |               0 |           |        |    ^Y    | C:end of medium              |              | Control character, ASCII compatible           |
| 026 |  1a | 032 |               0 |           |        |    ^Z    | C:substitute                 |              | Control character, ASCII compatible           |
| 027 |  1b | 033 |               0 |           |   \e   |    ^[    | C:escape                     |              | Control character, ASCII compatible           |
| 028 |  1c | 034 |               0 |           |        |    ^\    | C:file separator             |              | Control character, ASCII compatible           |
| 029 |  1d | 035 |               2 |           |        |    ^]    | C:group separator            |              | Control character, ASCII compatible           |
| 030 |  1e | 036 |               0 |           |        |    ^^    | C:record separator           |              | Control character, ASCII compatible           |
| 031 |  1f | 037 |               0 |           |        |    ^_    | C:unit separator             |              | Control character, ASCII compatible           |
    

We can see several interesting things from this table:

In this early version of the character encoding profile program, we did have the line and column numbers of characters that might cause us problems, or give us information about the possible encoding. Here is a very short extract:

  Line: 938543, column: 585  Found UTF-8 continuation byte 162, Hex: a2, Octal: 242, UTF-8 Continuation byte, Windows-1252 Cent (currency symbol) 
  Line: 938548, column: 585  Found UTF-8 continuation byte 162, Hex: a2, Octal: 242, UTF-8 Continuation byte, Windows-1252 Cent (currency symbol) 
  Line: 938553, column: 1  Carriage return (hex 0d) without newline (hex 0a) 
    

This extract shows a couple of lines that have a byte value of 162 (decimal) which is hex a2. If the file were in UTF-8 this would be a continuation byte so there should be a leading byte before it (and there is not, in this file, because it would show on the line above). If the file were in Windows-1252 this would be the cent currency symbol (¢). Since this is a plausible symbol in Windows-1252 and part of an invalid byte in UTF-8, it does suggest that this file is more likely to be in Windows-1252 and less likely to be in UTF-8. Hex a2 is also a valid character in the ISO/IEC 8859 character families.

The third line in this extract shows one of the lone carriage return characters. This enabled us to go and look at the line containing this abberation, and decide what to do about it. It is hard to see what use a lone carriage return would be. Since we gave up teletypes there has not been much use for the carriage return character except in the ill-advised use of it as part of the end of a line in Windows.

Example 5: A large binary file

In binary files, the bytes do not represent characters. The whole file is a sequence of bits that are organised in some way that a particular program can interpret. Binary files are used for images, video, audio, compressed data, and others.

In general, there is not much point in running the character encoding profile on a binary file except, maybe:

I tried it for the third reason above. I was looking for the biggest file on my Mac and it was the latest Hortonworks Sandbox virtual machine. Here goes!

  $ java EncodingProfile /Users/ronballard/Downloads/HDP_2.5_vmware.ova

  Scanning file: /Users/ronballard/Downloads/HDP_2.5_vmware.ova
  
  +------------------------------------------------------------|-----------------+
  | Lines:                                                     |      44,635,520 |
  | Bytes:                                                     |  11,907,244,545 |
  | Windows line ends:                                         |         174,874 |
  | UNIX line ends:                                            |      44,460,645 |
  | Control Characters:                                        |   1,478,386,355 |
  | UTF-8 Leading bytes:                                       |   2,376,960,854 |
  | UTF-8 Continuation bytes:                                  |   2,960,058,394 |
  | UTF-8 Leading bytes without enough Continuation bytes:     |   1,977,694,633 |
  | UTF-8 Continuation bytes without a preceding Leading byte: |   2,302,335,902 |
  | UTF-8 Invalid bytes:                                       |     614,551,653 |
  | Windows 1252 bytes:                                        |   2,730,723,664 |
  | ISO 8859 bytes:                                            |   5,951,570,901 |
  | Carriage returns without newline:                          |      45,527,068 |
  +------------------------------------------------------------|-----------------+
    

As predicted, this isn't very useful. It shows a pretty even distribution of byte values. In fact the report by every byte value shows about 45 million occurrences of every possible byte value (so I won't bother you with it here). The most useful thing for me was the following line:

 
      Elapsed Time: 00:04:20
    

That's nearly 46 million bytes per second. For me that makes the profiing worth doing on every file we are asked to process.