Common mistakes with character encodings - part 1

Know your encodings!

This might be pretty obviously to you once you've encountered it. But once in a while I'm meeting somebody who's stuck with weird hassles because he's simply using another encoding than he's declaring.

As long as your just using relatively safe 7-bit ASCII characters everything might seem to work pretty fine but as soon as you dare to move outside that range and try to (e.g.) use a German umlaut or other "extended" 8-bit ASCII character somehow the hell's going to rise and heaven's falling ... if you're environments character encodings don't match for some reason.

For example:

With Globalize for Ruby on Rails and other t10n/i18n tools you're hardcoding strings in your templates that - when run - are handed to the database (in case of Globalize) or some other persistence mechanism. E.g. a typical, Globalized template might contain: "I wonder how I'm going to be encoded".t. Other tools such GLoc might use: _("I wonder how I'm going to be encoded") instead.

Either way the string is used as hardcoded data and being worked on/with by the software. In case of Globalize it's going to be inserted into the database as a key for subsequent lookups.

Now when your sourcecode editor for some reason encodes your template as Latin-1 while your database is expecting you to provide UTF-8 you're in trouble.

Recently somebody asked me about an error he got from MySQL 5.0 when following the instructions in my Globalize tutorial. His database told him: "Mysql::Error: #22001 Data too long or column 'tr_key' at row 1: INSERT INTO globalize_translations (`item_id`, `pl...".

It turned out that he'd been bitten by exactly this problem: he's had encoded his files in Latin-1 while his database table was configured to use UTF-8. So these encodings clashed quite understandably: Rails handed a Latin-1 encoded string to a database that expected UTF-8.

The backstory was: while normally being used to VI on a Linux box he's now been working with RadRails for Eclipse on Windows XP. Windows XP file dialogs seem to offer "ANSI" as the default encoding of newly created files.

The MySQL error message was pretty misleading (there's nothing been "too long or column" whatever that was ment to tell in the first place) and this has been recognized and fixed in the meantime.

The lesson from this seems to be: Your files are encoded somehow. So, know about their encoding!

Your files are encoded!

That's what all this character encoding and charset stuff is all about. :)

You might want to start reading some things up. But for starters this has to do with the fact that every character needs to be saved as bits and bytes. Basically charsets are conventions that determine how characters are encoded.

An application that consumes some chunk of data, e.g. a file, will need to know about the character encoding that's been used to saved the data. Likewise, a browser that receives an HTML page from a webserver needs to know (or guess) the character encoding. It needs to decode the bits and bytes this way or another.

For example: The commonly used standard character encoding ISO 8859-1 or less formally Latin-1 will cause a character like the German umlaut Ä to be safed (encoded) as the hexadecimal byte or number A4 (which equals decimal 164).

But the same character will be encoded to an entirely different byte or number, that's to say hexadecimal 80 or decimal 128 when you tell your application to save (encode) this character using the Mac OS Roman character encoding. And the byte A4 does represent a completely unrelated character instead, namely the dagger glyph.

Now, when you try to open any such file with another application on another computer and probably even another operating system (browsing the web you're doing this all the time) ... how would that software know what that number A4 that's contained in the file is meant to be? Is it the German umlaut or is it that cross-shaped dagger glyph?