← Home

Getting MySQL compare Unicode Greek Extended characters correctly

For example: We have a table named forms that contains a column named value. There are three records that contain the following values: eta, eta, eta.

Now, each of the following statements ...

SELECT * FROM forms WHERE value = 'eta'
SELECT * FROM forms WHERE value = 'eta'
SELECT * FROM forms WHERE value = 'eta'

... will select all of these three records. Got that? All of these statements select the same three records. Woops. That's obviously not what one would expect here.

For the record, the database's character-set is set to UTF-8 Unicode (utf8) and the table's and VARCHAR column's collation is set utf8_unicode_ci.

So what the heck is going on here?

After I've posted this question to the MySQL users mailinglist Chris White put me on the right track by pointing me to the LIKE BINARY keyword which enables you to do case-sensitive string comparsions on byte-level rather than character level.

As far as I understand what's happening here MySQL transforms my VARCHAR columns for string comparsions. As of MySQL 4.1 a collation can be specified therefor. That's the reason why MySQL's string comparsions will be case-insensitive by default.

This transformation of course needs to rely on some mechanism to determine that, for example, "a" and "A" represent the same character. Now, utf8_unicode_ci seems to be only able to digest a certain range of UTF-8 characters and treats the both latter characters (eta with different accents) as the same as the first one (eta).

Bonus question now being why exactly this happens. I suspect I've just not collected enough Unicode fu to master this lesson, but the solution must be hidden somewhere there. If you're able to answer this and found this article I'm sure you'll also find your way to my "add a comment" form! :-)

Having learned something new I was pretty excited and went ahead to change some queries in my application, like:


Form.find_all_by_value value

... to use the newly discovered MySQL operator:


Form.find_all ["value LIKE BINARY ?", value]

... which worked perfectly. There's been another query that also went wild in the same kind of way - this time using a DISTINCT selection:


Form.find(:all, :select => "DISTINCT value")

This had the same issues like the orignal query above in that it regarded all three characters as identical and, thus, in this case selected only one of them. BINARY to the rescue - one can use this keyword here also:


Form.find(:all, :select => "DISTINCT BINARY value as value")

... which also yielded the expected results.

I already thought about creating a mini-plugin to switch ActiveRecords to use something like this by default (further fueled by having seen a Django commit concerning something like this).

But that's not even necessary at all.

Things are even far simpler.

Looking through the MySQL manual for more information on the LIKE BINARY operator I finally found that there's actually a special column type that makes MySQL treat comparsions in exactly this same way:

Just use VARBINARY as your column type and you're ready to go a compare exotic funny characters with MySQL.

I were quite perplexed because I never knew about this entire field type and for years on now stupidly and blindly used VARCHAR columns.

At the end of the day now the question remains what limitations, funny behaviours or other things to consider there are with VARBINARY columns compared to their well-known cousin VARCHAR.

If you know any ... please let me know! :-)

Oh! And if you wonder how you can change your migrations to use this column type ... this is not what Rails knows as :binary! That's the MySQL BLOB type. Instead you can specify this manually by something like this:


class CreateForms <ActiveRecord::Migration
  def self.up
    create_table :cards do |t|
      # ...
      t.columns << 'value varbinary(255)'	
    end
  end
end

However, this approach requires that you append the custom typed columns at the end of the list. Is there any better way to get custon column types accepted by migrations?

blog comments powered by Disqus