Skip to content

MySQL is destroying my Umlauts

This is the english translation of an article in my german blog. This article, like the german original, is licensed CC-BY-SA. The english translation has been kindly provided by Tobias Klausmann.

Recently, I had to explain this to several people, hence a writeup for the blog for easier reference. The question:
I have content in my database that can be sucessfully read and written by my application, but if I do a mysqldump to transfer the data to a new system, all the non-ASCII characters like Umlauts are destroyed.
This happens if you save data to a DB with the wrong text encoding label.

In MySQL, every string has a label that describes the character encoding the string was written in (and should be interpreted in). The string _latin1"Köhntopp" thus (hopefully) is the character sequence K-0xF6-hntopp and the string _utf8"Köhntopp" consequently should be K-0xC3 0xB6-hntopp. Problems arise as soon as the label (_latin1 or _utf8) does not match the encoding inside the string (0xF6 vs. 0xC3 0xB6).

This is outlined in more detail in Handling character sets, and you should have read that article before you continue.

Continue reading "MySQL is destroying my Umlauts"

Handling character sets

Choose any font you like


Ex-web.de Colleague Markus asked me:
QUOTE:
Because ... we are now changing dedicated server providers, I have already updated the Gentoo base system. Since as we are already changing everything, we might as well now migrate from MySQL 4 to 5.

So my question is: Shall I be switching from latin1 to utf8? Or should I be staying with latin1?
I have been writing a german language article on character sets in MySQL and this is the translation of it.
Continue reading "Handling character sets"

latin1_german1_cs

For a migration, I needed a latin1_german1_cs collation, that is, I needed German umlauts to sort as their base characters.

Here is what to do: Find your charsets directory with "SHOW VARIABLES LIKE 'char%dir%'".

Inside that directory will be a file Index.xml. Find a free id and add your new collation using an unused id. Add a description and a name.

Inside the same directory is a file latin1.xml. Inside that file, copy the block of the collation matching your new collation most closely, and adapt it as needed. In my case, I changed latin1_german1_ci to _cs, and modified the character values to match.

Patch attached after the break.

How to use:

After applying the patch, the server must be restarted. "SHOW COLLATION" will now show the new latin1_german1_cs collation. "ALTER TABLE t CHANGE COLUMN oldcol newcol varchar(80) CHARSET latin1 COLLATE latin1_german1_cs NOT NULL" will change any old column to the new collation and recreate indices as needed.

Because we changed a single byte character set, we do not need to recomple anything. Try sorting by latin1_bin, latin1_german1_ci and latin1_german1_cs to observe how things are different with each collation. Continue reading "latin1_german1_cs"