MySQL is destroying my Umlauts
This is the english translation of an article in my german blog. This article, like the german original, is licensed CC-BY-SA. The english translation has been kindly provided by Tobias Klausmann.
Recently, I had to explain this to several people, hence a writeup for the blog for easier reference. The question:
In MySQL, every string has a label that describes the character encoding the string was written in (and should be interpreted in). The string _latin1"Köhntopp" thus (hopefully) is the character sequence K-0xF6-hntopp and the string _utf8"Köhntopp" consequently should be K-0xC3 0xB6-hntopp. Problems arise as soon as the label (_latin1 or _utf8) does not match the encoding inside the string (0xF6 vs. 0xC3 0xB6).
This is outlined in more detail in Handling character sets, and you should have read that article before you continue.
Recently, I had to explain this to several people, hence a writeup for the blog for easier reference. The question:
I have content in my database that can be sucessfully read and written by my application, but if I do a mysqldump to transfer the data to a new system, all the non-ASCII characters like Umlauts are destroyed.This happens if you save data to a DB with the wrong text encoding label.
In MySQL, every string has a label that describes the character encoding the string was written in (and should be interpreted in). The string _latin1"Köhntopp" thus (hopefully) is the character sequence K-0xF6-hntopp and the string _utf8"Köhntopp" consequently should be K-0xC3 0xB6-hntopp. Problems arise as soon as the label (_latin1 or _utf8) does not match the encoding inside the string (0xF6 vs. 0xC3 0xB6).
This is outlined in more detail in Handling character sets, and you should have read that article before you continue.
Reproducing the problem
Let's define a table with a VARCHAR column that has the
encoding label latin1.
CODE:
mysql> create table t (
-> id integer unsigned not null primary key,
-> f varchar(20) charset latin1
-> ) engine = innodb;
Query OK, 0 rows affected (0.25 sec)
The connection we'll use to talk to the database we'll also declare to be-> id integer unsigned not null primary key,
-> f varchar(20) charset latin1
-> ) engine = innodb;
Query OK, 0 rows affected (0.25 sec)
latin1. That is, the character encoding of column t.f and the
encoding of the connection match.
CODE:
mysql> set names latin1;
Query OK, 0 rows affected (0.00 sec)
What we are doing now is to send utf8 characters over a connection that is labeled as latin1. In essence, we're lying to the database: Query OK, 0 rows affected (0.00 sec)
CODE:
mysql> select hex('Köhntopp') as umlaut;
+--------------------+
| umlaut |
+--------------------+
| 4BC3B6686E746F7070 |
+--------------------+
1 row in set (0.00 sec)
mysql> insert into t values ( 1, 'Köhntopp');
Query OK, 1 row affected (0.00 sec)
The data is encoded as utf8 but is transmitted using a connection labeled as latin1. The column t.f also has the label latin1. Hence, no conversion is necessary and MySQL stores the data as-is, i.e. exactly the same way we've sent it.+--------------------+
| umlaut |
+--------------------+
| 4BC3B6686E746F7070 |
+--------------------+
1 row in set (0.00 sec)
mysql> insert into t values ( 1, 'Köhntopp');
Query OK, 1 row affected (0.00 sec)
If we read the data, that works just fine. Data in column t.f is interpreted as latin1 and the connection used is latin1, too. Again, the data is returned as-is and are shown correctly in our utf8-configured terminal:
CODE:
mysql> select f, hex(f) from t where id =1;
+-----------+--------------------+
| f | hex(f) |
+-----------+--------------------+
| Köhntopp | 4BC3B6686E746F7070 |
+-----------+--------------------+
1 row in set (0.00 sec)
The output has a correct Umlaut in its left colum - but the terminal used is configured to use utf8. The right column shows the bytes as they are before being converted for output (i.e. how they are stored on disk) - 0xc3 0xb6 is the utf8 encoding of an o-umlaut.+-----------+--------------------+
| f | hex(f) |
+-----------+--------------------+
| Köhntopp | 4BC3B6686E746F7070 |
+-----------+--------------------+
1 row in set (0.00 sec)
Now let's take a look at the dump:
CODE:
...
/*!40101 SET NAMES utf8 */;
...
DROP TABLE IF EXISTS `t`;
/*!40101 SET @saved_cs_client = @@character_set_client */;
/*!40101 SET character_set_client = utf8 */;
CREATE TABLE `t` (
`id` int(10) unsigned NOT NULL,
`f` varchar(20) CHARACTER SET latin1 DEFAULT NULL,
PRIMARY KEY (`id`)
) ENGINE=InnoDB DEFAULT CHARSET=utf8 COLLATE=utf8_unicode_ci;
/*!40101 SET character_set_client = @saved_cs_client */;
--
-- Dumping data for table `t`
--
LOCK TABLES `t` WRITE;
/*!40000 ALTER TABLE `t` DISABLE KEYS */;
INSERT INTO `t` VALUES (1,'Köhntopp');
/*!40000 ALTER TABLE `t` ENABLE KEYS */;
UNLOCK TABLES;
We see that mysqldump sets the connection to utf8. It always does this, no matter what encoding the data is in. This is not a problem: the used charset is always a superset of all the charsets in the database and can represent all the string data in the database, no matter what encoding it is in on disk./*!40101 SET NAMES utf8 */;
...
DROP TABLE IF EXISTS `t`;
/*!40101 SET @saved_cs_client = @@character_set_client */;
/*!40101 SET character_set_client = utf8 */;
CREATE TABLE `t` (
`id` int(10) unsigned NOT NULL,
`f` varchar(20) CHARACTER SET latin1 DEFAULT NULL,
PRIMARY KEY (`id`)
) ENGINE=InnoDB DEFAULT CHARSET=utf8 COLLATE=utf8_unicode_ci;
/*!40101 SET character_set_client = @saved_cs_client */;
--
-- Dumping data for table `t`
--
LOCK TABLES `t` WRITE;
/*!40000 ALTER TABLE `t` DISABLE KEYS */;
INSERT INTO `t` VALUES (1,'Köhntopp');
/*!40000 ALTER TABLE `t` ENABLE KEYS */;
UNLOCK TABLES;
The data labeled as latin1 from the column t.f thus is converted from latin1 to utf8 on output. But it it already is utf8-encoded, i.e. in reality, we convert data from utf8 to utf8 again. The result can be seen in the dump: "Köhntopp" - doubly encoded utf8.
Take note: MySQL is doing the right thing. The problem is that a latin1 column contains data that has been encoded as utf8 because we used a connection labeled as latin1 to save utf8-encoded data. We lied to the server and now we get our just punishment.
There are two solutions to fix this:
Solution 1: editing the dump
This hair-raising command line:
CODE:
server:~ # mysqldump --default-character-set latin1 kris t | sed -e 's/SET NAMES latin1/SET NAMES utf8/'
creates a dump through a latin1-labeled connection - the data is thus dumped in exactly the same way as our application reads it.
Unfortunately, the dump itself still says the data is latin1-encoded. This is still wrong - the data is utf8-encoded. The sed command corrects this by changing the encoding declaration from latin1 to utf8. Label and content now agree.
The data will now be written to the database in the correct way. One more thing to do with the new database is to ensure that the application uses "SET NAMES utf8" when it writes utf8 data.
Solution 2: fixing the original database
As described in the manual, you can:
Warning
The CONVERT TO operation converts column values between the character sets. This is not what you want if you have a column in one character set (like latin1) but the stored values actually use some other, incompatible character set (like utf8). In this case, you have to do the following for each such column:
CODE:ALTER TABLE t1 CHANGE c1 c1 BLOB;
ALTER TABLE t1 CHANGE c1 c1 TEXT CHARACTER SET utf8;
The reason this works is that there is no conversion when you convert to or from BLOB columns.
So that is what we'll do:
CODE:
mysql> show create table t\G
Table: t
Create Table: CREATE TABLE `t` (
`id` int(10) unsigned NOT NULL,
`f` varchar(20) CHARACTER SET latin1 DEFAULT NULL,
PRIMARY KEY (`id`)
) ENGINE=InnoDB DEFAULT CHARSET=utf8 COLLATE=utf8_unicode_ci
1 row in set (0.00 sec)
In the old table, column f is defined as containing latin1 data, but contains utf8 data. Table: t
Create Table: CREATE TABLE `t` (
`id` int(10) unsigned NOT NULL,
`f` varchar(20) CHARACTER SET latin1 DEFAULT NULL,
PRIMARY KEY (`id`)
) ENGINE=InnoDB DEFAULT CHARSET=utf8 COLLATE=utf8_unicode_ci
1 row in set (0.00 sec)
CODE:
mysql> alter table t modify column f varbinary(20);
Query OK, 1 row affected (0.63 sec)
Records: 1 Duplicates: 0 Warnings: 0
mysql> alter table t modify column f varchar(20) charset utf8;
Query OK, 1 row affected (0.62 sec)
Records: 1 Duplicates: 0 Warnings: 0
We convert f from VARCHAR to be VARBINARY - basically a VARCHAR without character set label, simply a binary blob (convert VARCHAR to VARBINARY, CHAR to BINARY and TEXT to BLOB). Data inside the column is kept as-is, but it's not strings anymore. Rather, it's just bits without meaning (to MySQL), not belonging to a particular encoding.Query OK, 1 row affected (0.63 sec)
Records: 1 Duplicates: 0 Warnings: 0
mysql> alter table t modify column f varchar(20) charset utf8;
Query OK, 1 row affected (0.62 sec)
Records: 1 Duplicates: 0 Warnings: 0
Then, we convert column f back to something with an encoding, in this case VARCHAR(20) CHARSET utf8, i.e. the correct encoding for the bits.
Results:
CODE:
mysql> show create table t\G
Table: t
Create Table: CREATE TABLE `t` (
`id` int(10) unsigned NOT NULL,
`f` varchar(20) CHARACTER SET utf8 DEFAULT NULL,
PRIMARY KEY (`id`)
) ENGINE=InnoDB DEFAULT CHARSET=utf8 COLLATE=utf8_unicode_ci
1 row in set (0.00 sec)
mysql> select f, hex(f) from t;
+-----------+--------------------+
| f | hex(f) |
+-----------+--------------------+
| Köhntopp | 4BC3B6686E746F7070 |
+-----------+--------------------+
1 row in set (0.00 sec)
We now have utf8 data in a utf8 column. That is, column label and column data agree and the mechanisms in the database server work for us, no longer against us.Table: t
Create Table: CREATE TABLE `t` (
`id` int(10) unsigned NOT NULL,
`f` varchar(20) CHARACTER SET utf8 DEFAULT NULL,
PRIMARY KEY (`id`)
) ENGINE=InnoDB DEFAULT CHARSET=utf8 COLLATE=utf8_unicode_ci
1 row in set (0.00 sec)
mysql> select f, hex(f) from t;
+-----------+--------------------+
| f | hex(f) |
+-----------+--------------------+
| Köhntopp | 4BC3B6686E746F7070 |
+-----------+--------------------+
1 row in set (0.00 sec)
Preventing this error in the future
Never lie to your database about the character set encoding. If you have used SET NAMES latin1, only send latin1; if you use SET NAMES utf8, only send utf8. The database will convert your data correctly, no matter how your columns are labeled or whether you use latin1- or utf8-encoding applications in a wild mix.
Comments
Display comments as Linear | Threaded
Martin on :
Peter Laursen on :
http://webyog.com/faq/content/34/152/en/my-special-characters-display-as-_noisy_-latin-characters-in-sqlyog.html (following quite a lot of user reports)
As explained here it is not necessary to dump data to fix the problem in the database. Only two ALTER TABLE statements (encoded data >> binary data >> encoded data) will do.
My experience is that it is particularly PHP users that mess up their data with 'double encoding' (and the mess-up described in the FAQ is only one example - there are more ways!).
Stewart Smith on :
The joys of MySQL being able to mangle your data have meant that as an option to drizzledump, as an aid for migrating data from MySQL, we have the option:
--my-data-is-mangled
Do not make a UTF8 connection to MySQL, use if you have UTF8 data in a non-UTF8 table
joy.