This is the english translation of an
article in my german blog. This article, like the german original, is licensed
CC-BY-SA. The english translation has been kindly provided by Tobias Klausmann.
Recently, I had to explain this to several people, hence a writeup for the blog for easier reference. The question:
I have content in my database that can be sucessfully read and written by my application, but if I do a mysqldump to transfer the data to a new system, all the non-ASCII characters like Umlauts are destroyed.
This happens if you save data to a DB with the wrong text encoding label.
In MySQL, every string has a label that describes the character encoding the string was written in (and should be interpreted in). The string
_latin1"Köhntopp" thus (hopefully) is the character sequence
K-0xF6-hntopp and the string
_utf8"Köhntopp" consequently should be
K-0xC3 0xB6-hntopp. Problems arise as soon as the label (
_latin1 or
_utf8) does not match the encoding inside the string (
0xF6 vs.
0xC3 0xB6).
This is outlined in more detail in
Handling character sets, and you should have read that article before you continue.
Reproducing the problem
Let's define a table with a
VARCHAR column that has the
encoding label
latin1.
CODE:
mysql> create table t (
-> id integer unsigned not null primary key,
-> f varchar(20) charset latin1
-> ) engine = innodb;
Query OK, 0 rows affected (0.25 sec)
The connection we'll use to talk to the database we'll also declare to be
latin1. That is, the character encoding of column
t.f and the
encoding of the connection match.
CODE:
mysql> set names latin1;
Query OK, 0 rows affected (0.00 sec)
What we are doing now is to send
utf8 characters over a connection that is labeled as
latin1. In essence, we're lying to the database:
CODE:
mysql> select hex('Köhntopp') as umlaut;
+--------------------+
| umlaut |
+--------------------+
| 4BC3B6686E746F7070 |
+--------------------+
1 row in set (0.00 sec)
mysql> insert into t values ( 1, 'Köhntopp');
Query OK, 1 row affected (0.00 sec)
The data is encoded as
utf8 but is transmitted using a connection labeled as
latin1. The column
t.f also has the label
latin1. Hence, no conversion is necessary and MySQL stores the data as-is, i.e. exactly the same way we've sent it.
If we read the data, that works just fine. Data in column
t.f is interpreted as
latin1 and the connection used is
latin1, too. Again, the data is returned as-is and are shown correctly in our
utf8-configured terminal:
CODE:
mysql> select f, hex(f) from t where id =1;
+-----------+--------------------+
| f | hex(f) |
+-----------+--------------------+
| Köhntopp | 4BC3B6686E746F7070 |
+-----------+--------------------+
1 row in set (0.00 sec)
The output has a correct Umlaut in its left colum - but the terminal used is configured to use
utf8. The right column shows the bytes as they are before being converted for output (i.e. how they are stored on disk) - 0xc3 0xb6 is the
utf8 encoding of an o-umlaut.
Now let's take a look at the dump:
CODE:
...
/*!40101 SET NAMES utf8 */;
...
DROP TABLE IF EXISTS `t`;
/*!40101 SET @saved_cs_client = @@character_set_client */;
/*!40101 SET character_set_client = utf8 */;
CREATE TABLE `t` (
`id` int(10) unsigned NOT NULL,
`f` varchar(20) CHARACTER SET latin1 DEFAULT NULL,
PRIMARY KEY (`id`)
) ENGINE=InnoDB DEFAULT CHARSET=utf8 COLLATE=utf8_unicode_ci;
/*!40101 SET character_set_client = @saved_cs_client */;
--
-- Dumping data for table `t`
--
LOCK TABLES `t` WRITE;
/*!40000 ALTER TABLE `t` DISABLE KEYS */;
INSERT INTO `t` VALUES (1,'Köhntopp');
/*!40000 ALTER TABLE `t` ENABLE KEYS */;
UNLOCK TABLES;
We see that
mysqldump sets the connection to
utf8. It always does this, no matter what encoding the data is in. This is not a problem: the used charset is always a superset of all the charsets in the database and can represent all the string data in the database, no matter what encoding it is in on disk.
The data labeled as
latin1 from the column
t.f thus is converted from
latin1 to
utf8 on output. But it
it already is utf8-encoded, i.e. in reality, we convert data from
utf8 to
utf8 again. The result can be seen in the dump: "Köhntopp" - doubly encoded
utf8.
Take note: MySQL is doing the right thing. The problem is that a
latin1 column contains data that has been encoded as
utf8 because we used a connection labeled as
latin1 to save
utf8-encoded data. We lied to the server and now we get our just punishment.
There are two solutions to fix this:
Solution 1: editing the dump
This hair-raising command line:
CODE:
server:~ # mysqldump --default-character-set latin1 kris t | sed -e 's/SET NAMES latin1/SET NAMES utf8/'
creates a dump through a
latin1-labeled connection - the data is thus dumped in exactly the same way as our application reads it.
Unfortunately, the dump itself still says the data is
latin1-encoded. This is still wrong - the data is
utf8-encoded. The
sed command corrects this by changing the encoding declaration from
latin1 to
utf8. Label and content now agree.
The data will now be written to the database in the correct way. One more thing to do with the new database is to ensure that the application uses "
SET NAMES utf8" when it writes
utf8 data.
Solution 2: fixing the original database
As described in the
manual, you can:
Warning
The CONVERT TO operation converts column values between the character sets. This is not what you want if you have a column in one character set (like latin1) but the stored values actually use some other, incompatible character set (like utf8). In this case, you have to do the following for each such column:
CODE:
ALTER TABLE t1 CHANGE c1 c1 BLOB;
ALTER TABLE t1 CHANGE c1 c1 TEXT CHARACTER SET utf8;
The reason this works is that there is no conversion when you convert to or from BLOB columns.
So that is what we'll do:
CODE:
mysql> show create table t\G
Table: t
Create Table: CREATE TABLE `t` (
`id` int(10) unsigned NOT NULL,
`f` varchar(20) CHARACTER SET latin1 DEFAULT NULL,
PRIMARY KEY (`id`)
) ENGINE=InnoDB DEFAULT CHARSET=utf8 COLLATE=utf8_unicode_ci
1 row in set (0.00 sec)
In the old table, column
f is defined as containing
latin1 data, but contains
utf8 data.
CODE:
mysql> alter table t modify column f varbinary(20);
Query OK, 1 row affected (0.63 sec)
Records: 1 Duplicates: 0 Warnings: 0
mysql> alter table t modify column f varchar(20) charset utf8;
Query OK, 1 row affected (0.62 sec)
Records: 1 Duplicates: 0 Warnings: 0
We convert
f from
VARCHAR to be
VARBINARY - basically a
VARCHAR without character set label, simply a binary blob (convert VARCHAR to VARBINARY, CHAR to BINARY and TEXT to BLOB). Data inside the column is kept as-is, but it's not strings anymore. Rather, it's just bits without meaning (to MySQL), not belonging to a particular encoding.
Then, we convert column
f back to something with an encoding, in this case
VARCHAR(20) CHARSET utf8, i.e. the correct encoding for the bits.
Results:
CODE:
mysql> show create table t\G
Table: t
Create Table: CREATE TABLE `t` (
`id` int(10) unsigned NOT NULL,
`f` varchar(20) CHARACTER SET utf8 DEFAULT NULL,
PRIMARY KEY (`id`)
) ENGINE=InnoDB DEFAULT CHARSET=utf8 COLLATE=utf8_unicode_ci
1 row in set (0.00 sec)
mysql> select f, hex(f) from t;
+-----------+--------------------+
| f | hex(f) |
+-----------+--------------------+
| Köhntopp | 4BC3B6686E746F7070 |
+-----------+--------------------+
1 row in set (0.00 sec)
We now have
utf8 data in a
utf8 column. That is, column label and column data agree and the mechanisms in the database server work for us, no longer against us.
Preventing this error in the future
Never lie to your database about the character set encoding. If you have used
SET NAMES latin1, only send
latin1; if you use
SET NAMES utf8, only send
utf8. The database will convert your data correctly, no matter how your columns are labeled or whether you use
latin1- or
utf8-encoding applications in a wild mix.