Character Set Theory
A character set is just a collection of symbols. Thus, a character set is a completely abstract and nearly useless thing. The only thing that you can do with a character set is decide if a certain symbol is allowed in a certain context or not.
To make things printable, each symbol needs a shape. A collection of shapes for a character set is called a font. For example, "
ö" is a shape for LOWER CASE O-UMLAUT in the Arial font and "
ö" is the same symbol in another font, Times New Roman.
To make symbols accessible to computers, they need a binary representation, an encoding. In my terminal, KDE konsole, utf8 encoding is being used by default. The string "Köhntopp" is being represented as the hex sequence "4b c3 b6 68 6e 74 6f 70 70". As you can see, LOWER CASE o-UMLAUT is encoded as C3 B6.
CODE:
kris@linux:~> od -t x1a
Köhntopp
0000000 4b c3 b6 68 6e 74 6f 70 70 0a
K C 6 h n t o p p nl
0000012
I can select a different encoding from the menu: Settings->Encoding->Iso-8859-1 changes things to "4b f6 68 6e 74 6f 70 70 0a". Now the encoding for LOWER CASE o-UMLAUT is a single byte, F6.
CODE:
kris@linux:~> od -t x1a
Köhntopp
0000000 4b f6 68 6e 74 6f 70 70 0a
K v h n t o p p nl
0000011
Now, if we want to compare or order (sort) two strings, things get even more complicated: You will need a collation that matches your encoding. A collation can be thought of as a virtual canonical encoding of a string that is being used for comparison and sorting.
For example, the MySQL collation latin1_german1_ci internally transcodes "Köhntopp" into "kohntopp" before comparing it to other strings - it is a ci (case insensitive) collation, so all symbols are transcoded into their lower case equivalents, and ö is sorted as o in german1 collation. There is another german collation, latin1_german2_ci, which is used only in german phone books (don't ask). That one will compare "Köhntopp" as "koehntopp", transcoding ö into the two letter sequence oe.
Collations are virtual transcodings - in your columns you will always find the original string, "Köhntopp".
Character sets in the server
For some reason MySQL is calling an encoding a CHARACTER SET or CHARSET. The encodings available inside the server can be listed with SHOW CHARSET. They are also available in INFORMATION_SCHEMA.CHARACTER_SETS. In both cases the column "Maxlen" shows you how many bytes are needed to store a single character, worst case.
The list of supported collations is displayed with SHOW COLLATION, and is also available as INFORMATION_SCHEMA.COLLATIONS. A collation is useful only in the context of its matching encoding, so the INFORMATION_SCHEMA.CHARACTER_SETS can be joined into a n:m relationship with INFORMATION_SCHEMA.COLLATIONS through INFORMATION_SCHEMA. COLLATION_CHARACTER_SET_APPLICABILITY (or you do SHOW COLLATION LIKE '...%').
Chosing a encoding and a collation
Each string in MySQL is tagged with an encoding and a collation. For database objects this happens at the column level. A column with CHAR, VARCHAR or any TEXT type always has a CHARSET and COLLATION.
If you do not specify them when defining that column, the column inherits from the table which inherits from the database which inherits from the server default.
So you may specify in your my.cnf:
CODE:
[mysqld]
default-character-set=latin1
default-collation=latin1_german1_ci
Or you do it when you create a schema:
CODE:
root@localhost [(none)]> create database kris charset latin1 collate latin1_german1_ci;
Query OK, 1 row affected (0.00 sec)
root@localhost [(none)]> show create database kris\G
Database: kris
Create Database: CREATE DATABASE `kris`
/*!40100 DEFAULT CHARACTER SET latin1
COLLATE latin1_german1_ci */
1 row in set (0.00 sec)
For a table and its columns you might do:
CODE:
root@localhost [kris]> create table t (
id integer unsigned not null auto_increment primary key,
c char(20) charset utf8 collate utf8_general_ci,
d varchar(20) charset cp850 collate cp850_general_ci,
t text charset latin1 collate latin1_german1_ci
) charset latin1 collate latin1_general_ci;
Query OK, 0 rows affected (0.01 sec)
root@localhost [kris]> show create table t\G
Table: t
Create Table: CREATE TABLE `t` (
`id` int(10) unsigned NOT NULL auto_increment,
`c` char(20) character set utf8 default NULL,
`d` varchar(20) character set cp850 default NULL,
`t` text character set latin1 collate latin1_german1_ci,
PRIMARY KEY (`id`)
) ENGINE=MyISAM
DEFAULT CHARSET=latin1
COLLATE=latin1_general_ci
1 row in set (0.00 sec)
Of course, when you put data into your columns, the database has to know what encoding was used to send or type the data. This is needed because it is legal to enter LOWER CASE o-UMLAUT from an utf8 terminal to insert it into kris.t.t. But in order to do this, the database has to convert the symbol from utf8 (C3B6) to latin1 (F6).
You can set your default-character-set and default-encoding inside the [mysql] section of your my.cnf or change it online using SET NAMES at any time.
By selecting in KDE konsole Settings->Encoding->utf8 einstelle and then sending "SET NAMES utf8" in the mysql command line client, I am not only sending actual utf8 data, but also inform the server of this fact.
CODE:
root@localhost [kris]> set names utf8;
Query OK, 0 rows affected (0.00 sec)
root@localhost [kris]> select hex("ö");
+-----------+
| hex("ö") |
+-----------+
| C3B6 |
+-----------+
1 row in set (0.00 sec)
When I change this in KDE konsole to Settings->Encoding->latin1 and then "SET NAMES latin1", I will get instead
CODE:
root@localhost [kris]> set names latin1;
Query OK, 0 rows affected (0.00 sec)
root@localhost [kris]> select hex("ö");
+----------+
| hex("ö") |
+----------+
| F6 |
+----------+
1 row in set (0.00 sec)
Now let us insert umlauts into kris.t.t and see what happens:
CODE:
root@localhost [kris]> set names utf8;
Query OK, 0 rows affected (0.00 sec)
root@localhost [kris]> select hex("ö");
+-----------+
| hex("ö") |
+-----------+
| C3B6 |
+-----------+
1 row in set (0.00 sec)
root@localhost [kris]> insert into kris.t (t) values ("ö");
Query OK, 1 row affected (0.00 sec)
root@localhost [kris]> select hex(t), t from kris.t;
+--------+------+
| hex(t) | t |
+--------+------+
| F6 | ö |
+--------+------+
1 row in set (0.00 sec)
So my client is encoding to utf8. I can control this ("set names utf8") and double check it (I get C3B6). I am storing this utf8 encoded symbol into a column that is marked as latin1 encoded, and mysql changes the encoding for that symbol and stores the proper code (F6). I can double check by selecting hex(t) and t: The LOWER CASE o-UMLAUT is really being stored in latin1 encoding, but it shown properly as "ö" in my utf8 terminal. So MySQL even did code the value back properly!
Of course, you can encode "クリス" (isotopu) in utf8, but not in latin1. So what happens if you try to store utf8 isotopu in a latin1 column? We can try.
CODE:
root@localhost [kris]> insert into kris ( t ) values ("クリス");
ERROR 1406 (22001): Data too long for column 't' at row 1
And the other way around?
CODE:
root@localhost [kris]> alter table t change column t t varchar(100) charset utf8 not null;
root@localhost [kris]> insert into kris.t (t) values ("イソトプ");
root@localhost [kris]> set names latin1;
kris@localhost [kris]> select t, hex(t) from t;
+----------+--------------------+
| t | hex(t) |
+----------+--------------------+
| Köhntopp | 4BC3B6686E746F7070 |
| ??? | E382AFE383AAE382B9 |
+----------+--------------------+
2 rows in set (0.00 sec)
As you can see, things that cannot be represented in latin1 are transcoded to "?".
Loading a mysqldump
If you have dumped your old MySQL 4.0 data, you'll end up with database and table definitions without any character set information. You'll have to provide sensible defaults for this on your new server. Sensible means: So that the new databases and tables will be created with the CHARSETS and COLLATIONS you want. The easiest way to do this is to look for the CREATE DATABASE statements in your dump file and edit them the way you need them. These defaults will then be inherited by anything below that.
The connection used to read your dump also has to have proper encoding information. The easiest way to find that is to filter your dump through "od -t x1a" and search for a few special characters. Which encoding is used by your dumpfile?
That encoding must then be used when reading the dumpfile. The easiest way to force this is to edit a SET NAMES statement into the file.
CODE:
--
-- Table structure for table `t`
--
DROP TABLE IF EXISTS `t`;
CREATE TABLE `t` (
`id` int(10) unsigned NOT NULL,
`c` char(20) default NULL,
`d` varchar(20) default NULL,
`t` text,
PRIMARY KEY (`id`)
) TYPE=MyISAM;
--
-- Dumping data for table `t`
--
/*!40000 ALTER TABLE `t` DISABLE KEYS */;
LOCK TABLES `t` WRITE;
INSERT INTO `t` VALUES (1,NULL,NULL,'ö');
UNLOCK TABLES;
/*!40000 ALTER TABLE `t` ENABLE KEYS */;
...
linux:/export/data/charset # od -t x1a kris.sql
...
0001560 53 20 28 31 2c 4e 55 4c 4c 2c 4e 55 4c 4c 2c 27
S sp ( 1 , N U L L , N U L L , '
0001600 f6 27 29 3b 0a 55 4e 4c 4f 43 4b 20 54 41 42 4c
v ' ) ; nl U N L O C K sp T A B L
...
We now can easily modify that file and ament it with a "CREATE DATABASE kris CHARSET utf8 COLLATE utf8_general_ci" and a "SET NAMES latin1". We end up with a database names "kris", which stores strings encoded in utf8. Because the connection was defined as being latin1, all data will be read as latin1 encoded, transcoded to utf8 and then stored inside the database as utf8 data.
For your clients that is not relevant at all: They define with SET NAMES what encoding they prefer for their results, and MySQL will then deliver results in the selected encoding (if possible)
Convert data yourself
A string constant can be tagged with an encoding.
CODE:
root@localhost [kris]> select _utf8'Köhntopp';
+-----------+
| Köhntopp |
+-----------+
| Köhntopp |
+-----------+
1 row in set (0.00 sec)
This is your average "SELECT 'Köhntopp'", only that we tagged the string with an encoding independently from whatever SET NAMES said.
We can also force a certain encoding:
CODE:
root@localhost [kris]> select hex(convert('Köhntopp' using latin1)) as Beispiel;
+------------------+
| Beispiel |
+------------------+
| 4BF6686E746F7070 |
+------------------+
1 row in set (0.03 sec)
Here my utf8-Köhntopp from the connection has been transcoded into a latin1-Köhntopp with F6-umlaut.
The cost of utf8
Unicode is a character set with more than 256 symbols. utf8 is a multibyte encoding for this character set with variable length. Some symbols ("a", "Z") are encoded as single bytes, others ("ö") as sequences of bytes. The subset of unicode supported by MySQL can use up to 3 bytes for a character. So "SHOW CHARSET" display a Maxlen of 3 for utf8.
Another encoding of unicode is ucs2. While utf8 is being used by Unix and Java, ucs2 is preferred by Windows systems. ucs2 encodes all Unicode-Characters with a fixed length of 2 bytes. So Maxlen is shown as 2.
Compare:
CODE:
root@localhost [kris]> select hex(convert("Köhntopp" using ucs2)) as x;
+--------------------------------------+
| x |
+--------------------------------------+
| 004B00F60068006E0074006F00700070 |
+--------------------------------------+
1 row in set (0.00 sec)
root@localhost [kris]> select hex(convert("Köhntopp" using latin1)) as x;
+----------------------------------------+
| x |
+----------------------------------------+
| 4BF6686E746F7070 |
+----------------------------------------+
1 row in set (0.00 sec)
root@localhost [kris]> select hex("Köhntopp") as x;
+--------------------+
| x |
+--------------------+
| 4BC3B6686E746F7070 |
+--------------------+
1 row in set (0.00 sec)
In MySQL a CHAR(20) is defined a string of 20 characters. A character can be more than one byte, so MySQL has to reserve Maxlen * string length number of bytes. A CHAR(20) in utf8 will use up to 60 byte.
A VARCHAR(20) is a string of variable length. It will use up as many bytes as needed, plus one or two bytes to indicate the string length. A VARCHAR(20) in utf8 uses between 0 and 60 bytes plus the length byte.
In indexes MySQL operates in memory with static buffers and has to do worst case arithmetic. So even if MySQL stores index data in short or even prefix compressed records, internally utf8 counts with 3 bytes per character against the index length limits (1024 Bytes in general, 767 bytes in InnoDB). This leaves you with 343 and 255 characters, repective.
Recommendations are as follows:
- Define the server, the schema and the table always as single byte character sets, e.g. latin1.
- Define your connection as whatever you want. That might as well be utf8, even if the schema is latin1.
- Define columns that need a national character set or a national collation with that national single byte character set if possible. For example, a column for turkish text may be latin5 and a column for russion text may be latin2.
- Define columns that hold truly multilingual content and only these as utf8. Watch memory usage for storage and in index length limits.