Replication - now and then
One of the major contributing factors to the success of MySQL is the ease and simplicity of its replication. Read-slaves for scaleout and backup-slaves for noninterrupting backups are the norm in any MySQL installation I have seen in the last two years.
So how does replication work? And how shall it be expanded in the future?
What is available?
The binlog written by MySQL currently logs all statements changing the tablespace. It is a serialization of all tablespace changes. The binlog position, expressed as (binlog name, offset), is a database global timestamp - a timestamp expressed in seconds.fraction does not work for any precision at all, because on a multi-core machine multiple things can happen concurrently.
If you want to make a consistent full backup of the database, the database must not change during the backup. That is, it must be possible to associate one and exactly one binlog position with the backup.
In fact, if you have such a backup - one associated with a binlog position - and you happen to have the binlogs from that time until now, it is possible to do a point-in-time (PIT) recovery. You'd recover from the full backup and you'd then replay the binlog from the backups binlog position until now. That is why it is important to store the binlog in a filesystem that fails independently from the rest of your MySQL. That's also why you must not filter the binlog that is written by MySQL using binlog-do-db and binlog-ignore-db - if you do, you'll get an incomplete binlog that will fail to be useful in a PIT recovery scenario.
A slave in MySQL is now nothing but a binlog downloader and executor: The slave must be restored from a PIT-capable full backup. It is then being told the current binlog position and where to log in to get the missing binlog. The slaves IO_THREAD will then log into the master server and download the binlog to the local disk as fast as possible, storing it as the relay log. The slaves SQL_THREAD will then start to execute the relay log as fast as possible. Replication can thus be thought of as an ongoing live recovery.
So how does replication work? And how shall it be expanded in the future?
What is available?
The binlog written by MySQL currently logs all statements changing the tablespace. It is a serialization of all tablespace changes. The binlog position, expressed as (binlog name, offset), is a database global timestamp - a timestamp expressed in seconds.fraction does not work for any precision at all, because on a multi-core machine multiple things can happen concurrently.
If you want to make a consistent full backup of the database, the database must not change during the backup. That is, it must be possible to associate one and exactly one binlog position with the backup.
In fact, if you have such a backup - one associated with a binlog position - and you happen to have the binlogs from that time until now, it is possible to do a point-in-time (PIT) recovery. You'd recover from the full backup and you'd then replay the binlog from the backups binlog position until now. That is why it is important to store the binlog in a filesystem that fails independently from the rest of your MySQL. That's also why you must not filter the binlog that is written by MySQL using binlog-do-db and binlog-ignore-db - if you do, you'll get an incomplete binlog that will fail to be useful in a PIT recovery scenario.
A slave in MySQL is now nothing but a binlog downloader and executor: The slave must be restored from a PIT-capable full backup. It is then being told the current binlog position and where to log in to get the missing binlog. The slaves IO_THREAD will then log into the master server and download the binlog to the local disk as fast as possible, storing it as the relay log. The slaves SQL_THREAD will then start to execute the relay log as fast as possible. Replication can thus be thought of as an ongoing live recovery.
Time delayed replication - most customers want that
Unfortunately, this kind of recovery does help against server crashes only - it does not help against administrator oopses. That is because the slave will faithfully replicate all DROP DATABASE statements of doom a DBA enters on the master. If MySQL had time delayed replication, things would be different. Actually, it has - Kay Röpke has written a patch for it, MySQL "just" has to test and include it.
Integrity checks - many customers need that
Also, replication currently is quite fragile. There is no mechanism to guarantee the integrity of the data while it is being transferred from the master to the slave, and there is no way to guarantee that tables are identical on the master and the slave. Well, actually, there is, because Baron 'Xaprb' Schwartz has written the indispensable MySQL Toolkit, which contains not only a very efficient way to calculate table checksums on the master and the slave, but also a mechanism to create a minimum set of statements to synchronize a slave to a master if there are differences. MySQL replication should really have something built into it that does something equivalent.
Like Baron said in his talk on the MySQL UC 2007: "I'd never thought I'd need mysql-table-sync until I had it and actually saw all the differences between my master and the slaves."
Server side filtering - without messing up the backup
Replication over a long distance link currently is not very efficient, and with the advent of row based replication in 5.1, will be even less so. That is, because the only properly working filter for partial replication is replicate-wild-do-table/replicate-wild-ignore-table, and that filter is applied only in the slaves SQL_THREAD: The slave will still download all of the binlog, and only then start to throw away all statements that are not applicable. The code for this really needs to be pushed upstream, so that the filter can be done in the masters connection handler for the slave - essentially this is equivalent to expaning SHOW BINLOG EVENTS syntax to include an arbitrary list of "replicate-wild-(do|ignore)-table" patterns.
Keeping busy more than one CPU
Also, currently replication slaves that do nothing else are seriously limited by the fact that the binlog is a serialization of statements applied to the master - if you happen to have a replication slave that does nothing else (we call this a "backup slave" as opposed to a "read slave" that is used for scaleout), then this slave can hardly utilize more than one core. Statement based replication is notoriously hard to parallelize, but row based replication isn't, at least as long as you are dealing with rows that have proper primary keys: Two transactions who are sequentially stored in the binlog can be applied in parallel, when
The problems of a global ID (and how to sychronize instead)
Replication is currently completely asynchronous, and that is a good thing, even if MySQL beginners usually hate that fast. Asynchronicity has one important property in scaleout: It is damn fast. Every asynchronous system can be made synchronous by adding waits. Usually, we do not want waits, and we certainly do not want waits by default anywhere in our system.
Most people that need synchronisation of any kind in MySQL currently rely on the fact that the SQL_THREAD is single-threaded: A client that needs synchronizations will push a heartbeat t4 into a heartbeat table after applying transactions t1, t2, t3 to the master. The heartbeat table usually contains a primary key pair (ip, pid) and a value, the sequence number ( seq ) - the ( ip, pid ) pair is guaranteed to be unique in our installation and the ( seq ) is like a timestamp only that it can be incremented at will and faster than once per second.
The client can then look at a slave and if it finds the desired ( seq ) for the given ( ip, pid ) pair it knows that the slave has seen the transactions t1-t3 before pushing that heartbeat as well, because the binlog is being applied strictly in order.
If the desired heartbeat cannot be found on the chosen slave, a client can optionally wait for a short time (say, 500ms) and then retry, or it can redirect the read to the master immediately. We usually read from the master and not a different slave, because
When we paralellize replication with the PK intersection algorithm outlined above, this does no longer work. We need a special synchronisation primitive, a way to mark any write statement such as an INSERT or UPDATE as non-parallelizeable or barrier statement. Another way to handle thus would be globalized transaction IDs. The latter would theoretically also be useful in failover cases, because they are hypothetically useful to tell the difference between the failed master and the slave taking over for this master, but I cannot see this, yet. Either global IDs introduce some central synchronisation mechanism (this comes down to the implementation of two phase commit, which will be very, very slow), or the binlogs of the master and the slave are not semantically identical - before I believe in global IDs I'd like to see more about that concept and an explaination how this is different from 2PC (read: faster than) and still correct in all cases.
And what do you think?
What are your thoughts on replication? Jimmy wants to know, and he has set up something on Zoomerang to enable you to tell him.
Unfortunately, this kind of recovery does help against server crashes only - it does not help against administrator oopses. That is because the slave will faithfully replicate all DROP DATABASE statements of doom a DBA enters on the master. If MySQL had time delayed replication, things would be different. Actually, it has - Kay Röpke has written a patch for it, MySQL "just" has to test and include it.
Integrity checks - many customers need that
Also, replication currently is quite fragile. There is no mechanism to guarantee the integrity of the data while it is being transferred from the master to the slave, and there is no way to guarantee that tables are identical on the master and the slave. Well, actually, there is, because Baron 'Xaprb' Schwartz has written the indispensable MySQL Toolkit, which contains not only a very efficient way to calculate table checksums on the master and the slave, but also a mechanism to create a minimum set of statements to synchronize a slave to a master if there are differences. MySQL replication should really have something built into it that does something equivalent.
Like Baron said in his talk on the MySQL UC 2007: "I'd never thought I'd need mysql-table-sync until I had it and actually saw all the differences between my master and the slaves."
Server side filtering - without messing up the backup
Replication over a long distance link currently is not very efficient, and with the advent of row based replication in 5.1, will be even less so. That is, because the only properly working filter for partial replication is replicate-wild-do-table/replicate-wild-ignore-table, and that filter is applied only in the slaves SQL_THREAD: The slave will still download all of the binlog, and only then start to throw away all statements that are not applicable. The code for this really needs to be pushed upstream, so that the filter can be done in the masters connection handler for the slave - essentially this is equivalent to expaning SHOW BINLOG EVENTS syntax to include an arbitrary list of "replicate-wild-(do|ignore)-table" patterns.
Keeping busy more than one CPU
Also, currently replication slaves that do nothing else are seriously limited by the fact that the binlog is a serialization of statements applied to the master - if you happen to have a replication slave that does nothing else (we call this a "backup slave" as opposed to a "read slave" that is used for scaleout), then this slave can hardly utilize more than one core. Statement based replication is notoriously hard to parallelize, but row based replication isn't, at least as long as you are dealing with rows that have proper primary keys: Two transactions who are sequentially stored in the binlog can be applied in parallel, when
- all rows in all tables in these transactions have properly defined primary keys
- and the intersection of these PK sets is empty.
The problems of a global ID (and how to sychronize instead)
Replication is currently completely asynchronous, and that is a good thing, even if MySQL beginners usually hate that fast. Asynchronicity has one important property in scaleout: It is damn fast. Every asynchronous system can be made synchronous by adding waits. Usually, we do not want waits, and we certainly do not want waits by default anywhere in our system.
Most people that need synchronisation of any kind in MySQL currently rely on the fact that the SQL_THREAD is single-threaded: A client that needs synchronizations will push a heartbeat t4 into a heartbeat table after applying transactions t1, t2, t3 to the master. The heartbeat table usually contains a primary key pair (ip, pid) and a value, the sequence number ( seq ) - the ( ip, pid ) pair is guaranteed to be unique in our installation and the ( seq ) is like a timestamp only that it can be incremented at will and faster than once per second.
The client can then look at a slave and if it finds the desired ( seq ) for the given ( ip, pid ) pair it knows that the slave has seen the transactions t1-t3 before pushing that heartbeat as well, because the binlog is being applied strictly in order.
If the desired heartbeat cannot be found on the chosen slave, a client can optionally wait for a short time (say, 500ms) and then retry, or it can redirect the read to the master immediately. We usually read from the master and not a different slave, because
- there is no telling if another slave had the data if we tried that one, but the master has the desired data, guaranteed
- we already have a connection open to the master for our writes
- adding read load to the master when slaves are lagging will slow down the master and allow the slaves to catch up :-)
When we paralellize replication with the PK intersection algorithm outlined above, this does no longer work. We need a special synchronisation primitive, a way to mark any write statement such as an INSERT or UPDATE as non-parallelizeable or barrier statement. Another way to handle thus would be globalized transaction IDs. The latter would theoretically also be useful in failover cases, because they are hypothetically useful to tell the difference between the failed master and the slave taking over for this master, but I cannot see this, yet. Either global IDs introduce some central synchronisation mechanism (this comes down to the implementation of two phase commit, which will be very, very slow), or the binlogs of the master and the slave are not semantically identical - before I believe in global IDs I'd like to see more about that concept and an explaination how this is different from 2PC (read: faster than) and still correct in all cases.
And what do you think?
What are your thoughts on replication? Jimmy wants to know, and he has set up something on Zoomerang to enable you to tell him.
Comments
Display comments as Linear | Threaded