Skip to content

House and Heisenberg having Replication Delay

So I am getting a mail with a complaint about rising replication delays in a certain replication hierarchy.

Not good, because said hierarchy is one of the important ones. As in 'If that breaks, people are sleeping under the bridge'-important.

The theory was that the change rate in that hierarchy is too high for the single threaded nature of MySQL replication. That was supported by the observation that all affected boxes had no local datadir, but were filer clients. Filer clients as slaves are dying first because the SAN introduces communication latencies that local disks don't have, and the single threaded nature of replication is not helpful here, either. Filers are better when it comes to concurrent accesses, really.

So if that theory would hold that would really ruin my day. Make that month: Said hierarchy is just now recovering from severe refactoring surgery and should have almost no exploitable technical debt that can be leveraged for short term scaling and tuning. If that thing really accumulates delay we are in serious trouble.

Now, I am used to lies. People lie. Boxes lie. So let's fire up the Graphite and have a look at how bad things are. I am choosing a random host from the supposedly sick hierarchy:

The brand spanking new replication load monitor in graphite is a great toy.


This is based on rather brilliant work from Mark Leith, on which Dennis Kaarsemaker has expanded. Using performance_schema we are breaking down time spent in the SQL_THREAD, and are not only graphing idle time, but also other times.

My SQL_THREAD on my random box uses some 20% of its time in InnoDB Log IO, and another 10-20% in InnoDB Data IO. 60% are idle. Monday at 1300 somebody had a panic attack and set 'innodb_flush_log_at_trx_commit = 2' in the global configurator for that hierarchy, shedding the 20% of Log IO time (data time of course cannot be shedded).

Anyway, this is not the graph of a dying and overloaded replication hierarchy. We may be having localized slave problems, but not on this one. This is also proof that the problem is not originating at the master. Good news!

Further research and questioning reveals a list of affected servers. Two of them are quality control systems, and have no memory. They are not interesting. One is a potential production box, which is currently idling. This one has memory, but is slow in an unexplained way.

Since when? I am grabbing the replication delay graph and am zooming out. The problem started on 20-Sep in the afternoon, says the graph. So I am zooming through all metrics for this box for the 20-Sep and am looking at what I could possibly find.

Symtoms: Acute and life-threatening DML weakness since 20-Sep, afternoon.


This box is obviously sick. On 20-Sep, in the afternoon, we are dropping from 150/250/50 I/U/D to infinitesimally miniscule performance, but not by hitting a wall. Instead we are looking at a kind of discharge curve, as if something is accumulating drag. Also, some time on 21-Sep the box recovers for a short time and then dies again, slowly.

That is a good hint. We know that we are not looking at a direct config change, and we also know from the second spike that recovery is possible.

We are not looking for a direct config change, and that is a good thing, because there aren't any in the recent history of that box.

CODE:
root@dba-master-01 [dba]> select 
  new.report_date, 
  new.config_variable, 
  old.config_value, 
  new.config_value 
from 
  v_global_variables as old join v_global_variables as new on 
    old.hostname = new.hostname and 
    old.config_variable = new.config_variable and 
    old.report_date + interval 1 day = new.report_date
where 
  old.hostname like @host and 
  new.hostname like @host and 
  old.config_value != new.config_value and 
  old.report_date > now() - interval 6 month;
+-------------+----------------------------+--------------+--------------+
| report_date | config_variable            | config_value | config_value |
+-------------+----------------------------+--------------+--------------+
| 2012-04-20  | thread_cache_size          | 100          | 500          |
| 2012-05-12  | innodb_max_dirty_pages_pct | 20           | 75           |
+-------------+----------------------------+--------------+--------------+
2 rows in set (0.82 sec)


We are digging deeper, and find enlightenment in Graphite, because we can see the inverse of the Activity Graph pattern here:

Causa: The InnoDB Purge Thread stopped working shortly after noon, 20-Sep, comes back for a short time the day after and then turns back to procrastination.


This looks as if Cpt. Undo Log is riding again and needs to be cluebatted into logout. Is he?

Next stop: Processlist. We have only two candidate processes: One is a root from anywhere, and the other a cacti login from the local host. My money is on the root. A short question on Jabber and a KILL on root's connection ID will show me if I am right: If you aren't on live@... in Jabber, you'll probably make noise if you are losing your connections to random KILLs.

The effect is... Null.

Ok. Then I will backstab the Cacti just because I am having a bad day.

A dose of laxative (DBA KILL forte N) clears the box spontaneously, but recovery is not permanent at all.


Wunderbar. The box is getting up and dropping as if it had been overestimating it's L&B skills. Whenever I am stabbing the cacti user, the Undo Log drops back to zero and we are regaining performance. Then it will go back into slow death.

The old trick with processlist works here, too: The Login is from

CODE:
| 6694419 | cacti         | <hostname>:44467 | performance_schema |


Wait. performance_schema. Cacti?
This is not cacti, at all.

CODE:
# lsof -i -n -P | grep 44467
mysqld    17242      mysql  178u  IPv6 3599986352       TCP ...:3306->...:44467 (ESTABLISHED)
diamond   18596       root    1u  IPv4 3599986351       TCP ...:44467->10.147.206.122:3306 (ESTABLISHED)


This is diamond, the data collector for Graphite, using the cacti login as a temporary deployment solution. And which module for Diamond has been rolled out on 20-Sep, afternoon?

CODE:
# cd /etc/graphite/collectors
# ls -l MySQL*conf
...
-r-------- 1 root root 471 Sep 20 14:19 MySQLPerfCollector.conf


This is the Replication Load Monitor. Yes, the one in the first image, at the top of this article.

The RLM is supposed to be r/o, but for some reason it does it in a r/o transaction on about a promille of all boxes. And that transaction is apparently opened on connect and then held forever. Only on connection loss the REPEATABLE READ view is being dropped, and of course the connection will also never time out, because Diamond is driving this every few seconds to aquire data.

A short time later, after Puppet has set Diamond's MySQLPerfCollector.conf to 'enabled = false' we achieve permanent recovery without KILL:

Natural recovery after removal of the root cause, without a KILL.


Learnings:
  • Monitoring can have performance effects. Okay, no news here.
  • The Undo Log and long running transactions are performance poison. Okay, no news here, too.
  • It wasn't replication load, and also not the filer, even if both are popular blame targets at the moment.
  • Graphite is my bitch.


What House is saying.


And now let's look into the intestines of that Collector.

Looking at the source of the Collector, we can very quickly establish that it cannot be the source of the problem at all. So let's check what it monitors. On all affected boxes, I find that P_S is on, but

CODE:
mysql> select \* from performance_schema.threads;
Empty set (0.01 sec)


That's interesting. Let's have a look at the instruments, then.

CODE:
mysql> select \* from performance_schema.setup_instruments;
Empty set (0.03 sec)
 
mysql> select \* from performance_schema.setup_timers;
Empty set (0.01 sec)
 
mysql> select \* from performance_schema.setup_consumers;
Empty set (0.02 sec)


and that remains across server-restarts. Also, no error messages in the server log. Why does that happen?

CODE:
# cd /mysql/\*/data/performance_schema/
# ls -l
total 1840
-rw-rw---- 1 mysql mysql  8624 Oct  6  2011 cond_instances.frm
-rw-rw---- 1 mysql mysql 98304 Oct  6  2011 cond_instances.ibd
-rw-rw---- 1 mysql mysql    61 Oct  6  2011 db.opt
-rw-rw---- 1 mysql mysql  9220 Oct  6  2011 events_waits_current.frm
-rw-rw---- 1 mysql mysql 98304 Oct  6  2011 events_waits_current.ibd
...


and a SHOW CREATE TABLE confirms that.

So this is what goes on:

The collector is running in P_S only. Since P_S does not normally contain any innodb or other transactional tables, it can never start a transaction. In the broken boxes, the tables in P_S got created as innodb, though, by the way of an installation error that happened in October 2011.

That alone would not affect anything, as every command in mysql is by default a transaction of its own. For the observed behaviour to occur, python also needs to set AUTOCOMMIT = 0. And, as a colleague helpfully points out, the Python code in the collector is indeed doing this: "Starting with 1.2.0, MySQLdb disables autocommit by default, as required by the DB-API standard (PEP-249)".

In that case, reading the InnoDB tables in P_S will start a r/o-txn and create a stable view, which is maintained until the end of the transaction. Which never comes, unless you disconnect. The ever growing size of the undo log adversely affects the performance of the broken boxes to the point of total uselessness. Killing the graphite monitoring user will provide only temporary relief.

The actual solution is to run
CODE:
# mysql -BNe "select 
  concat('drop table ', table_name, ';') 
from 
  information_schema.tables 
where 
  table_schema = 'performance_schema'" | 
mysql performance_schema
# mysql_upgrade


to cleanly remove the innodb tables from the system, not messing with the innodb data dictionary. After that, a mysql_upgrade run will reinstall the P_S correctly. A system restart is not needed.

The cause for the strange behavior™ of the affected boxes remains unclear, but the time interval of 6-19 Oct 2011 suggests failures in the early 5.1 to 5.5 migration procedure. The affected hierarchy was indeed a pilot customer of that migration.

Anyway, the collector is now back in service, and I am getting stats.

Like many MySQLers I am of the conviction that AUTOCOMMIT = 0 and SELECT ... FOR UPDATE are not really helpful in most cases. This case will be another argument in my arsenal.

Trackbacks

No Trackbacks

Comments

Display comments as Linear | Threaded

Rob Smith on :

MySQLPerfCollector is a custom collector you (or your company) wrote? I just checked and the current mysql collector in diamond shouldn't be able to cause this issue :)

Thanks for using Diamond! :)

Kristian Köhntopp on :

MySQLPerfCollector.py is a custom collector, but with the performance_schema being InnoDB and a persistent connection from a continously running collector, any collector working on performance_schema would have caused this issue.

There is nothing specific or unusual to the code of this particular collector.

Add Comment

E-Mail addresses will not be displayed and will only be used for E-Mail notifications.

To prevent automated Bots from commentspamming, please enter the string you see in the image below in the appropriate input box. Your comment will only be submitted if the strings match. Please ensure that your browser supports and accepts cookies, or your comment cannot be verified correctly.
CAPTCHA

BBCode format allowed