Fighting the mysqld init script
Today we discovered a particularly subtle way of fucking up a server restart. After a routine configuration change and RPM upgrade, a colleague tried to restart an important master. That failed. The message:
The colleague tried multiple times, and finally resorted to manually typing a
That took care of production for now and left us with an investigation. Why is the init script trying to start the MySQL manager?
It is not, and never tried to. What happen?
CODE:
root@master ~]# /etc/init.d/mysql start
Starting MySQLCouldn't find MySQL manager (//bin/mysqlmanag[FAILED]erver (//bin/mysqld_safe)
Starting MySQLCouldn't find MySQL manager (//bin/mysqlmanag[FAILED]erver (//bin/mysqld_safe)
The colleague tried multiple times, and finally resorted to manually typing a
CODE:
nohup mysqld_safe ...
into a screen console, which he detached. That took care of production for now and left us with an investigation. Why is the init script trying to start the MySQL manager?
It is not, and never tried to. What happen?
To debug, I copied /etc/init.d/mysql to another location and instrumented a shell debugger. :-) That is, I changed the following lines:
In english: I let the script echo the command instead of executing it, and then I save all shell variables to a file and exit.
Looking at the variable dump, I can see that
So it is proven that the script never tried to actually execute the manager. In fact, the actual error message is
It is the red-colored [FAILURE] printed on top of that by the operating system sugar coating for the boot process that obscures the message and makes things fail.
The error message is neither unclear nor incomplete by itself. The only way it could possibly be cleaner would be
So the real error is that it cannot find //bin/mysqld_safe, which is pretty obvious, because it is /usr/bin/mysqld_safe. $bindir has the wrong content.
How so?
Here is how it is set:
Also, from the error message with the double-/ ("//bin/mysqld_safe") we now know that basedir is forced, and set to /.
And lo and behold:
There are several things to be learned here:
Oh. And we removed the [mysqld]-section from the .my.cnf file in root's home.
CODE:
287 case "$mode" in
288 'start')
289 # Start daemon
...
303 echo $echo_n "Starting MySQL"
304 if test -x $manager -a "$use_mysqld_safe" = "0"
305 then
...
311 # Give extra arguments to mysqld with the my.cnf file. This script may
312 # be overwritten at next upgrade.
313 echo "$manager" \
314 --mysqld-safe-compatible \
315 --user="$user" \
316 --pid-file="$pid_file" >/dev/null 2>&1 &
317 set > x
318 exit
319 wait_for_pid created $!; return_value=$?
...
327 elif test -x $bindir/mysqld_safe
328 then
329 # Give extra arguments to mysqld with the my.cnf file. This script
330 # may be overwritten at next upgrade.
331 pid_file=$server_pid_file
332 echo $bindir/mysqld_safe --datadir=$datadir --pid-file=$server_pid_file $other_args >/dev/null 2>&1 &
333 set > x
334 exit
335 wait_for_pid created $!; return_value=$?
288 'start')
289 # Start daemon
...
303 echo $echo_n "Starting MySQL"
304 if test -x $manager -a "$use_mysqld_safe" = "0"
305 then
...
311 # Give extra arguments to mysqld with the my.cnf file. This script may
312 # be overwritten at next upgrade.
313 echo "$manager" \
314 --mysqld-safe-compatible \
315 --user="$user" \
316 --pid-file="$pid_file" >/dev/null 2>&1 &
317 set > x
318 exit
319 wait_for_pid created $!; return_value=$?
...
327 elif test -x $bindir/mysqld_safe
328 then
329 # Give extra arguments to mysqld with the my.cnf file. This script
330 # may be overwritten at next upgrade.
331 pid_file=$server_pid_file
332 echo $bindir/mysqld_safe --datadir=$datadir --pid-file=$server_pid_file $other_args >/dev/null 2>&1 &
333 set > x
334 exit
335 wait_for_pid created $!; return_value=$?
In english: I let the script echo the command instead of executing it, and then I save all shell variables to a file and exit.
Looking at the variable dump, I can see that
CODE:
use_mysqld_safe=1
So it is proven that the script never tried to actually execute the manager. In fact, the actual error message is
CODE:
log_failure_msg "Couldn't find MySQL manager ($manager) or server ($bindir/mysqld_safe)"
It is the red-colored [FAILURE] printed on top of that by the operating system sugar coating for the boot process that obscures the message and makes things fail.
The error message is neither unclear nor incomplete by itself. The only way it could possibly be cleaner would be
CODE:
if [ "$use_mysqld_safe" = 0 ]
then
log_failure_msg "Couldn't find server ($bindir/mysqld_safe)"
else
log_failure_msg "Couldn't find MySQL manager ($bindir/mysqlmanager)"
fi
then
log_failure_msg "Couldn't find server ($bindir/mysqld_safe)"
else
log_failure_msg "Couldn't find MySQL manager ($bindir/mysqlmanager)"
fi
So the real error is that it cannot find //bin/mysqld_safe, which is pretty obvious, because it is /usr/bin/mysqld_safe. $bindir has the wrong content.
How so?
Here is how it is set:
CODE:
if test -z "$basedir"
then
basedir=/
bindir=/usr/bin
if test -z "$datadir"
then
datadir=/var/lib/mysql
fi
sbindir=/usr/sbin
libexecdir=/usr/sbin
else
bindir="$basedir/bin"
if test -z "$datadir"
then
datadir="$basedir/data"
fi
sbindir="$basedir/sbin"
libexecdir="$basedir/libexec"
fi
This code checks if $basedir is forced ('test -z "$basedir') and if it is NOT, it assumes FHS layout with the tools in /usr/bin where they belong. If $basedir is forced, it assumes instead what I tend to call /opt layout, with the tools in /opt/bin if for example $basedir were set to /opt.then
basedir=/
bindir=/usr/bin
if test -z "$datadir"
then
datadir=/var/lib/mysql
fi
sbindir=/usr/sbin
libexecdir=/usr/sbin
else
bindir="$basedir/bin"
if test -z "$datadir"
then
datadir="$basedir/data"
fi
sbindir="$basedir/sbin"
libexecdir="$basedir/libexec"
fi
Also, from the error message with the double-/ ("//bin/mysqld_safe") we now know that basedir is forced, and set to /.
And lo and behold:
CODE:
[root@master ~]# pwd
/root
[root@master ~]# cat .my.cnf
[client]
socket=/mysql/master/data/mysql.sock
password=...
[mysqld]
basedir=/
When did that happen? Well, git will know./root
[root@master ~]# cat .my.cnf
[client]
socket=/mysql/master/data/mysql.sock
password=...
[mysqld]
basedir=/
CODE:
[root@master root]# git log -p .my.cnf
commit 306254afee7293887a85884ecaaff9b2d13bb9af
Author: store_local_config <dba.cron>
Date: Mon Nov 8 14:03:11 2010 +0100
Changes committed by store_local_config
diff --git a/root/.my.cnf b/root/.my.cnf
index e871dad..6551caa 100644
--- a/root/.my.cnf
+++ b/root/.my.cnf
@@ -1,3 +1,6 @@
[client]
socket=/mysql/master/data/mysql.sock
password=...
+
+[mysqld]
+basedir=/
So at least we know when this happened, but not the who and why.commit 306254afee7293887a85884ecaaff9b2d13bb9af
Author: store_local_config <dba.cron>
Date: Mon Nov 8 14:03:11 2010 +0100
Changes committed by store_local_config
diff --git a/root/.my.cnf b/root/.my.cnf
index e871dad..6551caa 100644
--- a/root/.my.cnf
+++ b/root/.my.cnf
@@ -1,3 +1,6 @@
[client]
socket=/mysql/master/data/mysql.sock
password=...
+
+[mysqld]
+basedir=/
There are several things to be learned here:
- The start/stop script picks up server options from the local users .my.cnf files. This feels deeply dangerous and wrong to me.
- There is no internal state inside the script that marks the used layout type as either FHS or /opt. It is impossible later in the script to decide what happened in the original $basedir test and print that for debug.
- Operating system boot message decoration can really bite you in the ass.
- Obsessive-compulsive configuration change tracking is a good thing, expecially on a master.
Oh. And we removed the [mysqld]-section from the .my.cnf file in root's home.
Comments
Display comments as Linear | Threaded
Luciano Rocha on :
Kristian Köhntopp on :
That left me with an annotated execution trace in ~kris/log and a variable dump at the exit point in ~kris/x, and ultimately let to the solution that is this writeup.
Mike on :
http://dev.mysql.com/doc/refman/5.1/en/option-files.html
Simon J Mudd on :
So I've written a bug report about it which you can find here:
http://bugs.mysql.com/bug.php?id=60729
Let's hope that this can be addressed by better logging (in the script) and by the server and of course better change control by the DBAs.... :-)
Andrii on :
If user has permission to start mysqld, then it has right to define custom configuration for it. (i.e. user-specific)
Perhaps we have huge server box with multiple users who start multiple mysqld instances on it.
Kristian Köhntopp on :
The option --defaults-file exists, and can be used to maintain multiple instances in a clear and unambigous way.
In fact, I happen to believe that the following would be a great improvement: a server that refuses to start UNLESS you give it a --defaults-file option and which then will read server configuration from that one file from one single section, and which will complain about duplicate keywords with an error that prevents startup.
Kristian Köhntopp on :
And I'd like the configuration of each server in a separate file - the way mysqlmanager handles this is completely broken: all servers running under the same user id, configurations of multiple servers mixed up in one file, and an inheritance schema for configuration options so that you can neither backup this properly nor can you migrate one instance elsewhere easily. Somebody has been adding a lot of code to make my life harder here.