Skip to content

A quick tour of DRBD

Snapshot of the vmware config used (two running instances required for the example)
This is a quick tour of DRBD and how it compares to local RAID and to MySQL replication. DRBD is short for "distributed raw block device", so what it does is essentially RAID-1 over a network cable. You will be able to have two copies of a block device on two different physical machines, one of them the primary, active node and the other one a secondary, passive node.

The DRBD tour in this blog post has been created on two vmware instances with a Suse 10.0 Professional installation on each which I am using to show the most essential features of DRBD. Each vmware has a bit of memory, a network card, a boot disk with a text only Suse 10 installation and a second simulated 1 GB SCSI disk besides the boot disk to demonstrate stuff. The two instances are connected on a simulated local vmnet instance and share the 10.99.99.x/24 network, they are called left
(10.99.99.128) and right (10.99.99.129).

On each machine, in addition to the most basic Suse 10 Pro installation, the packages km_drbd and drbd have been installed. On left, we are partitioning the second hard disk completely for use with Linux using fdisk, creating /dev/sdb1 for use with DRBD.

CODE:
left:/tmp/share # fdisk /dev/sdb

Command (m for help): n
Command action
   e   extended
   p   primary partition (1-4)
p
Partition number (1-4): 1
First cylinder (1-130, default 1): 1
Last cylinder or +size or +sizeM or +sizeK (1-130, default 130): 130

Command (m for help): p

Disk /dev/sdb: 1073 MB, 1073741824 bytes
255 heads, 63 sectors/track, 130 cylinders
Units = cylinders of 16065 \* 512 = 8225280 bytes

   Device Boot     Start   End    Blocks   Id System
/dev/sdb1 1   130   1044193+  83 Linux

Command (m for help): w
The partition table has been altered!

Calling ioctl() to re-read partition table.
Syncing disks.
left:/tmp/share #


We now have to define a /etc/drbd.conf to configure that disk into DRBD. A basic drbd.conf has to define a resource. A ressource is something that contains a disk partition on the left node, a matching partition on the right node, a network connection between them and definitions for error handlung and synchronisation.

The config file format is straight forward and uses named blocks in curly brackets and semicolon-terminated statements - if you know named, you'll feel right at home.

CODE:
global {
    # we want to be able to use up to 5 drbd devices
    minor-count 5;
    dialog-refresh 5; # 5 seconds
}

resource r0 {
  protocol C;
  incon-degr-cmd "echo '!DRBD! pri on incon-degr' | wall ; sleep 60 ; halt -f";

  on left {
    device     /dev/drbd0;
    disk       /dev/sdb1;
    address    10.99.99.128:7788;
    meta-disk  internal;
  }

  on right {
    device    /dev/drbd0;
    disk      /dev/sdb1;
    address   10.99.99.129:7788;
    meta-disk internal;
  }

  disk {
    on-io-error   detach;
  }

  net {
    max-buffers     2048;
    ko-count 4;
    on-disconnect reconnect;
  }

  syncer {
    rate 10M;
    group 1;
    al-extents 257; # must be a prime number
  }

  startup {
    wfc-timeout  0;
    degr-wfc-timeout 120;    # 2 minutes.
  }
}


global options are set in a section aptly named global, and are currently limited to "minor-count" (the number of DRBD resources you'll be able to define), "dialog-refresh" (how quickly the startup dialog will redraw itself) and "disable-ip-verification" (disable some startup sanity checking that verifies that we are on the right machine).

All other config happens inside a resource section. That section needs to have a name, and the name can be anything, and can be quoted. We happen to use something boring like r0. Any ressource needs to have a protocol defined, and requires two "on" sections which define the two hosts we are going to use. It can also have startup, syncher, net and disk sections.

The protocol in DRBD is something like the "innodb_flush_log_at_trx_commit" in MySQL: It determines when the node comitting a disk write considers that write to be a success, and has essential influence on the speed and resilency of your DRBDs. It can be defined as A, B or C:
  • In protcol C, which is the recommended setting, a write is considered completed when it has reached stable storage on the local and the remote node.
  • In protocol B, we relax the constraints a little and consider the write completed when it
    has reached the local disk and the remote buffer cache. This should be faster than C, but for some reason currently is not, so you should not be using it.
  • In protocol A we consider a write completed when it has reached the local disk and the local TCP send buffer. This may be okay for you, but for most people it is not.


The beef is in the "on" sections, which also need names, specifically the hostname of the system carrying the device you want to mirror to. Inside your "on" section you'll define a device, a disk, an address and a meta-disk. This is fairly straightforward: The device is the /dev/drbdn which we will be using later on to work with.

The disk is the underlying real storage that will carry all our data, the address is an ip:udp-port pair used to talk to the local DRBD instance for this device (a different UDP port is needed for each DRBD disk) and the meta-disk is either internal or some dedicated metadata storage device. For simplicity of our example, we are using internal for now (DRBD will then use 128M of /dev/sdb1 for its internal purposes. Yes, that is a lot!).

Look at the on-sections in the example above: On left and right we will be using /dev/drbd0, and have it write to /dev/sdb1. We will communicate using UDP port 7788 on both ip numbers, 128 and 129.

How we handle local disk errors, we specify using the on-io-error handler of the unnamed disk section. "detach" means that on error we simply forget the local disk and operate in diskless mode: We read and write data from and to the disk of the remote node across the network. Other options are "pass_on" (the primary reports the error, the secondary ignores it) and "panic" (the nodes leaves the cluster with a kernel panic).

Both nodes are connected using a net section. Inside the net section, which is unnamed, we define the buffers and timeouts used by DRBD: sndbuf-size, timeout, connect-int, ping-int, max-buffers, max-epoch-size, ko-count, on-disconnect.

The sndbuf is being specified in KB, and determines how much buffer the local DRBD will reserve for communication with the remote node. It should be no smaller than 32K and no larger than 1M. The optimum size is dependent on your bandwidth delay product for the connection to the remote node.

If the partner node does not reply in timeout tenths of a second, this counts as a K.O. After ko-count of these, the partner is considered dead and dropped out of the cluster. The primary then goes into standalone mode. Also, the connection to the partner node is dropped on timeout, and will be restablished immediately. If that fails, every connect-int seconds a new try is being made. If on the other hand the connection between the two nodes is idle for more than ping-int seconds, a DRBD-internal ping is sent to the remote to check if it is still present.

How the node handles a disconnect can be specified using the on-disconnect handler: Valid choices are stand_alone (go from primary to standalone mode), reconnect (try to reconnect as described above) or freeze_io (try to reconnect, but halt all I/O as in a NFS hard mount, until the reconnect is successful).

DRBD uses 4K buffers to buffer writes to disk, and it does use at most max-buffers many of these (minimum 32, coming up to at least 128K). If you see many writes, this number needs to be set to some larger values.

max-epoch-size needs to be at least 10, and determines how many data blocks may be seen at most between two write barriers.

Across the network connection, the syncer does its work to keep both disks neat and tidy. It will use at most rate K/sec of bandwidth to do so, and the default is quite low (250 K/sec). For synchronisation, the disks are cut up slices and for each slice an al-extent is being used to indicate if and where it has been changed. A larger number of al-extents makes resynchronisation slower, but requires fewer metadata writes.
The number used here should be a prime, because it is used internally in hashes that benefit from prime number sized structures.

If you have multiple devices, all of these that are in the same group are resynchronized in parallel. If two DRBD devices reside on different physical disks, you can put them into the same group so that they are resynchronized in parallel without competing for seeks on the same disk. If two DRBD devices are partition on the same physical device, put them into different groups to avoid disk head threshing.

Inside the startup section, which is also is unnamed, we define two wait-for-connection timeouts: On startup, DRBD will try to find its partner node on the network. DRBD will remember if it was "degraded" the last time it went down or not - "degraded" here means that the partner node already was down and we are missing a mirror half. If we have been degraded prior to the node restart, we wait for 120 seconds for the second node to come up and continue the boot otherwise. If we have not been degraded, we require the second node to be present for our own
boot to complete (or require manual intervention).

Using this config file on left and right (copy it over using scp!), we can start DRBD on left and right. According to our config, left will hang until the start on right has completed ("wfc-timeout" is set to 0).

CODE:
left:~ # rcdrbd start
Starting DRBD resources:    [ d0 s0 n0 ].
..........
 DRBD's startup script waits for the peer node(s) to appear.
 - In case this node was already a degraded cluster before the
   reboot the timeout is 120 seconds. [degr-wfc-timeout]
 - If the peer was available before the reboot the timeout will
   expire after 0 seconds. [wfc-timeout]
   (These values are for resource 'r0'; 0 sec -> wait forever)
 To abort waiting enter 'yes' [  10]:


The system is now not in sync, and DRBD does not know which disk
is the leading disk in our little cluster. So both disks are in
secondary mode, and cannot be written to.

CODE:
left:~ # cat /proc/drbd
version: 0.7.13 (api:77/proto:74)
SVN Revision: 1942 build by root@d233, 2006-01-21 02:46:41
 0: cs:Connected st:Secondary/Secondary ld:Inconsistent
    ns:0 nr:0 dw:0 dr:0 al:0 bm:112 lo:0 pe:0 ua:0 ap:0


To change that, we need to make one disk (the one on left) a
primary, and then watch the system synchronize.

CODE:
left:~ # drbdadm primary r0
ioctl(,SET_STATE,) failed: Input/output error
Local replica is inconsistent (--do-what-I-say ?)
Command '/sbin/drbdsetup /dev/drbd0 primary' terminated with
exit code 21
left:~ # drbdadm -- --do-what-I-say primary r0
left:~ # cat /proc/drbd
version: 0.7.13 (api:77/proto:74)
SVN Revision: 1942 build by root@d233, 2006-01-21 02:46:41
 0: cs:SyncSource st:Primary/Secondary ld:Consistent
    ns:9204 nr:0 dw:0 dr:9204 al:0 bm:112 lo:0 pe:0 ua:0 ap:0
        [>...................] sync'ed:  1.4% (903916/913120)K
        finish: 0:01:34 speed: 9,204 (9,204) K/sec


When we try to switch the left copy of r0 to primary, this does not work - the local replica is marked inconsistent and the system cannot decide if it can act as a proper primary. We have to insist that the left copy of r0 is the one we want to become the primary using the "-- --do-what-I-say" sledgehammer.

The system will then start to sync both disks, and by monitoring /proc/drbd we can follow the progress of this operation.

Our syncher is limited to 10M per second due to configuration, and so we will a synchronisation rate of approximately 10M/sec in /proc/drbd - the sync will take almost 1.5 minutes to complete.

We do not have to wait: Even with the sync running we are free to operate on the primary copy as we like. We like to have a file system on /dev/drbd0 and then mount this. Here is how:

CODE:
left:~ # mkreiserfs /dev/drbd0
...
Format 3.6 with standard journal
Count of blocks on the device: 228272
Number of blocks consumed by mkreiserfs formatting process: 8218
Blocksize: 4096
Hash function used to sort names: "r5"
Journal Size 8193 blocks (first block 18)
Journal Max transaction length 1024
inode generation number: 0
UUID: 3f9270dd-894a-4da4-8818-35b691504974
ATTENTION: YOU SHOULD REBOOT AFTER FDISK!
        ALL DATA WILL BE LOST ON '/dev/drbd0'!
Continue (y/n):y
Initializing journal - 0%....20%....40%....60%....80%....100%
Syncing..ok
ReiserFS is successfully created on /dev/drbd0.
left:~ # mount /dev/drbd0 /usr/local
left:~ # df -Th /usr/local
Filesystem    Type    Size  Used Avail Use% Mounted on
/dev/drbd0
          reiserfs    892M   33M  860M   4% /usr/local


We get 860M useable, plus 32M reiserfs journal, plus 128M drbd overhead. This really is not useful for small partitions.

Meanwhile, on right:

CODE:
right:~ # cat /proc/drbd
version: 0.7.13 (api:77/proto:74)
SVN Revision: 1942 build by root@d233, 2006-01-21 02:46:41
 0: cs:SyncTarget st:Secondary/Primary ld:Inconsistent
    ns:0 nr:684728 dw:684728 dr:0 al:0 bm:149 lo:0 pe:0 ua:0 ap:0
        [=============>......] sync'ed: 68.2% (294148/913120)K
        finish: 0:00:25 speed: 11,380 (9,824) K/sec


Onto this system we now install MySQL.

CODE:
# tar -C /usr/local -xf /tmp/share/mysql-max-5.0.22-linux-i686-glibc23.tar.gz
# cd /usr/local
# ln -s mysql-max-5.0.22-linux-i686-glibc23/ mysql
# cd mysql
# groupadd mysql
# useradd -g mysql mysql
# chown -R mysql.mysql .
# ./scripts/mysql_install_db --user=mysql
# ./support_files/mysql.server start
...
# ./bin/mysql -u root


To fail over from left to right, a number of things need to be done:
  • MySQL needs to be stopped.
  • The disk needs to be unmounted.
  • The disk needs to be put in secondary on left.
  • The disk needs to be put in primary on right.
  • The disk needs to be mounted on right.
  • MySQL needs to be started.
Actually, for the change to be completely transparent to the applications using MySQL, MySQL needs to be running on a virtual IP (e.g. 10.99.99.130), which also needs transfer from left to right). This can be automated, and even forced if left crashes, but not with DRBD. An additional package is needed, the Linux heartbeat package, which offers this additional functionality - we will be covering this in a later article.

Here is the manual switchover, on left:
CODE:
left:~ #
/usr/local/mysql/support-files/mysql.server stop
Shutting down MySQL. 
left:~ # umount /dev/drbd0
left:~ # drbdadm secondary r0
left:~ #


And the inverse, on right:
CODE:
right:~ # cat /proc/drbd
version: 0.7.13 (api:77/proto:74)
SVN Revision: 1942 build by root@d233, 2006-01-21 02:46:41
 0: cs:Connected st:Secondary/Secondary ld:Consistent
    ns:0 nr:1109196 dw:1109196 dr:0 al:0 bm:168 lo:0 pe:0 ua:0
ap:0
right:~ # drbdadm primary r0
right:~ # cat /proc/drbd
version: 0.7.13 (api:77/proto:74)
SVN Revision: 1942 build by root@d233, 2006-01-21 02:46:41
 0: cs:Connected st:Primary/Secondary ld:Consistent
    ns:0 nr:1109196 dw:1109196 dr:0 al:0 bm:168 lo:0 pe:0 ua:0
ap:0
right:~ # mount /dev/drbd0 /usr/local
right:~ # /usr/local/mysql/support-files/mysql.server start
Starting MySQL.


In a real MySQL failover scenario, we do not know why the failover took place in the first place, and if the server data on the DRBD disk is useable. Thus, MySQL would most likely need to run a InnoDB recover and should also be run with MyISAM table autorepair. This will slow down failover time, a lot if you happen to require a very large InnoDB log.

Trackbacks

Die wunderbare Welt von Isotopp on : LVM, DRBD

Zwei Artikel im englischsprachigen Blog MySQL Dump: A Quick Tour of LVM: Was ist LVM und wie setzt man es ein? Eine Schritt-für-Schritt Lösung auf einer Suse 10 Professional in einer VMwareA Quick Tour of DRBD: Was ist DRBD und wie setzt man es ein? Eine

Comments

Display comments as Linear | Threaded

Axel Eble on :

Is it possible to combine Software RAID, LVM and DRBD to create a flexible system (LVM) that is fault tolerant against local disk crashes (Software RAID) and system crashes (DRBD)?
If so, where to put either one? Software RAID as lowest level, DRBD on top and finally, on top of that, LVM?

Isotopp on :

According to the FAQ, DRBD plays well with RAID, but there have been issues with LVM and DRBD 0.6. I have no information if these issues have been resolved with DRBD 0.7, and I haven't tested, yet.

Dave Edwards on :

Excellent post! It's covered in this week's Log Buffer.

Dave.

Ted Osadchuk on :

MySQL HA is such a myth!!!! When dealing with real high volume instances, thousands of queries per second, a few hundred gigabytes of data, if a failure happens and mysql isn't shut down gracefully, the data volume is completely useless

VMWare is a great tool for proof of concept, but lets see how this failover strategy works out there in the real world when a table recovery on a few dozen 6gb tables might take a few days

I really wish MySQL would stop trying to be something it's not, a high availability reliable transactional database. MySQL is a lightweight, easy to use, low overhead database system and it'll never be anything more (anyone who's worked in high volume mysql environments knows that this is the truth)

-t

Isotopp on :

Just for the record, the use of VMware here is to provide a walkthrough for DRBD that just covers basic installations and allows you to review the setup procedure and play with it. VMware is not very useful for databases, and is also not useful for performance testing.

Regarding the recovery situation: This depends a lot on which storage engine you are using. If you are using MyISAM tables, it is strongly recommended that you have set up a myisam_recover option in your my.ini. Recovery time after a switchover is dependent on the amount of data you have, which can be very long if your database is large.

With a transactional table type, things are very different: Here you'll see a recovery that is proportional to the size of the used part of the redo log, which usually is much quicker. DRBD and InnoDB for example work very well and will have decent switchover times, which do not depend on the size of the data.

Krishna Chandra Prajapati on :

MySQL failover scenario, is detected by drbd or mysql heartbeat

Pedro on :

Hello,

When are you going to come out with the heartbeat article?

Thanks,

Pedro

Matt Ruggles on :

Can you state what the performance gain (or loss) is over doing replication? In other words, convince me to switch to DRDB instead of continuing to replicate.

James on :

DRBD acts like a mirror of a disk only instead of copying to a local disk on the server it copies accross the network to another server. So as soon as you make a change to the primary server it is written to the disk of the failover server.

Darek on :

What happens on 'right' after you make reiserfs on 'left' and the initial sync is complete. Do you mount /dev/drbd0, or do you have to make an FS on right also? Both 'mount' and 'mke2fs -j' on /dev/drbd0 gave me errors.

Darren Cassar on :

Hi Kristian,

DRDB stands for Distributed Replicated Block Device not Distributed Raw Block Device.

I tried emailing you instead of posting it as a comment but couldn't find a way :/ sorry. Just tought you should change it for correctness sake.

Cheers,
Darren

Manish on :

Hi

Can I assigin multiple disks target on device /dev/drbd0 like

device /dev/drbd0;
disk /dev/sdb1 /dev/sdc1;

Which combine two disks and make like raid 0.

Thank You,
Manish

Kristian Köhntopp on :

It is possible to define a /dev/md0 (Linuxraid) or a /dev/system/mysql (LVM2) as a target disk. This can have the structure you want - any RAID or more complicated things even.

Add Comment

E-Mail addresses will not be displayed and will only be used for E-Mail notifications.

To prevent automated Bots from commentspamming, please enter the string you see in the image below in the appropriate input box. Your comment will only be submitted if the strings match. Please ensure that your browser supports and accepts cookies, or your comment cannot be verified correctly.
CAPTCHA

BBCode format allowed