Skip to content

A quick tour of LVM

The vmware config used for this example.
This is a quick tour of LVM and a demonstration how it is superior to static partitions. Basically, LVM provides you with a way to create dynamic partitions - you will be able to grow and shrink partitions on demand, move them between disks and snapshot them for backup, all while the filesystem and database on top of it are active and busy.

The LVM tour in this blog post has been created on a vmware instance with a Suse 10.0 Professional installation which I am using to show a combination of RAID and LVM configuration examples. The vmware has a bit of memory, a network card, a boot disk with a text only Suse 10 installation and 8 small simulated SCSI disks besides the boot disk to demonstrate stuff.

Here is the configuration for the basic system.

We start of with partitioning. Partitioning 8 disks can be a hassle, so we take the first disk as a master and copy the partition table automatically to all the other disks.

CODE:
# fdisk -l /dev/sdb
Disk /dev/sdb: 214 MB, 214748160 bytes
64 heads, 32 sectors/track, 204 cylinders
Units = cylinders of 2048 * 512 = 1048576 bytes

   Device Boot      Start         End      Blocks   Id  System
/dev/sdb1               1         204      208880   fd  Linux raid autodetect


The easiest way to do it is to use the command line partitioner, sfdisk, for this in a loop. "sfdisk -d /dev/sdb" will dump the partition table of /dev/sdb in a way that it can be fed to another sfdisk instance, so the following loop will work just fine.

CODE:
for i in c d e f g h i
do
  sfdisk -d /dev/sdb | sfdisk /dev/sd$i
done


Now we want to turn these disks into 4 RAID-1 pairs. We do not build a RAID-10 out of them, because we want to use LVM for this later on.

First, we copy the default raidtab sample into place, and then modify it to match our requirements. To define a RAID device, we have to specify a "raiddev" line first, and after that all definitions for the raid metadevice (md). Finally we have to add a device and raid-disk pair for each physical disk we want to include in the raid. The device names the physical device file, and the raid-disk gives an index number for that disk inside our metadevice.

CODE:
# cp /usr/share/doc/packages/raidtools/raid1.conf.sample /etc/raidtab
# vi /etc/raidtab
...
# cat /etc/raidtab
# Definition of /dev/md0 - a mirror of sdb1 and sdc1
raiddev                 /dev/md0
raid-level              1   # If you forget this, you'll get a RAID-0 by default
nr-raid-disks           2   # Two disk mirror
nr-spare-disks          0   # No hot spares here
chunk-size              128

# Specify /dev/sdb1 as disk 0, and /dev/sdc1 as disk 1 in the pair.
device                  /dev/sdb1
raid-disk               0
device                  /dev/sdc1
raid-disk               1

# Definition of /dev/md1 - a mirror of sdd1 and sde1
raiddev                 /dev/md1
raid-level              1
nr-raid-disks           2
nr-spare-disks          0
chunk-size              128

device                  /dev/sdd1
raid-disk               0
device                  /dev/sde1
raid-disk               1

# Definition of /dev/md2 - a mirror of sdf1 and sdg1
raiddev                 /dev/md2
raid-level              1
nr-raid-disks           2
nr-spare-disks          0
chunk-size              128

device                  /dev/sdf1
raid-disk               0
device                  /dev/sdg1
raid-disk               1

# Definition of /dev/md3 - a mirror of sdh1 and sdi1
raiddev                 /dev/md3
raid-level              1
nr-raid-disks           2
nr-spare-disks          0
chunk-size              128

device                  /dev/sdh1
raid-disk               0
device                  /dev/sdi1
raid-disk               1


We now have to mkraid all of our devices to initialize them. We will then find statistics for the devices in /proc/mdstat.

CODE:
# for i in 0 1 2 3; do mkraid /dev/md$i; done
handling MD device /dev/md0
analyzing super-block
disk 0: /dev/sdb1, 208880kB, raid superblock at 208768kB
disk 1: /dev/sdc1, 208880kB, raid superblock at 208768kB
...
# cat /proc/mdstat
Personalities : [raid0] [raid1]
md3 : active raid1 sdi1[1] sdh1[0]
      208768 blocks [2/2] [UU]

md2 : active raid1 sdg1[1] sdf1[0]
      208768 blocks [2/2] [UU]

md1 : active raid1 sde1[1] sdd1[0]
      208768 blocks [2/2] [UU]

md0 : active raid1 sdc1[1] sdb1[0]
      208768 blocks [2/2] [UU]

unused devices: <none>


All of this leaves us with four metadevices, each of them a mirror pair of two physical disks. We now want to create a RAID-0 like structure from them, but we want some more flexibility, so we are using LVM for this.

LVM is a method to slice your physical devices into physical extents of equal size. All of them end up in a big bag called a volume group. You can build logical volumes from the free extents in your volume group as you see fit, you can change logical volumes in size, migrate them and snapshot them. This gives you a lot of flexibility in storage management you would not have if you made a static RAID-0 from your RAID-1's.

If you want to put any devices into a volume group, you have to label them as physical volumes first. This is done using pvcreate, and can be checked using pvdisplay.

CODE:
# for i in 0 1 2 3
> do
> pvcreate /dev/md$i
> done
  Physical volume "/dev/md0" successfully created
  Physical volume "/dev/md1" successfully created
  Physical volume "/dev/md2" successfully created
  Physical volume "/dev/md3" successfully created
# pvdisplay
  --- NEW Physical volume ---
  PV Name               /dev/md0
  VG Name
  PV Size               203.69 MB
  Allocatable           NO
  PE Size (KByte)       0
  Total PE              0
  Free PE               0
  Allocated PE          0
  PV UUID               udoHkI-0wvA-KNYy-YdT0-G5G9-6Jg6-6SUUr9
...


We end up with four physical volumes which now need to be added to a volume group. A volume group has a name, ours will be called "system", and determines the extent size used with all physical volumes inside. Be default, the PE size is pretty small, 4M, and you'd want something larger such as 128M. But for our toy disks, the 4M default is just right.

Let's do it.

CODE:
# vgcreate -s 4M system /dev/md0
  Volume group "system" successfully created
# vgextend system /dev/md1 /dev/md2
  Volume group "system" successfully extended


We have put /dev/md0 into the volume group and created it that way. We have later added two additional volumes to the group, /dev/md1 and /dev/md2. We will add /dev/md3 even later to demonstrate migration and simulate a hardware update.

Our config now looks like this:

CODE:
# vgdisplay
  --- Volume group ---
  VG Name               system
  System ID
  Format                lvm2
  Metadata Areas        3
  Metadata Sequence No  2
  VG Access             read/write
  VG Status             resizable
  MAX LV                0
  Cur LV                0
  Open LV               0
  Max PV                0
  Cur PV                3
  Act PV                3
  VG Size               600.00 MB
  PE Size               4.00 MB
  Total PE              150
  Alloc PE / Size       0 / 0
  Free  PE / Size       150 / 600.00 MB
  VG UUID               P8qZM4-5VmP-bhyf-gu9l-glbW-o2ji-4a4EGm


We do have a VG with 3 PVs in it. It is 600M in size, made out of 150 4M slices, all of which are free for allocation.

You may have noticed the UUID specifier for each physical or logical volume and for each volume group. LVM deals with devices completely on the basis of UUIDs and not device names, so renaming and renumbering devices will leave LVM unfazed and it will find your disks no matter under which device name they are present.

We now want to make logical disks. True to Wide Thin Disk Striping we want to spread all of our partitions across all of our disks, so we need to specify interleaves and everything. For MySQL, we do need a data partition where datadir will reside, and a log partition where the binlog, the relay log and all other logs go. As per MySQL recommendation, we use reiserfs 3.6.

CODE:
lvcreate -i 3 -I 128K -L 100M -n data system
  Rounding size (25 extents) up to stripe boundary size (27 extents)
  Logical volume "data" created
# lvcreate -i3 -I 128K -L100M -n log system
  Rounding size (25 extents) up to stripe boundary size (27 extents)
  Logical volume "log" created


Using 4M extents, a 100M filesystem is 25 extents in size. But because we requested a stripe factor of 3, we need a number of extents in our logical volume that can be divided by 3 evenly. Thus, we get a 27 extent sized volume. We now can create filesystems and mount them.

CODE:
# mkreiserfs /dev/system/data
...
# mkreiserfs /dev/system/log
...
# mkdir /export
# mkdir /export/data
# mkdir /export/log
# mount /dev/system/data /export/data
# mount /dev/system/log /export/log
# df -Th
Filesystem    Type    Size  Used Avail Use% Mounted on
/dev/sda2 reiserfs    7.6G  659M  6.9G   9% /
tmpfs        tmpfs    126M     0  126M   0% /dev/shm
/dev/mapper/system-data
          reiserfs    108M   33M   76M  30% /export/data
/dev/mapper/system-log
          reiserfs    108M   33M   76M  30% /export/log


Onto this system we will now install MySQL.

CODE:
# cat wget
wget http://dev.mysql.com/get/Downloads/MySQL-5.0/\\
mysql-max-5.0.22-linux-i686-glibc23.tar.gz/\\
from/http://sunsite.informatik.rwth-aachen.de/mysql/
# sh wget
...
Length: 42,271,844 (40M) [application/x-tar]

100%[========================================================>] 42,271,844  1001.55M/s    ETA 00:00

17:58:59 (1.22 MB/s) - `mysql-max-5.0.22-linux-i686-glibc23.tar.gz' saved [42271844/42271844]
# tar -C /usr/local -xf mysql-max-5.0.22-linux-i686-glibc23.tar.gz
# cd /usr/local
# ln -s mysql-max-5.0.22-linux-i686-glibc23/ mysql
# cd mysql
# groupadd mysql
# useradd -g mysql mysql
# mkdir etc
# chown -R mysql.mysql .
# ./scripts/mysql_install_db --user=mysql
# vi etc/my.cnf
...
# cat etc/my.cnf
[mysqld]

datadir=/export/data
log-bin=/export/log/mysqld-bin
log-bin-index=/export/log/mysqld-bin.index
expire-logs-days=7
log-error=/export/log/mysqld.err
log-slow-queries=/export/log/mysql-slow.log
long-query-time=2
relay-log=/export/log/mysql-relay
relay-log-index=/export/log/mysql-relay.index
# ./support_files/mysql.server start
...
# ./bin/mysql -u root


This will now keep most logs in /export/log and the datadir in /export/data. If you enter the command line client and insert some data, you'll see how the simulated disks in the vmware blink in unison.

Now the fun can start. While the filesystem is mounted and MySQL is running, we can easily add disk space.

CODE:
# df -Th
...
/dev/mapper/system-data
          reiserfs    108M   53M   56M  49% /export/data
...
linux:/mnt # lvextend -L+100M /dev/system/data
  Using stripesize of last segment 128KB
  Rounding size (52 extents) down to stripe boundary size for segment (51 extents)
  Extending logical volume data to 204.00 MB
  Logical volume data successfully resized
linux:/mnt # resize_reiserfs /dev/system/data
resize_reiserfs 3.6.18 (2003 www.namesys.com)
resize_reiserfs: On-line resizing finished successfully.
linux:/mnt # df -Th
...
/dev/mapper/system-data
          reiserfs    204M   53M  152M  26% /export/data
...


Now we add the fourth disk /dev/md3 into the system, and clean up disk 3, /dev/md2, in order to remove it. All the time the database keeps running and the filesystems are mounted. In order to be able to do this, the kernel must have loaded the dm_mirror module, which it does not automatically do in Suse Linux.

CODE:
# vgextend system /dev/md3
  Volume group "system" successfully extended
# modprobe dm_mirror
# pvmove /dev/md2 /dev/md3
# pvdisplay /dev/md2
  --- Physical volume ---
  PV Name               /dev/md2
  VG Name               system
  PV Size               200.00 MB / not usable 0
  Allocatable           yes
  PE Size (KByte)       4096
  Total PE              50
  Free PE               50
  Allocated PE          0
  PV UUID               cRVCVD-T3BK-sT96-2Zvy-sUts-Tl35-7G4wTz

linux:/mnt # vgreduce system /dev/md2
  Removed "/dev/md2" from volume group "system"


Another nice trick: Snapshots, using the dm_snapshot module. A snapshot volume need not be as large as the original volume, because Snapshots are copy-on-write: When you make a snapshot, nothing is done. When you read from the snapshot, the requested block is read from the original volume. If a block is changed on the original volume, the original content it is copied to the snapshot first. Now, if you read from the snapshot, the saved copy is read from the snapshot instead, shadowing the modified data from the original.

Even fancier: Snapshots are writeable. How about snapshotting your database and then testing the migration on a snapshot first, while production goes on uninterrupted on the main system?

CODE:
# modprobe dm_snapshot
# /usr/local/mysql/support-files/mysql.server stop
done.
# lvcreate -s -L 50M -n backup /dev/system/data
  Rounding up size to full physical extent 52.00 MB
  Logical volume "backup" created
# cd /usr/local/mysql
# ./support-files/mysql.server start
# mkdir /export/backup
# mount /dev/system/backup /export/backup
# df -Th
...
/dev/mapper/system-data
          reiserfs    204M   53M  152M  26% /export/data
/dev/mapper/system-backup
          reiserfs    204M   53M  152M  26% /export/backup
...
# vgdisplay -v
...
  --- Logical volume ---
  LV Name                /dev/system/data
  VG Name                system
  LV UUID                o4S2T2-MR83-UKTs-g8P0-ry4G-4Ko5-78pfUR
  LV Write Access        read/write
  LV snapshot status     source of
                         /dev/system/backup [active]
  LV Status              available
  # open                 2
  LV Size                204.00 MB
  Current LE             51
  Segments               2
  Allocation             inherit
  Read ahead sectors     0
  Block device           253:0

  --- Logical volume ---
  LV Name                /dev/system/backup
  VG Name                system
  LV UUID                VGamCn-NucQ-Wtts-Qilg-7Dkc-0Z93-Cdqytf
  LV Write Access        read/write
  LV snapshot status     active destination for /dev/system/data
  LV Status              available
  # open                 2
  LV Size                204.00 MB
  Current LE             51
  COW-table size         52.00 MB
  COW-table LE           13
  Allocated to snapshot  0.38%
  Snapshot chunk size    8.00 KB
  Segments               1
  Allocation             inherit
  Read ahead sectors     0
  Block device           253:4


We can see how /dev/system/data is marked as "source of /dev/system/backup [active]" and how the COW (copy-on-write) table on /dev/system/backup is currently 0.38% filled. The mounted snapshot is shown to be as large as the original file system. The space allocated to the snapshot can be smaller - it must be large enough to keep all changes to the original filesystem for the duration of the lifetime of the snapshot.

We can now take our time and save the snapshot to tape. Then we must get rid of it.

CODE:
# umount /dev/system/backup
# lvremove /dev/system/backup
Do you really want to remove active logical volume "backup"? [y/n]: y
  Logical volume "backup" successfully removed


Trackbacks

linkdump on : A quick tour of LVM

Köhntopp hat eine nette kurze Anleitung für LVM2 geschreiben, auf die ich gerne verweise. Besonders mit Hinblick auf einen ganz bestimmten Skeptiker. ;-)

Die wunderbare Welt von Isotopp on : LVM, DRBD

Zwei Artikel im englischsprachigen Blog MySQL Dump: A Quick Tour of LVM: Was ist LVM und wie setzt man es ein? Eine Schritt-für-Schritt Lösung auf einer Suse 10 Professional in einer VMwareA Quick Tour of DRBD: Was ist DRBD und wie setzt man es ein? Eine

Comments

Display comments as Linear | Threaded

eckes on :

Wenn man die Disks direkt zur LVM Group added dann muss man sie nicht mal partitionieren. Allerdings weiss ich nicht ob man damit einfach raid1 machen kann (also ohne md?)

Mit xfs gibts ne snapshot funktion, ist die bei reiser nicht benötigt oder reicht es da, wenn man keine anwendungen auf dem file system hat? xfs snapshot blockiert alle opens und stellt sicher dass die fs bloecke konsistent sind.

Isotopp on :

reiserfs has barrier write patches (at least the version shipped with Suse Linux), so the state on disk is always almost consistent (data blocks go first, metadata blocks come after the data blocks that they point to and so on). When the snapshot is mounted, reiserfs rolls forward the log and creates a consistent image that contains all transactions that have been comitted when the snapshot was taken.

Pravin on :

Thanks lol...
Below mentioned link is very easy to understand,
http://www.redhatlinux.info/2010/11/lvm-logical-volume-manager.html

Marc 'Zugschlus' Haber on :

sfdisk makes it much easier to automatically create disk partitions without dirty tricks like copying disk partition tables.

Marc 'Zugschlus' Haber on :

Never mind. I should (a) actually read the article to its end before commenting and (b) do so while being actually awake.

Marc 'Zugschlus' Haber on :

I still have some useful comments ;)

I have always hated raidtools with a passion, mostly because they need an up to date /etc/raidtab even to work with an already-made RAID and do funky things it the /etc/raidtab does not fit the actual RAID. In my opinion, mdadm is a much better tool for RAID administration on Linux.

I have blogged a short LVM introduction in German a few months ago. It might have some information which is not present here, I am therefore linking it here. You can find it on http://blog.zugschlus.de/archives/65-LVM-unter-Linux.html

Axel Eble on :

I haven't yet understood the benefits of using software RAID under Linux compared to a hardware RAID controller, except for the hardware costs.

Can anybody elaborate on that a bit?

Isotopp on :

Software RAID is only useful for RAID-1, RAID-0 and any mixtures of these. RAID-4 and RAID-5 is very hard to get right, even in hardware. Since RAID-1 and RAID-0 do incur very little overhead, they also can be very fast as softraids.

With Linux RAID as opposed to hardware RAID you have more degrees of freedom when designing the RAID. For example, most controllers require the disks of a RAID-1 pair to be below the same controller, whereas with Linux RAID you can have completely disjunct pathes (different PCI bridges, different controlers, different disks). This is desireable for performance and availability reasons.

Also, with RAID-1 and LVM, you gain a lot of flexibility.

Axel Eble on :

Well, with a RAID adapter like a 3ware only one disk is to be attached to any given port at the adapter, so the performance issue shouldn't be there (well, that's the main point in buying a RAID adapter anyway).
I can see the point in distributing a RAID-1 across two controllers, though, as that is my main issue with software RAID: if you have two disks attached to the same controller and one dies the other disk is at the very least heavily impacted.

Bernd Eckenfels on :

A typical customer installation uses software mirroring to two SAN fibres on a metro-scale distribution to ensure all data is stored in at least two places.

This is IMHO a simpler solution than having kind of synchronisation, and it also works well with Failover Clusters.

However thats differnt from the DRDB approach whic hdoes not need to expensive FC connecions and FC Switches.

Bernd

Add Comment

E-Mail addresses will not be displayed and will only be used for E-Mail notifications.

To prevent automated Bots from commentspamming, please enter the string you see in the image below in the appropriate input box. Your comment will only be submitted if the strings match. Please ensure that your browser supports and accepts cookies, or your comment cannot be verified correctly.
CAPTCHA

BBCode format allowed