To answer the title of this post in one word: No.
But as with all things computer related, that “no” needs to be followed by the caveat: “Well, it depends upon your needs.”
From what I’ve seen, Linux clustering was designed primarily for high-availability services, with only a secondary effort to share disk resources across nodes.
I have tried — and would never use in production — Linux clustering services for a VM host cluster. I know other people have done it and will continue to do it, but a properly configured (and managed) VM cluster does not need true clustering. (Again, “depending upon your needs”).
Linux clustering requires fencing. (It didn’t always, but now it does). Fencing is a great thing in a homogeneous cluster where every machine is a clone of every other, and the point of the cluster is that it can lose a machine or six and still provide the same service(s). The purpose of fencing is to “shoot a bad node in the head”. This can either mean power-cycling it with an iLO or PDU, or disconnecting it from shared resources such as a SAN at the switch level.
Fencing is tremendously undesirable in a VM host cluster. If the cluster decides that one of the nodes is bad, it will simply kill it. In a heterogeneous cluster. Killing potentially tens (or even hundreds) of your VM guests in one stroke.
Of course, in a VM cluster, fencing would still be required if you were using a shared file system. However, LVM2 is another matter.
Another downside about Linux clustering is that to bring a failed cluster back to a consistent state, the recommended solution is to reboot all of the machines in the cluster simultaneously. (I’ve found that recommendation made by developers in RHEL’s bug database, amongst other places). In a production VM cluster, that’s unacceptable.
Configuration and Management
From a configuration standpoint, there’s nothing special about running non-clustered LVM on a shared disk in a cluster. All you have to do is run
vgcreate on one node using a shared LUN as a physical disk. Then run
vgscan on the other nodes in the cluster and you’ll see your new volume group. No fuss, no muss.
From a management standpoint, you have to be careful. Very, very careful. Writing to the LVM metadata simultaneously from different nodes (such as doing two
lvcreates) probably will result in metadata corruption, which could bork your entire cluster. It would be disastrous to employ cron jobs on more than one host, for example, that wrote to LVM’s metadata.
The best practice in this case would be to designate one node as the “metadata writer”. That simply means that you’d make all changes to LVM metadata from that machine. On all other cluster nodes, rename the LVM tools (usually
/sbin/lvmconf), and put your own script in their place. The script should output something like, “Please use the metadata writer node for changes to LVM”.
In most cases, command-line instances of LVM commands (e.g.
lvchange, etc) are just symlinked to
lvm. If you want to be thorough, change the symlinks for read-only operations like
lvdisplay, etc. to point to the renamed
Another gotcha is that there’s nothing to stop you from running two different instances of the same VM using the same logical volume. Because there’s no distributed locking, the system will let you do it. Of course, this is great if you designate the virtual disk as read-only, because you can share things like repositories and application images across as many VMs as you’d like. But running two read/write instances will result in data corruption.
You do have recent backups, right?
I use this technique within a few CentOS / Xen clusters that I manage. The key being that I manage them. Bringing in an outside technician or new employee that doesn’t know how (and why) things are configured could very easily screw everything up, even with the best of intentions. (“Gee, I wonder why he renamed
lvm.forreadonly. I’ll just use it anyway.”)
In fact, I’d wager that 99% of sysadmins would recommend that you don’t listen to me. However, I’ve been running things this way for over four years now without a hiccup (knock on chassis). Not having the overhead and headache of running a true cluster has been great.
Even I’d have to recommend against using this technique in a cluster larger than a handful of nodes.
Your mileage may vary, but my implementation is fairly straightforward: In my SAN array, I export each set of spindles as a LUN consisting of 100% of their capacity. The LUN is exposed to all the VM hosts, and I create a single volume group on the LUN-cum-physical disk. I then create logical volumes within that VG as needed for the VM guests. A LV is then used as a physical disk asset by a VM guest. (I know some people stick partitions on top of their LVs, but I don’t see the benefit to that).
I can do live migrations, and for times where that’s not necessary or desirable (e.g. a VM with a large memory footprint that is not mission critical), I use shared configuration files.
I store all of the conf files for my virtual machines on a shared LV that’s mounted as read-only on all VM hosts (dom0s) except for the “metadata writer” where it’s mounted as read/write. Therefore, when I want to migrate a guest VM from one host to another it’s just a matter of bringing the VM down, symlinking to the shared conf file on the new host, and removing the symlink from the old host.
I wanted to share my experiences and techniques not as a HOWTO, but simply as another way of looking at sharing resources between servers. I’ve gotten many a raised eyebrow (and worse) from other sysadmins when I’ve described my setup. But it does work, and despite all the pitfalls and caveats involved I maintain that it’s still pretty easy to ruin or disrupt a “traditional” cluster. This just has less overhead, and fewer places to make mistakes (though mistakes can be really, really B-A-D).