Thursday, November 22, 2012

RHEL Cluster Anti-Affinity Configuration

I'm often amused by how vendors define "High Availability", aka HA.  Customers always talk about Five Nines, but "off the shelf" HA solutions seldom achieve 99% availability.  In the case of RedHat's HA cluster service, the default configuration might provide unattended failover within 60 seconds.  Given that Five Nines only allows 25.9 seconds per month, a single failure can blow away a service level agreement.

To counter the real-world lag of application failover, an system must be load balanced across a cluster.  A real HA environment would be at least three nodes, running two instances of the service.  If one node fails, the load balancer will redirect everything to the second instance, while the service is recovered on the third node.

There is an HA problem that VMware has addressed in their HA solution, that RedHat has not, known as the anti-affinity rule.  Affinity is when two "processes" favor the same resource.  An example would be when running a web and database instance on the same machine improve performance.  In the case of redundant services, running them on the same machine is pointless, if the machine fails.  To prevent this, we need an anti-affinity rule that requires the two processes to never be on the same machine.

RedHat cluster suite provides affinity in the form of child services.  If the cluster moves the web service to another node, the database has to follow.  What they don't provide is an anti-affinity rule to prevent the load balanced services from trying to run on a single node.  As a matter of fact, by default, all services will start on the same cluster node.  (It will be the node with the lowest number.)

I found I could implement anti-affinity from with in the service,s init.d script.  First, we add an /etc/sysconfig/ file for the process, with the following variables:
CLUST_ENABLED="true"
CLUST_MYSERVICE="service:bark"
CLUST_COLLISION="service:meow service:moo"
A collision is when the presence of a service prevents this service from starting on this node.  The names should be listed exactly as they appear in clustat.  Make sure the script sources the config file:
# source sysconfig file
[ -f /etc/sysconfig/$prog ] && . /etc/sysconfig/$prog
Next, add a new subroutine to the existing init.d script:
cluster(){
  # look for other services on this host
  K=$(for J in $CLUST_COLLISION; do \
          clustat | grep "$J.*$HOSTNAME.*started" \
          >/dev/null; \
          [ $? == 0 ] && echo "$J "; \
          done)
  if [ $K ]; then
    # show service names
    echo -n "Cluster, collision $prog: $K"
    # fail, but with a success return code
    failure; echo; exit 0
  fi
  # look for this service running on other nodes
  K=$(clustat | grep "$CLUST_MYSERVICE.*started" | \
          awk '{print $2}')
  if [ $K ]; then
    # show hostname of other instance
    echo -n "Cluster, $prog exists: `echo $K | cut -d. -f1`"
    # fail but with a success return code
    failure; echo; exit 0
  fi
}
Finally, add a reference to the cluster sub in the start sub:
start(){
  if [ $(ps -C cluster-$prog.sh | grep -c $prog) == 0 ]; then
    # only check cluster status if enabled
    [ "$CLUST_ENABLED" == "true" ] && cluster
    echo -n "Starting $prog"
Here's what happens in the case of a collision:
  • rgmanager issues a start
  • the cluster sub recognizes the collision, but tells rgmanage that it started successfully (exit 0)
  • rgmanager shows the service as running
  • 30 seconds pass
  • rgmanager issues a status against the service, which fails, since the init.d script lied about the service running
  • the cluster orders a relocation of the service
  • rgmanager issues a start... on a different node
  • there is no collision this time, so the init.d runs as expected

Thursday, November 08, 2012

RHEL 6 Clustering, VM Fencing

I recently retasked one of my lab machines as a RedHat virtualization server, which RedHat calls RHEV, but is really KVM.  One of this machine's tasks is to support a test cluster of VMs.  Under normal circumstances, clustering would require a remote management interface such as an ILO, DRAC, or RMM.

As usual, I was disappointed with how difficult this was.  To make matters more difficult, for you, I won't be covering clustering in the article.  This document's scope will be limited setting up to RHEV VM fencing.

On the host machine, we need to install the fence daemon.  Considering this is very lightweight, the I'm going to do a shotgun install:
yum install fence-virtd-*
On my machine, this loaded four packages: the daemon, the interface between the daemon and the hypervisor, and two "plugins".  (The serial plugin is probably not needed.)

The base RPM will provide the /etc/fence_virt.conf file.  Modify it to look like this:
listeners {
  multicast {
    family = "ipv4";
    interface = "virbr0";
    address = "225.0.0.12";
    port = "1229";
    key_file = "/etc/cluster/fence_xvm.key";
  }
}
fence_virtd {
  module_path = "/usr/lib64/fence-virt";
  backend = "libvirt";
  listener = "multicast";
}
backends {
  libvirt {
    uri = "qemu:///system";
  }
}
Two things to notice about the config file.  The key_file option is little more than a password in a text file, which is going to have to be duplicated on all the VMs in the cluster.  The "theory" is that only a device with the password will be able to fence other nodes.  This brings us to the second point, the multicast option.  If a cluster node issues a fence command, the symmetric authentication key will be multicast on the network in the clear.  Thus, the reality is that the key_file provides no real security.

Which brings us to a second issue with the multicast.  Per RedHat, cross host fencing is not supported.  As such, all cluster nodes have to exist on the same physical machine, rending real world VM clustering pretty much worthless.  Here's the reality of cross host fencing: It is not supported because of the security concerns of multicasting the clear text fencing password and the fact that RedHat cannot guarantee the multicast configuration of the switch infrastructure.  Given properly configured switches, a dedicated host NIC and virtual bridge in each host, cross host fencing works.  In this lab configuration, however, it is not a concern.

After creating a key_file, open the fenced port in IPtables:
-A INPUT -s 192.168.122.0/24 -m tcp -p tcp --dport 1229 -j ACCEPT
Copy the key_file to each clustered VM (they don't need the config file) and add the opposite IPtables rule:
-A OUTPUT -d 225.0.0.12 -m tcp -p tcp --dport 1229 -j ACCEPT
On the host, chkconfig and start fence_virtd.  Running netstat should show the host listening on 1229.  What it is listening "for" is the name of a VM to "destroy" (power off.)  This means the names of the cluster nodes and VMs recognized by KVM/QEMU have to match.  On the host, display the status of the VMs using: 
watch virsh list
Given a two node cluster, on node1 issue:
fence_node node2
On the host, the status of node2 should change from running to inactive, and a moment later, back to running.  For testing purposes, the fence_node command can be installed on the host, without the host being part of the cluster.  If you try this using yum, you'll get the entire cluster suite.  Instead, force these three RPMs:
rpm -ivh clusterlib-*.x86_64.rpm  --nodeps
rpm -ivh corosynclib-*.x86_64.rpm  --nodeps
rpm -ivh cman-*.x86_64.rpm  --nodeps

 Truthfully, the better choice is to build a VM to manage the cluster using Luci.

Sunday, November 04, 2012

Kickstart from Hard Drive ISO

I'm building a machine that may need to be remotely re-imaged, without the benefit a kickstart server.  I've always heard that you can kick a machine from itself, but had never tried it.  Truthfully, it's probably more trouble than its worth.  The best option would be to install a DVD drive with media, but configure the BIOS such that DVD is after the main drive.  Since I didn't want an optical drive on this machine, here's how to kickstart a machine from a hard drive.

Let's do this backwards.  Get the machine to image off an HTTP server and then change the ks.cfg:
#url --url http://1.2.3.4/rhel6
harddrive --partition=/dev/sdb1 --dir=/
Notice that I'm telling the machine that its image is on sdb, not sda.  Providing two drives is safer that trying to image off the boot/root drive, but it could be a partition on the same drive.  Besides, I've got dozens of little drives laying around doing nothing.  Further down in the file, I also indicated:
#clearpart --all
clearpart --drives=sda

Next, we mount the second drive and copy the ISO into the root of the drive.  To clarify:
mount /dev/sdb1 /mnt
cp rhel-6-x86_64-dvd.iso /mnt/
When kickstarting from a local drive, we use the ISO file itself... not an extracted or loop mounted filesystem.  Looking back at the first change we made to the ks.cfg, we indicated --dir=/, so we are telling the installer that the ISO is on the top level of the drive.

As a matter of convenience, mount the ISO, because we need a few files from it:
mkdir /mnt/rhel6
mount rhel-6-x86_64-dvd.iso /mnt/rhel6 -o loop
Copy three files:
mkdir /mnt/images
cp /mnt/rhel6/images/install.img /mnt/images
cp /mnt/rhel6/isolinux/vmlinuz /mnt/images
cp /mnt/rhel6/isolinux/initrd.img /mnt/images
If you were going to allow the machine to rebuild different versions (or distros) you would want to add a version number to each file.

To initiate the rebuild, we will use the tried and true Grub rebuild hack:
cd /boot
cp /mnt/images/vmlinuz /boot
cp /mnt/images/initrd.img /boot
cat >> /boot/grub/grub.conf << EOF
title Rebuild
  root (hd0,0)
  kernel /vmlinuz ramdisk_size=8192 ks=hd:sdb1/ks/ks.cfg
  initrd /initrd.img
EOF
Many docs indicate that the second drive has to be made bootable.  In this case, we are still booting off the primary drive, but only long enough to read the ks.cfg from sdb and switch to install.img.  Once install.img has control, it will read the clearpart command, wipe sda, and reinstall from sdb.

There is a "gotcha" I've not quite worked out yet.  As we all know, Linux is notorious for renaming drive letters at boot time.  It is possible that  the machine might confuse sda and sdb.  This could be disastrous if the machine crashed, and while trying to rebuild, it wiped the image drive!  The good news is that the installer reports that it can't find the image and kickstart file, and fails.  Just reboot until it finds the correct drive.

* It would seem that either a UUID or LABEL could be used in both Grub and the kickstart file.  I'll add checking those possibilities to my ToDo list.  Or you could figure that part out and let me know which works.  Its only fair: I've already done the hard part for you.

Saturday, November 03, 2012

Citrix Xenserver: Apply Multiple Updates

As a result of reorganizing the servers in my lab, I had to reinstall Citrix Xenserver.  I should have downloaded 6.1, but decided to keep it at 6.0 and apply the updates that I already had on the NAS.  All went well with the install, I moved all my VMs and templates to this machine, and retasked the other machine.

When I went to load the updates, a funny thing happened... It refused to load more than one, and expected to reboot after each.  After a moment of thought, I realized that I had probably never tried to load two at a time before.  It seems like something that should be simple, but the procedure is not obvious.

Here's how:
  1. Highlight a server, click the "General" tab, and expand the "Updates" pane.
  2. In the "Updates" pane, notice which updates have already been applied.
  3. On the menu bar, click "Tools", "Install Software Update", and "Next".
  4. Click the "Add" button, select the lowest numbered update, and click "Open".
  5. At this point, its tempting to add another update, but don't: yet.
  6. Click "Next", select one or more servers, and click "Next".
  7. A check will be run against the update.
If the check succeeds, click "Previous", "Previous", and repeat from step 4.

If the check fails, then two things.  First, click "Cancel" and start the entire procedure over again, but don't add the update that failed the test.  Second, don't blame me-- I didn't create the interface.

Once you've added all the relevant updates, click "Next".  You'll have the choice of performing post install steps automatically or manually.  What this really means is reboot now or later.  If you select manually (reboot later,) it is possible that some of the updates will fail, but that's actually okay.  When an update succeeds, it appears in the "Updates" pane as Applied.  If it fails, it appears as Not applied.

To get activate the not applied updates, repeat steps 1, 2, and 3, but instead of step 4, highlight the not applied update.  Continue through the rest of the steps, making sure to do automatic, as recommended.