To counter the real-world lag of application failover, an system must be load balanced across a cluster. A real HA environment would be at least three nodes, running two instances of the service. If one node fails, the load balancer will redirect everything to the second instance, while the service is recovered on the third node.
There is an HA problem that VMware has addressed in their HA solution, that RedHat has not, known as the anti-affinity rule. Affinity is when two "processes" favor the same resource. An example would be when running a web and database instance on the same machine improve performance. In the case of redundant services, running them on the same machine is pointless, if the machine fails. To prevent this, we need an anti-affinity rule that requires the two processes to never be on the same machine.
RedHat cluster suite provides affinity in the form of child services. If the cluster moves the web service to another node, the database has to follow. What they don't provide is an anti-affinity rule to prevent the load balanced services from trying to run on a single node. As a matter of fact, by default, all services will start on the same cluster node. (It will be the node with the lowest number.)
I found I could implement anti-affinity from with in the service,s init.d script. First, we add an /etc/sysconfig/ file for the process, with the following variables:
CLUST_ENABLED="true"A collision is when the presence of a service prevents this service from starting on this node. The names should be listed exactly as they appear in clustat. Make sure the script sources the config file:
CLUST_MYSERVICE="service:bark"
CLUST_COLLISION="service:meow service:moo"
# source sysconfig fileNext, add a new subroutine to the existing init.d script:
[ -f /etc/sysconfig/$prog ] && . /etc/sysconfig/$prog
cluster(){Finally, add a reference to the cluster sub in the start sub:
# look for other services on this host
K=$(for J in $CLUST_COLLISION; do \
clustat | grep "$J.*$HOSTNAME.*started" \
>/dev/null; \
[ $? == 0 ] && echo "$J "; \
done)
if [ $K ]; then
# show service names
echo -n "Cluster, collision $prog: $K"
# fail, but with a success return code
failure; echo; exit 0
fi
# look for this service running on other nodes
K=$(clustat | grep "$CLUST_MYSERVICE.*started" | \
awk '{print $2}')
if [ $K ]; then
# show hostname of other instance
echo -n "Cluster, $prog exists: `echo $K | cut -d. -f1`"
# fail but with a success return code
failure; echo; exit 0
fi
}
start(){Here's what happens in the case of a collision:
if [ $(ps -C cluster-$prog.sh | grep -c $prog) == 0 ]; then
# only check cluster status if enabled
[ "$CLUST_ENABLED" == "true" ] && cluster
echo -n "Starting $prog"
- rgmanager issues a start
- the cluster sub recognizes the collision, but tells rgmanage that it started successfully (exit 0)
- rgmanager shows the service as running
- 30 seconds pass
- rgmanager issues a status against the service, which fails, since the init.d script lied about the service running
- the cluster orders a relocation of the service
- rgmanager issues a start... on a different node
- there is no collision this time, so the init.d runs as expected