Monday, May 27, 2013

PVM, aka Beowulf Cluster

I stumbled upon a little project this weekend and found myself without a Beowulf cluster to help out.  It had been several years since I'd built a computational cluster, so I noticed a few "new" gothchas.  But... before we get to the fun stuff, let's review:
  • No, Beowulf is not "dead technology"
  • No, Hadoop is not the perfect tool for every job
That which was once called a Beowulf cluster is actually the use of a Message Passing Interface (MPI) that when deployed across nodes, create a Parallel Virtual Machine (PVM).  Ready for this-- the current generation of Hadoop (et al) clusters, are actually MPIs, and thus PVMs.  Given network attached storage, a map-reduce cluster is, in theory, Beowulf compliant.

To set up the absolute simplest PVM, we need two nodes, with an NFS share, and a user account.  The user needs an SSH key pair distributed to all nodes such that the user can login to any machine, from any machine.  Each node's hostname must be able to resolve via DNS or /etc/hosts.  Each node's hostname and address must be statically configured, and cannot be "localhost".

The first step is to install the base package from an EPEL repo.  (I'm using Scientific Linux 6.)  The package is delivered as source and must be compiled with a minimal set of options:
yum install -y pvm --nogpgcheck
rpm -ql pvm | grep -m1 pvm3
/usr/share/pvm3
This shows us where the RPM installed the source.  The issue with this incarnation is that it is still configured for RSH rather than SSH:
export PVM_ROOT=/usr/share/pvm3/
cd $PVM_ROOT
find . -type f -exec sed -i "s~bin/rsh~bin/ssh~g" {} \;
make; make install
Unfortunately, there are still hard-coded references to RSH in some of the binary libraries, so we spoof the references with a symlink:
ln -s /usr/bin/ssh /usr/bin/rsh
Repeat these steps on all (both) nodes.

On only one of the nodes (it doesn't matter which one) validate that PVM is not running, configure the PVM_ROOT variable, and start the first instance as the non-root user:
ps -ef | awk '!/awk/ && /pvm/'
echo "export PVM_ROOT=/usr/share/pvm3" >> ~/.bashrc
echo id | pvm
pvm> id
t40001
Console: exit handler called
pvmd still running.
ps -ef | awk '!/awk/ && /pvm/'
compute <snip> /usr/share/pvm3/lib/LINUXX86_64/pvmd3
Notice that the PVM deamon launched and remained resident.  Individual commands can be piped to PVM, or an interactive console can be used.  From the same node, remotely configure the next node:
ssh pvm2 'echo "export PVM_ROOT=/usr/share/pvm3" \
                >> ~/.bashrc'
# should not prompt for a password
ssh pvm2 'echo $PVM_ROOT'
/usr/share/pvm3
ssh pvm2 'rm -f /tmp/pvm*'
The last line is the very, very, important.  From the first node, remotely start the second node:
pvm
pvmd already running.
pvm> conf
conf
1 host, 1 data format
   HOST     DTID     ARCH   SPEED       DSIG
   pvm1    40000 LINUXX86_64    1000 0x00408c41
pvm> add pvm2
add pvm2
1 successful
   HOST     DTID
   pvm2    80000
pvm> conf
conf
2 hosts, 1 data format
   HOST     DTID     ARCH   SPEED       DSIG
   pvm1    40000 LINUXX86_64    1000 0x00408c41
   pvm2    80000 LINUXX86_64    1000 0x00408c41
In this sequence, we have accessed the console on pvm1 to view the clusters configuration (conf).  Next, we started the second node.  It is now displayed in the cluster's conf.

Just for fun, let's throw it the simplest of compute jobs:
pvm> spawn -4 -> /bin/hostname
4 successful
t8000b
t8000c
t4000c
t4000d
pvm>
[3:t4000d] pvm1
[3:t4000c] pvm1
[3:t4000c] EOF
[3:t4000d] EOF
[3:t8000b] pvm2
[3:t8000b] EOF
[3:t8000c] pvm2
[3:t8000c] EOF
[3] finished
There are a few things to notice about the output:
  1. The command asked the cluster to spawn the command "/bin/hostname" four times.
  2. The "->" option indicates we wanted the output returned to the console, which is completely abnormal... we only do this for testing.
  3. The prompt returned before the output.  The assumption is that our compute jobs will take an extended period of time.
  4. The responses were not displayed correctly.  They were displayed as they returned, because all this magic is happening asynchronously.
  5. Each job's responses, from each node, could be grep'ed from the output using a unique serial number, automatically assigned to the job.
To leave the console and return to the command prompt issue the quit command.  All started nodes will continue to run.  To shutdown the compute cluster, execute:
echo halt | pvm
Finally, remember this one last thing: The cluster is a peer-to-peer grid.  Any node can manage any other, any node can schedule jobs, and any node can issue a halt.