Because we had a lot of worker node upgrades and new entries we needed a system that can bypass the installation of each worker node and also we wanted to upgrade in one go all of the existing worker nodes.
Basically we want to achieve a “plug-and-play” worker node infrastructure where each node uses its own HDD for job storage and processing.
Requirements:
- TFTP and DHCP server (with static IP-MAC binding)
- NFS server
- An up-to-date worker node system from which we’ll build the NFS root filesystem export for the worker nodes
The overview of the whole process is the following:
- Add a new physical worker node (WN)
- Add its MAC address to the DHCP server configuration file, assigning a fixed IP address
- Power on the WN
- If the HDD is blank then preformat it and populate with the needed directories (/var and /home)
- Configure the WN (NFS root filesystem, hdd mount points, hostname etc) based on its IP address
- Node is ready to run as it if was installed normally
The actual configuration of the infrastructure
We won’t go into details about setting up a TFTP and DHCP server but it should be clearly stated that each WN must have its MAC address set with a fixed IP in the DHCP server’s configuration file:
host my_new_wn { .... hardware ethernet ae:d6:90:bb:aa:1d; fixed-address 172.24.10.100; .... }
Also you need to modify the IPs and paths from the presented examples to match your environment.
The worker node must have its root filesystem on the NFS share. In order to achieve this, we must deal with two steps:
- NFS server configuration
- WN boot process modification with initrd customization
As said in the requirements, we need to have an image of the root filesystem that each WN will use as its own root filesystem. In order to accomplish that, we’ll be using a normal up-to-date worker node from which we’ll copy the whole filesystem and the export it thru NFS.
In our case we’ve used /opt/nfs_export and exported it with (/etc/exports):
/opt/nfs_export 172.24.10.0/24(sync,no_subtree_check,ro,no_root_squash)
The synchronization of the reference WN to the export directory was done over ssh with rsync:
rsync -av --one-file-system -l -t -e ssh root@REFERENCE_WN:/ /opt/nfs_export/ cd /opt/nfs_export; mv var var.ro; mkdir var cd etc/sysconfig rm network ln -s ../../var/network network
We’ve moved the var directory because each WN will have its own /var, /tmp and /home mounted locally. Later on, we’ll populate the /var on each WN with the contents from var.ro.
Because we need each worker node to have it’s own hostname it’s necessary that we store this information locally on each WN (we’ve chosen /var/network as a path) and symlink that file to the actual configuration file /etc/sysconfig/network.
Otherwise each WN would use the same /etc/sysconfig/network and we would end up with all the WNs having the same hostname.
The second step is to modify the boot process of the worker nodes so that they will get an IP address, mount the above exported filesystem and switchroot into it.
Extract the initrd image (from /boot) into a temporary directory:
mkdir /tmp/1; cd /tmp/1; cp /boot/THE_INITRD_IMAGE .; gunzip < INITRD.IMG | cpio -i --make-directories
Now, we have the contents of the initrd and we can examine and modify the boot sequence from the “init” script in the current directory. As you may notice, there aren’t many libraries and executables that you can use. Therefore we need to add some utilities and modules to the initrd image. In our case we had to add:
- busybox – a lot of useful commands
- network card module
- nfs modules
Then, we’ve modified the standard mouting procedure from:
echo Scanning and configuring dmraid supported devices echo Scanning logical volumes lvm vgscan --ignorelockingfailure echo Activating logical volumes lvm vgchange -ay --ignorelockingfailure VolGroup00 resume /dev/VolGroup00/LogVol01 echo Creating root device. mkrootdev -t ext3 -o defaults,ro /dev/VolGroup00/LogVol00 echo Mounting root filesystem. mount /sysroot
To:
/bin/insmod /lib/mii.ko /bin/insmod /lib/pcnet32.ko /bin/insmod /lib/sunrpc.ko /bin/insmod /lib/lockd.ko /bin/insmod /lib/nfs_acl.ko /bin/insmod /lib/nfs.ko /bin/ifconfig eth0 up /bin/udhcpc -q -s /udhcpc.script sleep 1 mount -o ro,proto=tcp,nolock 172.24.10.1:/work/nfs_export /sysroot
So, we have deleted the part that uses the HDD for the root filesystem and we mount the root filesystem from the NFS export. Before that, we’re loading the necessary modules and request an IP address.
Next, we need to repack the initrd image and copy it to the tftp server directory (you should tweak according to your setup):
find . | cpio --create --'format=newc' | gzip > ../initrd.img cp initrd.img /tftpboot
At this point, every WN that it’s set to boot from the network should boot & mount the root filesystem from the NFS server and continue with normal booting.
Of course, there are some more issues that need addressing:
- preformatting the hdd if necessary and copying some directories (in our specific case)
- setting the hostname correctly
In order to achieve this, we need to add some extra stepts to the boot sequence of the OS. For that we call in /opt/nfs_export/etc/rc.sysinit, right after udev has started, our own helper script:
/sbin/start_udev #----FROM HERE-------- /etc/setup.sh if [ -f /etc/sysconfig/network ]; then . /etc/sysconfig/network fi #------TO HERE------- # Load other user-defined modules for file in /etc/sysconfig/modules/*.modules ; do [ -x $file ] && $file done
The script file contains the following:
#!/bin/bash PARTDEF="n\np\n1\n \n+6000M\nn\np\n2\n \n+5000M\nn\np\n3\n \n+51000M\nw\n" function initial_setup() { echo "Performing initial setup on $1"; echo -ne $PARTDEF | fdisk $1 sleep 10 partprobe sleep 10 mke2fs -j "$1"1 mke2fs -j "$1"2 mke2fs -j "$1"3 sleep 1 mount "$DEVICE"1 /var cp -aR /var.ro/* /var/ echo -n "HOSTNAME=" >> /var/network cat /etc/hosts | grep `ifconfig eth0 | grep 'inet addr' | awk '{ printf $2."\n"}' | tr ':' ' ' | awk '{ printf $2."\n"}'` | awk '{ printf $2."\n"}' >> /var/network umount /var } if [ ! -e /dev/sda1 ]; then if [ -e /dev/sda ]; then DEVICE="/dev/sda"; initial_setup $DEVICE; else if [ ! -e /dev/hda1 ]; then if [ -e /dev/hda ]; then DEVICE="/dev/hda" initial_setup $DEVICE ; fi else DEVICE="/dev/hda" fi fi else DEVICE="/dev/sda" fi echo "Storage on $DEVICE" mount "$DEVICE"1 /var mount "$DEVICE"2 /tmp mount "$DEVICE"3 /home if [ -e /var/wipe ]; then echo "Wipe HDD requested :)" sleep 10 #echo b > /proc/sysrq-trigger #to be done fi cat /var/network | grep -v "HOSTNAME" > /var/network1 echo -n "HOSTNAME=" >> /var/network1 cat /etc/hosts | grep `ifconfig eth0 | grep 'inet addr' | awk '{ printf $2."\n"}' | tr ':' ' ' | awk '{ printf $2."\n"}'` | awk '{ printf $2."\n"}' >> /var/network1 rm -f /var/network mv /var/network1 /var/network
So, what does this script do ? It should be clear that it needs adaptation for each environment but, in our case:
- It will check it we have /dev/sda or /dev/hda as a hdd
- It will check if there are any partitions on it
- If not, it will create /dev/(h)sda1,2,3 with 6GB / 5 GB / 51GB, format them with ext3 and copy the original /var contents from /var.ro
- It will mount those partitions to /var, /tmp and /home
- It will set the hostname according to what it find in /etc/hosts for the current IP address (it’s clear that the hosts file must be up to date). The /etc/sysconfig/network is a symlink to /var/network because each WN must have a different configuration file and that file can only be stored on a personal RW directory.
After this whole process we’ve simplified a lot the installation and updates on the WNs from our grid site. Now, we only have to add the new worker node’s MAC to the DHCP server’s config file, the IP address to the hosts file and power up the WN. It gets formatted by itself and starts perfectly as a normal WN would.
There are a lot of tweaks that could be done, but we’ve simplified the process and got it to a point where it simply does what it’s supposed to do.