High Available Lustre MDS -- failover nodes with Redhat cluster tools

Background: The Lustre Meta Data Server is integral to using Lustre. If it is not available then clients cannot use the filesystem. In order to maximize availability of Lustre we want to set up a 2-node HA cluster where one node is the active MDS and the other remains on standby to take over. The Lustre meta-data partition is kept on shared iSCSI storage but it should also be realistic to configure a shared GFS filesystem with the RH cluster tools. We depend on the cluster tools to make sure only one of the nodes is active at any time, and to determine when to failover and when to "kill" a bad node.

RHEL5 and thus also SL5 ship with an integrated suite of cluster management tools. These tools somewhat simplify the process of setting up an HA cluster.

However they are not perfect, and do require some modification to work the way we want. Also some changes to core scripts are needed to accommodate the lustre filesystem.

Initial cluster setup

We started with some documentation on the Dell wiki. The documentation Redhat provides is similar. http://linux.dell.com/wiki/index.php/Products/HA/DellRedHatHALinuxCluster/Cluster

A brief summary of the procedure (all packages in SL5 base distribution):
  • Install the "ricci" package on cluster members
  • Install the "luci" package on a machine you'd like to use to manage the cluster. This machine isn't required for operation of the cluster and it is recommended to use a separate machine from the cluster nodes
  • On the management machine run "luci_admin init" and set an admin password.
  • Make sure luci and ricci services are started and chkconfig'd on

At that point you should be able to login to the cluster manager using a browser on port 8084. In our case: https://manage.aglt2.org:8084

  1. Enter your username and password to securely log in to the luci server.
  2. Go to the cluster tab.
  3. Click Create a New Cluster.
  4. Enter a cluster name of 15 characters or less.
  5. Add the fully qualified private hostname or IP address and root password for each cluster node. We used the "local" addresses for cluster communications: lmd01.local, lmd02.local. It is probably a good idea to have all hostnames in /etc/hosts - it would not be good if a DNS failure took down our HA cluster.
  6. NOTE: You may also select Check if node passwords are identical and only enter the password for the first node.
  7. Ensure that the option for Enable Shared Storage Support is selected and click Submit. NOTE: This is not strictly necessary for our cluster but the option was enabled when I set it up

If any problems are encountered in this initial setup, check these things:
  • Is ricci installed and running on each member of the cluster?
  • Do you have any old/duplicate packages installed relating to cluster tools? Check your yum logs to see what luci installs, and try yum remove on it to see what dependencies are coming in. I had exactly this problem which I did not realize until I had given up and set out to completely eradicate all traces of the redhat cluster software.

Fencing setup

If there is any problem with a node requiring the standby to be brought up the cluster tools will try to cleanly stop services on the active node and dismount the lustre MDS. More on how this happens later. If the cluster cannot verify that a node is fully stopped, before bringing up the standby host it will power off the other. This is referred to as "fencing". We use two fencing methods: DRAC5 and our switchable APC PDUs.

Login to the luci web interface and go to Cluster->Nodes. Select a node.

First we'll setup and test DRAC5 fencing by clicking on "Add a fence device" under "Main Fencing Method". Choose a "DRAC" fence device and fill in appropriate information for the DRAC on that node. Do this for both nodes in the cluster. The name can be whatever you wish.
This device won't work for us because we can only configure telnet logins and we normally do not enable those. However there is a fence agent that is updated for the DRAC5. It is just not in the web interface.
To use the right device we'll need to edit /etc/cluster/cluster.conf on a node. Pick either node, it doesn't matter - there is a util to propagate the config. Be sure you did the initial setup for the DRAC fencing method first so you have a template to work from.

Locate the relevant lines in the "fencedevices" section and change the agent from "fence_drac" to "fence_drac5". We also add the "secure" parameter.
<fencedevice agent="fence_drac5" ipaddr="x.x.x.x" login="xxx" name="lmd01_drac" passwd="xxx" secure="1"/>
<fencedevice agent="fence_drac5" ipaddr="x.x.x.x" login="xxx" name="lmd02_drac" passwd="xxx" secure="1"/>

Only do this edit on one host, and increment the "config_version" tag at the top of the file. Then propagate:
  ccs_tool update /etc/cluster/cluster.conf 

Next time you look in the web interface you will see that it doesn't display the DRAC info anymore. It does display a "remove this device" button but it apparently doesn't know how to display the info.

I recommend testing that this works before continuing. There is a menu of tasks at the top right of the node configuration screen, one of them is "fence this node". Try this will watching log messages on the node that you are not fencing.

APC Fencing

This is straightforward. Once again on the node configuration page choose to add a Backup Fencing Method. Choose APC power strip from the choices. Fill in information for one strip. The outlet goes in the "Port" field. Then, choose to "Add an instance" and add the other power strip (if you have a 2-supply machine). Then click "update fence device properties". No further manual editing should be needed.

This should also be tested by disabling network access to the DRAC card and trying again to fence each node.

More fencing tests

It is a good idea to simulate a failure event to see that a node is properly fenced. For example, try disabling the network interface on one node. The cluster tools should automatically fence the node if it cannot be contacted.

Configuring the Lustre MDS service.

At this point we should have a cluster of 2 nodes with fencing fully configured and tested. That's pretty useless to anyone without some kind of service.

A Lustre Meta Data server (MDS, aka LMD) is created simply by creating a lustre filesystem with the right parameters and mounting it. The filesystem is configured to be aware of both available LMD servers and will use the one that is online (see link to lustre setup here). As noted before, it is key that only one LMD machine be online and available. This is where the HA failover and fencing comes in.

In order to fail over a service, the cluster has to have a way to check the status of the service and dependencies.

TO-DO FIRST Perhaps you can do the final configuration before doing these customizations, but I think it will go much smoother if you take care of some customizations needed by Lustre.

NOTE: It is probably redundant to have both a service and a filesystem resource given modifications made to the filesystem resource script. They perform essentially the same checks, though a managed filesystem resource will check that the options are right when it mounts a filesystem. So could the init script. I recommend staying away from modifying RH packaged scripts since they are likely to be overwritten in updates. In that case, skip using a filesystem resource at all and simply write checks into the init script which return "1" under any failure.

  • The file /usr/share/cluster/fs.sh needs modifications to be able to successfully manage a lustre fs. NOTE: You can use it unmodified if you add "quick_status=1" to the filesystem resource in /etc/cluster/cluster.conf but it won't really check anything....and as noted already we don't necessarily need to define a filesystem resource.

    root@lmd01 /usr/share/cluster# diff -urN fs.sh.28Mar2010 fs.sh
    --- fs.sh.28Mar2010     2010-03-28 10:37:32.000000000 -0400
    +++ fs.sh       2010-03-28 13:38:34.000000000 -0400
    @@ -384,7 +384,7 @@
            [ -z "$OCF_RESKEY_fstype" ] && return 0
            case $OCF_RESKEY_fstype in
    -       ext2|ext3|jfs|xfs|reiserfs|vfat|tmpfs|vxfs)
    +       ext2|ext3|jfs|xfs|reiserfs|vfat|tmpfs|vxfs|lustre)
                    return 0
    @@ -505,6 +505,23 @@
    +               lustre)
    +                       case $o in
    +                       flock|localflock|noflock|user_xattr|nouser_xattr)
    +                               continue
    +                               ;;
    +                       acl|noacl|nosvc|nomgs|exclude=*|abort_recov)
    +                               continue
    +                               ;;
    +                       md_stripe_cache_size|recovery_time_soft=*)
    +                               continue
    +                               ;;
    +                       recovery_time_hard=*)
    +                               continue
    +                               ;;
    +                       esac
    +                       ;;
                    echo Option $o not supported for $OCF_RESKEY_fstype
    @@ -643,7 +660,19 @@
            [ $OCF_CHECK_LEVEL -lt 10 ] && return $YES
    +#+SPM March 28, 2010 Add Lustre check/test
    +       fsmnt=`grep $mount_point /proc/mounts | awk '{print $3}'`
    +        if [ $fsmnt = "lustre" ]; then
    +            ocf_log debug "fs (isAlive): Found Lustre filesystem"
    +           lstatus=`cat /proc/fs/lustre/health_check`
    +            if [ $lstatus = "healthy" ]; then
    +              return $YES
    +            else
    +              return $NO
    +            fi
    +        fi
            # depth 10 test (read test)
            ls $mount_point > /dev/null 2> /dev/null
            if [ $? -ne 0 ]; then
    @@ -999,6 +1028,7 @@
             case "$fstype" in
             reiserfs) typeset fsck_needed="" ;;
             ext3)     typeset fsck_needed="" ;;
    +        lustre)   typeset fsck_needed="" ;;
             jfs)      typeset fsck_needed="" ;;
             xfs)      typeset fsck_needed="" ;;
             ext2)     typeset fsck_needed=yes ;;
  • Create the /etc/init.d/MountMDT script. It should behave like any other RH init script in response to stop|start|restart|status and return 0 in response to a "good" status check. The checks in the script for the existence of the mounted filesystem are probably redundant with the checks that come into play for a Filesystem Resource. However the script also checks /proc/fs/lustre/health_check and returns bad status if the file contains "NOT HEALTHY". It is a good idea to check for as many possible failure conditions as we can. In our experience the filesystem will not dismount in case of network interruption to the iSCSI storage but eventually the status will change in /proc. Script below:
    # chkconfig: - 26 74
    # description: mount/unmount lustre MDT filesystem in /etc/fstab
    # copied from GFS2 init script
    # Provides: 
    . /etc/init.d/functions
    [ -f /etc/sysconfig/cluster ] && . /etc/sysconfig/cluster
    # This script's behavior is modeled closely after the netfs script.  
    LUSTREFSTAB=$(LC_ALL=C awk '!/^#/ && $3 == "lustre" { print $2 }' /etc/fstab)
    LUSTREMTAB=$(LC_ALL=C awk '!/^#/ && $3 == "lustre" && $2 != "/" { print $2 }' /proc/mounts)
    NOTHEALTHY=`grep -o "NOT HEALTHY" /proc/fs/lustre/health_check`
    function mount_lustre {
    if [ -n "$LUSTREFSTAB" ]
                    action $"Mounting LUSTRE filesystem: " mount /mnt/mdt
    # See how we were called.
    case "$1" in
            if [ $? -ne 0 ]
                    echo "Mount failed, restarting iscsi service and retrying"
                    /sbin/service iscsi restart
                    sleep 10
            touch /var/lock/subsys/lustre
            if [ -n "$LUSTREMTAB" ] 
                    remaining=`LC_ALL=C awk '!/^#/ && $3 == "lustre" && $2 != "/" {print $2}' /proc/mounts`
                    while [ -n "$remaining" -a "$retry" -gt 0 ]
                            action $"Unmounting LUSTRE filesystems: " umount /mnt/mdt
                            if [ $retry -eq 0 ] 
                                    action $"Unmounting lustre filesystems (lazy): " umount -l /mnt/mdt
                            sleep 2
                            remaining=`LC_ALL=C awk '!/^#/ && $3 == "lustre" && $2 != "/" {print $2}' /proc/mounts`
                            [ -z "$remaining" ] && break
                            /sbin/fuser -k -m $sig $remaining &> /dev/null
                            sleep 10
                            retry=$(($retry - 1))
            rm -f /var/lock/subsys/lustre
            if [ -f /proc/mounts ]
                    [ -n "$LUSTREFSTAB" ] && {
                         echo $"Configured lustre mountpoints: "
                         for fs in $LUSTREFSTAB; do echo $fs ; done
                    [ -n "$LUSTREMTAB" ] && {
                          echo $"Active lustre mountpoints: "
                          for fs in $LUSTREMTAB; do echo $fs ; done
                    if [ "$NOTHEALTHY" == "NOT HEALTHY" ] 
                            echo "Filesystems show 'not healthy'"
                            exit 1
                    echo "/proc filesystem unavailable"
            $0 stop
            $0 start
            $0 start
            echo $"Usage: $0 {start|stop|restart|reload|status}"
            exit 1

    Final Configuration:
    1. In the cluster manager interface go to the Cluster tab and under Services choose to "Add a Service". Name it MountMDT or whatever is appropriate.
    2. We definitely want to choose to "Run Exclusive", "Automatically Start", and choose recovery policy of "Relocate" (as in, relocate to other node).
    3. Now add a resource to the service. Choose type "Script Resource" and give the path to the init script we are going to put into place: /etc/init.d/MountMDT. Really it is just a copy of the RH GFS2 init script already in place with modifications specific to Lustre
    4. Before we save it, we may also want to add a child for the filesystem resource to the script resource. Click on "add a child" and choose a filesystem resource. For the device it is best to use "UUID=xxxx-xxxx-...". You won't be able to choose "lustre" as an fs but we can modify that later in cluster.conf. If you modified /usr/share/cluster/fs.sh to accommodate lustre then you can change the fstype to lustre. If not, just remove the filesystem type entirely and add the "quick_check=1" parameter to the resource. Remember to propogate: ccs_tool update /etc/cluster/cluster.conf

    Configuration file

    There's simply no way to accomplish this all through the GUI. In fact, once familiar with the cluster.conf file it is probably best to just edit this file and leave the GUI aside. It's good to establish the initial config and have a template to work from for devices and services but the final setup will have to be by editing the file. Here is what we ended up with. Since the GUI has no options for some of these things, using the GUI runs the risk of breaking this setup (but as long as you know the custom parts and avoid them it will be fine and is a good way to get a starting setup)

    <?xml version="1.0"?>
    <cluster alias="LMD" config_version="45" name="LMD">
            <fence_daemon clean_start="0" post_fail_delay="0" post_join_delay="3"/>
                    <clusternode name="lmd02.local" nodeid="1" votes="1">
                                    <method name="1">
                                            <device name="lmd02_drac"/>
                                    <method name="2">
                                            <device name="lmd02_socket1" option="off" port="19"/>
                                            <device name="lmd02_socket2" option="off" port="19"/>
                                            <device name="lmd02_socket1" option="on" port="19"/>
                                            <device name="lmd02_socket2" option="on" port="19"/>
                    <clusternode name="lmd01.local" nodeid="2" votes="1">
                                    <method name="1">
                                            <device name="lmd01_drac"/>
                                    <method name="2">
                                            <device name="lmd01_socket1" option="off" port="16"/>
                                            <device name="lmd01_socket2" option="off" port="16"/>
                                            <device name="lmd01_socket1" option="on" port="16"/>
                                            <device name="lmd01_socket2" option="on" port="16"/>
            <cman expected_votes="1" two_node="1"/>
                    <service autostart="1" exclusive="1" name="MountMDT" recovery="relocate">
                            <script file="/etc/init.d/MountMDT" name="MountMDT-init">
                                    <fs device="UUID=977bdf9c-0645-465e-9e72-1a1d1e5f93ca" force_fsck="0" force_unmount="1" fsid="9659" fstype="lustre" mountpoint="/mnt/mdt" name="LMD" self_fence="1"/>
                    <fencedevice agent="fence_drac5" ipaddr="x.x.x.x" login="" name="lmd01_drac" passwd="" secure="1"/>
                    <fencedevice agent="fence_drac5" ipaddr="x.x.x.x" login="" name="lmd02_drac" passwd="" secure="1"/>
                    <fencedevice agent="fence_apc" ipaddr="x.x.x.x" login="" name="lmd01_socket1" passwd=""/>
                    <fencedevice agent="fence_apc" ipaddr="x.x.x.x" login="" name="lmd01_socket2" passwd=""/>
                    <fencedevice agent="fence_apc" ipaddr="x.x.x.x" login="" name="lmd02_socket1" passwd=""/>
                    <fencedevice agent="fence_apc" ipaddr="x.x.x.x" login="" name="lmd02_socket2" passwd=""/>

    -- BenMeekhof - 06 Apr 2010
Topic revision: r3 - 06 Apr 2010, BenMeekhof
This site is powered by FoswikiCopyright © by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
Ideas, requests, problems regarding Foswiki? Send feedback