Lustre Configuration and Setup for AGLT2

In March 2010 we revisited our exploration of Lustre for use at AGLT2. This was motivated in part by the release of Lustre 1.8.2 which allowed the "patchless" client to be built without modification on our newest UltraLight kernel (2.6.30-9UL1smp). Originally we had setup an HA MDT using lmd01.aglt2.org and lmd02.aglt2.org (using RHEL4 and Heartbeat). The details of why and how are below.

Overview of the Need

For AGLT2 we have a number of storage systems in place. The primary (large) storage is dCache which currently has 1 petabyte of disk online. In addition we have our own AFS cell atlas.umich.edu which hosts many user home areas as well as some application software. Because of issues using AFS as "write" destinations for grid jobs we have also setup (an ever increasing number of) NFS servers, which host user data and OSG account home areas. This use of standalone NFS servers has a number of disadvantages:
  • Users must be assigned space on a specific server or servers
    • Users need to track their server location(s) and related space quotas
  • Space management is not easy and if users fill assigned space we either have to migrate them to a new NFS area or add a different location for them to use for new files.
  • Individual servers may alternately run "idle" and "overloaded" depending upon the current activity at AGLT2
  • NFS servers don't scale well for 1000's of cores potentially accessing them
  • Performance is sometimes a bottleneck, depending upon the job mix

There is extensive experience in using Lustre and many HEP and grid sites and we have robust hardware that could be used to enable this. Lustre is free to use (though most successful Lustre sites have some kind of support contract). Having an AGLT2 wide /lustre mount point would, in principle, address the shortcomings we have seen in our use of standalone NFS servers. To determine how this would work for AGLT2 we set about installing and configuring Lustre with the intent of eventually migrating all our NFS users/space into it.

Planning for Lustre

In our first Lustre install in 2008 we used some of our Dell storage nodes (PE2950+4xMD1000 shelves) as OSS's hosting 12 OSTs each (we had to split the storage into OST sizes that were usable by Lustre 1.6.7). The management (MGS) and metadata (MDT) servers were co-located on an HA/Heartbeat cluster comprised of two nodes lmd01.aglt2.org and lmd02.aglt2.org. This initial HA cluster was not very robust and the fail-over seemed to have (cause?) some problems.

In March 2010 we decided to revisit how we would architect this next generation of Lustre at AGLT2. Our first step was to redo the lmd01/02 MDT nodes:
  • Upgrade to most current Scientific Linux v5.4 64-bit
  • Install the RedHat HA clustering system (conga/ricci/luci)
  • Install needed iSCSCI access and multipathing for Linux (back-end storage is a Dell MD3000i server with RAID-10 on 15K 300GB SAS disks)
  • Install all needed Lustre RPMs (V1.8.2 and using the ext4 version which allows up to 16TB OSTs)
  • Configure and test resiliency:
    • Bonded network setup (VLANs over bonded link; mode=1 (active/standby) to two different switch stacks)
    • iSCSCI MDT location visible via Linux RDAC (multipathed)
    • Try various tests of fail-over (power cut; network cut; service stop; umount) and verify fencing works correctly

Details are below for these steps.

Next consideration was the Lustre management (MGS) configuration. Previously we co-located the MGS with the MDT on the HA cluster. Lustre best practice recommendations are to separate this functionality off onto its own node. Because we have Enterprise VMware running here and the Lustre MGS node requires very little disk (100-200 MB) and little network traffic (fast ethernet is fine) we decide to create a VM to host the MGS. We cloned one of the existing VMs (cache) and removed the SQUID services on that node. Our goal is to use VMware's Fault Tolerant capability for this VM to make sure it is always available. Details are below under the MGS section.

For the OSS (storage servers) we used UMFS05 and UMFS18 as the first OSS's. These systems were setup as:
  • Scientific Linux v5.4 64-bit
  • VLANs over bonded network mode 1 (active/standby) with the 10GE link being 'primary' and the 1GE being 'secondary'
  • MD1000 pools setup for RAID-6
    • Problem with UMFS18 in that making RAID-6 over 15 2TB disks is too large even for the ext4 version of Lustre.
  • Lustre v1.8.2 ext4 version installed

Details below.

Installation and Setup of Lustre at AGLT2

The outline of what we did is above. In the following sections we will describe the details of our configuration.

MGS (Management Node) for Lustre

As noted, our plan was to isolate the MGS on its own node. This is to insure that issues with the MDT and decoupled from the MGS. Because we have VMware Enterprise deployed and the MGS resource need is small we decided to deploy the MGS on a VM.

First step was to clone an existing VM in our cluster. We choose the cache VM as a starting point to clone. The cache node runs SQUID for our cluster and is fully updated to SL5.4 and running the most recent patches and VMware Tools. Our intent is to eventually run the new MGS VM in Fault Tolerant mode. This has some implications on the VM conifiguration. A Fault Tolerant VM needs:
  • Only 1 CPU allowed
  • No para-virtualized drivers (e.g., vmxnet3)
  • Disks must be "thick" (no thin-provisioning)

The cache VM had 1 CPU and acceptable drivers but the disk was thin-provisioned. We had to convert to thick partitioning during the clone operation. The result was a much large hard disk than we needed for this node...more on that later.

We re-assigned the original sunnas.aglt2.org DNS name and corresponding 192.41.230.140 IP to mgs.aglt2.org. We also assigned a new local (private) DNS mgs.local with IP 10.10.1.140. The VM was booted and the hostname and network configuration changes were made.

First step after rebooting the new MGS node was to clean up the SQUID and log files. The original hard disk was configured for 40GB and had approximately 30GB used. After clean-up and Lustre install the used space was under 8.4GB.

We downloaded all the needed Lustre rpms from Oracle/SUN and put them in /afs/atlas.umich.edu/hardware/Lustre/ This will be the source for all our installs. On mgs.aglt2.org we installed the following RPMS:
  • kernel-devel-2.6.18-164.11.1.el5_lustre.1.8.2.x86_64
  • lustre-ldiskfs-3.0.9-2.6.18_164.11.1.el5_lustre.1.8.2.x86_64
  • lustre-modules-1.8.2-2.6.18_164.11.1.el5_lustre.1.8.2.x86_64
  • kernel-2.6.18-164.11.1.el5_lustre.1.8.2.x86_64
  • lustre-1.8.2-2.6.18_164.11.1.el5_lustre.1.8.2.x86_64
  • kernel-headers-2.6.18-164.11.1.el5_lustre.1.8.2.x86_64
  • e2fsprogs-1.41.6.sun1-0redhat.x86_64 (Needed for mkfs.lustre)

NOTE: the versions above were the ext4 variants which support the 16TB OSTs. The names do not reflect this but the original RPMS do have ext4 in their filename.

First step in Lustre installation is to add an lnet options line to /etc/modprobe.conf:

options lnet networks=tcp0(eth0),tcp1(eth1)

This tells Lustre what network interfaces it can use. In our case eth0 is the private (10.10.0.0/20) and eth1 is the AGLT2 public (192.41.230.0/24). We run depmod to update the info.

The next step to installing Lustre/MGS on this node is to format the MGS area. In the links below there is some discussion about what MGS needs in terms of resources. It was mentioned that only 100 MBytes of space would be required. To host this we setup a "small" 2GB iSCSI storage area on our Dell MD3000i (UMVMSTOR01) which is located on our RAID-10 area composed of 15K 300GB SAS disks. We setup access to this new LUN so that VMware servers could mount it. On VMware we created a new VMFS 3.33 datastore which was visible to all VMware nodes. Then we added a new hard disk the the mgs.aglt2.org VM which shows up as 1503 MB on /dev/sdb. To format the MGS area for lustre we do:
  • mkfs.lustre --fsname aglt2 --mgs /dev/sdb (see below):
[mgs:~]# mkfs.lustre --fsname aglt2 --mgs --reformat /dev/sdb
   Permanent disk data:
Target:     MGS
Index:      unassigned
Lustre FS:  aglt2
Mount type: ldiskfs
Flags:      0x74
              (MGS needs_index first_time update )
Persistent mount opts: iopen_nopriv,user_xattr,errors=remount-ro
Parameters:

device size = 1433MB
2 6 18
formatting backing filesystem ldiskfs on /dev/sdb
        target name  MGS
        4k blocks     367001
        options        -J size=56 -q -O dir_index,extents,uninit_groups -F
mkfs_cmd = mke2fs -j -b 4096 -L MGS  -J size=56 -q -O dir_index,extents,uninit_groups -F /dev/sdb 367001
Writing CONFIGS/mountdata
  • We must also run the following tunefs on it:
[mgs:~]# tune2fs -O uninit_bg /dev/sdb
tune2fs 1.41.6.sun1 (30-May-2009)

At this point we can get some information about the newly formatted MGS area so we can setup the /etc/fstab:
[mgs:~]# /lib/udev/vol_id /dev/sdb
ID_FS_USAGE=filesystem
ID_FS_TYPE=ext4
ID_FS_VERSION=1.0
ID_FS_UUID=32c64d26-a74d-45a6-90ce-bf0422196049
ID_FS_LABEL=MGS
ID_FS_LABEL_SAFE=MGS

Using this we can setup for UUID mounting by adding this to /etc/fstab:
UUID=32c64d26-a74d-45a6-90ce-bf0422196049       /mnt/mgs        lustre  defaults   0 0

Test mount:
[mgs:~]# mount /mnt/mgs
[mgs:~]# dmesg
Linux version 2.6.18-164.11.1.el5_lustre.1.8.2 (lbuild@x86-build-0) (gcc version 4.1.2 20080704 (Red Hat 4.1.2-46)) #1 SMP Sat Jan 23 18:02:32 MST 2010
...
Lustre: OBD class driver, http://www.lustre.org/
Lustre:     Lustre Version: 1.8.2
Lustre:     Build Version: 1.8.2-20100125121550-PRISTINE-2.6.18-164.11.1.el5_lustre.1.8.2
Lustre: Added LNI 10.10.1.140@tcp [8/256/0/180]
Lustre: Added LNI 192.41.230.140@tcp1 [8/256/0/180]
Lustre: Accept secure, port 988
Lustre: Lustre Client File System; http://www.lustre.org/
LDISKFS-fs: barriers enabled
kjournald2 starting: pid 16787, dev sdb:8, commit interval 5 seconds
LDISKFS FS on sdb, internal journal on sdb:8
LDISKFS-fs: delayed allocation enabled
LDISKFS-fs: file extents enabled
LDISKFS-fs: mballoc enabled
LDISKFS-fs: mounted filesystem sdb with ordered data mode
LDISKFS-fs: mballoc: 0 blocks 0 reqs (0 success)
LDISKFS-fs: mballoc: 0 extents scanned, 0 goal hits, 0 2^N hits, 0 breaks, 0 lost
LDISKFS-fs: mballoc: 0 generated and it took 0
LDISKFS-fs: mballoc: 0 preallocated, 0 discarded
LDISKFS-fs: barriers enabled
kjournald2 starting: pid 16790, dev sdb:8, commit interval 5 seconds
LDISKFS FS on sdb, internal journal on sdb:8
LDISKFS-fs: delayed allocation enabled
LDISKFS-fs: file extents enabled
LDISKFS-fs: mballoc enabled
LDISKFS-fs: mounted filesystem sdb with ordered data mode
Lustre: MGS MGS started
Lustre: Server MGS on device /dev/sdb has started
Lustre: MGC10.10.1.140@tcp: Reactivating import
[mgs:~]# df
Filesystem           1K-blocks      Used Available Use% Mounted on
/dev/sda2             39247276   8686544  28567068  24% /
/dev/sda1               101086     56602     39265  60% /boot
tmpfs                  1990700         0   1990700   0% /dev/shm
/dev/sdb               1444916     59520   1311996   5% /mnt/mgs

At this point the MGS node is ready and working. However, the node configuration is not optimized for its task. For one thing the hard disk (which must be thick to work with Fault Tolerant mode) is too large. We may also want to tune the assigned memory for this VM as well as clean up unneeded services which may be running.

Tuning the MGS node setup

The first task is to reduce the size of the system disk. To do this we will create a new hard disk to attach to the VM and set it up to be 15GB (which should be more than sufficient and allow for some log file growth). Once we create/attach the new disk to the VM we need to copy/resize the existing disks. To do this I had to use two tools: qtPartEd ISO (to shrink the existing partitions) and a Knoppix ISO to do dd if=/dev/sda of=/dev/sdc (copy the old data to the new disk). In addition to "shrinking" the disk size this also converted the system disk from "thin" provisioned to "thick" which is required for Fault Tolerant mode.

We had a problem with the VMware infrastructure in activating the Fault Tolerant mode: one of the 10GE cables for the FT NICs seems bad and will need replacing. For now we have Fault Tolerant mode off but HA is on.

MDT Setup on LMD01/02

NOTE: Nodes are not in HA cluster, all references to HA are currently not correct

Now that the MGS is configured and running we can move on the to MDT setup for Lustre. This will require some more work and planning. The MDT server for Lustre is a critical component. If we lose the MDT we lose all the data in the system! This service must be up at all times and resilient as possible.

MDT Volume on umfs15 (Sun 7410)

The MDT volume is iSCSI mounted from umfs15. The system should be accessed via a multipathed device. Here is the setup for that device:

defaults {
        udev_dir                /dev
        # invalid, version too old or too new?
        #find_multipaths yes
        user_friendly_names     yes
}
blacklist {
        devnode "^(ram|raw|loop|fd|md|dm-|sr|scd|st)[0-9]*"
        devnode "^hd[a-z]"
        devnode "^sd[a-e]"
        device {
                vendor DELL
                product "PERC|Universal"
        }
}
devices {
        device {
                vendor                  DELL
                product                 MD3000i
                product_blacklist       "Universal Xport"
                features                "1 queue_if_no_path"
                path_grouping_policy    group_by_prio
                hardware_handler        "1 rdac"
                path_checker            rdac
#               prio                    "rdac"
                prio_callout            "/sbin/mpath_prio_rdac /dev/%n"
                failback                immediate
                getuid_callout          "/sbin/scsi_id -g -u -s /block/%n"
        }

        device {
                vendor            "SUN"
                product            "Sun Storage 7410"
                getuid_callout         "/sbin/scsi_id -g -u -s /block/%n"
                prio_callout             "/sbin/mpath_prio_alua /dev/%n"
                hardware_handler    "0"
                path_grouping_policy     group_by_prio
                failback         immediate
                no_path_retry             queue
                rr_min_io         100
                path_checker         tur
                rr_weight         uniform
        }
}



multipaths {

multipath {
wwid 3600144f0bb72c85400004c73dd410001
alias sunmdt
}

}

For an MDT node it is trivial to start Lustre...you just mount the MDT data area as filesystem type lustre. This makes it easy to setup the two nodes (lmd01.aglt2.org and lmd02.aglt2.org). We don't need to share any virtual IP or complex services. Instead we just make sure that only one node has the MDT filesystem mounted at any time and that nodes is the Lustre MDT server. As you may have noticed the MDT filesystem has to be somewhere that BOTH nodes can access it. We are using an iSCSI LUN setup on our Dell MD3000i system (on a RAID-10 configuration over a set of 15K 300GB disks). Each node runs both Linux RDAC and iSCSI tools which is documented here. The HA cluster make sure one and only one nodes mounts this area at a time. .

The Lustre configuration is partly encoded when we create the lustre filesystem. One of the option is to specify a failover node with a mkfs.lustre option. The various options and considerations for formatting the Lustre MDT filesystem is documented here.

Here was the format command given:
root@lmd01 ~# mkfs.lustre --fsname aglt2 --mgsnode=10.10.1.140@tcp0 --mgsnode=192.41.230.140@tcp1 --reformat --mdt --failnode=10.10.1.49@tcp0,192.41.230.49@tcp1 /dev/sdd

   Permanent disk data:
Target:     aglt2-MDTffff
Index:      unassigned
Lustre FS:  aglt2
Mount type: ldiskfs
Flags:      0x71
              (MDT needs_index first_time update )
Persistent mount opts: iopen_nopriv,user_xattr,errors=remount-ro
Parameters: mgsnode=10.10.1.140@tcp mgsnode=192.41.230.140@tcp1 failover.node=10.10.1.49@tcp,192.41.230.49@tcp1 mdt.group_upcall=/usr/sbin/l_getgroups

device size = 2035712MB
2 6 18
formatting backing filesystem ldiskfs on /dev/sdd
        target name  aglt2-MDTffff
        4k blocks     521142272
        options        -J size=400 -i 4096 -I 512 -q -O dir_index,extents,uninit_groups,mmp -F
mkfs_cmd = mke2fs -j -b 4096 -L aglt2-MDTffff  -J size=400 -i 4096 -I 512 -q -O dir_index,extents,uninit_groups,mmp -F /dev/sdd 521142272

OSS/OST setup on UMFS18 and UMFS05

The actual storage for Lustre is on OST (Object Storage Targets) on an OSS (Object Storage Server). In our case we are going to "seed" our Lustre configuration with two storage servers: umfs18 and umfs05. The UMFS18 node is a Dell R710 with 24 GB of ram, dual quad-core Nehalem processors (E5520), a dual port Myricom 10GE "Gen2" performance NIC, 2xPerc6/E w 512MB RAID cards and 4xMD1000 shelves each with 15 2TB/5400RPM disks. The UMFS05 system is a Dell PE2950 server with 32GB of ram, dual quad-core Harpertown (E5440) processors, a dual-port (failover) Myricom PCI-X NIC and 2xPerc6/E w 512MB RAID cards and 4xMD1000 shelves each with 1TB/7200 RPM disks.

To get these system ready for Lustre there are a number of issues to check /resolve:
  • Verify bios/firmware is up-to-date for all components
  • Is network setup to use bonded configuration? (VLANs over bonds for 2 or more NICs with switch diversity)
  • Is the Dell OpenManage software operational with the Lustre kernel and system setup?
  • Are the drivers up-to-date? (check Myricom 10GE driver version)

For UMFS18 we initially had issues with Dell's OpenManage server not working. We (re)installed via ROCKS 5.3 and OpenManage worked. UMFS18 also had two kinds of cabling to its MD1000 shelves. On controller #1 each channel of the Perc6/E went to a separate shelf but on controller #2 it was cabled "redundantly" and so has only 1 logical connector (#0). Bios and firmware seems up-to-date. One disk failed (Ctrl #2, disk 0:0:8) during RAID-6 initialization but it will be replaced shortly.

For UMFS05 the controller #1 had connector #0 in "critical" state. We tried hot-swapping to a different EMM on this shelf but that hung the system. After power-cycling the system was back and we were able to see all disks normally. Bios and firmware seems up-to-date.

Network setup

For Lustre it is important that we configure resilient networking for our OSS'es via bonding. For UMFS18 and UMFS05 we have 4 physical connections each that can be used in a bonded setup. I will cover each node separately in the sections below.

UMFS18 networking setup

UMFS18 has a dual-port 10GE Myricom ("Gen2" PCI-e x8) NIC (eth2 and eth3) and two on-board broadcom NICs (eth0, eth1). We want to setup these NICs in a bonded mode=1 (active/standby) configuration with arp pinging configured which tests for path connectivity.

The network ports available on UMFS18 are connected as follows:
  • eth0 is connected to Nile port Gi3/39
  • eth1 is connected to SW1-2/G18
  • eth2 (Myricom) is connected to SW7-3/XG4
  • eth3 (Myricom) is connected to SW1-4/XG3
  • drac5 is connect to SW1-4/G21

The 'ethX' ports will set setup into a bond0 mode=1 (active/standby) configuration. Each NIC will participate in 'bond0' with 'eth2' as the primary NIC. The configuration in /etc/modprobe.conf looks like:
alias bond0 bonding
options bond0 mode=1 arp_interval=200 arp_ip_target=192.41.230.1,10.10.1.2 primary=eth2
Where the arp_ip_target values are the IPs of our network gateways and the arp_interval is in units of ms.

Each NIC is setup to be a "slave" of the 'bond0' master. For example the /etc/sysconfig/network-scripts/ifcfg-eth0 looks like:
DEVICE=eth0
BOOTPROTO=none
ONBOOT=no
TYPE=Ethernet
MASTER=bond0
SLAVE=yes
Likewise for the other NICs involved in the bond. Then we setup two other configuration files: ifcfg-bond0 and ifcfg-bond0.4001 configured like: For ifcfg-bond0 (untagged on VLAN 4010):
DEVICE=bond0
#HWADDR=00:26:b9:3d:87:ce
IPADDR=10.10.1.38
NETMASK=255.255.254.0
BOOTPROTO=static
ONBOOT=yes
MTU=1500

And for ifcfg-bond0.4001 (tagged on VLAN 4001):
DEVICE=bond0.4001
HWADDR=00:60:DD:46:85:8C
IPADDR=192.41.230.38
NETMASK=255.255.254.0
BOOTPROTO=static
ONBOOT=yes
MTU=1500
VLAN=yes
Notice the VLAN=yes line which require 802.1Q VLAN tagging to be added to packets.

On the Dell switches ALL the ports are configured like this (using SW1-2/G18 as an example):
description ' UMFS18 eth1 bond vlan 4010 untagged,4001 tagged'
spanning-tree portfast
mtu 9216
switchport mode general
switchport general pvid 4010
switchport general allowed vlan add 4010
switchport general allowed vlan add 4001 tagged
lldp transmit-tlv port-desc sys-name sys-desc sys-cap
lldp transmit-mgmt
lldp notification
lldp med transmit-tlv location
lldp med transmit-tlv inventory
The switchport general mode allows trunked VLANS with both tagged and untagged participants. However we ALSO seem to require the switchport general pvid 4010 line as well to get the 4010 VLAN working (at least on the Dell M6220 switches).

On the Cisco 6509 the setup of port Gi3/39 looks like:
interface GigabitEthernet3/39
 description UMFS18 eth0 bond vlan 4001 tagged, 4010 untagged
 switchport
 switchport access vlan 4010
 switchport trunk encapsulation dot1q
 switchport trunk native vlan 4010
 switchport trunk allowed vlan 4001,4010
 switchport mode trunk
 mtu 9216
 spanning-tree portfast
end

UMFS05 networking setup

UMFS05 has a dual-port 10GE Myricom "redundant" NIC (just 'eth2' shows up though there are 2 physical connections) and two on-board broadcom NICs (eth0, eth1). We want to setup these NICs in a bonded mode=1 (active/standby) configuration with arp pinging configured which tests for path connectivity.

The network ports available on UMFS05 are connected as follows:
  • eth0 is connected to SW1-3/G2
  • eth1 is connected to Nile Gi3/12
  • eth2 (Myricom) is connected both to SW1-2/XG3 and to SW7-3/XG3
  • drac5 is connect to SW1-3/G1

The 'ethX' ports will set setup into a bond0 mode=1 (active/standby) configuration. Each NIC will participate in 'bond0' with 'eth2' as the primary NIC. The configuration in /etc/modprobe.conf looks like:
alias bond0 bonding
options bond0 mode=1 arp_interval=200 arp_ip_target=10.10.1.2,192.41.230.1 primary=eth2
Where the arp_ip_target values are the IPs of our network gateways and the arp_interval is in units of ms.

Each NIC is setup to be a "slave" of the 'bond0' master. For example the /etc/sysconfig/network-scripts/ifcfg-eth0 looks like the entries from UMFS18's network setup above.

On the Dell switches ALL the ports are configured like this (using SW7-3/XG3 as an example):
description ' UMFS05 eth2 bond vlan 4010 untagged,4001 tagged'
spanning-tree portfast
mtu 9216
switchport mode general
switchport general pvid 4010
switchport general allowed vlan add 4010
switchport general allowed vlan add 4001 tagged
lldp transmit-tlv port-desc sys-name sys-desc sys-cap
lldp transmit-mgmt
lldp notification
lldp med transmit-tlv location
lldp med transmit-tlv inventory
The switchport general mode allows trunked VLANS with both tagged and untagged participants. The switchport general pvid 4010 seems to be required on the Dell M6220 switch.

Note that 'eth2' has two physical ports and connections but only one internal 10GE interface. The system will always startup with port 0 as the active one. If the link is lost it will quickly switch over to port 1 and continue operating. If port 0 has its link restored this device will not switch back automatically (only if port 1 loses its link).

Configure OSTs on UMFS18 and UMFS05

Now that network is setup we are ready to format the OSTs on UMFS18 and UMFS05. The lustre raid configuration guide has some recommendations about how to optimize your RAID setup. We are using Dell Perc6/E (w 512MB cache) RAID cards, 2 per system. The storage is 4xMD1000 disk shelves each holding 15 disks. The very strong recommendation is to utilize RAID-6 over 6-10 disks (ideally either 6 or 10 to get the "data" disks at a multiple of 2). We choose to create two RAID-6 OSTs on each shelf (8 per OSS) as follows.

  • We name the OSTs on shelf 'n' like ost*_np_ where n is the shelf and p is the partition. The second partition on shelf1 is *ost12.
  • The raid configuration is done via a script stored in /afs/atlas.umich.edu/hardware/Lustre/setup_lustre_ost.sh
  • The RAID-6 creation uses Dell's OpenManage software as follows, first for partition 1 and then for partition 2:
omconfig storage controller action=createvdisk controller=$cntrl \
size=max raid=r6 \
pdisk=\
${enclosure}:0,\
${enclosure}:1,\
${enclosure}:2,\
${enclosure}:3,\
${enclosure}:4,\
${enclosure}:5,\
${enclosure}:6 \
stripesize=128kb readpolicy=ra writepolicy=wb name=$vdiskname1

omconfig storage controller action=createvdisk controller=$cntlr \
size=max raid=r6 \
pdisk=\
${enclosure}:7,\
${enclosure}:8,\
${enclosure}:9,\
${enclosure}:10,\
${enclosure}:11,\
${enclosure}:12,\
${enclosure}:13,\
${enclosure}:14 \
stripesize=128kb readpolicy=ra writepolicy=wb name=$vdiskname2

This gives us two RAID-6 partitions of 7 and 8 disks total (5 and 6 "data" disk equivalents). The stripesize=128kb is setup based upon the Lustre recommendation of having (ndisks - 2)*stripesize <= 1mb (for RAID-6). Unfortunately we can't get exactly 1mb with this number of disks (6 or 10 would allow us to match exactly by using a stripe of 256kb or 128kb respectively). The vdisknames were 'ost11','ost12','ost21',...'ost42' as above.

Once the RAID-6 areas were created we can now (finally) format the Lustre OSTs. The relevant parameters for the mkfs.lustre command are:
  • --ost (this is what we are formatting)
  • --msgnode=, where *nid*=@ like 10.10.1.140@tcp0 and 192.41.230.140@tcp1
  • --mkfsoptions for number of inodes. I just setup for approximate 8-9 MB/file so the number varies depending upon the size of the RAID-6
  • --mountfsoptions setup for the number of stripe blocks. Each block is 4096 bytes, so there are 32 of them in the "stripesize=128kb" we used when creating the RAID-6 array. Since the RAID-6 arrays are 7 or 8 total disks, we have either 5*32 or 6*32 for the "stripe=*n*" argument.

The example formatting commands used are:

UMFS05 "first" partitions (7x1TB total disks):
 mkfs.lustre --ost --mgsnode=10.10.1.140@tcp0,192.41.230.140@tcp1 --fsname=aglt2 --mkfsoptions="-i 524288" --mountfsoptions="stripe=160" /dev/sdb
UMFS05 "second" partitions (8x1TB total disks):
 mkfs.lustre --ost --mgsnode=10.10.1.140@tcp0,192.41.230.140@tcp1 --fsname=aglt2 --mkfsoptions="-i 629149" --mountfsoptions="stripe=192" /dev/sdc

UMFS18 "first" partitions (7x2TB total disks):
 mkfs.lustre --ost --mgsnode=10.10.1.140@tcp0,192.41.230.140@tcp1 --fsname=aglt2 --mkfsoptions="-i 1048576" --mountfsoptions="stripe=160" /dev/sdb
UMFS18 "second" partitions (8x2TB total disks):
 mkfs.lustre --ost --mgsnode=10.10.1.140@tcp0,192.41.230.140@tcp1 --fsname=aglt2 --mkfsoptions="-i 1258298" --mountfsoptions="stripe=192" /dev/sdc

Since shelves, cables or controllers can be moved around and the OS may also reorder devices it is a good idea to use either a lable or UUID to create the needed /etc/fstab entries. I choose using UUID's. To set this up you can use the vol_id program from /lib/udev/vol_id to get the current UUID for your formatted devices. For example:
[root@umfs05 ~]# /lib/udev/vol_id /dev/sdb
ID_FS_USAGE=filesystem
ID_FS_TYPE=ext4
ID_FS_VERSION=1.0
ID_FS_UUID=f5b49285-4d48-4eee-8993-dbf38b593567
ID_FS_LABEL=aglt2-OSTffff
ID_FS_LABEL_SAFE=aglt2-OSTffff

Then instead of mount /dev/sdb in the /etc/fstab you can instead add a line like:
UUID=f5b49285-4d48-4eee-8993-dbf38b593567  /mnt/ost11 lustre  defaults  0 0 

Once this is down we can mount all the lustre OSSs via: 'mount -a -t lustre'. You can check the status via:
[root@umfs05 ~]# cat /proc/fs/lustre/devices
  0 UP mgc MGC10.10.1.140@tcp ebf831c1-6810-c77e-3047-07dca5f22225 5
  1 UP ost OSS OSS_uuid 3
  2 UP obdfilter aglt2-OST0000 aglt2-OST0000_UUID 5
  3 UP obdfilter aglt2-OST0001 aglt2-OST0001_UUID 5
  4 UP obdfilter aglt2-OST0002 aglt2-OST0002_UUID 5
  5 UP obdfilter aglt2-OST0003 aglt2-OST0003_UUID 5
  6 UP obdfilter aglt2-OST0004 aglt2-OST0004_UUID 5
  7 UP obdfilter aglt2-OST0005 aglt2-OST0005_UUID 5
  8 UP obdfilter aglt2-OST0006 aglt2-OST0006_UUID 5
  9 UP obdfilter aglt2-OST0007 aglt2-OST0007_UUID 5
All 8 OSTs are there. You can also check 'dmesg' to see if there are relevant messages.

Setup the Lustre Client

Setting up a Lustre client is simple. You need a set of "patchless" kernel RPMS which are easy to build. You need to:
  • Install the lustre and lustre-modules rpms
  • Add 'options lnet networks=@[,@,...]
  • Make the mount point: mkdir /lustre
  • Run 'depmod -a'
  • Mount Lustre: 'mount -t lustre 10.10.1.140@tcp0:/aglt2 /lustre'

That's it...you should have Lustre mounted and useable now.

A quick test on UMFS01 showed the following:
root@umfs01 /lustre# time dd if=/dev/zero of=./mybigfile count=1000000 bs=10000
1000000+0 records in
1000000+0 records out
10000000000 bytes (10 GB) copied, 88.253 seconds, 113 MB/s

real    1m28.686s
user    0m0.547s
sys     0m50.153s
Not bad considering the network is on a 1GE NIC.

Some useful URLs about Lustre

-- ShawnMcKee - 25 Mar 2010
Topic revision: r10 - 18 Oct 2012, BenMeekhof
This site is powered by FoswikiCopyright © by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
Ideas, requests, problems regarding Foswiki? Send feedback