LustreZFS < AGLT2

You are here: Foswiki>AGLT2 Web>LustreSetup>LustreZFS (25 Oct 2017, BobBall)Edit Attach

Notes on setting up and configuring Lustre version 2.7

Index of Sections

Notes on setting up and configuring Lustre version 2.7

Source rpms

We have chosen to use the kernel distributed with the rpms from the Lustre repos. This is version 2.6.32-504.8.1, patched for Lustre in the case of the file and metadata servers, stock for the clients. All available Lustre rpms can be downloaded from here.

Component setup

Combined mgs/mdt

Created the host in VMWare mdtmgs with

4 cpus
4GB Ram
45GB boot disk

This was built via Cobbler, and then a second disk was created in VMWare, a 1TB RAID-10 volume of 15k disks for the combined DB. The combined mdtmgs is mounted at /mnt/mdtmgs (/dev/sdb).

Install the kernel rpms patched for Lustre. At the time of this build, those were located in /atlas/data08/ball/admin/LustreSL6/2.7/server/. Other required rpms have not updated since Lustre 2.5, and were stored at /atlas/data08/ball/admin/LustreSL6/2.5/other/.

yum localinstall kernel-2.6.32-504.8.1.el6_lustre.x86_64.rpm kernel-devel-2.6.32-504.8.1.el6_lustre.x86_64.rpm \
kernel-firmware-2.6.32-504.8.1.el6_lustre.x86_64.rpm kernel-headers-2.6.32-504.8.1.el6_lustre.x86_64.rpm

yum localupdate e2fsprogs-1.42.12.wc1-7.el6.x86_64.rpm e2fsprogs-libs-1.42.12.wc1-7.el6.x86_64.rpm \
libcom_err-1.42.12.wc1-7.el6.x86_64.rpm libcom_err-devel-1.42.12.wc1-7.el6.x86_64.rpm \
libss-1.42.12.wc1-7.el6.x86_64.rpm

yum localinstall lustre-2.7.0-2.6.32_504.8.1.el6_lustre.x86_64.x86_64.rpm \
lustre-modules-2.7.0-2.6.32_504.8.1.el6_lustre.x86_64.x86_64.rpm \
lustre-osd-ldiskfs-2.7.0-2.6.32_504.8.1.el6_lustre.x86_64.x86_64.rpm \
lustre-osd-ldiskfs-mount-2.7.0-2.6.32_504.8.1.el6_lustre.x86_64.x86_64.rpm

Created the volume, and mounted it via fstab entry

mkfs.lustre --fsname=umt3B --mgs --mdt --index=0 /dev/sdb
LABEL=umt3B:MDT0000 /mnt/mdtmgs lustre acl 0 0

Filesystem Size Used Avail Use% Mounted on
/dev/sdb 746G 8.8G 686G 2% /mnt/mdtmgs

Using the locally built rpms

When installing the locally built rpms, there is an extra step. History from the installation is shown here.

   34  cd /atlas/data08/ball/admin/LustreSL6/2.7.58/server
   39  yum localinstall kernel-2.6.32.504.16.2.el6_lustre-1.x86_64.rpm
   41  /sbin/new-kernel-pkg --package kernel --mkinitrd --dracut --depmod \
--install 2.6.32.504.16.2.el6_lustre
   42 cd ../../2.5/other
   51  yum localupdate e2fsprogs-1.42.12.wc1-7.el6.x86_64.rpm e2fsprogs-libs-1.42.12.wc1-7.el6.x86_64.rpm \
libcom_err-1.42.12.wc1-7.el6.x86_64.rpm libss-1.42.12.wc1-7.el6.x86_64.rpm
   52  yum -y localinstall libcom_err-devel-1.42.12.wc1-7.el6.x86_64.rpm
   53  reboot
   63  cd /atlas/data08/ball/admin/LustreSL6/2.7.58/server
   68  yum localinstall lustre-2.7.58-2.6.32.504.16.2.el6_lustre_gdeb6fcd.x86_64.rpm \
lustre-modules-2.7.58-2.6.32.504.16.2.el6_lustre_gdeb6fcd.x86_64.rpm \
lustre-osd-ldiskfs-2.7.58-2.6.32.504.16.2.el6_lustre_gdeb6fcd.x86_64.rpm \
lustre-osd-ldiskfs-mount-2.7.58-2.6.32.504.16.2.el6_lustre_gdeb6fcd.x86_64.rpm
   70  mkfs.lustre --fsname=T3test --mgs --mdt --index=0 /dev/mapper/vg0-lv_home

   Permanent disk data:
Target:     T3test:MDT0000
Index:      0
Lustre FS:  T3test
Mount type: ldiskfs
Flags:      0x65
              (MDT MGS first_time update )
Persistent mount opts: user_xattr,errors=remount-ro
Parameters:

checking for existing Lustre data: not found
device size = 32768MB
formatting backing filesystem ldiskfs on /dev/mapper/vg0-lv_home
        target name  T3test:MDT0000
        4k blocks     8388608
        options        -J size=1310 -I 512 -i 2048 -q -O dirdata,uninit_bg,^extents,dir_nlink,quota,huge_file,flex_bg \
                           -E lazy_journal_init -F
mkfs_cmd = mke2fs -j -b 4096 -L T3test:MDT0000  -J size=1310 -I 512 -i 2048 -q -O \
                     dirdata,uninit_bg,^extents,dir_nlink,quota,huge_file,flex_bg -E lazy_journal_init \
                    -F /dev/mapper/vg0-lv_home 8388608
Writing CONFIGS/mountdata


-------- The fstab entry is
/dev/mapper/vg0-lv_home /mnt/mgs                lustre  acl             1 2
(This volume was created specially during the build of this machine, and so retains the /dev/mapper entry)

This is the empty, startup state
/dev/mapper/vg0-lv_home
                       23G  1.3G   21G   6% /mnt/mgs
/dev/mapper/vg0-lv_home
                        1523712   49207    1474505    4% /mnt/mgs

See below about making a kmod-openafs rpm.

A comment about the "acl" option.

Could have added "--mountfsoptions=acl" when creating the mdt and mgs, but for this version of Lustre, the mountfsoptions are over-written, not additive, so would have also had to add back these defaults: errors=remount-ro,iopen_nopriv,user_xattr

BUT, the online web page says instead the default is "errors=remount-ro,user_xattr" which turns out to be correct. But, who knew for certain?

Instead, just add the -o acl to the mount instead.

Adding Lustre rpms to a file server such as umdist09

Pre-install zfs if it is to be used

See here for directions on installing zfs

NOTE: Lustre 2.7.0 DOES NOT WORK WITH ZFS 0.6.4, IT WORKS ONLY WITH 0.6.3. Various methods below were tried to work with the newer zfs, and all ultimately failed. So, at this moment, zfs 0.6.3 rpms have been built from source and are located at

/atlas/data08/ball/admin/zfs_rpms

Changed rpms

This is similar to what we did on mdtmgs. The difference is that zfs is in use for the volumes of umdist09, so we replace these two rpms

lustre-osd-ldiskfs-2.7.0-2.6.32_504.8.1.el6_lustre.x86_64.x86_64.rpm
lustre-osd-ldiskfs-mount-2.7.0-2.6.32_504.8.1.el6_lustre.x86_64.x86_64.rpm

With these two rpms

lustre-osd-zfs-2.7.0-2.6.32_504.8.1.el6_lustre.x86_64.x86_64
lustre-osd-zfs-mount-2.7.0-2.6.32_504.8.1.el6_lustre.x86_64.x86_64

The servers cannot simultaneously use both ldiskfs and zfs.

For testing, these were also installed
yum install lustre-tests-2.7.0-2.6.32_504.8.1.el6_lustre.x86_64.x86_64.rpm \
lustre-iokit-2.7.0-2.6.32_504.8.1.el6_lustre.x86_64.x86_64.rpm \
perf-2.6.32-504.8.1.el6_lustre.x86_64.rpm

Building new Lustre rpms

See this URL.

To fully summarize the steps from this URL, and locally (as it does not detail the requirements for zfs and spl modules) herewith is the history of needed commands. Note that it is NOT necessary to downgrade the epel repo from our standard version.

The problem encountered following this recipe is that the kernel does not boot, at least in our environment. Despite indications from the dracut directories, the lvm driver does not appear to correctly load into the initramfs. grub.conf is loaded from the sda partition, but when control transfers, the bootup just stops. So, these directions are included here only for the sake of future reference, and, really, unless you are academically interested, this section should be skipped. FYI, the assumption below is that the zfs rpms for 0.6.4.2 were installed.

   50  yum -y groupinstall "Development Tools"
   51  yum -y install xmlto asciidoc elfutils-libelf-devel zlib-devel binutils-devel newt-devel \
python-devel hmaccalc perl-ExtUtils-Embed bison elfutils-devel audit-libs-devel
   60  yum -y install quilt libselinux-devel
  109  yum install python-docutils
  143  yum install zfs-devel
  173  yum install libuuid-devel

  234  cd /usr/src/spl-0.6.4.2/
  243  ./configure --with-config=kernel
  244  make all
  245  cd ../zfs-0.6.4.2/
  246  ./configure --with-config=kernel
  247  make all

   74  useradd -m build
   75  su build

# Now, as user build

    1  cd $HOME
    3  git clone git://git.hpdd.intel.com/fs/lustre-release.git
(another variation is    git clone git://git.hpdd.intel.com/fs/lustre-release.git -b 'v2_7_0_0')
    4  cd lustre-release/
    5  sh ./autogen.sh
    6  cd $HOME
   10  mkdir -p kernel/rpmbuild/{BUILD,RPMS,SOURCES,SPECS,SRPMS}
   12  cd kernel
   15  echo '%_topdir %(echo $HOME)/kernel/rpmbuild' > ~/.rpmmacros
   19  rpm -ivh http://ftp.redhat.com/pub/redhat/linux/enterprise/6Server/en/os/SRPMS/kernel-2.6.32-504.16.2.el6.src.rpm \
2>&1 | grep -v mockb
   21  cd rpmbuild/
   23  rpmbuild -bp --target=`uname -m` ./SPECS/kernel.spec
#  Add a unique build id so we can be certain our kernel is booted. Do this by
#  Edit ~/kernel/rpmbuild/BUILD/kernel-2.6.32-504.16.2.el6/linux-2.6.32-504.16.2.el6.x86_64/Makefile 
#  and modify line 4, the EXTRAVERSION to read:  EXTRAVERSION = .504.16.2.el6_lustre
   26  cd BUILD/kernel-2.6.32-504.16.2.el6/linux-2.6.32-504.16.2.el6.x86_64/
   30  cp ~/lustre-release/lustre/kernel_patches/kernel_configs/kernel-2.6.32-2.
6-rhel6-x86_64.config ./.config
#
#  Now, this step 30 failed to provide a bootable kernel.  I (Bob) was unable to figure out why?  
#  Instead, copied the .config file from umdist08 and used that below with make oldconfig
#     /usr/src/kernels/2.6.32-504.16.2.el6.x86_64/.config
#
   33  ln -s ~/lustre-release/lustre/kernel_patches/series/2.6-rhel6.series series
   34  ln -s ~/lustre-release/lustre/kernel_patches/patches patches
   35  quilt push -av
   36  cd ~/kernel/rpmbuild/BUILD/kernel-2.6.32-504.16.2.el6/linux-2.6.32-504.16.2.el6.x86_64/
   37  make oldconfig || make menuconfig
   38  make include/asm
   39  make include/linux/version.h
   40  make SUBDIRS=scripts
   41  make include/linux/utsrelease.h
   42  make rpm

Now, we can make the full set of rpms, that include zfs support for the 0.6.4.1 distribution

WE HAVE LEARNED!  DISABLE THE ZFS REPO FROM FURTHER UPDATES

  127  cd lustre-release/
  128  ./configure --with-linux=/home/build/kernel/rpmbuild/BUILD/kernel-2.6.32.504.16.2.el6_lustre/ \
--with-zfs=/usr/src/zfs-0.6.4.2 --with-spl=/usr/src/spl-0.6.4.2
  129  make rpms

All of the rpms are in the directory ~build/kernel/rpmbuild/RPMS/x86_64/  Below is a complete list.  

These were also placed in /atlas/data08/ball/admin/LustreSL6/2.7.52/server

 310161168 May 14 13:18 kernel-2.6.32.504.16.2.el6_lustre-1.x86_64.rpm
    536456 May 18 12:37 lustre-2.7.52-2.6.32.504.16.2.el6_lustre_g96f566e.x86_64.rpm
  30254364 May 18 12:37 lustre-debuginfo-2.7.52-2.6.32.504.16.2.el6_lustre_g96f566e.x86_64.rpm
     42708 May 18 12:37 lustre-iokit-2.7.52-2.6.32.504.16.2.el6_lustre_g96f566e.x86_64.rpm
   3562124 May 18 12:37 lustre-modules-2.7.52-2.6.32.504.16.2.el6_lustre_g96f566e.x86_64.rpm
    359748 May 18 12:37 lustre-osd-ldiskfs-2.7.52-2.6.32.504.16.2.el6_lustre_g96f566e.x86_64.rpm
     13144 May 18 12:37 lustre-osd-ldiskfs-mount-2.7.52-2.6.32.504.16.2.el6_lustre_g96f566e.x86_64.rpm
     91776 May 18 12:37 lustre-osd-zfs-2.7.52-2.6.32.504.16.2.el6_lustre_g96f566e.x86_64.rpm
      7876 May 18 12:37 lustre-osd-zfs-mount-2.7.52-2.6.32.504.16.2.el6_lustre_g96f566e.x86_64.rpm
   7692592 May 18 12:37 lustre-source-2.7.52-2.6.32.504.16.2.el6_lustre_g96f566e.x86_64.rpm
   3810480 May 18 12:37 lustre-tests-2.7.52-2.6.32.504.16.2.el6_lustre_g96f566e.x86_64.rpm

Now, to install these rpms, copy them all somewhere safe, and do

yum localinstall kernel-2.6.32.504.16.2.el6_lustre-1.x86_64.rpm
/sbin/new-kernel-pkg --package kernel --mkinitrd --dracut --depmod --install 2.6.32.504.16.2.el6_lustre

Now reboot, install the helpers such as e2fsprogs, and then the needed lustre rpms.

As noted, the kernel would not boot, so, until some other pressing need brings us back here, these rpms are not useful.

Build openafs-kmod rpms

It's easiest to do this build on the same system where you built lustre and kernel rpms in previous steps. Get and install latest openafs srpm from SL mirror (I used root to build, could probably also use the "build" user already used). You will also need to be sure and install libcom_err-devel from the lustre repos to be able to install krb5-devel.

rpm -ivh http://mirror.lstn.net/scientific/6x/SRPMS/sl6/openafs.SLx-1.6.14-218.src.rpm
rpm -ivh https://downloads.hpdd.intel.com/public/e2fsprogs/1.42.12.wc1/el6/RPMS/x86_64/libcom_err-devel-1.42.12.wc1-7.el6.x86_64.rpm

We don't have the kernel-devel requirement so edit /root/rpmbuild/SPECS/openafs.SLx.spec to remove it:

%package -n openafs%{?nsfx}-kmod
Summary: This is really just a dummy to get the build requirement
Group: Networking/Filesystems
# BuildRequires: %{kbuildreq}

Next we'll put a kernel tree in the standard place with a standard name. Some RPM macros later will expect it in /usr/src regardless of what we set ksrcdir to in the next step.

cp -rfa /home/build/kernel/rpmbuild/BUILD/kernel-2.6.32.504.16.2.el6_lustre /usr/src/kernels/2.6.32.504.16.2.el6_lustre

The lustre kernel rpm we build is a little non-standard because it doesn't include the .x86_64 in the /lib/modules directory. Likewise for the source we can't use the .x86_64 because some rpm build macros are going to be looking for the directory in /usr/src without the arch included. Let's hack around automatic definitions by redefining some macros near the beginning of the spec. We'll also fix the version. The "." in place of "-" in lustre kernel rpms confuses the definition of krelmajor in the script.

# be sure your changes come after the openafs-sl-defs.sh.  The shell script is where the macros are defined and we want to over-ride those definitions.

%{expand:%(%{_sourcedir}/openafs-sl-defs.sh %{?kernel})}

%define ksrcdir /usr/src/kernels/%{kernel}
%define kmoddir /lib/modules/%{kernel}/extra/openafs
%define kmoddst /lib/modules/%{kernel}/kernel/fs/openafs

# make the version make sense - comes out to kmod-openafs-2 if left at the automatically generated value.
%define krelmajor 504-lustre

Finally, we need to modify a templating macro to use the stripped (no .x86_64) macro we used to define our destination. This macro sets up the %files section which needs to correctly reference the installed location of the kmod:

Line 476 change:
%{expand:%(%{kmodtool} rpmtemplate %{kmod_name} %{unamer} %{depmod} %{kvariants} 2>/dev/null)}

To replace unamer with kernel:
%{expand:%(%{kmodtool} rpmtemplate %{kmod_name} %{kernel} %{depmod} %{kvariants} 2>/dev/null)}

Now let's build. You may find you're missing some -devel packages which it will immediately complain about. They can be installed as usual with yum from SL repos. Take note of the requirement for libcom_err-devel - since we replaced the stock rpm with one from the lustre repos our -devel will also need to be installed from there.

rpmbuild -bb SPECS/openafs.SLx.spec --define "build_kmod 1" 

....
Wrote: /root/rpmbuild/RPMS/x86_64/kmod-openafs-2-1.6.14-218.sl6.2.6.32.504.16.2.x86_64.rpm
Wrote: /root/rpmbuild/RPMS/x86_64/kmod-openafs-2-debuginfo-1.6.14-218.sl6.2.6.32.504.16.2.x86_64.rpm

Output at start of build should indicate the kernel version, correct path to source, and show build_kmod defined as 1.

Pools on multipath storage

The zfs pools on umdist09 were created utilizing the mpathXY devices. It seems that these map to different dm-Z devices with every reboot, but they always correctly map to the correct physical disk in the MD3060e chassis. For example....

zpool create ost-012 raidz2 mapper/mpathbt mapper/mpathbu mapper/mpathcf \
mapper/mpathcg mapper/mpathcr mapper/mpathcs mapper/mpathdd mapper/mpathde \
mapper/mpathdp mapper/mpathdq

All these zpool correspond to two disks in each drawer, so the failure of an entire drawer will be bad, but not totally destructive to any of the pools, assuming no other failures also occur at the same time.

Should also set the disk to automagically re-add after replacement of a failed disk
zpool set autoreplace=on ost-012

The Lustre OST were then created with one OST per pool, eg
mkfs.lustre --backfstype=zfs --fsname=umt3B --mgsnode=10.10.2.173@tcp0 --ost --index=0 ost-001/ost0000

These were mounted at, eg, /mnt/ost-001, etc.

The zpools themselves were set to "legacy" status, so they do not simultaneously mount. This is per the advice of Andreas Dilger.
zfs set mountpoint=legacy ost-001

Pools on Storage Shelves such as the MD1000

zpools will be created from individual vidisks of 1 disk each. Each set up as a single disk RAID-0. The naming convention for the vdisks will be
cMdNOO
where M is the controller number, N is the shelf number on the controller, and OO is the disk within that shelf. For example

omconfig storage controller action=createvdisk controller=1 size=max raid=r0 pdisk=0:0:1 name=c1d001

This is a full set of vdisk creation for Controller 2, two shelves of 15 disks each. In practice, we did not do this as the shelves were split up.

for((i=0;i<10;i++)); do omconfig storage controller action=createvdisk controller=2 size=max raid=r0 pdisk=0:0:$i name=c2d00$i; done
for((i=10;i<15;i++)); do omconfig storage controller action=createvdisk controller=2 size=max raid=r0 pdisk=0:0:$i name=c2d0$i; done
for((i=0;i<10;i++)); do omconfig storage controller action=createvdisk controller=2 size=max raid=r0 pdisk=0:1:$i name=c2d10$i; done
for((i=10;i<15;i++)); do omconfig storage controller action=createvdisk controller=2 size=max raid=r0 pdisk=0:1:$i name=c2d1$i; done

Following the creation of the vdisks, the zpools can be created. The best option found is to do this from the "by-path" devices that will not change unless we reconfigure the hardware itself. The sd, dm and mpath devices (if multipath is installed) simply don't remain sufficiently static. For example, from umdist01:

 zpool create -f -m legacy ost-003 raidz2 \
disk/by-path/pci-0000:08:00.0-scsi-0:2:10:0 \
disk/by-path/pci-0000:08:00.0-scsi-0:2:11:0 \
disk/by-path/pci-0000:08:00.0-scsi-0:2:12:0 \
disk/by-path/pci-0000:08:00.0-scsi-0:2:13:0 \
disk/by-path/pci-0000:08:00.0-scsi-0:2:14:0 \
disk/by-path/pci-0000:08:00.0-scsi-0:2:25:0 \
disk/by-path/pci-0000:08:00.0-scsi-0:2:26:0 \
disk/by-path/pci-0000:08:00.0-scsi-0:2:27:0 \
disk/by-path/pci-0000:08:00.0-scsi-0:2:28:0 \
disk/by-path/pci-0000:08:00.0-scsi-0:2:29:0

Several scripts have been created to simplify this. All are in /root/tools. These are

full_control.sh This shows the overall order for sub-script execution
create_all_JBOD_vdisk.sh
map_sd_to_by-path.sh
create_all_zpools.sh
- dev_list.sh This is a utility script of create_all_zpools.sh

Several utility files are created in /root/zpoolFiles as these scripts execute, and are used in successor scripts.

For example, to output the full list of disk/by-path devices for the ost above, use this, where the input arguments are all of the vdisk that will make up this zpool.

/root/tools/dev_list.sh c1d010 c1d011 c1d012 c1d013 c1d014 c1d110 c1d111 c1d112 c1d113 c1d114

Conventions for creating the ost and mounting them

zfs pools are numbered sequentially on each OSS, eg, ost-001, ost-002, etc
Mount points for the OST are named identically in /mnt, eg, /mnt/ost-001, etc
Each lustre file system is created on the zfs pool, with the decimal index as part of the name, for example
- for index 12 on umdist01, mkfs.lustre uses --index=12 ost-001/ost0012
Each OSS has OST that are sequentially numbered via their mkfs.lustre index
The mdtmgs node, and the WN, generally know these by the hexadecimal equivalent of the decimal index
- for index 12, the official OST name is then umt3B-OST000c
- lustre_one_control.sh works with the official name after the dash, eg, OST000c

Example, create lustre file systems and mount them

This set of example commands is taken from the ost creation on umdist01

mkfs.lustre --backfstype=zfs --fsname=umt3B --mgsnode=10.10.2.173@tcp0 --ost --index=12 ost-001/ost0012
mkfs.lustre --backfstype=zfs --fsname=umt3B --mgsnode=10.10.2.173@tcp0 --ost --index=13 ost-002/ost0013
mkfs.lustre --backfstype=zfs --fsname=umt3B --mgsnode=10.10.2.173@tcp0 --ost --index=14 ost-003/ost0014
mkfs.lustre --backfstype=zfs --fsname=umt3B --mgsnode=10.10.2.173@tcp0 --ost --index=15 ost-004/ost0015
mkfs.lustre --backfstype=zfs --fsname=umt3B --mgsnode=10.10.2.173@tcp0 --ost --index=16 ost-005/ost0016
mkfs.lustre --backfstype=zfs --fsname=umt3B --mgsnode=10.10.2.173@tcp0 --ost --index=17 ost-006/ost0017

The corresponding /etc/fstab entries now are
ost-001/ost0012         /mnt/ost-001            lustre  _netdev         0 0
ost-002/ost0013         /mnt/ost-002            lustre  _netdev         0 0
ost-003/ost0014         /mnt/ost-003            lustre  _netdev         0 0
ost-004/ost0015         /mnt/ost-004            lustre  _netdev         0 0
ost-005/ost0016         /mnt/ost-005            lustre  _netdev         0 0
ost-006/ost0017         /mnt/ost-006            lustre  _netdev         0 0

Do not forget, make the mount points!
mkdir /mnt/ost-001
(etc)

NOTE: When re-creating an OST after it was destroyed for some reason, also add the parameter "--replace" 
in addition to the index number.

Rebuilding a client that was using the older Lustre rpms

This action consists of stopping lustre, unloading all the modules, erasing the lustre client rpms, updating the kernel and related rpms, and installing the new lustre client rpms. The system can then be rebooted. On a Worker node, where the rpms are in the Rocks repos, the following sequence was employed. This also makes sure all of the various grub conf files are updated.

  957  service lustre_mount_umt3 stop
  958  /atlas/data08/ball/admin/unload_lustre.sh
  960  yum erase lustre-client lustre-client-modules
  964  yum update kernel kernel-devel kernel-doc kernel-firmware kernel-headers kmod-openafs*
  965  cd /boot/grub
  966  cp -p grub.conf grub-orig.conf
  967  cat grub.conf    [ pick out the new kernel entries, and add them after the "reinstall" part of rocks.conf ]
  968  vi rocks.conf
  971  yum install lustre-client lustre-client-modules

Rolling back zfs on a pool server

It became necessary to roll back the zfs version when the unexpected rpm update broke Lustre. Furthermore, zpools created under 0.6.4.1 used properties that were unknown and incompatible with zfs 0.6.3.1. Fortunately, no Lustre file systems had yet been created on these pools, so they were simply destroyed and re-created using the script above.

[root@umdist04 zfs_rpms]# zpool status ost-001
  pool: ost-001
 state: UNAVAIL
status: The pool cannot be accessed on this system because it uses the
        following feature(s) not supported on this system:
        com.delphix:hole_birth
        com.delphix:embedded_data

action: Access the pool from a system that supports the required feature(s),
        or restore the pool from backup.
  scan: none requested

The procedure to perform this rollback, short of a rebuild, is as follows.

service zfs stop
dkms uninstall -m zfs -v 0.6.4.1 -k 2.6.32-504.8.1.el6_lustre.x86_64
dkms uninstall -m spl -v 0.6.4.1 -k 2.6.32-504.8.1.el6_lustre.x86_64
yum erase libuutil1 libnvpair1 libzpool2 spl-dkms zfs-dkms spl libzfs2 zfs lsscsi zfs-test zfs-dracut
- This has the side effect of uninstalling 3 lustre rpms, that must then later be re-installed
for i in /var/lib/dkms/*/[^k]*/source; do [ -e "$i" ] || echo "$i";done
- Delete the files found. This is because the zfs rpm removal is stupid
  - rm /var/lib/dkms/spl/0.6.4.1/source
  - rm /var/lib/dkms/zfs/0.6.4.1/source
More stupid zfs cleanup
- cd /var/lib/dkms
- /bin/rm -rf zfs spl
- cd /lib/modules/2.6.32-504.16.2.el6.x86_64/weak-updates
- /bin/rm -rf avl nvpair spl splat unicode zcommon zfs zpios
Make sure no more stupidities such as this are found
- /etc/cron.daily/mlocate.cron
- locate zfs
Unload old zfs modules
- rmmod zfs zcommon znvpair spl zlib_deflate zavl zunicode
Re-install zfs
- cd /atlas/data08/ball/admin/zfs_rpms
- yum localinstall libnvpair1-0.6.3-1.3.el6.x86_64.rpm libuutil1-0.6.3-1.3.el6.x86_64.rpm libzfs2-0.6.3-1.3.el6.x86_64.rpm libzpool2-0.6.3-1.3.el6.x86_64.rpm spl-0.6.3-1.3.el6.x86_64.rpm spl-dkms-0.6.3-1.3.el6.noarch.rpm zfs-0.6.3-1.3.el6.x86_64.rpm zfs-dkms-0.6.3-1.3.el6.noarch.rpm zfs-dracut-0.6.3-1.3.el6.x86_64.rpm
Re-install lost lustre rpms
- cd ../LustreSL6/2.7/server
- yum localinstall lustre-2.7.0-2.6.32_504.8.1.el6_lustre.x86_64.x86_64.rpm lustre-modules-2.7.0-2.6.32_504.8.1.el6_lustre.x86_64.x86_64.rpm lustre-osd-zfs-2.7.0-2.6.32_504.8.1.el6_lustre.x86_64.x86_64.rpm lustre-osd-zfs-mount-2.7.0-2.6.32_504.8.1.el6_lustre.x86_64.x86_64.rpm
service zfs start

Now, destroy and re-create the zpools, then make the Lustre file systems

Updating the OSS to a new version of zfs and Lustre


Now, on the OSS, first save the zpool info for the sake of safety
1. cp /etc/zfs/zpool.cache /root
2. service zfs stop
3. chkconfig zfs off
4. yum erase lustre lustre-modules lustre-osd-zfs lustre-osd-zfs-mount
5. yum erase libnvpair1 libuutil1 libzfs2 libzpool2 spl spl-dkms zfs zfs-dkms zfs-dracut
6. cd /atlas/data08/ball/admin/LustreSL6/2.7.58/server
7. yum localinstall kernel-2.6.32.504.16.2.el6_lustre-1.x86_64.rpm
8. /sbin/new-kernel-pkg --package kernel --mkinitrd --dracut --depmod \
    --install 2.6.32.504.16.2.el6_lustre
9. Remove these files
for i in /var/lib/dkms/*/[^k]*/source; do [ -e "$i" ] || echo "$i";done
/var/lib/dkms/spl/0.6.3/source
/var/lib/dkms/zfs/0.6.3/source

10. Reboot to new kernel
11. mkdir -p /home/build/kernel/rpmbuild/BUILD
12. cd /home/build/kernel/rpmbuild/BUILD
13. tar xzf /atlas/data08/ball/admin/LustreSL6/2.7.58/server/lustre_2.7.58_headers.tgz

14. cd /atlas/data08/ball/admin/zfs_0.6.4_rpms
15. yum localinstall libnvpair1-0.6.4.2-1.el6.x86_64.rpm libuutil1-0.6.4.2-1.el6.x86_64.rpm \
   libzfs2-0.6.4.2-1.el6.x86_64.rpm libzpool2-0.6.4.2-1.el6.x86_64.rpm \
   spl-0.6.4.2-1.el6.x86_64.rpm spl-dkms-0.6.4.2-1.el6.noarch.rpm \
   zfs-0.6.4.2-1.el6.x86_64.rpm zfs-dkms-0.6.4.2-1.el6.noarch.rpm \
   zfs-dracut-0.6.4.2-1.el6.x86_64.rpm
16. cd ../LustreSL6/2.7.58/server
17. yum localinstall lustre-2.7.58-2.6.32.504.16.2.el6_lustre_gdeb6fcd.x86_64.rpm \
   lustre-modules-2.7.58-2.6.32.504.16.2.el6_lustre_gdeb6fcd.x86_64.rpm \
   lustre-osd-zfs-2.7.58-2.6.32.504.16.2.el6_lustre_gdeb6fcd.x86_64.rpm \
   lustre-osd-zfs-mount-2.7.58-2.6.32.504.16.2.el6_lustre_gdeb6fcd.x86_64.rpm
18. yum localinstall kmod-openafs-504-lustre-1.6.14-218.sl6.2.6.32.504.16.2.x86_64.rpm
19. depmod -a
20. Reboot
21. zpool upgrade -a
22. Uncomment the OST in /etc/fstab and mount them via "mount -av"

If/when a disk fails on the MD1000 or MD1200

I have not been able to get a disk to automatically re-import after a "pull the disk to simulate failure" event. The failure of the pdisk, results also in the failure of the vdisk. If the machine is still online, and the disk is replaced, the vdisk can be re-created.

The failing vdisk leaves preserved cache on the controller that must be cleared. See Troubleshooting and Managing Preserved Cache on a Dell PowerEdge\x99 RAID Controller (PERC) for details, but, bottom line, is to follow this procedure that has worked.

Pull the failed disk, eg, disk 0:0:1 on controller 2
omconfig storage controller action=discardpreservedcache controller=2 force=enabled
Insert the replacement disk

You can now proceed to re-create the vdisk and re-add it to the zpool.

omconfig storage controller action=clearforeignconfig controller=2
omconfig storage controller action=createvdisk controller=2 size=max raid=r0 pdisk=0:0:1 name=c2d001

When tests are performed with NO filesystem on the zpool, the zpool never "notices" the failed disk, and also does not seem to re-import into the pool. This importation can be forced, using the same disk in the same cabled locale:

zpool replace -f ost-004 pci-0000:0a:00.0-scsi-0:2:2:0

If there is an active file system on the pool, then as soon as a file is written to (probably also read from) the file system, the failed disk is noticed and marked failed.

pci-0000:0a:00.0-scsi-0:2:3:0 UNAVAIL 3 49 0 corrupted data

"zpool replace" as above successfully handles this.

If there is a random disk failure, after replacing it, it is easy to re-create the vdisk using the controller, shelf, and disk number as above. /root/tools/storage_details.sh is useful in helping to find the missing vdisk name as well. However, you must then find the affected zpool. If there is only a single failure, you can sequentially look at the zpool status for each until you find the failed disk

zpool status ost-001

and so on. The failed disk will be obvious. The argument for the "zpool replace" command above can then be selected from the status output, just leaving off the extra characters in the listed device name. Those can be determined by comparison to the device names of other, good disks in the zpool. If for any reason the disk is NOT showing failed once all the zpool status commands are issued, then see the next paragraph.

The helper files in /root/zpoolFiles can be useful, particulary "sd_to_by-path.txt". A sample section, defined by the corresponding virtual disk ID, is shown below. In the situation in the paragraph just above, just find the zpool with this by-path identifier, and do the "zpool replace".

ID : 0
Name : c1d000
Device Name : /dev/sdb
by-pathName : disk/by-path/pci-0000:08:00.0-scsi-0:2:0:0

Finding file associated with a zpool error

It may happen that a zpool reports errors. It may look something like this

errors: Permanent errors have been detected in the following files:

ost-004/ost0027:<0x28697>
ost-004/ost0027:<0x257a1>
ost-004/ost0027:<0x286b1>
ost-004/ost0027:<0x444d0>
ost-004/ost0027:<0x285e5>
ost-004/ost0027:<0x288ef>
ost-004/ost0027:<0x4c2fb>

Let's take that first example. On the OSS either make a zfs snapshot of the volume, or umount the OST, mount it as type zfs and then use Linux find to look for the inode.

zfs snapshot ost-004/ost0027@Sep28
mount -t zfs ost-007/ost0027@Sep28 /mnt/tmp
find /mnt/tmp/O -inum 165527
- This returned /mnt/tmp/O/0/d0/344992

The inode number 165527 is the decimal equivalent of the hexadecimal error pointer (to an inode) above, 0x28697.

The Lustre OID of this file is 344992. Translate it to something akin to the Lustre FID

ll_decode_filter_fid /mnt/tmp/O/0/d0/344992
- This returned /mnt/tmp/O/0/d0/344992: parent=[0x200004bf0:0x3984:0x0] stripe=0

The bit between the pair of [] is what we are looking for. Go to some Lustre client machine and do this.

lfs fid2path /lustre/umt3 [0x200004bf0:0x3984:0x0]
- This returned the name of the affected file, ie, /lustre/umt3/user/daits/data12/NTUP_ONIAMUMU/data12_8TeV.periodI.physics_Bphysics.PhysCont.NTUP_ONIAMUMU.repro14_v01.r4065_p1278_p1424_nmy19_pUM999999/NTUP_ONIAMUMU.nmy19.00213695._000022.root.1

Once the bad file is found, delete it from Lustre, repeat for all the permanent errors on the OST, and do "zpool clear" to clear the bad file report. Also, umount and destroy the zpool snapshot (zfs destroy ost-007/ost0027@Sep28).

Test results

Running the "obdfilter-survey" test on some OSS.

So, looks like we are getting the following performance on umdist09 with 2 threads per OST
write 4781.18 [ 310.98, 412.99] rewrite 5881.79             SHORT read 3775.87 [ 268.99, 397.99]

Units are MB/s, with "aggregate [ min/OST, max/OST ]" detailed for various conditions.

The aggregate is faster then the network bandwidth out of the machine.

There is also a test with each 2 objects per ost with one thread per object,
write 5915.34             SHORT rewrite 5844.71             SHORT read 3812.05 [ 284.99, 387.99]

The test is run as follows:
nobjhi=2 thrhi=2 size=1024 case=disk sh /usr/bin/obdfilter-survey

As expected, the results are not as good on umdist01. The full output is as follows. The second line of this is the "2 threads per OST" result quoted above. The third line is the second test quoted for umdist09. Results from umdist03, with 6 MD1000 shelves, and 9 OST, are also shown.

umdist01 and umdist03 are both PE2950.

Mon Jun  1 10:00:31 EDT 2015 Obdfilter-survey for case=disk from umdist01.aglt2.org
ost  6 sz  6291456K rsz 1024K obj    6 thr    6 write  793.36 [  65.99, 177.96] rewrite 1498.67 [ 119.99, 314.97] read  883.23 [  73.99, 247.91]
ost  6 sz  6291456K rsz 1024K obj    6 thr   12 write 1417.80 [ 179.96, 301.93] rewrite  738.33 [  41.99, 316.93] read 1088.38 [  89.99, 375.92]
ost  6 sz  6291456K rsz 1024K obj   12 thr   12 write 1900.24 [ 322.78, 369.97] rewrite  438.91 [   0.00, 567.84] read 1668.66 [ 228.99, 370.83]


Mon Jun  1 12:13:37 EDT 2015 Obdfilter-survey for case=disk from umdist03.aglt2.org
ost  9 sz  9437184K rsz 1024K obj    9 thr    9 write  595.45 [   0.00, 297.97] rewrite  754.19 [   0.00, 312.96] read 1710.89 [ 144.99, 280.98]
ost  9 sz  9437184K rsz 1024K obj    9 thr   18 write  611.77 [   0.00, 285.95] rewrite  621.57 [   0.00, 290.97] read 1962.69 [ 190.99, 270.96]
ost  9 sz  9437184K rsz 1024K obj   18 thr   18 write  535.08 [   0.00, 280.97] rewrite  562.50 [   0.00, 273.98] read 2020.20 [ 196.99, 287.99]

umdist02 is an R710

Mon Jun  1 12:24:34 EDT 2015 Obdfilter-survey for case=disk from umdist02.aglt2.org
ost  6 sz  6291456K rsz 1024K obj    6 thr    6 write 3409.24             SHORT rewrite 2489.33 [ 304.97, 389.96] read 3051.71             SHORT
ost  6 sz  6291456K rsz 1024K obj    6 thr   12 write 4378.68             SHORT rewrite 1456.43 [   0.00, 602.95] read 3834.28             SHORT
ost  6 sz  6291456K rsz 1024K obj   12 thr   12 write 1782.82             SHORT rewrite  625.03 [   0.00, 187.99] read 4303.14             SHORT

More test results, zfs vs ldiskfs

During the process of building Lustre rpms with zfs 0.6.4.2, it was decided to do several tests

IO tests using cp from/to /tmp of dc2-10-23 with zfs
IO tests using cp from/to /tmp of dc2-10-23 with ldiskfs
IO tests using stock 2.7.0 from/to /tmp of dc2-10-23 with zfs 0.6.3
Upgrade zfs and kernel to the 2.7.58 build with zfs 0.6.4.2
Repeat of first test with zfs

Size formatted ldiskfs
/dev/sdb 3.7T 69M 3.5T 1% /mnt/ost-001
Size formatted zfs
ost-001/ost0000 3.6T 3.8M 3.6T 1% /mnt/ost-001

This is the log of test results performed

Test results conclusions

Best single-machine read rate is from ldiskfs at ~16MB/s, otherwise range is from 9-12MB/s
Best single-machine write rate is to ldiskfs at 20MB/s, but zfs is statistically the same, not far behind at 19.7
Writes to zfs are expensive, in an iostat sense, 40% vs 10% for ldiskfs on dual threads
Reads from ldiskfs are expensive, in an iostat sense, 95% vs 60% for zfs on dual threads
Single thread read iostat on umdist10 is 33-48% on zfs. single ldiskfs was 60%
Single thread write iostat on umdist10 is 20-25% on zfs.

Some post-production update stats

Typical iostat on umdist09 during writes is 15-20% of the OST capability
Typical write rate on umdist09 is 10MB/s/OST

Table keys

The Version is that of Lustre
- 0 = stock 2.7.0
- 58-1 = Initial install is 2.7.58 with zfs 0.6.4.2
- 58-2 = Install is 2.7.0 with zfs 0.6.3 upgraded to 2.7.58 with zfs 0.6.4.2
- 58-3 = "zpool upgrade -a" run on the 58-2 pool
- Final = Post-full-upgrade Production system
LD is ldiskfs formatted disk

Sequence	Host 1	Host 2	dist10 NIC	Version	LD or ZFS	Read	Write	iostat	load_one
1	dc2-10-23		1Gb	58-1	ZFS		11.74
2	dc2-10-23		1Gb	58-1	ZFS	11.88
3	dc2-10-23		1Gb	58-1	LD		10.27
4	dc2-10-23		1Gb	58-1	LD	15.7
5	dc2-10-23		10Gb	58-1	LD		11.22
6	dc2-10-23		10Gb	58-1	LD	12.81
7	dc40-16-25		10Gb	58-1	LD	16.4		60
8	dc40-16-25		10Gb	58-1	LD		20.2
9	dc2-10-23	dc40-16-25	10Gb	58-1	LD	21.06
10	dc2-10-23	dc40-16-25	10Gb	58-1	LD	19.54		95
11	dc2-10-23	dc40-16-25	10Gb	58-1	LD		34	10
12	dc40-16-25		10Gb	0	ZFS		19.79	23
13	dc2-10-23		10Gb	0	ZFS		10.21	14
14	dc2-10-23	dc40-16-25	10Gb	0	ZFS		35	40	0.8
15	dc40-16-25		10Gb	0	ZFS	12.02		42	2.5
16	dc2-10-23		10Gb	0	ZFS	11.74		14
17	dc2-10-23	dc40-16-25	10Gb	0	ZFS	18		62	1.5
18	dc2-10-23		10Gb	58-2	ZFS		13.92	16.38	0.2
19	dc40-16-25		10Gb	58-2	ZFS		19.93	25.64
20	dc40-16-25		10Gb	58-2	ZFS	9.41		43.69	0.9
21	dc2-10-23		10Gb	58-2	ZFS	9.44		35.55	0.9
22	dc2-10-23	dc40-16-25	10Gb	58-2	ZFS	19		55.76	1.7
23	dc2-10-23	dc40-16-25	10Gb	58-2	ZFS		34	42.15	0.4
24	dc40-16-25		10Gb	58-3	ZFS		19.73	23.57	0.4
25	dc2-10-23		10Gb	58-3	ZFS		13.39	20.26	0.25
26	dc2-10-23	dc40-16-25	10Gb	58-3	ZFS		36	43.88	0.55
27	dc40-16-25		10Gb	58-3	ZFS	9.46		48.32	1.1
28	dc2-10-23		10Gb	58-3	ZFS	8.96		33.08	0.7
29	dc2-10-23	dc40-16-25	10Gb	58-3	ZFS	14.65		60.55	1.8
30	dc2-10-23		10Gb	Final	ZFS		12.67		0.8
31	dc40-16-25		10Gb	Final	ZFS		20.20		1.0
32	dc2-10-23		10Gb	Final	ZFS	26.0			1.0
33	dc40-16-25		10Gb	Final	ZFS	13.96			1.0

Random notes gleaned while reading up on the topic

OSS in Lustre 2.4 max size at 16TB, but with 2.5 can use up to 256TB? This assumes zfs beneath. For ldiskfs, max is 128TB.

MDS should have 1-2% of the storage of the full system, so, for our 1PB, this would be 1-2TB.

49M inodes currently in use on the MDT, at 2kB each, is 98GB of space. Double that to 200GB. Current is 263GB.

For better performance, we recommend that you create RAID sets with 4 or 8 data disks plus one or two parity disks. Using larger RAID sets will negatively impact performance compared to having multiple independent RAID sets.

For RAID 5, RAID 6, or RAID 1+0 storage, specifying the following option to the --mkfsoptions parameter option improves the layout of the file system metadata, ensuring that no single disk contains all of the allocation bitmaps: -E stride = chunk_blocks The chunk_blocks variable is in units of 4096-byte blocks and represents the amount of contiguous data written to a single disk before moving to the next disk. This is alternately referred to as the RAID stripe size. This is applicable to both MDT and OST file systems.

For best results, use RAID 5 with 5 or 9 disks or RAID 6 with 6 or 10 disks, each on a different controller. The stripe width is the optimal minimum I/O size. Ideally, the RAID configuration should allow 1 MB Lustre RPCs to fit evenly on a single RAID stripe without an expensive read-modify-write cycle. Use this formula to determine the stripe_width, where number_of_data_disks does not include the RAID parity disks (1 for RAID 5 and 2 for RAID 6):

stripe_width_blocks = chunk_blocks * number_of_data_disks = 1 MB

If the RAID configuration does not allow chunk_blocks to fit evenly into 1 MB, select stripe_width_blocks, such that is close to 1 MB, but not larger. The stripe_width_blocks value must equal chunk_blocks * number_of_data_disks. Specifying the stripe_width_blocks parameter is only relevant for RAID 5 or RAID 6, and is not needed for RAID 1 plus 0. Run --reformat on the file system device (/dev/sdc), specifying the RAID geometry to the underlying ldiskfs file system, where:

--mkfsoptions "other_options -E stride=chunk_blocks, stripe_width=stripe_width_blocks"

A RAID 6 configuration with 6 disks has 4 data and 2 parity disks. The chunk_blocks <= 1024KB/4 = 256KB. Because the number of data disks is equal to the power of 2, the stripe width is equal to 1 MB.

--mkfsoptions "other_options -E stride=chunk_blocks, stripe_width=stripe_width_blocks

For best performance, should put the journal elsewhere than on the OST. See page 34 of the manual. For example:

oss# mke2fs -b 4096 -O journal_dev /dev/sdb journal_size The value of journal_size is specified in units of 4096-byte blocks. For example, 262144 for a 1 GB journal size.

[oss#] mkfs.lustre --mgsnode=mds@osib --ost --index=0 --mkfsoptions="-J device=/dev/sdb1" /dev/sdc

Growing an OST

It is possible to replace all disks on an OST with larger disks, and when all disks are replaced, the OST will grow to the new size. The procedure outlined below replaces one disk at a time in the OST, with 2 replacements per day possible. This was successfully done on umdist01, and a new purchase of 20 1TB disks will again be employed to grow two more OST, providing 750GB disk spares in the process.

zpool set autoexpand=on ost-001
zpool offline ost-001 pci-0000:08:00.0-scsi-0:2:16:0
omconfig storage vdisk action=deletevdisk controller=1 vdisk=16
Replace the disk
omconfig storage controller action=clearforeignconfig controller=1
omconfig storage controller action=createvdisk controller=1 size=max raid=r0 pdisk=0:1:1 name=c1d101
zpool replace -f ost-001 pci-0000:08:00.0-scsi-0:2:16:0
Wait for the resilver to complete, and move to the next disk in the list

The list of devices in a zpool can be obtained from the "zpool status ost-001" command. This in turn can be matched up to the output of "/root/tools/map_sd_for_iostat.sh", thus obtaining a full list of physical disks and vdisk names that correspond to those devices.

After the last disk is replaced and resilvered, make sure the new, grown OST is properly ensconced in the zfs configuration.

umount /mnt/ost-001
zpool export ost-001
zpool import ost-001
mount ost-001/ost0012

-- BobBall - 07 Apr 2015

Topic revision: r47 - 25 Oct 2017, BobBall

AGLT2

Copyright © by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
Ideas, requests, problems regarding Foswiki? Send feedback