Test results comparing zfs to ldiskfs

The tests below run a test Lustre system (mgs + umdist10) through its paces, starting with a zfs 0.6.4.2 straight-up install, with both ldiskfs and zfs partitions, then proceeding to a stock Lustre 2.7.0 install with zfs 0.6.3, then to an upgrade to our home-build 2.7.58 rpms.


[root@c-10-23 lustre]# cat /root/copy_script.sh
#!/bin/bash
cp $1 .

copying 25GB of file out of old Lustre to /tmp/condor/lustre (keeps it from being deleted).
[root@c-10-23 lustre]# time find /lustre/umt3/datadisk/mc09_7TeV/event -type f -print \
| xargs -n 1 /root/copy_script.sh
real    54m15.407s
user    0m41.144s
sys     3m4.177s

/tmp utilization is running around 15% but is spiky, not consistent
wkB_s around 9000kB/s

[root@c-10-23 condor]# du -s -x -m lustre
19622   lustre
[root@c-10-23 ~]# ls -1 /tmp/condor/lustre|wc -l
71344
Average file size is 0.275MB

So, 6.03MB/s average copy from old lustre this way to local disk /tmp
-------------------------

Now, write this to the new Lustre. /tmp utilization is 72% and is steady.
Bytes out about 11MB/s, rkB_S around 10000kB/s
[root@c-10-23 copiedTo]# time find /tmp/condor/lustre -type f -print | xargs -n 1 /root/copy_script.sh

real    27m51.289s
user    0m43.058s
sys     3m0.993s
So, 11.74MB/s average local disk /tmp to new lustre
--------------------

Now, copy from new Lustre to new /tmp/condor/bkfromnew.  /tmp util steady around 25-30%.
The load_one on umdist10 is higher now (0.6) than for the writing (mostly less than 0.25)
[root@c-10-23 lustre]# time find /lustre/T3test/copiedTo -type f -print | xargs -n 1 /root/copy_script.sh

real    27m31.534s
user    0m42.136s
sys     3m2.407s
So, 11.88MB/s average new lustre back to local disk /tmp
--------------------

umount zfs disk on umdist10, and ldiskfs /mnt/mgs on mgs.
Re-create mgs and mount it
mkfs.lustre --fsname=T3test --mgs --mdt --reformat --index=0 /dev/mapper/vg0-lv_home

Destroy zfs volume, delete RAID-0 vdisks, create RAID-6 volume, and initialize it (wait for completion)

  667  omconfig storage vdisk action=deletevdisk controller=0 vdisk=5
  668  omconfig storage vdisk action=deletevdisk controller=0 vdisk=4
  669  omconfig storage vdisk action=deletevdisk controller=0 vdisk=3
  670  omconfig storage vdisk action=deletevdisk controller=0 vdisk=2
  671  omconfig storage vdisk action=deletevdisk controller=0 vdisk=1
  672  omconfig storage vdisk action=deletevdisk controller=0 vdisk=0
  674  omconfig storage controller action=createvdisk controller=0 size=max raid=r6 pdisk=0:0:0,0:0:1,0:0:2,0:0:3,0:0:4,0:0:5 \
stripesize=256kb readpolicy=ra writepolicy=wb name=ost0
  676  omconfig storage vdisk action=initialize controller=0 vdisk=0
                  Wait for init to complete, about 15hrs
  699  yum erase lustre-ost-zfs lustre-osd-zfs-mount
  700  cd /atlas/data08/ball/admin/LustreSL6/2.7.58/server
  702  yum localinstall lustre-2.7.58-2.6.32.504.16.2.el6_lustre_gdeb6fcd.x86_64.rpm \
lustre-osd-ldiskfs-2.7.58-2.6.32.504.16.2.el6_lustre_gdeb6fcd.x86_64.rpm \
lustre-osd-ldiskfs-mount-2.7.58-2.6.32.504.16.2.el6_lustre_gdeb6fcd.x86_64.rpm
  705  mkfs.lustre --ost --mgsnode=10.10.1.140@tcp0 --fsname=T3test --index=0 --mountfsoptions="stripe=256" /dev/sdb

------------------

Copy to ldiskfs formatted storage
Load_one on umdist10 is <~0.15
[root@c-10-23 copiedTo]# time find /tmp/condor/lustre -type f -print | xargs -n 1 /root/copy_script.sh

real    31m51.338s
user    0m43.523s
sys     3m2.792s
This is 10.27MB/s to Lustre.
----------------------

Copy from Lustre back to /tmp
Load_one on umdist10 is <~0.55
[root@c-10-23 bkfromnew2]# time find /lustre/T3test/copiedTo -type f -print | xargs -n 1 /root/copy_script.sh

real    20m49.689s
user    0m42.474s
sys     3m3.575s
This is 15.70MB/s
-----------------------

Add 10Gb Myricom on umdist10, create /lustre/T3test/copiedTo10G and write to it.
Modified /etc/modprobe.d/lustre.conf and /etc/ganglia/gmond.conf, and created ifcfg-eth2.
Load on umdist10 unchanged.
[root@c-10-23 copiedTo10G]# time find /tmp/condor/lustre -type f -print | xargs -n 1 /root/copy_script.sh

real    29m8.506s
user    0m42.990s
sys     3m2.682s
This is 11.22MB/s
--------------------------

Reverse direction back to /tmp in bkfromnew3
[root@c-10-23 bkfromnew3]# time find /lustre/T3test/copiedTo -type f -print | xargs -n 1 /root/copy_script.sh

real    25m31.611s
user    0m42.181s
sys     3m4.468s
This is 12.81MB/s
---------------------------------

Switch to 10GB host dc40-16-25 for some testing.  Copy to node from Lustre.
umdist10 IO rate is about 17MB/s
[root@c-16-25 lustre]# time find /lustre/T3test/copiedTo -type f -print | xargs -n 1 /root/copy_script.sh

real    19m56.640s
user    0m30.901s
sys     2m28.088s
This is 16.40MB/s
-----------------------------

Copy to Lustre from dc40-16-25
umdist10 IO rate is about 22MB/s.
[root@c-16-25 from_16-25]# time find /tmp/condor/lustre -type f -print | xargs -n 1 /root/copy_script.sh

real    16m11.373s
user    0m24.455s
sys     1m44.448s
This is 20.20MB/s
-----------------------------

Now, let's run both dc40-16-25 and dc2-10-23 simultaneously.  Note the separation of
   source directories so that they are distinct.
ost utilization on umdist10 is about 90-95%.  Single stream was around 60% yesterday.
Bytes out from umdist10 around 22-24MB/s
[root@c-10-23 bkfromnew3]# time find /lustre/T3test/copiedTo -type f -print | xargs -n 1 /root/copy_script.sh

real    31m3.092s
user    0m41.830s
sys     3m0.513s
This is 10.53MB/s

[root@c-16-25 bkfromnew]# time find /lustre/T3test/copiedTo10G -type f -print | xargs -n 1 /root/copy_script.sh

real    27m50.670s
user    0m7.688s
sys     0m58.788s
This is 11.74MB/s

Aggregate over longest time is 21.06MB/s
----------------------------

Repeat this, copying TO two distinct lustre directories.
ost utilization on umdist10 is about 10%
Bytes in to umdist10 around 34MB/s

[root@c-16-25 new_from_16-25]# time find /tmp/condor/lustre -type f -print | xargs -n 1 /root/copy_script.sh

real    16m10.288s
user    0m23.471s
sys     1m43.641s
This is very close to the single-stream write recorded above for this machine, ie,
20.22MB/s

[root@c-10-23 new_from_10-23]# time find /tmp/condor/lustre -type f -print | xargs -n 1 /root/copy_script.sh

real    32m0.445s
user    0m43.284s
sys     2m59.878s
This is 10.22MB/s, again very close to the single stream write rate above.

These two streams appear to be independent, and do not impact each other.
---------------------------------------------------

Repeat the "from lustre" copies one last time using both hosts.
Iostat, etc, all look about the same as before.

[root@c-16-25 bkfromnew4]# time find /lustre/T3test/copiedTo -type f -print | xargs -n 1 /root/copy_script.sh

real    24m35.714s
user    0m7.876s
sys     0m59.209s
This is 13.30MB/s

[root@c-10-23 bkfromnew4]# time find /lustre/T3test/copiedTo10G -type f -print | xargs -n 1 /root/copy_script.sh

real    33m28.274s
user    0m42.190s
sys     3m2.625s
This is 9.77MB/s
Aggregate over the longest time is 19.54MB/s

------------------------------------------------------------------------------------
------------------------- Now, the upgrade path ----------------------------

Rebuild both mgs and umdist10 with zfs 0.6.3 and the stock Lustre, make them, re-run the throughput tests.
After that, upgrade to our home-built rpms, and do it all again.

---------------------
Copy from dc40-16-25 to /lustre/T3test/copiedTo10G
[root@c-16-25 copiedTo10G]# time find /tmp/condor/lustre -type f -print | xargs -n 1 /root/copy_script.sh


real    16m31.722s
user    0m27.663s
sys     1m59.183s
This is 19.79MB/s

Repeat following reboot of dc40-16-25
iostat usage on umdist10 6 zfs sd devices averages about 23%
[root@c-16-25 copiedTo10G_b]# time find /tmp/condor/lustre -type f -print | xargs -n 1 /root/copy_script.sh

real    16m11.089s
user    0m24.273s
sys     1m45.823s
Nearly identical???  Why is this faster than before?  Is it a network bottleneck?  Or is it my rpms?
-------------------------------

[root@c-10-23 copiedTo1G]# time find /tmp/condor/lustre -type f -print | xargs -n 1 /root/copy_script.sh

real    32m0.230s
user    0m43.092s
sys     3m1.415s
This is more typical, 10.21MB/s

iostat usage on umdist10 6 zfs sd devices averaged about 14% during this time.
------------------------------

Run both writes simultaneously now
iostat usage on umdist10 6 zfs sd devices averaged about 40% with both running.  Seems
   simple additive of the individual rates.
Peak IO around 35MB/s
load_one around 0.8

[root@c-16-25 copiedTo10G_c]# time find /tmp/condor/lustre -type f -print | xargs -n 1 /root/copy_script.sh

real    15m55.832s
user    0m20.618s
sys     1m34.171s

[root@c-10-23 copiedTo1G_c]# time find /tmp/condor/bkfromnew -type f -print | xargs -n 1 /root/copy_script.sh

real    30m44.877s
user    0m43.150s
sys     3m1.965s
-----------------------------------

Read from new Lustre
Start with dc40-16-25
Average iostat is 42-43%
load_one around 2.5
[root@c-16-25 T2back1]# time find /lustre/T3test/copiedTo1G -type f -print | xargs -n 1 /root/copy_script.sh

real    27m12.723s
user    0m8.614s
sys     1m4.618s
12.02MB/s
----------------------------------

Read on dc2-10-23
Average iostat is about 14%
[root@c-10-23 T2back1]# time find /lustre/T3test/copiedTo10G -type f -print | xargs -n 1 /root/copy_script.sh

real    27m51.404s
user    0m42.114s
sys     3m2.842s
11.74MB/s
-----------------------------
Now do simultaneous reads.

Average iostat of umdist10 is 60-63%
load_one around 1.5
IO rate around 18MB/s
[root@c-16-25 T2back2]# time find /lustre/T3test/copiedTo1G_c -type f -print | xargs -n 1 /root/copy_script.sh
date

real    33m56.085s
user    0m8.137s
sys     1m1.258s


[root@c-10-23 T2back2]# time find /lustre/T3test/copiedTo10G_b -type f -print | xargs -n 1 /root/copy_script.sh
date

real    37m39.361s
user    0m42.361s
sys     3m3.502s
Using aggregate time, get 17.37MB/s
-------------------------------------------------------------------------

Now, upgrade.  Start with mgs, move to umdist10
1. umount on both machines, umdist10 first.  Comment out in fstab.
2. yum remove lustre rpms
   yum erase lustre lustre-osd-ldiskfs-mount lustre-modules lustre-osd-ldiskfs
   yum erase kernel-firmware
3. upgrade kernel and reboot
4. install new Lustre rpms
5. reboot
6. mount mgs

Now, on umdist10
7. cp /etc/zfs/zpool.cache /root
8. service zfs stop
9. chkconfig zfs off
10. yum erase lustre lustre-modules lustre-osd-zfs lustre-osd-zfs-mount
11. yum erase libnvpair1 libuutil1 libzfs2 libzpool2 spl spl-dkms zfs zfs-dkms zfs-dracut
12. yum erase kernel-firmware (gets 573 version out of the way)
13. cd /atlas/data08/ball/admin/LustreSL6/2.7/server
14. yum localinstall kernel-firmware
12. upgrade kernel and reboot
13. Remove these files
for i in /var/lib/dkms/*/[^k]*/source; do [ -e "$i" ] || echo "$i";done
/var/lib/dkms/spl/0.6.3/source
/var/lib/dkms/zfs/0.6.3/source
14. install zfs and lustre rpms.
See zfs install directions here:  https://www.aglt2.org/wiki/bin/view/AGLT2/ZFsforAFS#Install_ZFS
yum localinstall lustre-2.7.58-2.6.32.504.16.2.el6_lustre_gdeb6fcd.x86_64.rpm \
lustre-modules-2.7.58-2.6.32.504.16.2.el6_lustre_gdeb6fcd.x86_64.rpm \
lustre-osd-zfs-2.7.58-2.6.32.504.16.2.el6_lustre_gdeb6fcd.x86_64.rpm \
lustre-osd-zfs-mount-2.7.58-2.6.32.504.16.2.el6_lustre_gdeb6fcd.x86_64.rpm
15. Reboot
16. Mount the ost

-------------------------  All is working, run read/write tests ---------------

From dc2-10-23 to Lustre
The average iostat is 16.38%
IO rate started at 22MB/s, has dropped to 14MB/s
load_one of umdist10 is around 0.2
[root@c-10-23 copiedTo1G_e]# time find /tmp/condor/T2back2 -type f -print | xargs -n 1 /root/copy_script.sh


real    23m29.328s
user    0m42.869s
sys     3m1.129s
13.92MB/s
--------------------------

Ditto dc40-16-25
Average iostat 25.64%
IO rate pretty steady at ~23MB/s
[root@c-16-25 copiedTo10G_e]# time find /tmp/condor/T2back2 -type f -print | xargs -n 1 /root/copy_script.sh

real    16m24.367s
user    0m24.729s
sys     1m44.971s
19.93MB/s
--------------------------

Now, read back from Lustre

Read from Lustre dc40-16-25
Average iostat 43.69%
IO rate around 13MB/s
load_one around 0.9
[root@c-16-25 T2back3]# time find /lustre/T3test/copiedTo1G_c -type f -print | xargs -n 1 /root/copy_script.sh

real    34m44.445s
user    0m8.097s
sys     1m2.598s
9.41MB/s
-----------------------------

Read from Lustre dc2-10-23
Average iostat 35.55%
IO rate around 11MB/s
load_one around 0.9
[root@c-10-23 T2back3]# time find /lustre/T3test/copiedTo1G_c -type f -print | xargs -n 1 /root/copy_script.sh

real    34m37.604s
user    0m42.412s
sys     3m2.963s
9.44MB/s
---------------------------

Simultaneous reads
Average iostat 55.76%
IO rate around 17-19MB/s (10-11MB on 16-25 and 8-9MB on 10-23)
load_one around 1.7

[root@c-10-23 T2back4]# time find /lustre/T3test/copiedTo1G -type f -print | xargs -n 1 /root/copy_script.sh

real    42m57.068s
user    0m41.797s
sys     3m2.971s

[root@c-16-25 T3back4]# time find /lustre/T3test/copiedTo10G_b -type f -print | xargs -n 1 /root/copy_script.sh

real    34m38.139s
user    0m10.243s
sys     1m31.489s
Using aggregate (largest time) get 15.22
---------------------------

Simultaneous writes
Average iostat 42.15%
IO rate around 34MB/s (20-22MB on 16-25 and 14MB on 10-23)
load_one around 0.4

[root@c-16-25 copiedTo10G_f]# time find /tmp/condor/T2back1  -type f -print | xargs -n 1 /root/copy_script.sh

real    16m59.451s
user    0m26.940s
sys     1m49.630s

[root@c-10-23 copiedTo1G_f]# time find /tmp/condor/T2back1  -type f -print | xargs -n 1 /root/copy_script.sh

real    26m11.999s
user    0m43.154s
sys     3m3.011s

------------------------- Last of all, upgrade the zpool features -----------------

umount all OST

[root@umdist10 ~]# zpool upgrade
This system supports ZFS pool feature flags.

All pools are formatted using feature flags.


Some supported features are not enabled on the following pools. Once a
feature is enabled the pool may become incompatible with software
that does not support the feature. See zpool-features(5) for details.

POOL  FEATURE
---------------
ost-001
      spacemap_histogram
      enabled_txg
      hole_birth
      extensible_dataset
      embedded_data
      bookmarks
[root@umdist10 ~]# zpool upgrade -a
This system supports ZFS pool feature flags.

Enabled the following features on 'ost-001':
  spacemap_histogram
  enabled_txg
  hole_birth
  extensible_dataset
  embedded_data
  bookmarks

remount the OST

------------------------ Repeat last set of tests ----------------------

From dc40-16-25 to Lustre
Average iostat 23.57%
IO rate around 22MB/s
load_one around 0.4

[root@c-16-25 copiedTo10G_g]# time find /tmp/condor/T2back1  -type f -print | xargs -n 1 /root/copy_script.sh

real    16m34.413s
user    0m23.646s
sys     1m44.393s
This is 19.73MB/s
-------------------------------------------------

From dc2-10-23 to Lustre
Average iostat 20.26%
IO rate around 15MB/s
load_one around 0.25

[root@c-10-23 copiedTo1G_g]# time find /tmp/condor/T2back1  -type f -print | xargs -n 1 /root/copy_script.sh

real    24m24.959s
user    0m43.396s
sys     3m0.664s
13.39MB/s
-------------------------------------------------

Simultaneous writes
Average iostat 43.88%
IO rate around 36MB/s (Individually, these machines are unchanged from single write rates)
load_one around 0.55

root@c-16-25 copiedTo10G_h]# time find /tmp/condor/T2back2  -type f -print | xargs -n 1 /root/copy_script.sh

real    16m30.593s
user    0m25.605s
sys     1m48.979s

[root@c-10-23 copiedTo1G_h]# time find /tmp/condor/T2back2 -type f -print | xargs -n 1 /root/copy_script.sh

real    24m23.808s
user    0m43.325s
sys     3m1.776s
-------------------------------------------------------

Now, read tests
On dc40-16-25 first....
Average iostat 48.32%
IO rate around 8-10MB/s
load_one around 1.1

[root@c-16-25 T2back5]# time find /lustre/T3test/copiedTo1G  -type f -print | xargs -n 1 /root/copy_script.sh

real    34m32.438s
user    0m8.226s
sys     1m3.338s
9.46MB/s

On dc2-10-23 read from Lustre
Average iostat 33.08%
IO rate around 9-10MB/s
load_one around 0.7

[root@c-10-23 T2back5]# time find /lustre/T3test/copiedTo10G_g -type f -print | xargs -n 1 /root/copy_script.sh

real    36m30.948s
user    0m42.654s
sys     3m3.977s
This is 8.96MB/s
-------------------------------------

Now, do the combined reads from Lustre
Average iostat 60.55%
IO rate around 15MB/s (dc40 and dc2 each around 8-9)
load_one around 1.8

[root@c-16-25 T2back6]# time find /lustre/T3test/copiedTo1G_c  -type f -print | xargs -n 1 /root/copy_script.sh
date

real    41m4.598s
user    0m9.673s
sys     1m14.899s

[root@c-10-23 T2back6]# time find /lustre/T3test/copiedTo10G_b -type f -print | xargs -n 1 /root/copy_script.sh
date

real    44m39.748s
user    0m42.022s
sys     3m2.961s
Aggregate rate is 14.65MB/s

----------------------------------------
----------------------------------------
Following upgrade,write from dc2-10-23 to /lustre/umt3

IO rate is steady and stable, with 10-23 load_one around 0.8
[root@c-10-23 copyTo1G]# time find /tmp/condor/T2back1 -type f -print | xargs -n 1 /root/copy_script.sh

real    25m48.471s
user    0m43.222s
sys     3m0.494s
12.67MB/s

---------------------
Now, write from dc40-16-25
IO rate is steady and stable, with 40-25 load_one around 1.0
[root@c-16-25 copyTo10G]# time find /tmp/condor/T3back4  -type f -print | xargs -n 1 /root/copy_script.sh

real    16m11.506s
user    0m28.363s
sys     2m3.528s
20.20MB/s

------------------------------

read on dc2-10-23
IO rate steady, load_one around 1.0
[root@c-10-23 final1]# time find /lustre/umt3/bobtest/copyTo10G -type f -print | xargs -n 1 /root/copy_script.sh

real    12m33.031s
user    0m43.726s
sys     3m9.278s
26.06MB/s

----------------------------------
read on dc40-16-25
IO rate steady, but also steadily decreasing, never exceeded about 20MB/s, load_one around 1.0
[root@c-16-25 T3back5]# time find /lustre/umt3/bobtest/copyTo1G  -type f -print | xargs -n 1 /root/copy_script.sh

real    23m25.403s
user    0m10.222s
sys     1m21.805s
13.96MB/s

-----------------
Interestingly, although disparate both read rates are higher than in the mgs/umdist10 tests.

The write rates are compatible with the mgs/umdist10 tests..

-- BobBall - 02 Sep 2015
Topic revision: r2 - 11 Sep 2015, BobBall
This site is powered by FoswikiCopyright © by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
Ideas, requests, problems regarding Foswiki? Send feedback