You are here: Foswiki>AGLT2 Web>LustreSetup>ReFormatOST (22 Mar 2017, BobBall)Edit Attach

How to empty all OST on an OSS, then re-create the underlying Lustre file systems

How to empty all OST on an OSS, then re-create the underlying Lustre file systems

Motivation

The underlying striping for a Lustre OST, as seen in the mail list, MUST BE EXACTLY 1MB. So, for 4 data disks in the RAID, the stripe should be 256kB, for 8 data disks, the stripe should be 128kB, and so on.

For 15 disk MD1000 shelves, the only fully-utilized way to do this is with RAID-5, using 4 and 8 data disk, each with one parity disk, and one global hot spare for the shelf. The procedures below implement this.

Procedures

Empty the OST

Follow the procedures presented in MigratingToNewOST or equivalent to empty all OST on the OSS of files. This can take days, split into 10 streams. Once tried 20 streams and the MGS crashed. I don't know for certain this was related, but....

Specifically:

Run lfs_find on all OST of the server
split the file list 10 ways, for 10 clients to work with
Run move_w_stat_to_new_ost.sh that first checks the file will stat before sending it to lfs_migrate
Repeat the 3 steps above
On any file that does not move during the second iteration, do "unlink" to force its removal

Perform the reconfiguration

All work on the OSS will be from the directory /root/reformat.
Scripts below are mostly located in svn at svn+ssh://ndt.aglt2.org/repos/rocks/trunk/tools/lustre_tools/reformat

Disable access on all workers, eg
- /root/tools/lustre_control.sh off umfs06
Save all the lustre info from the empty OST, eg, for each device, do
- ./save_OST_info.sh sdh
Determine the current Lustre structure, specifically
- Controller, ID within the controller, device, lustre name, lustre identifier and Card, eg
  - 1, 0, /dev/sdb, umt3-OST0008, 8, PERC 6/E Adapter (Slot 3)
- This information will be used to modify the scripts used below
Umount all partitions, eg
- umount /dev/sdb
Wipe the partitions:
- omconfig storage controller action=resetconfig controller=2 (or 1, or....)
Make new vdisk
- ./setup_lustre_ost.sh
- Modify the script before running using the correct information tabulated above, in the correct device order
Make lustre fs:
- ./format_lustre.sh
- Modify the script before running using the correct information tabulated above, in the correct device order
Check the umt3-OST00xx entry is correct in each fs
- look at files in /mnt/ost/CONFIGS for correct association
fsck each partition
- e2fsck -fy /dev/sdN
restore saved info for each OST
- ./restore_OST_info.sh
Remove old fstab entries:
- save/edit /etc/fstab
Make fstab entries
- ./make_lustre_fstab.sh
Mount disks
- mount -av
Restore access from wn, eg
- /root/tools/lustre_control.sh on umfs06

Reformat a single OST after emptying it of all files

This is an example from ost12 of umdist01 (OST0023) where the volume was throwing IO errors. After draining it of all files, and marking it unavailable on all WN, lustre-nfs, and interactive machines, the following items are taken from the root history file, detailing the procedure.

  289  mkdir reformat
  290  cd reformat
  292  mkdir -p /mnt/ost
  293  mount -t ldiskfs /dev/sdc /mnt/ost
  294  mkdir sdc
  295  pushd /mnt/ost
  296  cp -p last_rcvd /root/reformat/sdc
  297  cd O
  298  cd 0
  299  cp -p LAST_ID /root/reformat/sdc
  300  cd ../..
  301  cp -p CONFIGS/* /root/reformat/sdc
  304  umount /mnt/ost

At this point, the web interface was used to do a complete, slow initialization of the volume.  
It is just too much work to drain/fill an OST otherwise, should new IO errors be encountered.

the index, inode count, and stripe are taken from the files above when the volumes were first created.

  309  mkfs.lustre --ost --mgsnode=10.10.1.140@tcp0 --fsname=umt3 --reformat --index=35 \
--mkfsoptions="-i 2000000" --reformat --mountfsoptions="errors=remount-ro,extents,mballoc,stripe=256" /dev/sdc

The UUID here is taken from the /etc/fstab, where the entry has been commented out until we are ready to again
use the volume

  310  tune2fs -O uninit_bg -m 1 -U 02bcb3d2-ad48-4992-ba71-7b48787defea /dev/sdc
  311  e2fsck -fy /dev/sdc
  312  mount -t ldiskfs /dev/sdc /mnt/ost

Copy back all identifiers so that the volume can continue from where it was left off

  315  cd /root/reformat/sdc
  316  cp -v /mnt/ost/CONFIGS/mountdata mountdata.new2
  317  cp -fv mountdata /mnt/ost/CONFIGS
  319  cp last_rcvd /mnt/ost
  320  mkdir -p /mnt/ost/O/0
  321  chmod 700 /mnt/ost/O
  322  chmod 700 /mnt/ost/O/0
  323  cp -fv LAST_ID /mnt/ost/O/0
  324  umount /mnt/ost

Add the fstab entry back in again, and remount the disk

  325  vi /etc/fstab
  326  mount -av

Example from vdisk creation to lustre file system for an 8(9)-disk RAID-5 on an MD1000

omconfig storage controller action=createvdisk controller=1 size=max raid=r5 pdisk=0:1:5,0:1:6,0:1:7,0:1:8,0:1:9,0:1:10,0:1:11,0:1:12,0:1:13 stripesize=128kb readpolicy=ra writepolicy=wb name=ost22
mkfs.lustre --ost --mgsnode=10.10.1.140@tcp0 --fsname=umt3 --reformat --index=11 --mkfsoptions="-i 2000000" --reformat --mountfsoptions="errors=remount-ro,extents,mballoc,stripe=256" /dev/sde
tune2fs -O uninit_bg -m 1 -U cadf431a-6b03-4dd1-acc7-b6a3a0cbb69c /dev/sde

Example from vdisk creation to lustre file system for a 4(5)-disk RAID-5 on an MD1000

omconfig storage controller action=createvdisk controller=1 size=max raid=r5 pdisk=1:2:0,1:2:1,1:2:2,1:2:3,1:2:4 stripesize=256kb readpolicy=ra writepolicy=wb name=ost31
mkfs.lustre --ost --mgsnode=10.10.1.140@tcp0 --fsname=umt3 --reformat --index=32  --mkfsoptions="-i 1000000" --reformat --mountfsoptions="errors=remount-ro,extents,mballoc,stripe=256" /dev/sdf
tune2fs -O uninit_bg -m 1 -U ffc8fa63-e7d7-470f-9691-72b56839a6b3 /dev/sdf

Creating a new LAST_ID for an OST

During the work after an OST failed, and the files were all drained, the "magic files" (LAST_ID, etc) were in trouble. In particular, LAST_ID was not available. So, in order to bring the OST back up after reformatting, I had to find a way to recreate this file.

On lmd02, find this:

[root@lmd02 ~]# lctl get_param osc.*.prealloc_next_id
...
osc.umt3-OST0025-osc.prealloc_next_id=6778336
An alternate value is the prealloc_list_id, which is larger, but considering the OST
was already completely drained, Andreas Dilger has suggested to use the next_id value
prealloc_last_id is 6778369

See these URL

http://wiki.lustre.org/manual/LustreManual20_HTML/LustreTroubleshooting.html
https://groups.google.com/forum/#!topic/lustre-discuss-list/NcDiutUirDg

So, we have to get this value into place as LAST_ID.  

     As an aside, we will need to be sure we are talking the correct index at all times:
     OK, so, using the directions

     [root@umdist01 tmp]# od -Ax -td4 last_rcvd |less
     ...
     000080           0           0           0          37

     This matches with the "lfs df" output, decimal 37, and is the index used in the mkfs.lustre run.

-----------------------------------------------------
Data found on the Internet indicates, if we have the mds/mdt offline, and mounted ldiskfs,
we can do the following to find the value to use in LAST_ID

# extract last allocated object for all OSTs
mds# debugfs -c -R "dump lov_objids /tmp/lo"

# cut out the last allocated object for this OST index
mds# dd if=/tmp/lo of=/tmp/LAST_ID bs=8 skip=${OST index NN} count=1

# verify value is the right one (LAST_ID = next_id - 1)
mds# lctl get_param osc.*OST00NN.prealloc_next_id  # NN is OST index
mds# od -td8 /tmp/LAST_ID

# get OST filesystem ready for this value and copy it in place
ossN# mount -t ldiskfs /dev/{ostdev} /mnt/tmp
ossN# mkdir -p /mnt/tmp/O/0
mds# scp /tmp/LAST_ID ossN:/mnt/tmp/O/0/LAST_ID

---------------------------------------------------------------------------------
Instead, we note that we can start with ANY LAST_ID file, and edit it to create the one desired.
Convert binary to text
xxd /tmp/LAST_ID /tmp/LAST_ID.asc

Fix it
vi /tmp/LAST_ID.asc
For example, 6513958 decimal is 0x636526, and appears in the LAST_ID.asc file like so:
0000000: 2665 6300 0000 0000                      &ec.....


Convert to binary
xxd -r /tmp/LAST_ID.asc /tmp/LAST_ID.new

Verify
od -Ax -td8 /tmp/LAST_ID.new
copy it to the real LAST_ID

So, how do we do the edit? Use a calculator to convert the "prealloc_next_id" above to hex, for example:
[root@umdist01 ~]# echo "obase=16; 6778336" | bc
676DE0

When you edit the .asc file above, this appears in byte-order with the most significant of the 32 bits at the highest address. Don't ask me which Endianism this is. So, when you edit the file, it goes in like so
e06d 6700 0000 0000

The verify sequence will now give consistent results. This LAST_ID.new can be copied back to the appropriate reformat directory as LAST_ID. And, note that the /mnt/ost/last_rcvd was NOT copied into place when the OST was ready to go; rather, let the file system recreate it, per Andreas Dilger

"Lustre is fairly robust about handling situations like this (e.g. recreating the last_rcvd file, the object heirarchy O/0/d{0..31}, etc). The one item that it will need help with is to recreate the LAST_ID file on the OST."

https://groups.google.com/forum/#!topic/lustre-discuss-list/2KEIrJl4YHg

Other files needed to restore an OST

The full list of files needed to restore an OST when reformatted is

LAST_ID
last_rcvd
mountdata
umt3-OST000b (for example, depending on the OST)

Andreas Dilger of OpenSFS says the following about files beyond the LAST_ID, where LAST_ID is detailed above.

last_rcvd -- Will be re-created during mount, or old one can be used. However, see below, if the LAST_ID has to be created from scratch, do not copy this file back.
mountdata -- Copy created during run of mkfs.lustre can be used. EXCEPT for us it can't. Must use instead the mountdata copied out previously. Not sure why.
umt3-OST000b -- would be recreated with a --writeconf, but it may also be created automatically during mount if missing (it is an OST-local copy of the MGS file of the same name so the OST can mount even if the MGS is offline).

Recreation of a ZFS OST

From above, we have the following items specified as requiring re-creation if an ldiskfs OST is re-created. Following an Email exchange with Andreas Dilger, we see there are several differences. Each ldiskfs item is detailed, and, in the case of zfs, what must be done

cd backup_directory
- Ditto with saved content from the OST
cp -fv mountdata /mnt/ost/CONFIGS
- Stored as lustre.* files but I can't find these in any directory. Probably correctly re-created by mkfs.lustre
cp last_rcvd /mnt/ost
- Not strictly required, but doesn't hurt either.
mkdir -p /mnt/ost/O/0
chmod 700 /mnt/ost/O
- Perms now 755
chmod 700 /mnt/ost/O/0
- Perms now 755
cp -fv LAST_ID /mnt/ost/O/0
- Not needed, for 2.5.0 and later with LU-14 fixes, LAST_ID will be recreated based on info from the MDS.

The bottom line is that these should all be backed up as usual, but it is not strictly required to restore them.

Be sure to add "--replace" option in addition to "--index=n" during the recreation of the OST so that it does not re-register with the MGS as a new OST.

Permanently Delete an OST

It happened that OST000f became corrupted in such a way that a maximum number of client mounts was attained and then the processes crashed/locked up. The OST was totally drained of files, as detailed here, then re-created and allowed to stabilize with no mounted clients. It was decided to destroy this OST then reformat as a new index.

On mdtmgs, use the following command to permanently mark the OST as disabled

lctl conf_param umt3B-OST000f.osc.active=0

However, this leaves the problem of a new client mount occurring that will then hang Lustre on that client. Work around this by adding "lazystatfs" to the client mount options

mount -o localflock,lazystatfs -t lustre 10.10.2.173@tcp0:/umt3B /lustre/umt3

This could also be realized on the client by doing, but again this requires mount-time action

lctl set_param llite.umt3B*.lazystatfs=1

Another alternative was to use the /etc/sysconfig/lustre_mount_umt3.conf file and add the offline OST to the list, but this was not done either.

Finally, create and mount the new OST, using the already defined zfs pool.

mkfs.lustre --backfstype=zfs --fsname=umt3B --mgsnode=10.10.2.173@tcp0 --ost --index=63 ost-004/ost0063
mount /mnt/ost-004

Following this, the usual procedure for refilling an OST was followed.

IO Testing of an OST

After reformatting an OST, it should be IO tested for a while to make sure nothing shows bad, before copying files back to it. The following was sent to me by Joe Landman of scalableinformatics.

wget http://download.scalableinformatics.com/disk_stress_tests/fio/sw_check.fio
wget http://download.scalableinformatics.com/disk_stress_tests/fio/loop_check.pl
yum install fio

Edit the sw_check.fio file to change the device, like so:
directory=/mnt/ost
Now, mount the disk as type ldiskfs, as usual, and, in a screen, because you want to run this for at least several hours, if not a whole day

./loop_check.pl 100 > out 2>&1

then

grep crc out

Increase to 1000 for a long test. You should also look in /var/log/messages for errors thrown on the mounted device, for example, /dev/sde

-- BobBall - 22 Nov 2010

Topic revision: r9 - 22 Mar 2017, BobBall

AGLT2

Copyright © by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
Ideas, requests, problems regarding Foswiki? Send feedback