Ejected disk¶

When a disk is ejected by some of the server instances:

Summary:

IF disk model = M500:
BALANCE_OUT

Search tickets for eject events for the same disk
IF disk was ejected before:
BALANCE_OUT

IF test result != success:
BALANCE_OUT

add disk to the cluster (storpool_initdisk -r)

END

BALANCE_OUT:

Check placement group has enough free space
IF PG free space < disk size * 1.5:
notify customer
END

# Balance-out the disk
storpool disk <diskID> forget # removes the drive from all placement groups
/usr/lib/storpool/balancer.sh -R # attempts to restore the data redundancy
storpool balancer commit

Notify the customer the disk has to be replaced.
END

1. Check disk type and repeated events

If this is Micron M500 disk go to 7.Balance-out

Search tickets for recent eject events for the same disk. Hint: search by serial number (storpool disk list info). If disk has been ejected before, go to balance_out

2. Notify customer

Notify customer the disk is ejected and it is being tested

3. Check that the disk is still accessible and its current status. On the storage server where the disk is installed execute:

# storpool_initdisk --list


if the disk is listed and /dev/sdX device file exists, continue with 5.Test disk.

4. If the /dev/sdX device does not exist and the disk is behind an LSI RAID controller

4.1. Megaraid 3108 (Dell PERC):

Get disk location:

# /usr/lib/storpool/storcli-helper.pl /dev/sdX


Note

for Dell PERC controllers add the -p perccli64 argument to storcli-helper

Check the disk status and get the disk ID:

# storcli64 /c0/e252/s2 show


Get the controller log with:

# storcli64 /c0 show termlog | less


Get the SMART status with:

# smartctl -a -d megaraid,<DID> /dev/sdX


, where DID is returned by storcli64 show command above.

Todo

HP

Todo

Reset/resotre disk status in the controller.

If disk can not be accessed or there are critical errors in the RAID controller log or SMART status, go to 7.Balance-out.

5. Test disk

Get latest version of disk-teser tool from lopi:/home/cust/_tools/disk-tester.sh

# screen -S disktests
# mkdir ~/disktests && cd ~/disktests


Add drive to be tested in drives.txt:

# echo sdX1 > ./drives.txt
# ./disk-tester.sh --no-write


A failure to complete some of the tests, as well as results lower than the detailed below are an indication for a possible disk failure:

• for HDD:

• IOPS < 150

• for SSD:

• IOPS < 10k

S.M.A.R.T. (for HDD and SSD):

Note

See 4. above how to get SMART status for disks connected to a RAID controller

• 5 Reallocated_Sector_Ct : increased during test

• 187 Reported_Uncorrect : >100 or increased during test

• 197 Current_Pending_Sector : >10 or increased during test

• 198 Offline_Uncorrectable : > 0

• 199 UDMA_CRC_Error_Count : > 0

If the test did not pass or the drive failed during the test, go to 7.Balance-out

6. After sucessful test, add the disk to the cluster:

# storpool_initdisk -r /dev/sdXN   # where X is the drive letter and N is the partition number


Check disk status with:

# storpool disk list
# less /var/log/messages


Update customer

End.

1. Balance-out the disk (Restore data redundancy)

7.1. Check there is enough space and performance for rebalancing.

Find the placement group the ejected disk belongs to:

# storpool placementGroup list
# storpool placementGroup <PG> list


Check that there is enough free space in the placementGroup:

# storpool template status


Check avail. {head|all|tail} depending on what part of the template this PG participates into. PG shall have at least disk size + 50% available.

Note

Check the collected statistics for the drives in the placement group where the ejected drive participates to and ensure that the disks would not get and are not already overloaded.

If there is not enough free space available in the placement group:

• immediately notify the customer that an increase in the capacity or freeing additional space is required to restore the redundancy

• End.

7.2. Restore the data redundancy.

On the management node execute:

# storpool disk <diskID> forget # removes the drive from all placement groups
# /usr/lib/storpool/balancer.sh -R # attempts to restore the data redundancy
# storpool balancer commit


Also, it’s mandatory to mark the disk itself as ejected. This action will prevent its appearance when the server instance will start next time, which we should avoid taking into account that this disk is faulty. Login to the node where this disk is physically installed and execute:

# storpool_initdisk -F <diskID> <device>


After that it still will be shown with storpool_initdisk --list until it is physically removed, but will now be with EJECTED flag.

7.3. Remove VD, if disk is attached on a RAID controller

The virtual drive should be removed, otherwise some controllers put all VD’s in pass-through mode when disk is physically removed. Example for Megaraid 3108:

# storcli64 /c0/v1234 del


7.4. Start locate function, if supported

It could be convinient if we start the locate function of the controller so the disk is easier to find. Example for Megaraid 3108:

# storcli64 /c0/e252/s4 start locate


7.5. Update customer, that the disk has to be replaced. Note the serial number and if locate function has been started.