Ejected disk

When a disk is ejected by some of the server instances:

Summary:

IF disk model = M500:
    BALANCE_OUT

Search tickets for eject events for the same disk
IF disk was ejected before:
    BALANCE_OUT

Start read-only test
IF test result != success:
    BALANCE_OUT

add disk to the cluster (storpool_initdisk -r)

END

BALANCE_OUT:

    Check placement group has enough free space
    IF PG free space < disk size * 1.5:
        notify customer
        END

    # Balance-out the disk
    storpool disk <diskID> forget # removes the drive from all placement groups
    /usr/lib/storpool/balancer.sh -R # attempts to restore the data redundancy
    storpool balancer commit

    Notify the customer the disk has to be replaced.
    END
  1. Check disk type and repeated events

    If this is Micron M500 disk go to 7.Balance-out

    Search tickets for recent eject events for the same disk. Hint: search by serial number (storpool disk list info). If disk has been ejected before, go to balance_out

  2. Notify customer

    Notify customer the disk is ejected and it is being tested

  3. Check that the disk is still accessible and its current status. On the storage server where the disk is installed execute:

    # storpool_initdisk --list
    

    if the disk is listed and /dev/sdX device file exists, continue with 5.Test disk.

  4. If the /dev/sdX device does not exist and the disk is behind an LSI RAID controller

    4.1. Megaraid 3108 (Dell PERC):

    Get disk location:

    # /usr/lib/storpool/storcli-helper.pl /dev/sdX
    

    Note

    for Dell PERC controllers add the -p perccli64 argument to storcli-helper

    Check the disk status and get the disk ID:

    # storcli64 /c0/e252/s2 show
    

    Get the controller log with:

    # storcli64 /c0 show termlog | less
    

    Get the SMART status with:

    # smartctl -a -d megaraid,<DID> /dev/sdX
    

    , where DID is returned by storcli64 show command above.

    Todo

    HP

    Todo

    Reset/resotre disk status in the controller.

    If disk can not be accessed or there are critical errors in the RAID controller log or SMART status, go to 7.Balance-out.

  5. Test disk

    Get latest version of disk-teser tool from lopi:/home/cust/_tools/disk-tester.sh

    # screen -S disktests
    # mkdir ~/disktests && cd ~/disktests
    

    Add drive to be tested in drives.txt:

    # echo sdX1 > ./drives.txt
    # ./disk-tester.sh --no-write
    

    A failure to complete some of the tests, as well as results lower than the detailed below are an indication for a possible disk failure:

    • for HDD:

      • sequential read < 100MB/s

      • IOPS < 150

    • for SSD:

      • sequential read < 350MB/s

      • IOPS < 10k

    S.M.A.R.T. (for HDD and SSD):

    Note

    See 4. above how to get SMART status for disks connected to a RAID controller

    • 5 Reallocated_Sector_Ct : increased during test

    • 187 Reported_Uncorrect : >100 or increased during test

    • 197 Current_Pending_Sector : >10 or increased during test

    • 198 Offline_Uncorrectable : > 0

    • 199 UDMA_CRC_Error_Count : > 0

    If the test did not pass or the drive failed during the test, go to 7.Balance-out

  6. After sucessful test, add the disk to the cluster:

    # storpool_initdisk -r /dev/sdXN   # where X is the drive letter and N is the partition number
    

    Check disk status with:

    # storpool disk list
    # less /var/log/messages
    

    Update customer

    End.


  1. Balance-out the disk (Restore data redundancy)

    7.1. Check there is enough space and performance for rebalancing.

    Find the placement group the ejected disk belongs to:

    # storpool placementGroup list
    # storpool placementGroup <PG> list
    

    Check that there is enough free space in the placementGroup:

    # storpool template status
    

    Check avail. {head|all|tail} depending on what part of the template this PG participates into. PG shall have at least disk size + 50% available.

    Note

    Check the collected statistics for the drives in the placement group where the ejected drive participates to and ensure that the disks would not get and are not already overloaded.

    If there is not enough free space available in the placement group:

    • immediately notify the customer that an increase in the capacity or freeing additional space is required to restore the redundancy

    • End.

    7.2. Restore the data redundancy.

    On the management node execute:

    # storpool disk <diskID> forget # removes the drive from all placement groups
    # /usr/lib/storpool/balancer.sh -R # attempts to restore the data redundancy
    # storpool balancer commit
    

    Also, it’s mandatory to mark the disk itself as ejected. This action will prevent its appearance when the server instance will start next time, which we should avoid taking into account that this disk is faulty. Login to the node where this disk is physically installed and execute:

    # storpool_initdisk -F <diskID> <device>
    

    After that it still will be shown with storpool_initdisk --list until it is physically removed, but will now be with EJECTED flag.

    7.3. Remove VD, if disk is attached on a RAID controller

    The virtual drive should be removed, otherwise some controllers put all VD’s in pass-through mode when disk is physically removed. Example for Megaraid 3108:

    # storcli64 /c0/v1234 del
    

    7.4. Start locate function, if supported

    It could be convinient if we start the locate function of the controller so the disk is easier to find. Example for Megaraid 3108:

    # storcli64 /c0/e252/s4 start locate
    

    7.5. Update customer, that the disk has to be replaced. Note the serial number and if locate function has been started.