Ejected disk

When a disk is ejected by some of the server instances and does not get back automatically:

1. Check if the drive is actually available.

Check if the drive is missing with storpool_initdisk --list --json |jq '.disks[] | select(.id==DISKID)':

# storpool_initdisk --list --json |jq '.disks[] | select(.id==118)'
 {
   "id": 118,
   "device": "/dev/sdp1",
   "isStorPool": true,
   "cluster": "b7ug.b",
   "nvme": false,
   "sectors": 15628048384,
   "type": "storpool",
   "typeId": 4,
   "meta": {
     "version": 65545,
     "instance": 3,
     "ssd": false,
     "wbc": false,
     "noFua": true,
     "noFlush": false,
     "noTrim": false,
     "testOverride": null,
     "diskMagickValue": 1615840159737,
     "journalMagickValue": 0,
     "entriesCount": 30600000,
     "objectsCount": 7650000
   }
 }

If the drive is missing, proceed with 5.  Removal & re-balance out.

2. Check if this is a repeated event

In storpool -j disk list | jq '.data[] | select(.id==DISKID) | .testResults' you can see if the drive was ejected before and if the last automated test done by the server had any problems:

[root@comp-node-022 ~]# storpool disk 103 testInfo
times tested  |   test pending  |  read speed   |  write speed  |  read max latency   |  write max latency  | failed
         40  |             no  |  478 MiB/sec  |  485 MiB/sec  |  2 msec             |  2 msec             |     no

In this case, the drive has been ejected/tested 40 times. If the number of times is above 100, go to 5.  Removal & re-balance out.

3. Check if the drive is causing delays for the cluster operations

Check if the drive has been stalling requests, visible in syslog with grep storpool_server.*stalled /var/log/messages. If this is a regular occurrence, see to 5.  Removal & re-balance out.

Note

If all drives on the controller have stalled at the same time, look if there’s an issue with the controller itself.

Check if the drive’s behavior (latency-wise) is getting progressively worse, especially compared to the other drives on the same node. In the Analytics platform (https://analytics.storpool.com), in the System disks -> SP disk stat section. If the drive’s latencies have started to go above some seconds, and this is a new behavior, see to 5.  Removal & re-balance out.

4. Return drive to the cluster

If none of the above has led to removal of the drive, return to the cluster:

# storpool_initdisk -r /dev/sdXN   # where X is the drive letter and N is the partition number

After that, wait for the drive to get back in the cluster, either by looking into storpool disk list or by using wait_all_disks helper (requires access to the API), then wait for the recovery tasks to complete. If the drive is unable to get back for some reason, see to 5.  Removal & re-balance out. Otherwise END.

5. Removal & re-balance out

At this step it is clear that the drive in question is not good at keeping the data and has to be removed.

5.1. Collect information on the drive

For any drive we remove, we need to provide two types of information to the customer:

  • Location/drive information - Node name, disk ID, model, serial number, controller, enclosure, slot identifiers (if applicable), and if possible, to enable the location LED where such is available;

  • Error information - any detail on the problem with the drive, which is required for RMA to the vendor in order to get a replacement.

5.1.1. Location information

Some data for any drive in StorPool can be seen with the diskid-helper and related tools:

Disk attached to Dell 3108-based controller:

[root@hc1 ~]# /usr/lib/storpool/diskid-helper /dev/sdb
MODEL=ST4000NM0023
SERIAL=Z1Z5621F
METHOD=PERCCLI_CMD
MODULE=megaraid_sas
[root@hc1 ~]# /usr/lib/storpool/storcli-helper.pl -p perccli64 /dev/sdb
SMARTCMD='smartctl -a -d megaraid,34 /dev/sdb'
ID_SERIAL='Z1Z5621F'
ID_MODEL='ST4000NM0023'
ID_ENCLOSURE='32'
ID_SLOT='2'
ID_CTRL='0'
ID_VDID='1'
[root@hc1 ~]# smartctl -a -d megaraid,34 /dev/sdb
smartctl 6.5 2016-05-07 r4318 [x86_64-linux-3.10.0-862.11.6.el7.x86_64] (local build)
Copyright (C) 2002-16, Bruce Allen, Christian Franke, www.smartmontools.org

Warning: DEFAULT entry missing in drive database file(s)
=== START OF INFORMATION SECTION ===
Vendor:               SEAGATE
Product:              ST4000NM0023
Revision:             GS0F
Compliance:           SPC-4
User Capacity:        4,000,787,030,016 bytes [4.00 TB]
Logical block size:   512 bytes
LU is fully provisioned
Rotation Rate:        7200 rpm
Form Factor:          3.5 inches
Logical Unit id:      0x5000c5005923ca73
Serial number:        Z1Z5621F
Device type:          disk
Transport protocol:   SAS (SPL-3)
Local Time is:        Thu Sep 30 13:42:04 2021 BST
SMART support is:     Available - device has SMART capability.
SMART support is:     Enabled
Temperature Warning:  Disabled or Not Supported

=== START OF READ SMART DATA SECTION ===
SMART Health Status: OK

Current Drive Temperature:     28 C
Drive Trip Temperature:        60 C

Manufactured in week 31 of year 2014
Specified cycle count over device lifetime:  10000
Accumulated start-stop cycles:  101
Specified load-unload count over device lifetime:  300000
Accumulated load-unload cycles:  1930
Elements in grown defect list: 0

Vendor (Seagate) cache information
  Blocks sent to initiator = 180993466
  Blocks received from initiator = 2497577725
  Blocks read from cache and sent to initiator = 3544629222
  Number of read and write commands whose size <= segment size = 1020357238
  Number of read and write commands whose size > segment size = 1206

Vendor (Seagate/Hitachi) factory information
  number of hours powered up = 43603.95
  number of minutes until next internal SMART test = 58

Error counter log:
           Errors Corrected by           Total   Correction     Gigabytes    Total
               ECC          rereads/    errors   algorithm      processed    uncorrected
           fast | delayed   rewrites  corrected  invocations   [10^9 bytes]  errors
read:   1693677694      145         0  1693677839        145    1262332.018           0
write:         0        0         0         0          0     301934.463           0
verify: 17510550        0         0  17510550          0          0.000           0

Non-medium error count:     1328

SMART Self-test log
Num  Test              Status                 segment  LifeTime  LBA_first_err [SK ASC ASQ]
     Description                              number   (hours)
# 1  Background long   Completed                  64   42181                 - [-   -    -]
# 2  Background short  Completed                  64      13                 - [-   -    -]
# 3  Reserved(7)       Completed                  48      13                 - [-   -    -]
# 4  Background short  Completed                  64       9                 - [-   -    -]
# 5  Background short  Completed                  64       7                 - [-   -    -]
# 6  Reserved(7)       Completed                  48       7                 - [-   -    -]
# 7  Background short  Completed                  64       3                 - [-   -    -]

Long (extended) Self Test duration: 32700 seconds [545.0 minutes]

In the example above, you can see that:

  • The drive is connected to a megaraid controller, that’s controlled via perccli;

  • The drive’s model is ST4000NM0023, the serial Z1Z5621F

  • It’s connected on controller 0, enclosure 32, slot 2;

  • There are no defects in the growth defects list, but 145 slow error corrections.

Below, there are two more examples with shortened output - a directly attached drive and one via a HP controller:

[root@sof1 ~]# /usr/lib/storpool/diskid-helper /dev/sda
MODEL=INTEL_SSDSC2KB038T7
SERIAL=PHYS750100YN3P8EGN
METHOD=ATA_ID_CMD
MODULE=ahci

root@magnetic1.sjc:~# /usr/lib/storpool/diskid-helper /dev/sdo
MODEL=LOGICAL_VOLUME
SERIAL=PDNLL0CRHAN28K
METHOD=SCSI_ID_CMD
MODULE=hpsa
root@magnetic1.sjc:~# /usr/lib/storpool/hpssacli-helper /dev/sdo
ID_SERIAL='ZC1742B3'
ID_MODEL='ATA     MB4000GVYZK'
SMARTCMD='smartctl -a -d sat+cciss,8 /dev/sg17'

5.1.2. Error information

This information is needed by the customer to be able to deal with warranties and disk replacements. Not all customers require such information, but if we see any error in SMART that could be useful, it’s worth sharing with the customer to make their lives easier.

As SMART is not unified between SATA, SAS and NVMes, there’s no simple guide to the values that can be used to say “This drive is broken”. The following fields can be used:

  • 5 Reallocated_Sector_Ct : > 100

  • 187 Reported_Uncorrect : > 100

  • 197 Current_Pending_Sector : > 10

  • 198 Offline_Uncorrectable : > 10

  • 199 UDMA_CRC_Error_Count : > 0

  • Elements in grown defect list: > 100

5.2. Forget and start balancing

Removal of the drive is done with the following steps:

5.2.1. Remove drive from placement groups and from list in the cluster

From a node that has access to the API:

storpool disk XXX forget

5.2.2. Make sure the drive is flagged as ejected:

Run on the node with the drive, if the drive is writable:

storpool_initdisk --bad XXX /dev/sdXN

This step is needed so if the node or the storpool_server process restarts before the drive is replaced, the storpool_server service will not try to add the drive in the cluster.

5.2.3. Clean up RAID VDs, if applicable

For any drive that’s with WBC in the RAID controller and has a journal, remove the VDs:

storcli64 /c0/v1235 del
storcli64 /c0/v1234 del

5.2.4. Start locate function, if available

storcli64 /c0/e252/s4 start locate
sas3ircu 0 locate 2:23 on

5.2.5. Balance out the drive (restore redundancy)

The basic steps are: run the balancer and commit, as per 18.  Rebalancing the cluster. There are two extra things to look for:

  • If there’s not enough space for restoring the redundancy, you can add -f 95 to try with filling the drives more;

  • If even with -f 95 the redundancy can’t be restored.

If any of the above is true, notify the customer immediately, especially in the case when redundancy can’t be restored, and look into other options.

5.3. Notify customer on the event

Open a ticket to the customer to notify them about the drive, that it needs a replacement, and if there are any issues with restoring redundancy.

5.4. Monitor progress

Monitor the progress of the rebalance and see if there are any issues for it to complete.

watch -n10 -d 'storpool relocator status; storpool task list groupBy disk'

5.5. Notify customer that the rebalance is complete and the new drive is expected.

5.6. After the drive is replaced, disable locator lights.

storcli64 /c0/e252/s4 stop locate
sas3ircu 0 locate 2:23 off

Note

In some clusters, you also need to update the firmware of the new drive before adding it in the cluster.

See Adding a drive to a running cluster for the remainder of the procedure when the drive is replaced.