Adding a drive to a running cluster

This procedure describes adding HDD, SSD and NVMe drives to a running cluster.

Drives stress tests

Note

The same procedure is followed by ansible on any new deployment.

Attention

Make sure that all SSD drives and SAS/HBA controllers are with the latest recommended Firmware or BIOS version. There are some known good versions in our system requirements

Any drive that’s added to a StorPool cluster needs to have an initial burn-in test to make sure there are no serious defects for the drive.

The recommended steps are:

Note

Running the tester tool in screen leaves the test running if there’s a connectivity issue.

screen -S disktests
mkdir ~/disktests && cd ~/disktests

Note

Add all drives to be used for StorPool in a file drives.txt, without the /dev prefix

echo sda sdb > ./drives.txt
disk_tester

Note

Wait for it to finish then check the results in disks/*-completed files.

How disk_tester works

Collects the S.M.A.R.T. output from each drive before the stress tests to be able to later collect it once again and show the differences after the tests.
Trims all flash-based drives.
Fills sequentially all drives in parallel with random data and verifies the data written afterwards.
Writes randomly all drives with random data and verifies the written data afterwards. The amount of data written is different for HDDs and SSDs.
Prints a summary of the results including the minimum, maximum and average for the sequential and random workload tests, as well as the differences in the S.M.A.R.T. output collected before and the one collected after the tests completed.

With more than one drive tested the sum of the sequential read and write bandwidth with all drives could be investigated in results/all-completed to check if there aren’t any bottlenecks in the controller when reading/writing to all drives in drives.txt.

How to read the results

Compare the results from the sequential and random writes and verify that the speeds between similar drives are similar. For example drives with the same model with very different results like a lower performance, spikes in the minimum IOPS or bandwidth might be a sign of pending issues with either the drive or the controller.

Note

In case of huge differences in before/after for Seek Error Rate, Raw Read Error Rate, and Hardware ECC Recovered SMART attributes, you can search on the Internet for Seagate SER RRER HEC. High error rates after first fill might be OK. (Notable on Seagate disks)

Partition and init drive

HDD on 3108 MegaRaid controller

Locating the disk

We’ll assume that the disk is already physically installed and visible by the controller. First locate the disk on the controller, either by S/N or by controller/enclosure/slot position. In this case we know from the customer that the disk is on controller 0, enclosure 252 slot 1.

Get SP_OURID of this node (see Identification and voting) with:

# storpool_confshow -e SP_OURID

Check for available disk IDs (see Disk):

# storpool disk list

Select new, non-existent disk ID, according the numbering convention. e.g. 212.

If we don’t know the disk’s location but, we know its device name, we can use the storcli-helper.pl script:

# /usr/lib/storpool/storcli-helper.pl /dev/sdX
ID_SERIAL='ZC209BG90000C7168ZVB'
ID_MODEL='ST2000NM0135'
ID_ENCLOSURE='252'
ID_SLOT='1'
ID_CTRL='0'

Device name can be found with:

# lshw -c disk -short

Check RAID controller type with:

# lshw -c storage -short

/0/100/1/0           scsi0      storage        MegaRAID SAS-3 3108 [Invader]

Check SN if known. We can verify correct location with:

# storcli64 /c0/e252/s1 show all
...
Drive /c0/e252/s1 Device attributes :
===================================
SN = ZC209BG90000C7168ZVB
Manufacturer Id = SEAGATE
Model Number = ST2000NM0135
NAND Vendor = NA
Raw size = 1.819 TB [0xe8e088b0 Sectors]
Coerced size = 1.818 TB [0xe8d00000 Sectors]
Non Coerced size = 1.818 TB [0xe8d088b0 Sectors]
Logical Sector Size = 512B
...

From the data above we can calculate the disk’s size in MiB: Coerced size x Logical Sector Size / 1024^2.:

# echo 'ibase=16;E8D00000*200/400^2' | bc
1907200

We’ll need this later.

Note

All input is hex, output is dec in MiB (2^20). Hex input is uppercase.

Prepare HDD

All HDDs shall be connected to RAID controllers with battery backup or cachevault.

For 3108 controllers, a small WBC journal device is to be created for each hard drive. Physical HDD is split in two Virtual Disks (VD) on the controller. First VD is the main disk and the second VD is the journal. WBC is disabled on the first VD, and enabled on the second VD (journal).

Switch drive from JBOD to RAID mode

Check disk state. If it is JBOD it has to be switched to Online:

# storcli64 /c0 show
...
-----------------------------------------------------------------------
EID:Slt DID State DG     Size Intf Med SED PI SeSz Model            Sp
-----------------------------------------------------------------------
252:0     6 Onln   0 1.818 TB SAS  HDD N   N  512B ST2000NM0135     U
252:1     8 JBOD   - 1.818 TB SAS  HDD N   N  512B ST2000NM0135     U
252:2     7 Onln   1 1.818 TB SAS  HDD N   N  512B ST2000NM0135     U
...

Change JBOD to Online with:

# storcli64 /c0/e252/s1 set good force

If this fails with Failed controller has data in cache for offline or missing virtual disks, there is unflushed cache for the replaced disk that needs to be cleared. It can be done with:

# storcli64 /c0 show preservedCache # shows the missing VD

Controller = 0

Status = Success

Description = None

-----------

VD State

-----------

9 Missing

-----------

# storcli64 /c0/v9 delete preservedcache # delete the cache for missing VD

and repeat the previous command.

Create a Virtual Disk for Journal

Define 2 virtual disks for the physical disk - first is the main storage, and the second is 100MiB disk, used for journal with WBC enabled. The size of the first VD is total disk size we calculated before minus 100MiB. The second VD will use the remaining space (100MiB).

# storcli64 /c0 add vd type=r0 Size=1907100 drives=252:1 pdcache=off direct nora wt
# storcli64 /c0 add vd type=r0 drives=252:1 pdcache=off direct nora wb

Check status with:

# storcli64 /c0 show

Look at PD LIST EID:Slt -> DG and VD LIST DG/VD.

Check /dev/sdX with lsblk. Now there shall be 2 drives. In our case they are sdb and sdo.

Create Partitions

Create partitions on both virtual disks. Note the different parameters for journal disk.

# parted -s --align optimal /dev/sdX mklabel gpt -- mkpart primary 2MiB 100%    # where X is the drive letter
# parted -s --align optimal /dev/sdY mklabel gpt -- mkpart journal 2MiB 98MiB   # where Y is the drive letter

Initialize the HDD

Initialize the disk with journal (see Device preparation options):

# storpool_initdisk --no-notify -i Z -j /dev/sdYP 212 /dev/sdXP   # where X and Y are the drive letters, Z is the storpool_server instance, and P is the partition number
# storpool_initdisk --list
...
/dev/sdb1, diskId 212, version 10007, server instance 0, cluster d.b, WBC, jmv 160E0BDE853
...
/dev/sdo1, diskId 212, version 10007, server instance 0, cluster -, journal mv 160E0BDE853

Note

All drives not attached to a RAID controller could be initialized with disk_init_helper. For details about this tool, see Storage devices

Any SSD/HDD/NVMe drives should be visible with lsblk, and should not have any partitions. Ensure they are not part of an LVM or device mapper device setup.

Make sure that multipathd.service and multipathd.socket are disabled/stopped during the discovery/init phases of disk_init_helper.

The tool requires a valid symlink from /dev/disk/by-id to each drive to exist.

SATA SSD drive

SSD disks shall be used in JBOD mode, no WBC if connected to a RAID controller. The SSD disk can be connected to a SATA port on the motherboard as no special features are required for the controller with SSD disks.

Note

This note is kept just for backwards compatibility for clusters where the disk_init_helper is not yet installed/available.

We highly recommend using disk_init_helper, because it is posing less risks compared to manually executing commands.

To manually create a partition on some of the disks use:

# parted -s --align optimal /dev/sdX mklabel gpt -- mkpart primary 2MiB 100%    # where X is the drive letter

Same example, but for an NVMe device larger than 4TiB, split into two partitions:

# parted -s -a optimal /dev/nvmeXnY mklabel gpt -- mkpart primary 2MiB 50% mkpart primary 50% 100%

To manually initialize a disk as a StorPool device use:

Attention: Check if the diskID is not in use before you initialize the disk:

# storpool_initdisk --list

# storpool_initdisk --no-notify -i Z -s 201 /dev/sdXP   # where X is the drive letter and P is the partition number, Z is the storpool_server instance
# storpool_initdisk --list
...
/dev/sdj1, diskId 201, version 10007, server instance 0, cluster d.b, SSD

NVMe drive

StorPool uses either the vfio-pci driver or its own driver storpool_pci to manage PCI devices.

The vfio-pci driver is preferred in new installations instead of storpool_pci, but have some constraints, for example when multiple NVMe drives are in the same IOMMU group, in which case the storpool_pci driver is the only alternative.

The following table shows the supported kernel cmdline requirements for each:

#	storpool_pci	vfio-pci
iommu=pt	yes	yes
iommu=on	no	yes
iommu=off	yes	no

The new NVMe drive must be present in the list lsblk returns.

If it’s absent, the drive might be attached to the storpool_pci or the vfio_pci driver. To check if this is the case, first determine the PCI ID of the NVMe with lspci with:

[root@sp001 ~]# lspci | grep "olatile"
3b:00.0 Non-Volatile memory controller: Intel Corporation NVMe Datacenter SSD [Optane]
3c:00.0 Non-Volatile memory controller: Intel Corporation NVMe Datacenter SSD [Optane]
5e:00.0 Non-Volatile memory controller: Samsung Electronics Co Ltd NVMe SSD Controller SM981/PM981/PM983
5f:00.0 Non-Volatile memory controller: Samsung Electronics Co Ltd NVMe SSD Controller SM981/PM981/PM983
60:00.0 Non-Volatile memory controller: Samsung Electronics Co Ltd NVMe SSD Controller SM981/PM981/PM983
61:00.0 Non-Volatile memory controller: Samsung Electronics Co Ltd NVMe SSD Controller SM981/PM981/PM983

For this example we select NVMe with PCI ID 3b:00.0.

Next, we need to verify if the NVMe is already initialized as a StorPool disk. This can be done via storpool_initdisk --list --json. Here’s an example:

[root@sp001 ~]# storpool_initdisk --list --json | jq -r '.disks[] | select(.typeId == 4 and .nvme)'

If not present in the output of storpool_initdisk, verify that the NVMe drive of interest is bound to either storpool_pci or vfio_pci and if so, bind it to the kernels’ nvme driver.

If the drive is present in storpool_initdisk --list, the storpool_nvmed service and all the storpool_server instances serving NVMes must be stopped first. The following example uses storpool_pci but the same applies for vfio_pci:

[root@sp001 ~]# lspci -k | grep -A 3 3b:00.0
3b:00.0 Non-Volatile memory controller: Intel Corporation NVMe Datacenter SSD [Optane]
     Subsystem: Intel Corporation Device 3907
     Kernel driver in use: storpool_pci
     Kernel modules: nvme, storpool_pci

Next, we need to unbind the NVMe from storpool_pci or vfio_pci and bind it to the nvme driver. This would require the storpool_nvmed service and all the storpool_server instances that serve NVMe drives to be stopped. This is done with:

[root@testcl2-node2 ~]# storpool_ctl stop --servers --expose-nvme

After having rebound the NVMe drive, it should appear in the output of lsblk

Partitioning the NVMe

The disk_init_helper will take care to partition all NVMe devices, and all drives larger than 4TiB will be automatically split to as many partitions as required, so that each chunk does is not larger than the split size.

The split size could be adjusted to either a larger or a smaller value, though there is a maximum of ~7TiB chunk size limited by the amount of memory that could be allocated for the metadata cache for each disk.

Example discovery on a node with a 6.98 TiB NVMe drive split into four partitions each with a --nvme-split-size of 1.75 TiB:

First, we check with the following command how the disks are going to be partitioned. <MZQLB7T6HMLA> is the model of the disk listed under /dev/disk/by-id/nvme-…

The NVMe type may be one of:

n - NVMe drive

nj - NVMe drive with HDD journals

njo - NVMe drive with journals only (no StorPool data disk)

[root@sp001 by-id]# /usr/sbin/disk_init_helper discover --start 105 '*MZQLB7T6HMLA*:n' --nvme-split-size $((2*1024**4))
Disk /dev/disk/by-id/nvme-SAMSUNG_MZQLB7T6HMLA-00007_S4BGNC0T500213 matches pattern *MZQLB7T6HMLA*
Disk /dev/disk/by-id/nvme-SAMSUNG_MZQLB7T6HMLA-00007_S4BGNC0T500213 overriding nj to n
nvme-SAMSUNG_MZQLB7T6HMLA-00007_S4BGNC0T500213 (type: NVMe):
        data partitions
                /dev/disk/by-id/nvme-SAMSUNG_MZQLB7T6HMLA-00007_S4BGNC0T500213-part1 (105): 1788.49 GiB (mv: None)
                /dev/disk/by-id/nvme-SAMSUNG_MZQLB7T6HMLA-00007_S4BGNC0T500213-part2 (106): 1788.49 GiB (mv: None)
                /dev/disk/by-id/nvme-SAMSUNG_MZQLB7T6HMLA-00007_S4BGNC0T500213-part3 (107): 1788.49 GiB (mv: None)
                /dev/disk/by-id/nvme-SAMSUNG_MZQLB7T6HMLA-00007_S4BGNC0T500213-part4 (108): 1788.49 GiB (mv: None)
Discovery done, use '-d'/'--dump-file' to provide a target for the discovery output.

If you’re satisfied with the partitioning dump it to a file:

[root@sp001 by-id]# /usr/sbin/disk_init_helper discover --start 105 '*MZQLB7T6HMLA*:n' --nvme-split-size $((2*1024**4)) -d new_disk
Disk /dev/disk/by-id/nvme-SAMSUNG_MZQLB7T6HMLA-00007_S4BGNC0T500213 matches pattern *MZQLB7T6HMLA*
Disk /dev/disk/by-id/nvme-SAMSUNG_MZQLB7T6HMLA-00007_S4BGNC0T500213 overriding nj to n
Success generating new_disk, proceed with 'init'

To finish write the generated partitions to the disk use init <file> –exec

/usr/sbin/disk_init_helper init new_disk --exec

Before proceeding with init ensure that the automatically generated disk IDs are not already in the cluster, so that there will be no ID collision when they are added by the server instances. Otherwise the cluster will not join the disk, and it will be ejected.

Adjusting configuration

In order to attach the device to the vfio-pci or the storpool_pci driver it just needs to be initialized and visible with storpool_initdisk --list.

The default driver is storpool_pci but this could be adjusted with the SP_NVME_PCI_DRIVER configuration variable.

Any newly added devices will need their hugepages adjusted first.

Initializing the NVMe

The disk_init_helper init subcommand takes care of initializing all NVMe partition defined during the discovery phase.

In order to distribute all initialized disks to a number of storpool_server instances, use the /usr/lib/storpool/multi-server-helper.py to print the necessary commands.

Note that for a setup in which there are working instances, they could be stopped so that devices could be re-distributed equally among the instances.

For example:

[root@sp001 ~]# storpool_ctl stop --servers --expose-nvme
[snip]
[root@sp001 ~]# storpool_initdisk --list
/dev/nvme0n1p1, diskId 105, version 10009, server instance 0, cluster -, SSD, tester force test
/dev/nvme0n1p2, diskId 106, version 10009, server instance 0, cluster -, SSD, tester force test
/dev/nvme0n1p3, diskId 107, version 10009, server instance 0, cluster -, SSD, tester force test
/dev/nvme0n1p4, diskId 108, version 10009, server instance 0, cluster -, SSD, tester force test
Done.
[root@sp001 ~]# /usr/lib/storpool/multi-server-helper.py -i 4
/usr/sbin/storpool_initdisk -r --no-notify -i 0 105 /dev/nvme0n1p1  # SSD
/usr/sbin/storpool_initdisk -r --no-notify -i 1 106 /dev/nvme0n1p2  # SSD
/usr/sbin/storpool_initdisk -r --no-notify -i 2 107 /dev/nvme0n1p3  # SSD
/usr/sbin/storpool_initdisk -r --no-notify -i 3 108 /dev/nvme0n1p4  # SSD

[root@sp001 ~]# /usr/lib/storpool/multi-server-helper.py -i 4 | sh -x
[snip]

The drive could also be added manually to StorPool via the storpool_initdisk command directly. If there are more than one partitions, the command must be issued for each partition, specifying a different diskId. Here’s an example:

Note

If the drive has to be added to a server instance different that the first one, specify -i X before <diskId>

[root@sp001 ~]# storpool_initdisk --no-notify --ssd y <diskId> /dev/nvmeXnYpZ

Adjusting hugepages

Note

If this is a replacement of a drive with the same size and parameters, you can skip this step.

Hugepages must be adjusted to allocate the necessary amount of memory to handle the NVMe drive. Issuing the following commands would readjust the required number of hugepages:

[root@sp001 ~]# storpool_hugepages -Nv # to print what would be done
[snip]
[root@sp001 ~]# storpool_hugepages -v # to perform the reservation

For more information, see Hugepages.

Adjusting cgroups

Note

If this is a replacement of a drive with the same size and parameters, you can skip this step.

cgroups must be adjusted to allocate the necessary amount of memory. Cgroups memory limits can be set via the storpool_cg tool. First, retrieve the current configuration and compare it to the one that would get deployed:

Note

If the node is a hyper-converged one, add CONVERGED=1 to the conf command. Also, make sure you’re not missing any other options needed on the node.

[root@sp001 ~]# storpool_cg print
[root@sp001 ~]# storpool_cg conf -NMEi

If the results don’t match and there are more changes than expected (there should be a difference only in memory, if this is a new drive).

Attention

If the CACHE_SIZEs for any storpool_server instance have changed, make a note to restart that instance.

If the results match and the changes you see are what’s expected, apply the configuration:

[root@sp001 ~]# storpool_cg conf -ME

More information on setting up CGroups can be found here

Deleting storpool_server cache and memory files (/dev/shm)

If this is a new disk addition you have to cleanup the storpool_server memory and cache files from /dev/shm:

# rm -f /dev/shm/storpool.cache*
# rm -f /dev/shm/storpool.server*.mem

Then you can move to the next step and restart the required services.

Restarting the required services (only for NVMe drives)

If on the Adjusting cgroups step the CACHE_SIZE for a storpool_server instance was changed, the service needs to be restarted:

[root@sp001 ~]# service storpool_server_1 restart

For NVMe devices, the storpool_nvmed service needs to be restarted so the device is visible to the StorPool servers. When restarting the storpool_nvmed all server instances that control NVMe devices will be restarted, thus directly restart all server instances with:

[root@sp001 ~]# storpool_ctl restart --servers --wait

The --wait option will instruct storpool_ctl to wait for all disks visible on this node to join the cluster before exiting (requires configured access to the API service).

Manually adding a drive initialized with –no-notify in the cluster

When a disk was initialized manually with --no-notify, no server instance was updated that the disk should be added. In order for the drive to join the cluster, the relevant instance needs to be notified about it. This is done by using storpool_initdisk with the -r option, example below:

[root@sp001 ~]# storpool_initdisk --list
/dev/sdb1, diskId 1106, version 10009, server instance 2, cluster bbe7.b, SSD
/dev/sda1, diskId 1105, version 10009, server instance 2, cluster bbe7.b, SSD
0000:c2:00.0-p1, diskId 1101, version 10009, server instance 1, cluster bbe7.b, SSD
0000:c2:00.0-p2, diskId 1102, version 10009, server instance 1, cluster bbe7.b, SSD
0000:c1:00.0-p1, diskId 1103, version 10009, server instance 0, cluster bbe7.b, SSD
0000:c1:00.0-p2, diskId 1104, version 10009, server instance 0, cluster bbe7.b, SSD
[root@sp001 ~]# storpool_initdisk -r /dev/sdb1 # SATA SSD drive
[root@sp001 ~]# storpool_initdisk -r 0000:c2:00.0-p1, # First NVMe partition

The wait_all_disks tool will wait for all non-ejected disks visible with storpool_initdisk --list to join into the cluster (requires access to the API).

Check disk is added to the cluster

Monitor the logs (like /var/log/messages) if the drive was accepted by the relevant storpool_server instance, the example below is with storpool_server_7:

Jun 23 04:43:24 storpool1 kernel: [ 2054.378110] storpool_pci 0000:3b:00.0: probing
Jun 23 04:43:24 storpool1 kernel: [ 2054.378340] storpool_pci 0000:3b:00.0: device registered, physDev 0000:3b:00.0
Jun 23 04:43:32 storpool1 storpool_server_7[33638]: [info] 0000:3b:00.0-p1: adding as data disk 101 (ssd)
Jun 23 04:43:51 storpool1 storpool_server_7[33638]: [info] Disk 101 beginning tests
Jun 23 04:43:53 storpool1 storpool_server_7[33638]: [info] Disk 101 tests finished

Check with storpool disk list (this could take some time on clusters with a large amount of volumes/snapshots):

# storpool disk list | fgrep -w 212
    212  |     2.7  |    1.8 TiB  |    2.6 GiB  |    1.8 TiB  |    0 %  |       1918385  |         48 MiB  |    40100 / 1920000 |   0 / 0

Adding drive to a placement group and balancing data

The drive must be added to a placement group so StorPool can place data on it (see Placement groups). Here’s an example:

[root@sp001 ~]# storpool placementGroup <placement-group-name> addDisk <diskId>
OK

Finally, start a balancing operation (see Rebalancing the cluster) to put some of the existing data on the newly added drive:

[root@sp001 ~]# cd ~/storpool/balancer
[root@sp001 ~]# /usr/lib/storpool/balancer.sh -F
[root@sp001 ~]# storpool balancer commit

For more information on how to balance a cluster, see Rebalancing the cluster.