Adding a drive to a running cluster

This procedure describes adding HDD, SSD and NVMe drives to a running cluster.

1. Drives stress tests

Note

The same procedure is followed by ansible on any new deployment.

Attention

Make sure that all SSD drives and SAS/HBA controllers are with the latest recommended Firmware or BIOS version.

Any drive that’s added to a StorPool cluster needs to have an initial burn-in test to make sure there are no serious defects for the drive.

The recommended steps are:

Note

Running the tester in screen leaves the test running if there’s a connectivity issue.

screen -S disktests
mkdir ~/disktests && cd ~/disktests

Note

Add all drives to be used for StorPool in a file drives.txt

printf "%s\n" sd{X,Y,Z} > ./drives.txt
disk_tester

Note

Wait for it to finish then check the results in disks/*-completed files.

1.1. How disk_tester works

  • Collects the S.M.A.R.T. output from each drive before the stress tests to be able to later collect it once again and show the differences after the tests.

  • Trims all flash-based drives.

  • Fills sequentially all drives in parallel with random data and verifies the data written afterwards.

  • Writes randomly all drives with random data and verifies the written data afterwards. The amount of data written is different for HDDs and SSDs.

  • Prints a summary of the results including the minimum, maximum and average for the sequential and random workload tests, as well as the differences in the S.M.A.R.T. output collected before and the one collected after the tests completed.

With more than one drive tested the sum of the sequential read and write bandwidth with all drives could be investigated in results/all-completed to check if there aren’t any bottlenecks in the controller when reading/writing to all drives in drives.txt.

The same type/model groups of drives could be tested separately to not limit the test when a slower drive is in the same group with a faster one.

1.2. How to read the results

Compare the results from the sequential and random writes and verify that the speeds between similar drives are similar. For example drives with the same model with very different results like a lower performance, spikes in the minimum IOPS or bandwidth might be a sign of pending issues with either the drive or the controller.

Note

In case of huge differences in before/after for Seek Error Rate, Raw Read Error Rate and Hardware ECC Recovered SMART attributes check here. High error rates after first fill might be OK. (Notable on Seagate disks)

2. Partition and init drive

2.1. HDD on 3108 MegaRaid controller

2.1.1. Locating the disk

We’ll assume that the disk is already physically installed and visible by the controller. First locate the disk on the controller, either by S/N or by controller/enclosure/slot position. In this case we know from the customer that the disk is on controller 0, enclosure 252 slot 1.

Get SP_OURID of this node with:

# storpool_confshow -e SP_OURID

Check for available disk IDs:

# storpool disk list

Select new, non-existent disk ID, according the the numbering convention. e.g. 212.

If we don’t know the disk’s location but, we know its device name, we can use the storcli-helper.pl script:

# /usr/lib/storpool/storcli-helper.pl /dev/sdX
ID_SERIAL='ZC209BG90000C7168ZVB'
ID_MODEL='ST2000NM0135'
ID_ENCLOSURE='252'
ID_SLOT='1'
ID_CTRL='0'

Device name can be found with:

# lshw -c disk -short

Check RAID controler type with:

# lshw -c storage -short

/0/100/1/0           scsi0      storage        MegaRAID SAS-3 3108 [Invader]

Check SN if known. We can verify correct location with:

# storcli64 /c0/e252/s1 show all
...
Drive /c0/e252/s1 Device attributes :
===================================
SN = ZC209BG90000C7168ZVB
Manufacturer Id = SEAGATE
Model Number = ST2000NM0135
NAND Vendor = NA
Raw size = 1.819 TB [0xe8e088b0 Sectors]
Coerced size = 1.818 TB [0xe8d00000 Sectors]
Non Coerced size = 1.818 TB [0xe8d088b0 Sectors]
Logical Sector Size = 512B
...

From the data above we can calculate the disk’s size in MB: Coerced size x Logical Sector Size / 1024^2.:

# echo 'ibase=16;E8D00000*200/400^2' | bc
1907200

We’ll need this later.

Note

All input is hex, output is dec in MiB (2^20). Hex input is uppercase.

2.1.2. Prepare HDD

All HDDs shall be connected to RAID controllers with battery backup or cachevault.

For 3108 controllers, a small WBC journal device s to be created for each hard drive. Physical HDD is split in two Virtual Disks (VD) on the controller. First VD is the main disk and the second VD is the journal. WBC is disabled on the fisrt VD, and enabled on the second VD (journal).

2.1.2.1. Switch drive from JBOD to RAID mode

Check disk state. If it is JBOD it has to be switched to Online:

# storcli64 /c0 show
...
-----------------------------------------------------------------------
EID:Slt DID State DG     Size Intf Med SED PI SeSz Model            Sp
-----------------------------------------------------------------------
252:0     6 Onln   0 1.818 TB SAS  HDD N   N  512B ST2000NM0135     U
252:1     8 JBOD   - 1.818 TB SAS  HDD N   N  512B ST2000NM0135     U
252:2     7 Onln   1 1.818 TB SAS  HDD N   N  512B ST2000NM0135     U
...

Change JBOD to Online with:

# storcli64 /c0/e252/s1 set good force

If this fails with Failed controller has data in cache for offline or missing virtual disks, there is unflushed cache for the replaced disk that needs to be cleared. It can be done with:

# storcli64 /c0 show preservedCache # shows the missing VD

Controller = 0

Status = Success

Description = None

-----------

VD State

-----------

9 Missing

-----------

# storcli64 /c0/v9 delete preservedcache # delete the cache for missing VD

and repeat the previous command.

2.1.2.2. Create a Virtual Disk for Journal

Define 2 virtual disks for the physical disk - first is the main storage, and the second is 100MB disk, used for journal with WBC enabled. The size of the first VD is total disk size we calculated before minus 100MiB. The second VD will use the remaining space (100MB).

# storcli64 /c0 add vd type=r0 Size=1907100 drives=252:1 pdcache=off direct nora wt
# storcli64 /c0 add vd type=r0 drives=252:1 pdcache=off direct nora wb

Check status with:

# storcli64 /c0 show

look at PD LIST EID:Slt -> DG and VD LIST DG/VD

Check /dev/sdX with lsblk. Now there shall be 2 drives. In our case they are sdb and sdo.

2.1.3. Create Partitions

Create partitions on both virtual disks. Note the different parameters for journal disk.

# parted -s --align optimal /dev/sdX mklabel gpt -- mkpart primary 2MiB 100%    # where X is the drive letter
# parted -s --align optimal /dev/sdY mklabel gpt -- mkpart journal 2MiB 98MiB   # where Y is the drive letter

2.1.4. Initialize the HDD

Initialize the disk with journal:

# storpool_initdisk --no-notify -i Z -j /dev/sdX 212 /dev/sdYM   # where X and Y are the drive letters, Z is the storpool_server instance
# storpool_initdisk --list
...
/dev/sdb1, diskId 212, version 10007, server instance 0, cluster d.b, WBC, jmv 160E0BDE853
...
/dev/sdo1, diskId 212, version 10007, server instance 0, cluster -, journal mv 160E0BDE853

2.2. SATA SSD drive

SSD disks shall be used in JBOD mode, no WBC if connected to RAID controller. SSD can be connected to SATA port on the motherboard as no special features are required for the controller with SSD disks.

2.2.1. Create Partitions

Create partitions on the disk:

# parted -s --align optimal /dev/sdX mklabel gpt -- mkpart primary 2MiB 100%    # where X is the drive letter

2.2.2. Initialize the SSD

Attention

Check if the diskID is not in use before you initialize the disk:

# storpool_initdisk --list
# storpool_initdisk --no-notify -i Z -s 201 /dev/sdXN   # where X is the drive letter and N is the partition number, Z is the storpool_server instance
# storpool_initdisk --list
...
/dev/sdj1, diskId 201, version 10007, server instance 0, cluster d.b, SSD

2.3. NVMe drive

StorPool uses either the vfio driver or its own driver storpool_pci to manage PCI devices. Therefore the NVMe in question must be present in the list lsblk returns. If it’s absent, the drive might be attached to storpool_pci. To check if this is the case, first determine the PCI ID of the NVMe with lspci like so:

[root@testcl-node2 ~]# lspci | grep "olatile"
3b:00.0 Non-Volatile memory controller: Intel Corporation NVMe Datacenter SSD [Optane]
3c:00.0 Non-Volatile memory controller: Intel Corporation NVMe Datacenter SSD [Optane]
5e:00.0 Non-Volatile memory controller: Samsung Electronics Co Ltd NVMe SSD Controller SM981/PM981/PM983
5f:00.0 Non-Volatile memory controller: Samsung Electronics Co Ltd NVMe SSD Controller SM981/PM981/PM983
60:00.0 Non-Volatile memory controller: Samsung Electronics Co Ltd NVMe SSD Controller SM981/PM981/PM983
61:00.0 Non-Volatile memory controller: Samsung Electronics Co Ltd NVMe SSD Controller SM981/PM981/PM983

For the purpose of this document, we select NVMe with PCI ID 3b:00.0. Next, we need to verify if the NVMe is already initialized as a StorPool disk. This can be done via storpool_initdisk --list --json. Here’s an example:

[root@testcl-node2 ~]# storpool_initdisk --list --json | jq -r '.disks[] | select(.typeId == 4 and .nvme)'

If not present in the output of storpool_initdisk, verify the NVMe of interest is bound to either storpool_pci or vfio and if so, bind it to the kernels’ nvme driver. If the drive is present in storpool_initdisk --list, storpool_nvmed and the storpool_server instances serving NVMes must be first stopped. The following example uses storpool_pci but the same applies for vfio:

[root@cl8-svp1 ~]# lspci -k | grep -A 3 3b:00.0
3b:00.0 Non-Volatile memory controller: Intel Corporation NVMe Datacenter SSD [Optane]
     Subsystem: Intel Corporation Device 3907
     Kernel driver in use: storpool_pci
     Kernel modules: nvme, storpool_pci

Next, we need to unbind the NVMe from storpool_pci and bind it to the nvme driver. This would require to stop storpool_nvmed and the storpool_server instances that serve NVMe drives like so:

[root@testcl2-node2 ~]# storpool_ctl stop --servers --expose

After having rebound the NVMe, it should appear in the output of lsblk

2.3.1. Partitioning the NVMe

After having successfully tested the NVMe, one must partition it. Here’s an example using parted:

[root@testcl-node2 ~]# parted -s -a optimal /dev/nvmeXnY mklabel gpt -- mkpart primary 2MiB 100%

StorPool doesn’t handle NVMe drives larger than 4TB, so if the NVMe in question is larger, one must split it into smaller partitions, each no larger than 4TB. Here’s an example using parted:

[root@testcl-node2 ~]# parted -s -a optimal /dev/nvmeXnY mklabel gpt -- mkpart primary 2MiB 50% mkpart primary 50% 100%

2.3.2. Adjusting configuration

If not done already, StorPools’ configuration must be updated so that the NVMe falls under the control of storpool_pci or vfio when storpool_nvmed starts. To do this, one must add the PCI ID of the NVMe to SP_NVME_PCI_ID under the section of the node in storpool.conf. First, one must find the PCI ID of the drive. This can be achieved by looking at the /dev/disk/by-path directory.

[root@testcl-node2 ~]# ls -lha /dev/disk/by-path
total 0
drwxr-xr-x 2 root root  80 May 19 13:25 .
drwxr-xr-x 5 root root 100 May 19 13:25 ..
lrwxrwxrwx 1 root root  13 May 17 11:09 pci-0000:ca:00.0-nvme-1 -> ../../nvmeX

Then add 0000:3b:00.0 to the SP_NVME_PCI_ID variable under the [testcl-node2] section in /etc/storpool.conf like so:

[root@testcl-node2 ~]# vi /etc/storpoo.conf
...
[testcl-node2]
SP_NVME_PCI_ID=0000:3b:00.0 AAAA:BB:CC.D
...

2.3.3. Initializing the NVMe

The drive has to be added to StorPool via the storpool_initdisk command. If there are more than one partitions, the command must be issued for each partition, specifying a different diskId. Here’s an example:

Note

If the drive has to be added to a server instance different that the first one, specify -i X before <diskId>

[root@testcl-node2 ~]# storpool_initdisk --no-notify --ssd y <diskId> /dev/nvmeXnYpZ

2.3.4. Adjusting hugepages

Note

If this is a replacement of a drive with the same size and parameters, you can skip this step.

Hugepages must be adjusted to allocate the necessary amount of memory to handle the NVMe drive. Issuing the following commands would readjust the required number of hugepages:

[root@testcl-node2 ~]# storpool_hugepages

3. Adjusting cgroups

Note

If this is a replacement of a drive with the same size and parameters, you can skip this step.

cgroups must be adjusted to allocate the necessary amount of memory. Cgroups memory limits can be set via the storpool_cg tool. First, retrieve the current configuration and compare it to the one that would get deployed:

Note

If the node is a hyper-converged one, add CONVERGED=1 to the conf command. Also, make sure you’re not missing any other options needed on the node.

[root@testcl-node2 ~]# storpool_cg print
[root@testcl-node2 ~]# storpool_cg conf -N

If the results don’t match and there are more changes than expected (there should be a difference only in memory, if this is a new drive).

Attention

If the CACHE_SIZEs for any storpool_server instance have changed, make a note to restart that instance.

If the results match and the changes you see are what’s expected, apply the configuration:

[root@testcl-node2 ~]# storpool_cg conf -ME

More information on setting up CGroups can be found here

4. Restarting the required services (only for NVMe drives)

If on the previous step (Adjusting cgroups) the CACHE_SIZE for a storpool_server instance was changed, the service needs to be restarted:

[root@testcl-node2 ~]# service storpool_server_1 restart

For NVMe devices, the storpool_nvmed service needs to be restarted so the device is visible to the StorPool servers:

[root@testcl-node2 ~]# service storpool_nvmed restart

Before moving forward, check that all services are running. The command below would log any service that isn’t but must be running. Please make sure that all required services are running:

[root@testcl-node2 ~]# storpool_ctl status

5. Adding the drive in the cluster

For the drive to join the cluster, the relevant instance needs to be notified about the drive. This is done by using storpool_initdisk with the -r option:

[root@sof11 ~]# storpool_initdisk --list
/dev/sdb1, diskId 1106, version 10009, server instance 2, cluster bbe7.b, SSD
/dev/sda1, diskId 1105, version 10009, server instance 2, cluster bbe7.b, SSD
0000:c2:00.0-p1, diskId 1101, version 10009, server instance 1, cluster bbe7.b, SSD
0000:c2:00.0-p2, diskId 1102, version 10009, server instance 1, cluster bbe7.b, SSD
0000:c1:00.0-p1, diskId 1103, version 10009, server instance 0, cluster bbe7.b, SSD
0000:c1:00.0-p2, diskId 1104, version 10009, server instance 0, cluster bbe7.b, SSD
[root@sof11 ~]# storpool_initdisk -r /dev/sdb1 # SATA SSD drive
[root@sof11 ~]# storpool_initdisk -r 0000:c2:00.0-p1, # First NVMe partition

6. Check disk is added to the cluster

Monitor the logs if the drive was accepted by the storpool_server instance:

Jun 23 04:43:24 storpool1 kernel: [ 2054.378110] storpool_pci 0000:3b:00.0: probing
Jun 23 04:43:24 storpool1 kernel: [ 2054.378340] storpool_pci 0000:3b:00.0: device registered, physDev 0000:3b:00.0
Jun 23 04:43:32 storpool1 storpool_server_7[33638]: [info] 0000:3b:00.0-p1: adding as data disk 101 (ssd)
Jun 23 04:43:51 storpool1 storpool_server_7[33638]: [info] Disk 101 beginning tests
Jun 23 04:43:53 storpool1 storpool_server_7[33638]: [info] Disk 101 tests finished

Check with storpool disk list (this could take some time on clusters with a large amount of volumes/snapshots):

# storpool disk list
...
    212  |     2.0  |    1.8 TB  |    2.6 GB  |    1.8 TB  |    0 %  |       1918385  |         48 MB  |    40100 / 1920000 |   0 / 0
...

7. Adding drive to a placement group and balancing data

The drive must be added to a placement group so StorPool can place data on it. Here’s an example:

[root@testcl-mgmt]# storpool placementGroup <placement-group-name> addDisk <diskId>
OK

Finally, start a balancing operation to put some of the existing data on the newly added drive:

[root@test-cl-mgmt ~]# cd ~/storpool/balancer
[root@test-cl-mgmt ~]# ./usr/lib/storpool/balancer.sh -F
[root@test-cl-mgmt ~]# storpool balancer commit

More info on how to balance a cluster can be found at the User Guide