Adding a drive to a running cluster
This procedure describes adding HDD, SSD and NVMe drives to a running cluster.
1. Drives stress tests
Note
The same procedure is followed by ansible
on any new deployment.
Attention
Make sure that all SSD drives and SAS/HBA controllers are with the latest recommended Firmware or BIOS version. There are some known good versions in our system requirements
Any drive that’s added to a StorPool cluster needs to have an initial burn-in test to make sure there are no serious defects for the drive.
The recommended steps are:
Note
Running the tester tool in screen
leaves the test running if there’s a connectivity issue.
screen -S disktests
mkdir ~/disktests && cd ~/disktests
Note
Add all drives to be used for StorPool in a file drives.txt
printf "%s\n" sd{X,Y,Z} > ./drives.txt
disk_tester
Note
Wait for it to finish then check the results in disks/*-completed
files.
1.1. How disk_tester works
Collects the S.M.A.R.T. output from each drive before the stress tests to be able to later collect it once again and show the differences after the tests.
Trims all flash-based drives.
Fills sequentially all drives in parallel with random data and verifies the data written afterwards.
Writes randomly all drives with random data and verifies the written data afterwards. The amount of data written is different for HDDs and SSDs.
Prints a summary of the results including the minimum, maximum and average for the sequential and random workload tests, as well as the differences in the S.M.A.R.T. output collected before and the one collected after the tests completed.
With more than one drive tested the sum of the sequential read and write bandwidth with all drives could be investigated in results/all-completed
to check if there aren’t any bottlenecks in the controller when reading/writing to all drives in drives.txt
.
The same type/model groups of drives could be tested separately to not limit the test when a slower drive is in the same group with a faster one.
1.2. How to read the results
Compare the results from the sequential and random writes and verify that the speeds between similar drives are similar. For example drives with the same model with very different results like a lower performance, spikes in the minimum IOPS or bandwidth might be a sign of pending issues with either the drive or the controller.
Note
In case of huge differences in before/after for Seek Error Rate, Raw Read Error Rate and Hardware ECC Recovered SMART attributes check here. High error rates after first fill might be OK. (Notable on Seagate disks)
2. Partition and init drive
2.1. HDD on 3108 MegaRaid controller
2.1.1. Locating the disk
We’ll assume that the disk is already physically installed and visible by the controller. First locate the disk on the controller, either by S/N or by controller/enclosure/slot position. In this case we know from the customer that the disk is on controller 0, enclosure 252 slot 1.
Get SP_OURID
of this node with:
# storpool_confshow -e SP_OURID
Check for available disk IDs:
# storpool disk list
Select new, non-existent disk ID, according the numbering convention. e.g.
212
.
If we don’t know the disk’s location but, we know its device name, we can use the
storcli-helper.pl
script:
# /usr/lib/storpool/storcli-helper.pl /dev/sdX
ID_SERIAL='ZC209BG90000C7168ZVB'
ID_MODEL='ST2000NM0135'
ID_ENCLOSURE='252'
ID_SLOT='1'
ID_CTRL='0'
Device name can be found with:
# lshw -c disk -short
Check RAID controller type with:
# lshw -c storage -short
/0/100/1/0 scsi0 storage MegaRAID SAS-3 3108 [Invader]
Check SN if known. We can verify correct location with:
# storcli64 /c0/e252/s1 show all
...
Drive /c0/e252/s1 Device attributes :
===================================
SN = ZC209BG90000C7168ZVB
Manufacturer Id = SEAGATE
Model Number = ST2000NM0135
NAND Vendor = NA
Raw size = 1.819 TB [0xe8e088b0 Sectors]
Coerced size = 1.818 TB [0xe8d00000 Sectors]
Non Coerced size = 1.818 TB [0xe8d088b0 Sectors]
Logical Sector Size = 512B
...
From the data above we can calculate the disk’s size in MB: Coerced size
x Logical
Sector Size
/ 1024^2.:
# echo 'ibase=16;E8D00000*200/400^2' | bc
1907200
We’ll need this later.
Note
All input is hex, output is dec in MiB (2^20). Hex input is uppercase.
2.1.2. Prepare HDD
All HDDs shall be connected to RAID controllers with battery backup or cachevault.
For 3108 controllers, a small WBC journal device s to be created for each hard drive. Physical HDD is split in two Virtual Disks (VD) on the controller. First VD is the main disk and the second VD is the journal. WBC is disabled on the first VD, and enabled on the second VD (journal).
2.1.2.1. Switch drive from JBOD to RAID mode
Check disk state. If it is JBOD it has to be switched to Online:
# storcli64 /c0 show
...
-----------------------------------------------------------------------
EID:Slt DID State DG Size Intf Med SED PI SeSz Model Sp
-----------------------------------------------------------------------
252:0 6 Onln 0 1.818 TB SAS HDD N N 512B ST2000NM0135 U
252:1 8 JBOD - 1.818 TB SAS HDD N N 512B ST2000NM0135 U
252:2 7 Onln 1 1.818 TB SAS HDD N N 512B ST2000NM0135 U
...
Change JBOD to Online with:
# storcli64 /c0/e252/s1 set good force
If this fails with Failed controller has data in cache for offline or missing
virtual disks
, there is unflushed cache for the replaced disk that needs
to be cleared. It can be done with:
# storcli64 /c0 show preservedCache # shows the missing VD
Controller = 0
Status = Success
Description = None
-----------
VD State
-----------
9 Missing
-----------
# storcli64 /c0/v9 delete preservedcache # delete the cache for missing VD
and repeat the previous command.
2.1.2.2. Create a Virtual Disk for Journal
Define 2 virtual disks for the physical disk - first is the main storage, and the second is 100MB disk, used for journal with WBC enabled. The size of the first VD is total disk size we calculated before minus 100MiB. The second VD will use the remaining space (100MB).
# storcli64 /c0 add vd type=r0 Size=1907100 drives=252:1 pdcache=off direct nora wt
# storcli64 /c0 add vd type=r0 drives=252:1 pdcache=off direct nora wb
Check status with:
# storcli64 /c0 show
look at PD LIST
EID:Slt -> DG and VD LIST
DG/VD
Check /dev/sdX
with lsblk
. Now there shall be 2 drives. In our case
they are sdb
and sdo
.
2.1.3. Create Partitions
Create partitions on both virtual disks. Note the different parameters for journal disk.
# parted -s --align optimal /dev/sdX mklabel gpt -- mkpart primary 2MiB 100% # where X is the drive letter
# parted -s --align optimal /dev/sdY mklabel gpt -- mkpart journal 2MiB 98MiB # where Y is the drive letter
2.1.4. Initialize the HDD
Initialize the disk with journal:
# storpool_initdisk --no-notify -i Z -j /dev/sdX 212 /dev/sdYM # where X and Y are the drive letters, Z is the storpool_server instance
# storpool_initdisk --list
...
/dev/sdb1, diskId 212, version 10007, server instance 0, cluster d.b, WBC, jmv 160E0BDE853
...
/dev/sdo1, diskId 212, version 10007, server instance 0, cluster -, journal mv 160E0BDE853
Note
All drives not attached to a RAID controller could be initialized with
disk_init_helper
. For details about this tool, see
7. Storage devices
Any SSD/HDD/NVMe drives should be visible with lsblk
, and should
not have any partitions. Ensure they are not a part of an LVM or a
device mapper device setup.
Make sure that multipathd.service
and multipathd.socket
are
disabled/stopped during the discovery/init phases of disk_init_helper
.
The tool requires a valid symlink from /dev/disk/by-id
to each
drive to exist.
2.2. SATA SSD drive
SSD disks shall be used in JBOD mode, no WBC if connected to a RAID controller. The SSD disk can be connected to a SATA port on the motherboard as no special features are required for the controller with SSD disks.
Note
This note is kept just for backwards compatibility for clusters where the disk_init_helper
is not yet installed/available.
We highly recommend using disk_init_helper
, because it is posing less risks compared to manually executing commands.
To manually create a partition on some of the disks use:
# parted -s --align optimal /dev/sdX mklabel gpt -- mkpart primary 2MiB 100% # where X is the drive letter
Same example, but for an NVMe device larger than 4TiB, split into two partitions:
# parted -s -a optimal /dev/nvmeXnY mklabel gpt -- mkpart primary 2MiB 50% mkpart primary 50% 100%
To manually initialize a disk as a StorPool device use:
Attention: Check if the diskID is not in use before you initialize the disk:
# storpool_initdisk --list
# storpool_initdisk --no-notify -i Z -s 201 /dev/sdXN # where X is the drive letter and N is the partition number, Z is the storpool_server instance
# storpool_initdisk --list
...
/dev/sdj1, diskId 201, version 10007, server instance 0, cluster d.b, SSD
2.3. NVMe drive
StorPool uses either the vfio-pci
driver or its own driver storpool_pci
to
manage PCI devices.
The vfio-pci
driver is preferred in new installations instead of
storpool_pci
, but have some constraints, for example when multiple NVMe
drives are in the same IOMMU group, in which case the storpool_pci
driver
is the only alternative.
The following table shows the supported kernel cmdline requirements for each:
# |
storpool_pci |
vfio-pci |
---|---|---|
iommu=pt |
yes |
yes |
iommu=on |
no |
yes |
iommu=off |
yes |
no |
The new NVMe drive must be present in the list lsblk
returns.
If it’s absent, the drive might be attached to the
storpool_pci
or thevfio_pci
driver. To check if this is the case, first determine the PCI ID of the NVMe withlspci
with:
[root@testcl-node2 ~]# lspci | grep "olatile"
3b:00.0 Non-Volatile memory controller: Intel Corporation NVMe Datacenter SSD [Optane]
3c:00.0 Non-Volatile memory controller: Intel Corporation NVMe Datacenter SSD [Optane]
5e:00.0 Non-Volatile memory controller: Samsung Electronics Co Ltd NVMe SSD Controller SM981/PM981/PM983
5f:00.0 Non-Volatile memory controller: Samsung Electronics Co Ltd NVMe SSD Controller SM981/PM981/PM983
60:00.0 Non-Volatile memory controller: Samsung Electronics Co Ltd NVMe SSD Controller SM981/PM981/PM983
61:00.0 Non-Volatile memory controller: Samsung Electronics Co Ltd NVMe SSD Controller SM981/PM981/PM983
For this example we select NVMe with PCI ID 3b:00.0
.
Next, we need to verify if the NVMe is already initialized as a StorPool disk.
This can be done via storpool_initdisk --list --json
. Here’s an example:
[root@testcl-node2 ~]# storpool_initdisk --list --json | jq -r '.disks[] | select(.typeId == 4 and .nvme)'
If not present in the output of storpool_initdisk
, verify that the NVMe drive
of interest is bound to either storpool_pci
or vfio_pci
and if so, bind it
to the kernels’ nvme
driver.
If the drive is present in storpool_initdisk --list
, the storpool_nvmed
service and all the storpool_server
instances serving NVMes must be stopped
first. The following example uses storpool_pci
but the same applies for
vfio_pci
:
[root@cl8-svp1 ~]# lspci -k | grep -A 3 3b:00.0
3b:00.0 Non-Volatile memory controller: Intel Corporation NVMe Datacenter SSD [Optane]
Subsystem: Intel Corporation Device 3907
Kernel driver in use: storpool_pci
Kernel modules: nvme, storpool_pci
Next, we need to unbind the NVMe from storpool_pci
or vfio_pci
and bind it
to the nvme
driver. This would require the storpool_nvmed
service and
all the storpool_server
instances that serve NVMe drives to be stopped.
This is done with:
[root@testcl2-node2 ~]# storpool_ctl stop --servers --expose-nvme
After having rebound the NVMe drive, it should appear in the output of lsblk
2.3.1. Partitioning the NVMe
The disk_init_helper
will take care to partition all NVMe devices, and all
drives larger than 4TiB will be automatically split to as many partitions as
required, so that each chunk does is not larger than the split size.
The split size could be adjusted to either a larger or a smaller value, though there is a maximum of ~7TiB chunk size limited by the amount of memory that could be allocated for the metadata cache for each disk.
Example discovery on a node with a 6.98 TiB NVMe drive split into four partitions
each with a --nvme-split-size
of 1.75 TiB:
First we check with the following command, how the disks are going to be partitioned <MZQLB7T6HMLA> is the model of the disk listed under /dev/disk/by-id/nvme-…
The NVMe type may be one of:
n
- NVMe drive
nj
- NVMe drive with HDD journals
njo
- NVMe drive with journals only (no StorPool data disk)
[root@sp001 by-id]# /usr/sbin/disk_init_helper discover --start 105 '*MZQLB7T6HMLA*:n' --nvme-split-size $((2*1024**4))
Disk /dev/disk/by-id/nvme-SAMSUNG_MZQLB7T6HMLA-00007_S4BGNC0T500213 matches pattern *MZQLB7T6HMLA*
Disk /dev/disk/by-id/nvme-SAMSUNG_MZQLB7T6HMLA-00007_S4BGNC0T500213 overriding nj to n
nvme-SAMSUNG_MZQLB7T6HMLA-00007_S4BGNC0T500213 (type: NVMe):
data partitions
/dev/disk/by-id/nvme-SAMSUNG_MZQLB7T6HMLA-00007_S4BGNC0T500213-part1 (105): 1788.49 GiB (mv: None)
/dev/disk/by-id/nvme-SAMSUNG_MZQLB7T6HMLA-00007_S4BGNC0T500213-part2 (106): 1788.49 GiB (mv: None)
/dev/disk/by-id/nvme-SAMSUNG_MZQLB7T6HMLA-00007_S4BGNC0T500213-part3 (107): 1788.49 GiB (mv: None)
/dev/disk/by-id/nvme-SAMSUNG_MZQLB7T6HMLA-00007_S4BGNC0T500213-part4 (108): 1788.49 GiB (mv: None)
Discovery done, use '-d'/'--dump-file' to provide a target for the discovery output.
If you’re satisfied with the partitioning dump it to a file:
[root@sp001 by-id]# /usr/sbin/disk_init_helper discover --start 105 '*MZQLB7T6HMLA*:n' --nvme-split-size $((2*1024**4)) -d new_disk
Disk /dev/disk/by-id/nvme-SAMSUNG_MZQLB7T6HMLA-00007_S4BGNC0T500213 matches pattern *MZQLB7T6HMLA*
Disk /dev/disk/by-id/nvme-SAMSUNG_MZQLB7T6HMLA-00007_S4BGNC0T500213 overriding nj to n
Success generating new_disk, proceed with 'init'
To finish write the generated partitions to the disk use init <file> –exec
/usr/sbin/disk_init_helper init new_disk --exec
Before proceeding with init
ensure that the automatically generated disk
IDs are not already in the cluster, so that there will be no ID collision
when they are added by the server instances. Otherwise the cluster will not
join the disk, and it will be ejected.
2.3.2. Adjusting configuration
In order to attach the device to the vfio-pci
or the storpool_pci
driver
it just needs to be initialized and visible with storpool_initdisk --list
.
The default driver is storpool_pci
but this could be adjusted with the
SP_NVME_PCI_DRIVER
configuration variable.
Any newly added devices will need their hugepages adjusted first.
2.3.3. Initializing the NVMe
The disk_init_helper
init
subcommand takes care of initializing all NVMe
partition defined during the discovery
phase.
In order to distribute all initialized disks to a number of storpool_server
instances, use the /usr/lib/storpool/multi-server-helper.py
to print the
necessary commands.
Note that for a setup in which there are working instances, they could be stopped so that devices could be re-distributed equally among the instances.
For example:
[root@sp001 ~]# storpool_ctl stop --servers --expose-nvme
[snip]
[root@sp001 ~]# storpool_initdisk --list
/dev/nvme0n1p1, diskId 105, version 10009, server instance 0, cluster -, SSD, tester force test
/dev/nvme0n1p2, diskId 106, version 10009, server instance 0, cluster -, SSD, tester force test
/dev/nvme0n1p3, diskId 107, version 10009, server instance 0, cluster -, SSD, tester force test
/dev/nvme0n1p4, diskId 108, version 10009, server instance 0, cluster -, SSD, tester force test
Done.
[root@s33 ~]# /usr/lib/storpool/multi-server-helper.py -i 4
/usr/sbin/storpool_initdisk -r --no-notify -i 0 105 /dev/nvme0n1p1 # SSD
/usr/sbin/storpool_initdisk -r --no-notify -i 1 106 /dev/nvme0n1p2 # SSD
/usr/sbin/storpool_initdisk -r --no-notify -i 2 107 /dev/nvme0n1p3 # SSD
/usr/sbin/storpool_initdisk -r --no-notify -i 3 108 /dev/nvme0n1p4 # SSD
[root@s33 ~]# /usr/lib/storpool/multi-server-helper.py -i 4 | sh -x
[snip]
The drive could also be added manually to StorPool via the storpool_initdisk
command directly. If there are more than one partitions, the command must be
issued for each partition, specifying a different diskId
. Here’s an example:
Note
If the drive has to be added to a server instance different that the first one, specify -i X
before <diskId>
[root@testcl-node2 ~]# storpool_initdisk --no-notify --ssd y <diskId> /dev/nvmeXnYpZ
2.3.4. Adjusting hugepages
Note
If this is a replacement of a drive with the same size and parameters, you can skip this step.
Hugepages must be adjusted to allocate the necessary amount of memory to handle the NVMe drive. Issuing the following commands would readjust the required number of hugepages:
[root@testcl-node2 ~]# storpool_hugepages -Nv # to print what would be done
[snip]
[root@testcl-node2 ~]# storpool_hugepages -v # to perform the reservation
For more information, see Hugepages.
3. Adjusting cgroups
Note
If this is a replacement of a drive with the same size and parameters, you can skip this step.
cgroups must be adjusted to allocate the necessary amount of memory. Cgroups memory limits can be set
via the storpool_cg
tool. First, retrieve the current configuration and compare it to the one that
would get deployed:
Note
If the node is a hyper-converged one, add CONVERGED=1
to the conf
command. Also, make sure you’re not missing any other options needed on the node.
[root@testcl-node2 ~]# storpool_cg print
[root@testcl-node2 ~]# storpool_cg conf -NMEi
If the results don’t match and there are more changes than expected (there should be a difference only in memory, if this is a new drive).
Attention
If the CACHE_SIZEs for any storpool_server
instance have changed, make a note to restart that instance.
If the results match and the changes you see are what’s expected, apply the configuration:
[root@testcl-node2 ~]# storpool_cg conf -ME
More information on setting up CGroups can be found here
4. Restarting the required services (only for NVMe drives)
If on the previous step (Adjusting cgroups) the CACHE_SIZE for a storpool_server
instance was changed, the service needs to be restarted:
[root@testcl-node2 ~]# service storpool_server_1 restart
For NVMe devices, the storpool_nvmed
service needs to be restarted so the
device is visible to the StorPool servers. When restarting the storpool_nvmed
all server instances that control NVMe devices will be restarted, thus
directly restart all server instances with:
[root@testcl-node2 ~]# storpool_ctl restart --servers --wait
The --wait
option will instruct storpool_ctl
to wait for all disks
visible on this node to join the cluster before exiting (requires configured
access to the API service).
5. Manually adding a drive initialized with –no-notify in the cluster
When a disk was initialized manually with --no-notify
, no server instance
was updated that the disk should be added. In order for the drive to join the
cluster, the relevant instance needs to be notified about it. This is done
by using storpool_initdisk
with the -r
option, example below:
[root@sof11 ~]# storpool_initdisk --list
/dev/sdb1, diskId 1106, version 10009, server instance 2, cluster bbe7.b, SSD
/dev/sda1, diskId 1105, version 10009, server instance 2, cluster bbe7.b, SSD
0000:c2:00.0-p1, diskId 1101, version 10009, server instance 1, cluster bbe7.b, SSD
0000:c2:00.0-p2, diskId 1102, version 10009, server instance 1, cluster bbe7.b, SSD
0000:c1:00.0-p1, diskId 1103, version 10009, server instance 0, cluster bbe7.b, SSD
0000:c1:00.0-p2, diskId 1104, version 10009, server instance 0, cluster bbe7.b, SSD
[root@sof11 ~]# storpool_initdisk -r /dev/sdb1 # SATA SSD drive
[root@sof11 ~]# storpool_initdisk -r 0000:c2:00.0-p1, # First NVMe partition
The wait_all_disks
tool will wait for all non-ejected disks visible with
storpool_initdisk --list
to join into the cluster (requires access to the API).
6. Check disk is added to the cluster
Monitor the logs if the drive was accepted by the relevant storpool_server
instance, the example below is with storpool_server_7
:
Jun 23 04:43:24 storpool1 kernel: [ 2054.378110] storpool_pci 0000:3b:00.0: probing
Jun 23 04:43:24 storpool1 kernel: [ 2054.378340] storpool_pci 0000:3b:00.0: device registered, physDev 0000:3b:00.0
Jun 23 04:43:32 storpool1 storpool_server_7[33638]: [info] 0000:3b:00.0-p1: adding as data disk 101 (ssd)
Jun 23 04:43:51 storpool1 storpool_server_7[33638]: [info] Disk 101 beginning tests
Jun 23 04:43:53 storpool1 storpool_server_7[33638]: [info] Disk 101 tests finished
Check with storpool disk list
(this could take some time on clusters with a
large amount of volumes/snapshots):
# storpool disk list | fgrep -w 212
212 | 2.7 | 1.8 TB | 2.6 GB | 1.8 TB | 0 % | 1918385 | 48 MB | 40100 / 1920000 | 0 / 0
7. Adding drive to a placement group and balancing data
The drive must be added to a placement group so StorPool can place data on it. Here’s an example:
[root@testcl-mgmt]# storpool placementGroup <placement-group-name> addDisk <diskId>
OK
Finally, start a balancing operation to put some of the existing data on the newly added drive:
[root@test-cl-mgmt ~]# cd ~/storpool/balancer
[root@test-cl-mgmt ~]# ./usr/lib/storpool/balancer.sh -F
[root@test-cl-mgmt ~]# storpool balancer commit
For more information on how to balance a cluster, see 18. Rebalancing the cluster.