Drive management

Introduction

When working a StorPool cluster there would be times when you need to add or remove drives from nodes. Here are a few examples:

Adding a new node with new drives.
Adding a new drive to a node.
Replacing an existing drive with a new one.
Removing a drive when it fails, or if it is not needed anymore.

Whenever you add or remove a drive you should take care to maintain or restore the redundancy levels. It is recommended to evade periods with decreased redundancy levels, or if not possible, to minimize their length.

Adding drives

After a new drive is added to a node you need to define partitions on the drive, and set a StorPool-specific ID. For details, see Storage devices.

Ejecting drives

In StorPool, marking a drive as temporarily unavailable is referred to as ejecting. When you eject a drive, the cluster will do the following:

Stop the data replication to it.
Keep the metadata about the placement groups in which it participated, and which volume objects it contained.

The following sections show some examples with ejected drives. For more information about when and how to eject a drive, see Disk.

Removing a drive without replacement

Follow this procedure when a drive has to be removed from the cluster. Data replicas would be restored on other available drives before the drive is removed from the cluster. This procedure will not leave any volumes with degraded redundancy at any moment. It assumes there is enough free space in the cluster to move the data contained in the drive to be replaced.

Note

If the drive is to be replaced by another drive and there are available slots for the new drive without the need to remove the old drive, follow the instructions to add the new drive first, and then continue with this procedure to remove the old drive. In this case there is no need to rebalance when adding the new drive. Rebalance can be done only once, with the removal of the old drive.

When removing multiple drives (for example, when decommissioning a server), the rebalance is to be done as one operation, with all drives marked as softEject before running the rebalancing..

To remove a drive and balance the load:

Balance-out the drive:
1. Mark the drive for removal (repeat for all drives to be removed):
  # storpool disk <diskID> softEject
2. Use the balancer to restore the replication level:
  # /usr/lib/storpool/balancer.sh -R
  
  In case you experience issues with the rebalancing - for example, lack of space for performing the operation, or other issues - you can refer to Rebalancing the cluster for details and examples on balancer usage.
3. Commit the changes and start the relocation:
  # storpool balancer commit
Note

Rebalancing operations can take multiple hours, up to few days. This depends on the size of the data that needs to be moved per disk, and also on the load of the cluster. As a rule of thumb, the main limiting factor is the the drive with most data to be moved to.
Wait for the data on the drive to be moved to other drives.
- Monitor the process:
  # storpool task list # storpool -j relocator status "data" : { "recoveringFromRemote" : 0, "status" : "on", "vagsToCleanup" : 0, "vagsToRelocate" : 443, "vagsToRelocateWaitingForBalancer" : 0, "volumesToRelocate" : 880 },
  
  The relevant field for the state of the rebalancing is vagsToRelocate. volumesToRelocate also includes the value of recoveringFromRemote.
- Monitor the remaining data to be moved:
  # storpool relocator disks
Attention

The storpool relocator disks command consumes a lot of resources. When executed on big clusters with a large number of volumes it may introduce latency and delays to queries to the management of the cluster (it cannot affect I/O operations). Avoid frequent usage of this command if possible.

When a drive is balanced-out it will go into ejected state. You can check in one of the following ways:

Using the storpool command:

# storpool disk list
  disk |  server  |    size    |    used    |  est.free  |      %  |  free entries  |  on-disk size  |  allocated objects |  errors  |  flags
  802  |     8.0  |         -  |         -  |         -  |    - %  |             -  |             -  |        - / -       |   - / -  |

The ejected drive is shown only with its ID and dashes in all other fields.

Using the storpool_initdisk command, until you see the “EJECTED” flag:

# storpool_initdisk --list
  0000:83:00.0-p2, diskId 812, version 10009, server instance 5, cluster baaa.b, EJECTED, SSD
  /dev/sdb2, diskId 912, version 10009, server instance 2, cluster baa.b, EJECTED, SSD

The ejected drive is marked with an EJECTED flag.

After ensuring the drive is ejected, remove it from the configuration:
# storpool disk <diskID> forget
Now you can physically unplug the drive.

Replacing a drive without balancing-out the old drive

This procedure will leave data with reduced redundancy during the replacement and following data recovery. It is not recommended to use it. It should be applied only when there are no other options - for example, when there is no space available on the cluster to restore redundancy before removing the old drive.

When replacing an SSD drive on hybrid clusters with one copy on SSD, this procedure will cause increased read latency until all data is fully recovered on the replaced SSD, as some read operations will be performed from HDDs. On all-SSD/NVMe clusters and hybrid clusters with two copies on SSD, performance will not be degraded during the replacement.

To replace a drive without impacting the redundancy or performance should first eject it and then add the new drive, as described in the sections below.

Ejecting the old drive

Make sure the new drive you would use meets the following requirements:
- It is of the same type; for example, replacing a SATA SSD with a SATA SSD or NVMe, a NVMe with a NVMe, or a HDD with a HDD.
- It is the same size, or it’s bigger.
Put the node on which the drive is to be ejected in Maintenance mode. Note the following:
- Make sure the maintenance period you set is long enough for completing the whole drive replacement procedure!
- If this cannot be done (for example, if there is a degraded volume), follow the procedure in Removing an ejected drive without replacement to restore full redundancy! After that you can proceed with ejecting the drive.
Eject the old drive (see Ejecting disks and internal server tests):
# storpool disk <diskID> eject

Adding the new drive

Remove the old drive physically, and insert the new one.
Follow the instructions in Adding a drive to a running cluster to create partitions on the new drive, and assign the same ID and server instance.

Note

Make sure the Disk ID and server instance ID are the same (and, optionally, other parameters). You can use the --list and -i options of the storpool_initdisk tool.
After the new drive is initialized it will be added automatically, and the cluster will start recovering data on it.
- To check drive’s status:
  # storpool disk list
- To check the data recovery process:
  # storpool task list
Complete the maintenance mode for the node.

Replacing an ejected drive

A drive can be ejected from a cluster as described in Removing a drive without replacement, or for other reasons (for example, failure). Follow the procedure below when you need to replace the ejected drive with a new one:

Make sure the new drive you would use meets the following requirements:
- It is of the same type.
- It is the same size, or it’s bigger.
Verify the old drive is ejected before physically unplugging it:
# storpool disk list
Proceed with the steps in the Adding the new drive section.

Removing an ejected drive without replacement

Refer to the procedure for removal of ejected drive at Removing and re-balancing out

Recovering drives

When you want to return an ejected disk back to the cluster:

storpool_initdisk -r <device>

Here, <device> should be in the /dev/sdXN format, where X is the drive letter and N is the partition number. For more information about the storpool_initdisk tool, see Device preparation options.

Wait for the drive to get back in the cluster (check with storpool disk list), and then wait for the recovery tasks to complete. If the drive is unable to get back for some reason, you can do one of the following:

Remove it, as described in Removing an ejected drive without replacement.
Replace it, as described in Replacing an ejected drive.

Note

If you’re returning a drive that was previously balanced-out from the cluster you should reformat it first.

More information

Whenever you add a drive to the cluster, consider also adding the drive to a placement group and then rebalancing it into the cluster. For details, see Placement groups and Rebalancing the cluster.