StorPool User Guide - version 21
Document version 2022-09-08
1. StorPool Overview
StorPool is distributed block storage software. It pools the attached storage (HDDs, SSDs or NVMe drives) of standard servers to create a single pool of shared storage. The StorPool software is installed on each server in the cluster. It combines the performance and capacity of all drives attached to the servers into one global namespace.
StorPool provides standard block devices. You can create one or more volumes through its sophisticated volume manager. StorPool is compatible with ext3 and XFS file systems and with any system designed to work with a block device, e.g. databases and cluster file systems (like OCFS and GFS). StorPool can also be used with no file system, for example when using volumes to store VM images directly or as LVM physical volumes.
Redundancy is provided by multiple copies (replicas) of the data written synchronously across the cluster. Users may set the number of replication copies or an erasure coding scheme. The replication level directly correlates with the number of servers that may be down without interruption in the service. For replication 3 and all N+2 erasure coding schemes the number of the servers (see Fault sets) that may be down simultaneously without losing access to the data is 2.
StorPool protects data and guarantees its integrity by a 64-bit checksum and version for each sector on a StorPool volume or snapshot. StorPool provides a very high degree of flexibility in volume management. Unlike other storage technologies, such as RAID or ZFS, StorPool does not rely on device mirroring (pairing drives for redundancy). So every disk that is added to a StorPool cluster adds capacity and improves the performance of the cluster, not just for new data but also for existing data. Provided that there are sufficient copies of the data, drives can be added or taken away with no impact to the storage service. Unlike rigid systems like RAID, StorPool does not impose any strict hierarchical storage structure dictated by the underlying disks. StorPool simply creates a single pool of storage that utilizes the full capacity and performance of a set of commodity drives.
2. Architecture
StorPool works on a cluster of servers in a distributed shared-nothing architecture. All functions are performed by all servers on an equal peer basis. It works on standard off-the-shelf servers running GNU/Linux.
Each storage node is responsible for data stored on its local drives. Storage nodes collaborate to provide the storage service. StorPool provides a shared storage pool combining all the available storage capacity. It uses synchronous replication across servers. The StorPool client communicates in parallel with all StorPool servers. The StorPool iSCSI target provides access to volumes exported through it to other initiators.
The software consists of two different types of services - storage server services and a storage client services - that are installed on each physical server (host, node). The storage client services are the native block on Linux based systems, the iSCSI target or the NVMeTCP target for other systems. Each host can be a storage server, a storage client, iSCSI target, NVMeOF controller or any combination. To storage clients the StorPool volumes appear as block devices under the /dev/storpool/
directory and behave as normal disk devices. The data on the volumes can be read and written by all clients simultaneously; its consistency is guaranteed through a synchronous replication protocol. Volumes may be used by clients as they would use a local hard drive or disk array.
3. Feature Highlights
3.1. Scale-out, not Scale-Up
The StorPool solution is fundamentally about scaling out (by adding more drives or nodes) rather than scaling up (adding capacity by replacing a storage box with larger storage box). This means StorPool can scale independently by IOPS, storage space and bandwidth. There is no bottleneck or single point of failure. StorPool can grow without interruption and in small steps - a drive, a server and/or a network interface at a time.
3.2. High Performance
StorPool combines the IOPS performance of all drives in the cluster and optimizes drive access patterns to provide low latency and handling of storage traffic bursts. The load is distributed equally between all servers through striping and sharding.
3.3. High Availability and Reliability
StorPool uses a replication mechanism that slices and stores copies of the data on different servers. For primary, high performance storage this solution has many advantages compared to RAID systems and provides considerably higher levels of reliability and availability. In case of a drive, server, or other component failure, StorPool uses some of the available copies of the data located on other nodes in the same or other racks significantly decreasing the risk of losing access to or losing data.
3.4. Commodity Hardware
StorPool supports drives and servers in a vendor-agnostic manner, allowing you to avoid vendor lock-in. This allows the use of commodity hardware, while preserving reliability and performance requirements. Moreover, unlike RAID, StorPool is drive agnostic - you can mix drives of various types, make, speed or size in a StorPool cluster.
3.6. Co-existence with hypervisor software
StorPool can utilize repurposed existing servers and can co-exist with hypervisor software on the same server. This means that there is no dedicated hardware for storage, and growing an IaaS cloud solution is achieved by simply adding more servers to the cluster.
3.7. Compatibility
StorPool is compatible with 64-bit Intel and AMD based servers. We support all Linux-based hypervisors and hypervisor management software. Any Linux software designed to work with a shared storage solution such as an iSCSI or FC disk array will work with StorPool. StorPool guarantees the functionality and availability of the storage solution at the Linux block device interface.
3.8. CLI interface and API
StorPool provides an easy to use yet powerful command-line interface (CLI) tool for administration of the data storage solution. It is simple and user-friendly - making configuration changes, provisioning and monitoring fast and efficient. StorPool also provides a RESTful JSON API, and python bindings exposing all the available functionality, so you can integrate it with any existing management system.
3.9. Reliable Support
StorPool comes with reliable dedicated support: remote installation and initial configuration by StorPool’s specialists; 24x7 support; live software updates without interruption in the service
4. Hardware Requirements
All distributed storage systems are highly dependent on the underlying hardware. There are some aspects that will help achieve maximum performance with StorPool and are best considered in advance. Each node in the cluster can be used as server, client, iSCSI target or any combination; depending on the role, hardware requirements vary.
4.1. Minimum StorPool cluster
3 industry-standard x86 servers;
any x86-64 CPU with 4 threads or more;
32 GB ECC RAM per node (8+ GB used by StorPool);
any hard drive controller in JBOD mode;
3x SATA3 hard drives or SSDs;
dedicated 2x10GE LAN;
4.2. Recommended StorPool cluster
5 industry-standard x86 servers;
IPMI, iLO/LOM/iDRAC desirable;
Intel Nehalem generation (or newer) Xeon processor(s);
64GB ECC RAM or more in every node;
any hard drive controller in JBOD mode;
dedicated dual 25GE or faster LAN;
2+ NVMe drives per storage node;
4.3. How StorPool relies on hardware
4.3.1. CPU
When the system load is increased, CPUs are saturated with system interrupts. To avoid the negative effects of this, StorPool’s server and client processes are given one or more dedicated CPU cores. This significantly improves overall the performance and the performance consistency.
4.3.2. RAM
ECC memory can detect and correct the most common kinds of in-memory data corruption thus maintains a memory system immune to single-bit errors. Using ECC memory is an essential requirement for improving the reliability of the node. In fact, StorPool is not designed to work with non-ECC memory.
4.3.3. Storage (HDDs / SSDs)
StorPool ensures the best drive utilization. Replication and data integrity are core functionality, so RAID controllers are not required and all storage devices might be connected as JBOD. All hard drives are journaled either on an NVMe drive similar to Intel Optane series. When write-back cache is available on a RAID controller it could be used in a StorPool specific way in order to provide power-loss protection for the data written on the hard disks. This is not necessary for SATA SSD pools.
4.3.4. Network
StorPool is a distributed system which means that the network is an essential part of it. Designed for efficiency, StorPool combines data transfer from other nodes in the cluster. This greatly improves the data throughput, compared with access to local devices, even if they are SSD or NVMe.
4.4. Software Compatibility
4.4.1. Operating Systems
Linux (various distributions)
Windows and VMWare, Citrix Xen through standard protocols (iSCSI)
4.4.2. File Systems
Developed and optimized for Linux, StorPool is very well tested on CentOS, Ubuntu and Debian. Compatible and well tested with ext4 and XFS file systems and with any system designed to work with a block device, e.g. databases and cluster file systems (like GFS2 or OCFS2). StorPool can also be used with no file system, for example when using volumes to store VM images directly. StorPool is compatible with other technologies from the Linux storage stack, such as LVM, dm-cache/bcache, and LIO.
4.4.3. Hypervisors & Cloud Management/Orchestration
KVM
LXC/Containers
OpenStack
OpenNebula
OnApp
CloudStack
any other technology compatible with the Linux storage stack.
5. Installation and Upgrade
Currently the installation and upgrade procedures are performed by the StorPool support team.
6. Node configuration options
To configure nodes StorPool uses a configuration file, which can be found at
/etc/storpool.conf
. Host specific configuration can be placed in
/etc/storpool.conf.d/
folder.
6.1. Minimal Node Configuration
The minimum working configuration must specify the network interface, number of expected nodes, authentication tokens and unique ID of the node like in the following example:
#-
# Copyright (c) 2013 - 2017 StorPool.
# All rights reserved.
#
# Human readable name of the cluster, usuall form "Company Name"-"Location", e.g. StorPoolLab-Sofia
#
# Mandatory for the monitoring
SP_CLUSTER_NAME= #<Company-Name-PoC>-<City-or-nearest-airport>
# Remote authentication token provided by StorPool support for data related to crashed services, collected
# vmcore-dmesg files after kernel panic, per-host monitoring alerts, storpool_iolatmon alerts, etc.
SP_RAUTH_TOKEN= <rauth-token>
# Computed from the StorPool Support and consists of location and cluster separated by a dot, e.g. nzkr.b
#
# Mandatory since version 16.02
SP_CLUSTER_ID= #Ask StorPool Support
# Interface for storpool communication
#
# Default: empty
SP_IFACE1_CFG=1:eth1.100:eth1:100:10.1.100.1:b:x:r
SP_IFACE2_CFG=1:eth2.100:eth2:100:10.2.100.1:b:x:r
# expected nodes for beacon operation
#
# !!! Must be specified !!!
#
SP_EXPECTED_NODES=3
# API authentication token
#
# 64bit random value
# generate for example with: 'od -vAn -N8 -tu8 /dev/random'
SP_AUTH_TOKEN=4306865639163977196
##########################
[spnode1.example.com]
SP_OURID = 1
6.2. Full Configuration Options List
The following is a complete list of the configuration options with short explanation for each of them.
6.2.1. Cluster name
Required for the pro-active monitoring performed by StorPool support team. Usually in the form <Company-Name>-<City-or-nearest-airport>
:
SP_CLUSTER_NAME=StorPoolLab-Sofia
6.2.2. Cluster ID
The Cluster ID is computed from the StorPool Support ID and consists of two parts - location and cluster separated by a dot ("."
). Each location consists of one or more clusters:
SP_CLUSTER_ID=nzkr.b
6.2.3. Non-voting beacon node
For client only nodes, the storpool_server
service will refuse to start on a node with SP_NODE_NON_VOTING
. Default is 0:
SP_NODE_NON_VOTING=1
Attention
It is strongly recommended to configure SP_NODE_NON_VOTING
at the per-host configuration sections in storpool.conf
(see Per host configuration for more)
6.2.4. Communication interface for StorPool cluster
Recommended to have two dedicated network interfaces for communication between the nodes:
SP_IFACE1_CFG=1:eth1.100:eth1:100:10.1.100.1:b:x:r
SP_IFACE2_CFG=1:eth2.100:eth2:100:10.2.100.1:b:x:r
For a full explanation of all options, please check /usr/share/doc/storpool/examples/storpool.conf.example
6.2.5. Address for the API management (storpool_mgmt
)
Used by the CLI. Multiple clients can simultaneously send requests to the API. The management service is usually started on one or more nodes in the cluster at a time. By default it is bound on localhost:
SP_API_HTTP_HOST=127.0.0.1
For cluster wide access and automatic failover between the nodes, multiple nodes might have the API service started. The specified IP address is brought up only on one of the nodes in the cluster at a time - the so called active
API service. You may specify an available IP address (SP_API_HTTP_HOST
), which will be brought up or down on the corresponding interface (SP_API_IFACE
) when migrating the API service between the nodes.
To configure an interface (SP_API_IFACE
) and address (SP_API_HTTP_HOST
):
SP_API_HTTP_HOST=10.10.10.240
SP_API_IFACE=eth1
Note
The script that adds or deletes the SP_API_HTTP_HOST
address is located at /usr/lib/storpool/api-ip
and could be easily modified for other use cases (e.g. configure routing, firewalls, etc.).
6.2.6. Port for the API management (storpool_mgmt
)
Port for the API management service, the default is:
SP_API_HTTP_PORT=81
6.2.7. Ignore RX port option
Used to instruct the services that the network can preserve the selected port even when altering ports, default is:
SP_IGNORE_RX_PORT=0
6.2.8. Preferred port
Used to specify which port is preferred when two networks are specified, but only one of them could actually be used for any reason (in an active-backup bond style). The default value is:
SP_PREFERRED_PORT=0 # which is load-balancing
Supported values are:
SP_PREFERRED_PORT=1 # use SP_IFACE1_CFG by default
SP_PREFERRED_PORT=2 # use SP_IFACE2_CFG by default
6.2.9. Address for the bridge service (storpool_bridge
)
Required for the local bridge service, this is the address where the bridge binds to:
SP_BRIDGE_HOST=180.220.200.8
6.2.10. Interface for the bridge address (storpool_bridge
)
Expected when the SP_BRIDGE_HOST
value is a floating IP address for the storpool_bridge
service:
SP_BRIDGE_IFACE=bond0.900
6.2.11. Parallel requests per disk when recovering from remote (storpool_bridge
)
Number of parallel requests to issue while performing remote recovery, between 1 and 64, default:
SP_REMOTE_RECOVERY_PARALLEL_REQUESTS_PER_DISK=2
6.2.12. Working directory
Used for fifos, sockets, core files, etc., default:
SP_WORKDIR=/var/run/storpool
Hint
On nodes with /var/run
in RAM and limited amount of memory, /var/spool/storpool/run
is recommended.
6.2.13. Report directory
Location for collecting automated bug reports and shared memory dumps:
SP_REPORTDIR=/var/spool/storpool
6.2.14. Restart automatically in case of crash
Restart the service in case of crash if there are less than 3 crashes during this interval in seconds. If this value is 0 service will not restart at all and will have to be started manually, default is 30 minutes:
SP_RESTART_ON_CRASH=1800
6.2.15. Expected nodes
Minimum expected nodes for beacon operation, usually equal to the number of nodes with storpool_server
instances running:
SP_EXPECTED_NODES=3
6.2.16. Local user for debug data collection
User to change the ownership of the storpool_abrtsync
service runtime. Unset by default:
SP_ABRTSYNC_USER=
Note
In case of no configuration during installation, this user will be set by default to storpool
.
6.2.17. Remote addresses for sending debug data
The defaults are below, in the unlikely event they should be altered in case a jumphost or a custom collection nodes are used:
SP_ABRTSYNC_REMOTE_ADDRESSES=reports.storpool.com,reports1.storpool.com,reports2.storpool.com
6.2.18. Remote ports for sending debug data
The default port is below, might be altered in case a jumphost or a custom collection nodes are used:
SP_CRASH_REMOTE_PORT=2266
6.2.19. Group owner for the StorPool devices
The system group to use for the /dev/storpool
directory and the /dev/sp-*
raw disk devices:
SP_DISK_GROUP=disk
6.2.20. Permissions for the StorPool devices
The access mode to set on the /dev/sp-*
raw disk devices:
SP_DISK_MODE=0660
6.2.21. Logging of non-read-only open/close for StorPool devices
The if set to 0
, the storpool_bd
kernel module will not log anything about opening/closing StorPool devices:
SP_BD_LOG_OPEN_CLOSE=1
6.2.22. Exclude disks globally or per server instance
A list of paths to drives to be excluded at instance boot time:
SP_EXCLUDE_DISKS=/dev/sda1:/dev/sdb1
Can also be specified for each server instance individually:
SP_EXCLUDE_DISKS=/dev/sdc1
SP_EXCLUDE_DISKS_1=/dev/sda1
6.2.24. Cgroup setup
Tip
For more information on StorPool and cgroups, see Control Groups.
The following enables the usage of cgroups, default is on (1):
SP_USE_CGROUPS=1
Each StorPool process requires a specification of the cgroups it should be
started into; there is a default configuration for each service. One or more
processes may be placed in the same cgroup, or each one may be in a cgroup of
its own, as appropriate. StorPool provides a tool for setting up cgroups called
storpool_cg
. It is able to automatically configure a system depending on
the installed services on all supported operating systems.
The SP_RDMA_CGROUPS
is required for setting the kernel threads started by
the storpool_rdma
module:
SP_RDMA_CGROUPS=-g cpuset:storpool.slice/rdma -g memory:storpool.slice/common
Set cgroups for the storpool_block
service:
SP_BLOCK_CGROUPS=-g cpuset:storpool.slice/block -g memory:storpool.slice/common
Set cgroups for the storpool_bridge
service:
SP_BRIDGE_CGROUPS=-g cpuset:storpool.slice/bridge -g memory:storpool.slice/alloc
Set cgroups for the storpool_server
service:
SP_SERVER_CGROUPS=-g cpuset:storpool.slice/server -g memory:storpool.slice/common
SP_SERVER1_CGROUPS=-g cpuset:storpool.slice/server_1 -g memory:storpool.slice/common
SP_SERVER2_CGROUPS=-g cpuset:storpool.slice/server_2 -g memory:storpool.slice/common
SP_SERVER3_CGROUPS=-g cpuset:storpool.slice/server_3 -g memory:storpool.slice/common
SP_SERVER4_CGROUPS=-g cpuset:storpool.slice/server_4 -g memory:storpool.slice/common
SP_SERVER5_CGROUPS=-g cpuset:storpool.slice/server_5 -g memory:storpool.slice/common
SP_SERVER6_CGROUPS=-g cpuset:storpool.slice/server_6 -g memory:storpool.slice/common
Set cgroups for the storpool_beacon
service:
SP_BEACON_CGROUPS=-g cpuset:storpool.slice/beacon -g memory:storpool.slice/common
Set cgroups for the storpool_mgmt
service:
SP_MGMT_CGROUPS=-g cpuset:storpool.slice/mgmt -g memory:storpool.slice/alloc
Set cgroups for the storpool_controller
service:
SP_CONTROLLER_CGROUPS=-g cpuset:system.slice -g memory:system.slice
Set cgroups for the storpool_iscsi
target service:
SP_ISCSI_CGROUPS=-g cpuset:storpool.slice/iscsi -g memory:storpool.slice/alloc
Set cgroups for the storpool_nvmed
service:
SP_NVMED_CGROUPS=-g cpuset:storpool.slice/beacon -g memory:storpool.slice/common
6.2.25. Network and Storage controllers interrupts affinity
The setirqaff
utility is started by cron every minute. It checks the CPU affinity settings of several classes of IRQs (network interfaces, HBA, RAID) and updates them if needed. The policy is build in the script and does not require any external configuration files, apart from properly configured storpool.conf
for the present node.
6.2.26. Cache size
Each storpool_server
process allocates this amount of RAM (in MB) for caching. The size of the cache depends on the number of storage devices on each storpool_server
instance and is taken care by the storpool_cg
tool during cgroups configuration. Example configuration for all storpool_server
instances:
SP_CACHE_SIZE=4096
Note
A node with three storpool_server
processes running will use 4096*3 = 12GB cache total.
Override the size of the cache for each of the storpool_server
instances, useful when different instances control different number of drives:
SP_CACHE_SIZE=1024
SP_CACHE_SIZE_1=1024
SP_CACHE_SIZE_2=4096
SP_CACHE_SIZE_3=8192
Set the internal write-back caching to on:
SP_WRITE_BACK_CACHE_ENABLED=1
Attention
UPS is mandatory with WBC, a clean server shutdown is required before the UPS batteries are depleted.
6.2.27. API authentication token
This value must be an unique integer for each cluster:
SP_AUTH_TOKEN=0123456789
Hint
Generated with: od -vAn -N8 -tu8 /dev/random
6.2.28. NVMe SSD drives
The storpool_nvmed
service automatically detects all initialized StorPool devices and attaches them
to the configured SP_NVME_PCI_DRIVER
.
To configure different than the default storpool_pci
driver for storpool_nvmed
use:
SP_NVME_PCI_DRIVER=vfio-pci
The vfio-pci
driver requires the iommu=pt
option for both Intel/AMD CPUs
and the intel_iommu=on
option in addition for Intel CPUs on the
kernel command line.
6.2.29. Per host configuration
Specific details per host. The value in the square brackets should be the name of the host as returned by the hostname
command. The SP_OURID
value for this node must be unique throughout the cluster:
[spnode1.example.com]
SP_OURID=1
The highest ID in this release is 62
or up to this number of nodes in a single cluster.
Specific configuration details might be added for each host individually, e.g.:
[spnode1.example.com]
SP_OURID=1
SP_IFACE1_CFG=1:eth1.100:eth1:100:10.1.100.1:b:x:r
SP_IFACE2_CFG=1:eth2.100:eth2:100:10.2.100.1:b:x:r
SP_NODE_NON_VOTING=1
7. Storage devices
All storage devices that will be used by StorPool (HDD, SSD, or NVMe) must have
one or more properly aligned partitions, and must have an assigned ID. Larger
NVMe drives should be split into two or more partitions, which allows assigning
them to different instances of the storpool_server
service.
You can initialize the devices quickly using the disk_init_helper
tool
provided by StorPool. Alternatively, you do this manually using the common
parted
tool.
7.1. Journals
All hard disk drives should have a journal provided in one of the following ways:
On a persistent memory device (
/dev/pmemN
)On a small, high-endurance NVMe device (An Intel Optane or similar)
On a regular NVMe on a small separate partition from its main data one
On battery/cachevault power-loss protected virtual device (RAID controllers).
No journals on the HDD is acceptable in case of snapshot-only data (for example, a backup-only cluster).
7.2. Using disk_init_helper
The disk_init_helper
tool is used in two steps:
Discovery and setup
The tool discovers all drives that do not have partitions and are not used anywhere (no LVM PV, device mapper RAID, StorPool data disks, and so on). It uses this information to generate a suggested configuration, which is stored as a configuration file. You can try different options until you get a configuration that suits you needs.
Initialization
You provide the configuration file from the first step to the tool, and it initializes the drives.
disk_init_helper
is also used in the storpool-ansible
playbook (see
github.com/storpool/ansible), where it
helps providing consistent defaults for known configurations and idempotency.
7.2.1. Example node
This is an example node with 7 x 960GB SSDs, 8 x 2TB HDDs, 1 x 100GB Optane NVMe, and 3 x 1TB NVMe disks:
[root@s25 ~]# lsblk
NAME MAJ:MIN RM SIZE RO TYPE MOUNTPOINT
sda 8:0 0 894.3G 0 disk
sdb 8:16 0 894.3G 0 disk
sdc 8:32 0 894.3G 0 disk
sdd 8:48 0 894.3G 0 disk
sde 8:64 0 894.3G 0 disk
sdf 8:80 0 894.3G 0 disk
sdg 8:96 0 894.3G 0 disk
sdh 8:112 0 1.8T 0 disk
sdi 8:128 0 1.8T 0 disk
sdj 8:144 0 1.8T 0 disk
sdk 8:160 0 1.8T 0 disk
sdl 8:176 0 111.8G 0 disk
|-sdl1 8:177 0 11.8G 0 part
|-sdl2 8:178 0 100G 0 part /
`-sdl128 259:15 0 1.5M 0 part
sdm 8:192 0 1.8T 0 disk
sdn 8:208 0 1.8T 0 disk
sdo 8:224 0 1.8T 0 disk
sdp 8:240 0 1.8T 0 disk
nvme0n1 259:6 0 93.2G 0 disk
nvme1n1 259:0 0 931.5G 0 disk
nvme2n1 259:1 0 931.5G 0 disk
nvme3n1 259:4 0 931.5G 0 disk
This node is used in the examples below.
7.2.2. Discovering drives
7.2.2.1. Basic usage
To assign IDs for all disks on this node run the tool with the --start
argument:
[root@s25 ~]# disk_init_helper discover --start 2501 -d disks.json
sdl partitions: sdl1, sdl2, sdl128
Success generating disks.json, proceed with 'init'
Note
Note that the automatically generated IDs must be unique within the StorPool cluster. Allowed IDs are between 1 and 4000.
StorPool disk IDs will be assigned in an offset by 10 for SSD, NVMe, and HDD drives accordingly, which could be further tweaked by parameters.
By default, the tool discovers all disks without partitions; the one where the
OS is installed (/dev/sdl
) is skipped. The tools does the following:
Prepares all SSD, NVMe, and HDD devices with a single large partition on each one.
Uses the Optane device as a journal-only device for the hard drive journals.
7.2.2.2. Viewing configuration
You can use the --show
option to see what will be done:
[root@s25 ~]# disk_init_helper discover --start 2501 --show
sdl partitions: sdl1, sdl2, sdl128
/dev/sdb (type: SSD):
data partitions
/dev/disk/by-id/ata-SAMSUNG_MZ7LM960HMJP-00005_S2TZNX0J302126-part1 (2501): 894.25 GiB (mv: None)
/dev/sda (type: SSD):
data partitions
/dev/disk/by-id/ata-SAMSUNG_MZ7LM960HMJP-00005_S2TZNX0J302127-part1 (2502): 894.25 GiB (mv: None)
/dev/sdc (type: SSD):
data partitions
/dev/disk/by-id/ata-SAMSUNG_MZ7LM960HMJP-00005_S2TZNX0J302128-part1 (2503): 894.25 GiB (mv: None)
/dev/sdd (type: SSD):
data partitions
/dev/disk/by-id/ata-SAMSUNG_MZ7LM960HMJP-00005_S2TZNX0J302129-part1 (2504): 894.25 GiB (mv: None)
/dev/sde (type: SSD):
data partitions
/dev/disk/by-id/ata-SAMSUNG_MZ7LM960HMJP-00005_S2TZNX0J302137-part1 (2505): 894.25 GiB (mv: None)
/dev/sdf (type: SSD):
data partitions
/dev/disk/by-id/ata-SAMSUNG_MZ7LM960HMJP-00005_S2TZNX0J302138-part1 (2506): 894.25 GiB (mv: None)
/dev/sdg (type: SSD):
data partitions
/dev/disk/by-id/ata-SAMSUNG_MZ7LM960HMJP-00005_S2TZNX0J302139-part1 (2507): 894.25 GiB (mv: None)
/dev/sdh (type: HDD):
data partitions
/dev/disk/by-id/ata-ST2000NM0008-2F3100_ZDS00Y25-part1 (2521): 1863.01 GiB (mv: /dev/disk/by-id/nvme-INTEL_SSDPE21K100GA_PHKE831600DM100EGN-part1)
/dev/sdj (type: HDD):
data partitions
/dev/disk/by-id/ata-ST2000NM0008-2F3100_ZDS03YRJ-part1 (2522): 1863.01 GiB (mv: /dev/disk/by-id/nvme-INTEL_SSDPE21K100GA_PHKE831600DM100EGN-part2)
/dev/sdi (type: HDD):
data partitions
/dev/disk/by-id/ata-ST2000NM0008-2F3100_ZDS041FK-part1 (2523): 1863.01 GiB (mv: /dev/disk/by-id/nvme-INTEL_SSDPE21K100GA_PHKE831600DM100EGN-part3)
/dev/sdk (type: HDD):
data partitions
/dev/disk/by-id/ata-ST2000NM0008-2F3100_ZDS04280-part1 (2524): 1863.01 GiB (mv: /dev/disk/by-id/nvme-INTEL_SSDPE21K100GA_PHKE831600DM100EGN-part4)
/dev/sdp (type: HDD):
data partitions
/dev/disk/by-id/ata-ST2000NM0008-2F3100_ZDS05PTA-part1 (2525): 1863.01 GiB (mv: /dev/disk/by-id/nvme-INTEL_SSDPE21K100GA_PHKE831600DM100EGN-part5)
/dev/sdo (type: HDD):
data partitions
/dev/disk/by-id/ata-ST2000NM0008-2F3100_ZDS05PTB-part1 (2526): 1863.01 GiB (mv: /dev/disk/by-id/nvme-INTEL_SSDPE21K100GA_PHKE831600DM100EGN-part6)
/dev/sdm (type: HDD):
data partitions
/dev/disk/by-id/ata-ST2000NM0008-2F3100_ZDS05PTD-part1 (2527): 1863.01 GiB (mv: /dev/disk/by-id/nvme-INTEL_SSDPE21K100GA_PHKE831600DM100EGN-part7)
/dev/sdn (type: HDD):
data partitions
/dev/disk/by-id/ata-ST2000NM0008-2F3100_ZDS05PTJ-part1 (2528): 1863.01 GiB (mv: /dev/disk/by-id/nvme-INTEL_SSDPE21K100GA_PHKE831600DM100EGN-part8)
/dev/nvme0n1 (type: journal-only NVMe):
journal partitions
/dev/disk/by-id/nvme-INTEL_SSDPE21K100GA_PHKE831600DM100EGN-part1 (None): 0.10 GiB (mv: None)
/dev/disk/by-id/nvme-INTEL_SSDPE21K100GA_PHKE831600DM100EGN-part2 (None): 0.10 GiB (mv: None)
/dev/disk/by-id/nvme-INTEL_SSDPE21K100GA_PHKE831600DM100EGN-part3 (None): 0.10 GiB (mv: None)
/dev/disk/by-id/nvme-INTEL_SSDPE21K100GA_PHKE831600DM100EGN-part4 (None): 0.10 GiB (mv: None)
/dev/disk/by-id/nvme-INTEL_SSDPE21K100GA_PHKE831600DM100EGN-part5 (None): 0.10 GiB (mv: None)
/dev/disk/by-id/nvme-INTEL_SSDPE21K100GA_PHKE831600DM100EGN-part6 (None): 0.10 GiB (mv: None)
/dev/disk/by-id/nvme-INTEL_SSDPE21K100GA_PHKE831600DM100EGN-part7 (None): 0.10 GiB (mv: None)
/dev/disk/by-id/nvme-INTEL_SSDPE21K100GA_PHKE831600DM100EGN-part8 (None): 0.10 GiB (mv: None)
/dev/nvme3n1 (type: NVMe w/ journals):
data partitions
/dev/disk/by-id/nvme-INTEL_SSDPE2KX010T8_BTLJ849207E61P0FGN-part1 (2511): 931.51 GiB (mv: None)
/dev/nvme1n1 (type: NVMe w/ journals):
data partitions
/dev/disk/by-id/nvme-INTEL_SSDPE2KX010T8_BTLJ849207F91P0FGN-part1 (2512): 931.51 GiB (mv: None)
/dev/nvme2n1 (type: NVMe w/ journals):
data partitions
/dev/disk/by-id/nvme-INTEL_SSDPE2KX010T8_BTLJ84920JAJ1P0FGN-part1 (2513): 931.51 GiB (mv: None)
7.2.2.3. Recognizing SSDs
The SSDs and HDDs are auto-discovered by their rotational flag in the
/sys/block
hierarchy. There are however occasions when this flag might be
misleading, and an SSD is visible as a rotational device.
For such cases there are overrides that can further help with proper configuration, as shown in the example below:
# disk_init_helper discover --start 101 '*Micron*M510*:s'
All devices matching the /dev/disk/by-id*Micron*M510*:s
pattern will be
forced as SSD drives, regardless of how they were discovered by the tool.
7.2.2.4. Specifying a journal
Similarly, a journal may be specified for a device, for example:
# disk_init_helper discover --start 101 '*Hitachi*HUA7200*:h:njo'
Instructs the tool to use an NVMe journal-only device for keeping the journals for all Hitachi HUA7200 drives.
The overrides look like this:
<disk-serial-pattern>:<disk-type>[:<journal-type>]
The disk type may be one of:
s
- SSD drive
sj
- SSD drive with HDD journals (used for testing only)
n
- NVMe drive
nj
- NVMe drive with HDD journals
njo
- NVMe drive with journals only (no StorPool data disk)
h
- HDD drive
x
- Exclude this drive match, even if it is with the right size.
The journal-type override is optional, and makes sense mostly when the device is an HDD:
nj
- journal on an NVMe drive - requires at least onenj
device
njo
- journal on an NVMe drive - requires at least onenjo
device
sj
- journal on SSD drive (unusual, but useful for testing); requires at least onesj
device.
7.2.3. Initializing drives
To initialize the drives using an existing configuration file:
# disk_init_helper init disks.json
The above will apply the settings pre-selected during the discovery phase.
More options may be specified to either provide some visibility on what will be
done (like --verbose
and --noop
), or provide additional options to
storpool_initdisk
for the different disk types (like --ssd-args
and
--hdd-args
).
7.3. Manual partitioning
A disk drive can be initialized manually as a StorPool data disk.
7.3.1. Creating partitions
First, an aligned partition should be created spanning the full volume of the disk drive. Here is an example command for creating a partition on the whole drive with the proper alignment:
# parted -s --align optimal /dev/sdX mklabel gpt -- mkpart primary 2MiB 100% # Here, X is the drive letter
For dual partitions on a NVMe drive that is larger than 4TB use:
# parted -s --align optimal /dev/nvmeXnY mklabel gpt -- mkpart primary 2MiB 50% # Here, X is the nvme device controller, and Y is the device's first namespace
# parted -s --align optimal /dev/nvmeXnY mkpart primary 50% 100%
Similarly, to split an even larger (for example, 8TB or larger) NVMe drive to four partitions use:
# parted -s --align optimal /dev/nvmeXnY mklabel gpt -- mkpart primary 2MiB 25% # Here, X is the nvme device controller, and Y is the device's first namespace
# parted -s --align optimal /dev/nvmeXnY mkpart primary 25% 50%
# parted -s --align optimal /dev/nvmeXnY mkpart primary 50% 75%
# parted -s --align optimal /dev/nvmeXnY mkpart primary 75% 100%
Hint
NVMe devices larger than 4TB should always be split as up to 4TiB chunks.
7.3.2. Initializing a drive
On a brand new cluster installation it is necessary to have one drive formatted
with the “init” (-I
) flag of the storpool_initdisk
tool.
This device is necessary only for the first start, and therefore it is best to
pick the first drive in the cluster.
Initializing the first drive on the first server node with the init flag:
# storpool_initdisk -I {diskId} /dev/sdX # Here, X is the drive letter
Initializing an SSD or NVME SSD device with the SSD flag set:
# storpool_initdisk -s {diskId} /dev/sdX # Here, X is the drive letter
Initializing an HDD drive with a journal device:
# storpool_initdisk {diskId} /dev/sdX --journal /dev/sdY # Here, X and Y are the drive letters
To list all initialized devices:
# storpool_initdisk --list
Example output:
0000:01:00.0-p1, diskId 2305, version 10007, server instance 0, cluster e.b, SSD, opened 7745
0000:02:00.0-p1, diskId 2306, version 10007, server instance 0, cluster e.b, SSD, opened 7745
/dev/sdr1, diskId 2301, version 10007, server instance 1, cluster e.b, SSD, opened 8080
/dev/sdq1, diskId 2302, version 10007, server instance 1, cluster e.b, SSD, opened 8080
/dev/sds1, diskId 2303, version 10007, server instance 1, cluster e.b, SSD, opened 8080
/dev/sdt1, diskId 2304, version 10007, server instance 1, cluster e.b, SSD, opened 8080
/dev/sda1, diskId 2311, version 10007, server instance 2, cluster e.b, WBC, jmv 160036C1B49, opened 8185
/dev/sdb1, diskId 2311, version 10007, server instance 2, cluster -, journal mv 160036C1B49, opened 8185
/dev/sdc1, diskId 2312, version 10007, server instance 2, cluster e.b, WBC, jmv 160036CF95B, opened 8185
/dev/sdd1, diskId 2312, version 10007, server instance 2, cluster -, journal mv 160036CF95B, opened 8185
/dev/sde1, diskId 2313, version 10007, server instance 3, cluster e.b, WBC, jmv 160036DF8DA, opened 8971
/dev/sdf1, diskId 2313, version 10007, server instance 3, cluster -, journal mv 160036DF8DA, opened 8971
/dev/sdg1, diskId 2314, version 10007, server instance 3, cluster e.b, WBC, jmv 160036ECC80, opened 8971
/dev/sdh1, diskId 2314, version 10007, server instance 3, cluster -, journal mv 160036ECC80, opened 8971
7.3.3. Drive initialization options
Other available options of the storpool_initdisk
tool:
- --list
List all StorPool disks on this node.
- -i
Specify server instance, used when more than one
storpool_server
instances are running on the same node.- -r
Used to return an ejected disk back to the cluster or change some of the flags.
- -F
Forget this disk and mark it as ejected; succeeds only without a running
storpool_server
instance that has the drive opened.
- -s|–ssd y/n
Set SSD flag - on new initialize only, not reversible with
-r
. Providing they
orn
value forces a disk to be considered as flash-based or not.- -j|–journal (<device>|none)
Used for HDDs when a RAID controller with a working cachevault or battery is present or an NVMe device is used as a power loss protected write-back journal cache.
- --bad
Marks disk as bad. Will be treated as ejected by the servers.
- --good
Resets disk to ejected if it was bad. Use with caution.
- --list-empty
List empty NVMe devices.
- --json
Output the list of devices as a JSON object.
- --nvme-smart nvme-pci-addr
Dump the NVMe S.M.A.R.T. counters; only for devices controlled by the
storpool_nvmed
service.
Advanced options (use with care):
- -e (entries_count)
Initialize the disk by overriding the default number of entries count (default is based on the disk size).
- -o (objects_count)
Initialize the disk by overriding the default number of objects count (default is based on the disk size).
- --wipe-all-data
Used when re-initializing an already initialized StorPool drive. Use with caution.
- --no-test
Disable forced one-time test flag.
- --no-notify
Does not notify servers of the changes. They won’t immediately open the disk. Useful for changing a flag with
-r
without returning the disk back to the server.
- –no-fua (y|n)
Used to forcefully disable FUA support for an SSD device. Use with caution because it might lead to data loss if the device is powered off before issuing a FLUSH CACHE command.
- –no-flush (y|n)
Used to forcefully disable FLUSH support for an SSD device.
- –no-trim (y|n)
Used to forcefully disable TRIM support for an SSD device. Useful when the drive is misbehaving when TRIM is enabled.
- –test-override no/test/pass
Modify the “test override” flag (default during disk init is “test”).
- –wbc (y|n)
Used for HDDs when the internal write-back caching is enabled, implies
SP_WRITE_BACK_CACHE_ENABLED
to have an effect. Turned off by default.
- --nvmed-rescan
Instruct the
storpool_nvmed
service to rescan after device changes.
8. Network interfaces
The recommended mode of operation is with hardware acceleration enabled for supported network interfaces.
Most NICs controlled by the i40e
/ixgbe
/ice
(Intel),
mlx4_core
/mlx5_core
(Nvidia/Mellanox), and bnx2x
/bnxt
(Broadcom) drivers do support hardware acceleration.
When enabled the StorPool services are directly using the NIC, bypassing the Linux kernel. This reduces CPU usage and processing latency, and StorPool traffic is not affected by issues in Linux kernel (floods). Due to the Linux kernel being now bypassed, the entire network stack is implemented in the StorPool services.
8.1. Preparing Interfaces
There are two ways to configure the network interfaces for StorPool. One is automatic by providing just the VLAN ID, the IP network(s) and the mode of operation, and leave the IP address selection to the helper tooling.
The other is semi-manual by providing explicit IP address configuration for each parameter on each of the nodes in the cluster.
8.2. net_helper
The fully automatic mode of operation selects addresses, based on the
SP_OURID
of each node in the cluster. It requires VLAN (default 0, i.e.
untagged), an IP network for the storage, and pre-defined mode of
operation for the interfaces in the OS.
Supported modes and some notes for each below.
Note
All modes relate only to the way the kernel networks are configured. The storage services are always in active-active mode (unless configured differently) using directly the underlying interfaces. Any configuration of the kernel interfaces is solely for the purposes of other traffic, for example for access to the API.
8.2.1. exclusive-ifaces
Simplest possible configuration, with just the two main raw interfaces, each configured with a different address.
This mode is used mainly when there is no need for redundancy on the kernel network, usually for multipath iSCSI.
Not recommended for the storage network if the API is configured on top, where some of the bond modes are recommended.
8.2.2. active-backup-bond
and bridge-active-backup-bond
In these two modes the underlying storage interfaces are added in an
active-backup bond interface (named spbond0
by default) which uses an
arp-monitor to select the active interface in the bond.
If the vlan for the storage interfaces is tagged, an additional VLAN interface is created on top of the bond.
Simplest example with untagged VLAN (i.e. 0):

In case the network is with a tagged VLAN, it will be created on top of the
spbond0
interface.
Example with tagged VLAN 100:

In the bridge-active-backup-bond
modification the final resolve interface
is a slave of a bridge interface (named br-storage
by default).
This is a tagged VLAN 100
on the bond:

Lastly a more complex example with four interfaces (sp0
, sp1
, sp2
,
sp3
), the first two for the storage network are in
bridge-active-backup-bond
mode. The other two for the iSCSI network are in
exclusive-ifaces
mode. There are two additional networks on top of the
storage resolve interface (in this example 1100
and 1200
).
There is also an additional multipath network on the iSCSI interfaces with VLANs
1301
on the first, and 1302
on the second iSCSI network interface.
The net_helper genconfig
one-liner to prepare this configuration looks like
this:
# net_helper genconfig sp0 sp1 sp2 sp3 \
--vlan 100 \
--sp-network 10.0.0.1/24 \
--sp-mode bridge-active-backup-bond \
--add-iface-net 1100,10.1.100.0/24 \
--add-iface-net 1200,10.1.200.0/24 \
--iscsi-mode exclusive-ifaces \
--iscsicfg 1301,10.130.1.0/24:1302,10.130.2.0/24

The tooling helps automatically select addresses for the arp-monitoring targets if such are not overridden for better active network selection. These addresses are usually the other storage nodes in the cluster. For the iSCSI in this mode it is best to provide explicit arp monitoring addresses.
8.2.3. mlag-bond
and bridge-mlag-bond
These two modes are very close to the active-backup-bond
and
bridge-active-backup-bond
, with the notable difference that the bond is
LACP both when specified for the main storage or the iSCSI network interfaces.
With this bond type, no additional arp-monitoring addresses are required or being autogenerated by the tooling.
A quirk with this mode is that multipath networks for the iSCSI are being
created on top of the bond interface, because there is no way to send traffic
through a specific interface under the bond. Use the exclusive-iface
mode
for such cases.
8.3. Examples
A minimal example with the following parameters:
Interface names
sp0
andsp1
(the order is important)VLAN ID 42
IP Network 10.4.2.0/24
Predefined mode mode of operation - active-backup bond on top of the storage interfaces
The example below is from a node with SP_OURID=11
which will just print an
example config that could be stored on the filesystem of the node:
[root@s11 ~]# storpool_showconf SP_OURID
SP_OURID=11
[root@s11 ~]# net_helper genconfig sp0 sp1 --vlan 42 --sp-network 10.4.2.0/24 --sp-mode active-backup-bond
interfaces=sp0 sp1
addresses=10.4.2.11
sp_mode=active-backup-bond
vlan=42
add_iface=
sp_mtu=9000
iscsi_mtu=9000
iscsi_add_iface=
arp_ip_targets=10.4.2.12,10.4.2.13,10.4.2.14,10.4.2.15
config_path=/etc/storpool.conf.d/net_helper.conf
This tool is just printing the configuration, to store it, use:
[root@s11 ~]# net_helper genconfig sp0 sp1 --vlan 42 --sp-network 10.4.2.0/24 --sp-mode active-backup-bond > /etc/storpool/autonets.conf
With this configuration the net_helper applyifcfg
could be used to produce
configuration for the network based on the operating system , this example is
under CentOS 7 (--noop
just prints what will be done):
[root@s11 ~]# net_helper applyifcfg --from-config /etc/storpool/autonets.conf --noop
Same resolve interface spbond0.42 for both nets, assuming bond
An active-backup bond interface detected
Will patch /etc/storpool.conf.d/net_helper.conf with:
________________
SP_IFACE1_CFG=1:spbond0.42:sp0:42:10.4.2.11:b:s:P
SP_IFACE2_CFG=1:spbond0.42:sp1:42:10.4.2.11:b:s:P
SP_ALL_IFACES=dummy0 sp0 sp1 spbond0 spbond0.42
________________
Executing command: iface-genconf --auto --overwrite --sp-mtu 9000 --iscsi-mtu 9000 --arp-ip-targets 10.4.2.12,10.4.2.13,10.4.2.14,10.4.2.15 --noop
Using /usr/lib/storpool, instead of the default /usr/lib/storpool
Same resolve interface spbond0.42 for both nets, assuming bond
An active-backup bond interface detected
# Noop selected, use the following commands to create the configuration file manually
cat <<_EOF_ > /etc/sysconfig/network-scripts/ifcfg-spbond0.42
# Autogenerated by /usr/sbin/iface-genconf on 2022-07-12 07:02:27
DEVICE=spbond0.42
ONBOOT=yes
TYPE=Vlan
BOOTPROTO=none
MTU=9000
VLAN=yes
PHYSDEV=spbond0
IPADDR=10.4.2.11
PREFIX=24
NM_CONTROLLED=no
_EOF_
# Noop selected, use the following commands to create the configuration file manually
cat <<_EOF_ > /etc/sysconfig/network-scripts/ifcfg-dummy0
# Autogenerated by /usr/sbin/iface-genconf on 2022-07-12 07:02:27
DEVICE=dummy0
ONBOOT=yes
TYPE=dummy
MASTER=spbond0
SLAVE='yes'
BOOTPROTO=none
MTU=9000
NM_CONTROLLED=no
_EOF_
# Noop selected, use the following commands to create the configuration file manually
cat <<_EOF_ > /etc/sysconfig/network-scripts/ifcfg-spbond0
# Autogenerated by /usr/sbin/iface-genconf on 2022-07-12 07:02:27
DEVICE=spbond0
ONBOOT=yes
TYPE=Bond
BONDING_MASTER=yes
BONDING_OPTS="mode=active-backup arp_interval=500 arp_validate=active arp_all_targets=any arp_ip_target=10.4.2.12,10.4.2.13,10.4.2.14,10.4.2.15"
BOOTPROTO=none
MTU=9000
NM_CONTROLLED=no
_EOF_
# Noop selected, use the following commands to create the configuration file manually
cat <<_EOF_ > /etc/sysconfig/network-scripts/ifcfg-sp0
# Autogenerated by /usr/sbin/iface-genconf on 2022-07-12 07:02:27
DEVICE=sp0
ONBOOT=yes
TYPE=Ethernet
MASTER=spbond0
SLAVE='yes'
BOOTPROTO=none
MTU=9000
NM_CONTROLLED=no
_EOF_
# Noop selected, use the following commands to create the configuration file manually
cat <<_EOF_ > /etc/sysconfig/network-scripts/ifcfg-sp1
# Autogenerated by /usr/sbin/iface-genconf on 2022-07-12 07:02:27
DEVICE=sp1
ONBOOT=yes
TYPE=Ethernet
MASTER=spbond0
SLAVE='yes'
BOOTPROTO=none
MTU=9000
NM_CONTROLLED=no
_EOF_
There are many additional options available, for example the name of the bond could be customized, an additional set of VLAN interfaces, could be created on top of the bond interface, etc.
A more advanced example below is:
Interface names
sp0
,sp1
for StorPool, andlacp0
,lacp1
for the iSCSI serviceVLAN ID 42 for the storage interfaces
IP Network 10.4.2.0/24
Additional VLAN ID 43 on the bond over the storage interfaces
Storage interfaces kernel mode of operation - with a bridge with MLAG bond on top of the storage interfaces
iSCSI dedicated interfaces kernel mode of operation - with an MLAG bond
VLAN 100, and IP network 172.16.100.0/24 for a portal group in iSCSI
To prepare configuration use:
[root@s11 ~]e net_helper genconfig \
lacp0 lacp1 sp0 sp1
--vlan 42 \
--sp-network 10.4.2.0/24 \
--sp-mode bridge-mlag-bond \
--iscsi-mode mlag-bond \
--add-iface 43,10.4.3.0/24 \
--iscsicfg-net 100,172.16.100.0/24 | tee /etc/storpool/autonets.conf
interfaces=lacp0 lacp1 sp0 sp1
addresses=10.4.2.11
sp_mode=bridge-mlag-bond
vlan=42
iscsi_mode=mlag-bond
add_iface=43,10.4.3.11/24
sp_mtu=9000
iscsi_mtu=9000
iscsi_add_iface=100,172.16.100.11/24
iscsi_arp_ip_targets=
config_path=/etc/storpool.conf.d/net_helper.conf
Example output:
[root@s11 ~]# net_helper applyifcfg --from-config /etc/storpool/autonets.conf --noop
Same resolve interface br-storage for both nets, assuming bond
An 802.3ad bond interface detected
Will patch /etc/storpool.conf.d/net_helper.conf with:
________________
SP_RESOLVE_IFACE_IS_BRIDGE=1
SP_BOND_IFACE_NAME=spbond0.42
SP_IFACE1_CFG=1:br-storage:lacp0:42:10.4.2.11:b:s:v
SP_IFACE2_CFG=1:br-storage:lacp1:42:10.4.2.11:b:s:v
SP_ISCSI_IFACE=sp0,spbond1:sp1,spbond1:[lacp]
SP_ALL_IFACES=br-storage dummy0 dummy1 lacp0 lacp1 sp0 sp1 spbond0 spbond0.42 spbond0.43 spbond1 spbond1.100
________________
Executing command: iface-genconf --auto --overwrite --sp-mtu 9000 --iscsi-mtu 9000 --add-iface 43,10.4.3.11/24 --iscsicfg-explicit 100,172.16.100.11/24 --noop
Using /usr/lib/storpool, instead of the default /usr/lib/storpool
Same resolve interface br-storage for both nets, assuming bond
An 802.3ad bond interface detected
# Noop selected, use the following commands to create the configuration file manually
cat <<_EOF_ > /etc/sysconfig/network-scripts/ifcfg-br-storage
# Autogenerated by /usr/sbin/iface-genconf on 2022-07-12 07:33:39
DEVICE=br-storage
ONBOOT=yes
TYPE=Bridge
BOOTPROTO=none
MTU=9000
IPADDR=10.4.2.11
PREFIX=24
NM_CONTROLLED=no
_EOF_
# Noop selected, use the following commands to create the configuration file manually
cat <<_EOF_ > /etc/sysconfig/network-scripts/ifcfg-spbond0.42
# Autogenerated by /usr/sbin/iface-genconf on 2022-07-12 07:33:39
DEVICE=spbond0.42
ONBOOT=yes
TYPE=Vlan
BRIDGE=br-storage
BOOTPROTO=none
MTU=9000
VLAN=yes
PHYSDEV=spbond0
NM_CONTROLLED=no
_EOF_
# Noop selected, use the following commands to create the configuration file manually
cat <<_EOF_ > /etc/sysconfig/network-scripts/ifcfg-dummy0
# Autogenerated by /usr/sbin/iface-genconf on 2022-07-12 07:33:39
DEVICE=dummy0
ONBOOT=yes
TYPE=dummy
MASTER=spbond0
SLAVE='yes'
BOOTPROTO=none
MTU=9000
NM_CONTROLLED=no
_EOF_
# Noop selected, use the following commands to create the configuration file manually
cat <<_EOF_ > /etc/sysconfig/network-scripts/ifcfg-spbond0
# Autogenerated by /usr/sbin/iface-genconf on 2022-07-12 07:33:39
DEVICE=spbond0
ONBOOT=yes
TYPE=Bond
BRIDGE=br-storage
BONDING_MASTER=yes
BONDING_OPTS="mode=802.3ad miimon=100 lacp_rate=1"
BOOTPROTO=none
MTU=9000
NM_CONTROLLED=no
_EOF_
# Noop selected, use the following commands to create the configuration file manually
cat <<_EOF_ > /etc/sysconfig/network-scripts/ifcfg-lacp0
# Autogenerated by /usr/sbin/iface-genconf on 2022-07-12 07:33:39
DEVICE=lacp0
ONBOOT=yes
TYPE=Ethernet
MASTER=spbond0
SLAVE='yes'
BOOTPROTO=none
MTU=9000
NM_CONTROLLED=no
_EOF_
# Noop selected, use the following commands to create the configuration file manually
cat <<_EOF_ > /etc/sysconfig/network-scripts/ifcfg-lacp1
# Autogenerated by /usr/sbin/iface-genconf on 2022-07-12 07:33:39
DEVICE=lacp1
ONBOOT=yes
TYPE=Ethernet
MASTER=spbond0
SLAVE='yes'
BOOTPROTO=none
MTU=9000
NM_CONTROLLED=no
_EOF_
# Noop selected, use the following commands to create the configuration file manually
cat <<_EOF_ > /etc/sysconfig/network-scripts/ifcfg-spbond0.43
# Autogenerated by /usr/sbin/iface-genconf on 2022-07-12 07:33:39
DEVICE=spbond0.43
ONBOOT=yes
TYPE=Vlan
BOOTPROTO=none
MTU=9000
VLAN=yes
PHYSDEV=spbond0
IPADDR=10.4.3.11
PREFIX=24
NM_CONTROLLED=no
_EOF_
# Noop selected, use the following commands to create the configuration file manually
cat <<_EOF_ > /etc/sysconfig/network-scripts/ifcfg-sp0
# Autogenerated by /usr/sbin/iface-genconf on 2022-07-12 07:33:39
DEVICE=sp0
ONBOOT=yes
TYPE=Ethernet
MASTER=spbond1
SLAVE='yes'
BOOTPROTO=none
MTU=9000
NM_CONTROLLED=no
_EOF_
# Noop selected, use the following commands to create the configuration file manually
cat <<_EOF_ > /etc/sysconfig/network-scripts/ifcfg-sp1
# Autogenerated by /usr/sbin/iface-genconf on 2022-07-12 07:33:39
DEVICE=sp1
ONBOOT=yes
TYPE=Ethernet
MASTER=spbond1
SLAVE='yes'
BOOTPROTO=none
MTU=9000
NM_CONTROLLED=no
_EOF_
# Noop selected, use the following commands to create the configuration file manually
cat <<_EOF_ > /etc/sysconfig/network-scripts/ifcfg-dummy1
# Autogenerated by /usr/sbin/iface-genconf on 2022-07-12 07:33:39
DEVICE=dummy1
ONBOOT=yes
TYPE=dummy
MASTER=spbond1
SLAVE='yes'
BOOTPROTO=none
MTU=9000
NM_CONTROLLED=no
_EOF_
# Noop selected, use the following commands to create the configuration file manually
cat <<_EOF_ > /etc/sysconfig/network-scripts/ifcfg-spbond1.100
# Autogenerated by /usr/sbin/iface-genconf on 2022-07-12 07:33:39
DEVICE=spbond1.100
ONBOOT=yes
TYPE=Vlan
BOOTPROTO=none
MTU=9000
VLAN=yes
PHYSDEV=spbond1
IPADDR=172.16.100.11
PREFIX=24
NM_CONTROLLED=no
_EOF_
# Noop selected, use the following commands to create the configuration file manually
cat <<_EOF_ > /etc/sysconfig/network-scripts/ifcfg-spbond1
# Autogenerated by /usr/sbin/iface-genconf on 2022-07-12 07:33:39
DEVICE=spbond1
ONBOOT=yes
TYPE=Bond
BONDING_MASTER=yes
BONDING_OPTS="mode=802.3ad miimon=100 lacp_rate=1"
BOOTPROTO=none
MTU=9000
NM_CONTROLLED=no
_EOF_
To actually apply the configuration use:
[root@s11 ~]# net_helper applyifcfg --from-config /etc/storpool/autonets.conf
...
Additional sub-commands available are:
up
- execute ifup/nmcli connection up on all created interfacesdown
- execute ifdown/nmcli connection down on all created interfacescheck
- check whether there is a configuration or if there is a difference between the present one and a newly created one.cleanup
- delete all network interfaces created by thenet_helper
- useful when re-creating the same raw interfaces with a different mode.
8.4. Manual config notes
The net_helper
tool is merely a glue-like tool covering the following manual steps:
Construct the
SP_IFACE1_CFG
/SP_IFACE2_CFG
/SP_ISCSI_IFACE
and other configuration statements based on the provided parameters (for the first and second network interfaces for the storage/iSCSI)Execute
iface_genconf
that recognizes these configurations, and dumps configuration in/etc/sysconfig
(CentOS 7) or/etc/network/interfaces
(Debian) or usingnmcli
to configure with Network Manager (Alma8/Rocky8/RHEL8)Execute
/usr/lib/storpool/vf-genconf
to prepare or re-create the configuration for virtual function interfaces.
9. Background services
A StorPool installation provides the following daemons that take care of different functionality on each participant node in the cluster.
9.1. storpool_beacon
The beacon must be the first started process on all nodes in cluster. It informs all members about the availability of the node on which it is installed. If the number of the visible nodes changes, every storpool_beacon
service checks that its node still participates is the quorum, i.e. it can communicate with more than half of the expected nodes, including itself (see SP_EXPECTED_NODES
in the Full Configuration Options List section). If the storpool_beacon
service has been started successfully, it will send to the system log (/var/log/messages
, /var/log/syslog
, or similar) messages such as the following for every node that comes up in the StorPool cluster:
[snip]
Jan 21 16:22:18 s01 storpool_beacon[18839]: [info] incVotes(1) from 0 to 1, voteOwner 1
Jan 21 16:23:10 s01 storpool_beacon[18839]: [info] peer 2, beaconStatus UP bootupTime 1390314187662389
Jan 21 16:23:10 s01 storpool_beacon[18839]: [info] incVotes(1) from 1 to 2, voteOwner 2
Jan 21 16:23:10 s01 storpool_beacon[18839]: [info] peer up 1
[snip]
9.2. storpool_server
The storpool_server
service must be started on each node that provides its hard drives, SSDs or NVMe drives to the cluster. If the service has started successfully, all the hard drives intended to be used as
StorPool disks should be listed in the system log, e.g.:
Dec 14 09:54:19 s11 storpool_server[13658]: [info] /dev/sdl1: adding as data disk 1101 (ssd)
Dec 14 09:54:19 s11 storpool_server[13658]: [info] /dev/sdb1: adding as data disk 1111
Dec 14 09:54:20 s11 storpool_server[13658]: [info] /dev/sda1: adding as data disk 1114
Dec 14 09:54:20 s11 storpool_server[13658]: [info] /dev/sdk1: adding as data disk 1102 (ssd)
Dec 14 09:54:20 s11 storpool_server[13658]: [info] /dev/sdj1: adding as data disk 1113
Dec 14 09:54:22 s11 storpool_server[13658]: [info] /dev/sdi1: adding as data disk 1112
On a dedicated or node with a larger amount of spare resources, more than one storpool_server
instance could be started (up to four instances).
9.3. storpool_block
The storpool_block
service provides the client (initiator) functionality. StorPool volumes can be attached only to the nodes where this service is running. When attached to a node, a volume can be used and manipulated as a regular block device via the /dev/stopool/{volume_name}
symlink:
# lsblk /dev/storpool/test
NAME MAJ:MIN RM SIZE RO TYPE MOUNTPOINT
sp-2 251:2 0 100G 0 disk
9.4. storpool_mgmt
The storpool_mgmt
service should be started on the management node. It receives requests from user space tools (CLI or API), executes them in the StorPool cluster and returns the results back to the sender. Two or more nodes in the same cluster can be used as an API management server, with only one node active at the same time. An automatic failover mechanism is available: when the node with the active storpool_mgmt
service fails, the SP_API_HTTP_HOST
IP address is configured on the next node with the lowest SP_OURID
with a running storpool_mgmt
service.
9.5. storpool_bridge
The storpool_bridge
service is started on one or more nodes in the cluster with one working as the active similar to the storpool_mgmt
service. This service synchronizes snapshots for the backup and disaster recovery use cases between this and one or more StorPool clusters in different locations.
9.6. storpool_controller
The storpool_controller
service is started on all nodes running the storpool_server
service. It collects information from all storpool_server
instances in order to provide statistics data to the API.
Note
The storpool_controller
service requires port 47567
to be open on the nodes where the API (storpool_mgmt
) service is running.
9.7. storpool_nvmed
The storpool_nvmed
service is started on all nodes that have the storpool_server
service and have NVMe devices. It handles the management of the NVMe devices, their unplugging from the kernel’s nvme
driver and passing to the storpool_pci
driver.
9.8. storpool_stat
The storpool_stat
service is started on all nodes and collects metrics about different aspects of the system:
on all nodes, CPU stats - queue run/wait, user, system, etc., per CPU;
on all nodes, memory usage stats per cgroup;
on all nodes, network stats for the StorPool services;
on all nodes, the I/O stats of the system drives;
on all nodes, per-host validating service checks (for example, if there are processes in the root cgroup, the API is reachable if configured, etc.);
on all nodes with
storpool_block
, the I/O stats of all attached StorPool volumes;on server nodes, stats for the communication of
storpool_server
with the drives.
The collected data can be viewed at https://analytics.storpool.com
and can also be submitted to an InfluxDB instance of the customer, configurable in storpool.conf
.
9.9. storpool_qos
The storpool_qos
service tracks changes for each volume if it is:
in a template which name matches a defined storage tier or
has a
qc
tag matching a defined storage tier
and takes care of updating the settings for this (set of) volume(s).
By default it is started on all nodes running the API service and it is active on the active API node.
By default there are no tiers defined, but an example is available in
/usr/share/doc/storpool/examples/default-qos.json
.
A tier consists of:
min_iops
: Minimum IOPS regardless of the size of the volume or snapshot (int, IOPS)
min_bw
: Minimum bandwidth regardless of the size of the volume or snapshot (int, MiB/s)
max_iops
: Maximum IOPS regardless of the size of the volume or snapshot (int, IOPS)
max_bw
: Maximum bandwidth regardless of the size of the volume or snapshot (int, MiB/s)
iops_gb
: IOPS limit per GiB based on the size of the volume or the snapshot (float, IOPS)
mbps_gb
: MiB/s limit per GiB based on the size of the volume or the snapshot (float, MiB/s)
Where 0
means no limit and any value results in a defined limit.
To define a (set of) storage tier(s) use:
# /usr/lib/storpool/storpool_qos config --load-from $path_to_json
To show the present configuration use:
# /usr/lib/storpool/storpool_qos config --dump
Note
A systemctl restart storpool_qos.service
is required on the active API node in order to re-apply any changes in the tiers settings.
10. Managing services with storpool_ctl
The storpool_ctl
is a helper tool providing an easy way to perform an
action for all installed services on a StorPool node. An example action is
start, stop, restart, or enable/disable starting them on boot. Other actions
might be added later.
10.1. Supported actions
To list all supported actions use:
# storpool_ctl --help
usage: storpool_ctl [-h] {disable,start,status,stop,restart,enable} ...
Tool that controls all StorPool services on a node, taking care of
service dependencies, required checks before executing and action and
others.
positional arguments:
{disable,start,status,stop,restart,enable}
action
optional arguments:
-h, --help show this help message and exit
10.2. Getting status
List the status of all services installed on this node:
# storpool_ctl status
storpool_nvmed not_running
storpool_mgmt not_running
storpool_reaffirm not_running
storpool_flushwbc not_running
storpool_stat not_running
storpool_abrtsync not_running
storpool_block not_running
storpool_kdump not_running
storpool_hugepages not_running
storpool_bridge not_running
storpool_beacon not_running
storpool_cgmove not_running
storpool_iscsi not_running
storpool_server not_running
storpool_controller not_running
The status is always one of:
not_running
when the service is disabled or stopped
not_enabled
when the service is running but is not yet enabled
running
when the service is running and enabled
Note
The tool always prints the status after any selected action was applied.
Alternative storpool_ctl status --problems
shows only services that are not in running state. Note that in this mode the utility exits with non-zero exit status in case there are any installed services in not_running
or not_enabled
state.
# storpool_ctl status --problems
storpool_mgmt not_running
storpool_server not_enabled
# echo $?
1
10.3. Starting services
To start all services:
# storpool_ctl start
cgconfig running
storpool_abrtsync not_enabled
storpool_cgmove not_enabled
storpool_block not_enabled
storpool_mgmt not_enabled
storpool_flushwbc not_enabled
storpool_server not_enabled
storpool_hugepages not_enabled
storpool_stat not_enabled
storpool_controller not_enabled
storpool_beacon not_enabled
storpool_kdump not_enabled
storpool_reaffirm not_enabled
storpool_bridge not_enabled
storpool_iscsi not_enabled
10.4. Enabling services
To enable all services:
# storpool_ctl enable
Created symlink from /etc/systemd/system/multi-user.target.wants/storpool_cgmove.service to /usr/lib/systemd/system/storpool_cgmove.service.
Created symlink from /etc/systemd/system/multi-user.target.wants/storpool_bridge.service to /usr/lib/systemd/system/storpool_bridge.service.
Created symlink from /etc/systemd/system/multi-user.target.wants/storpool_block.service to /usr/lib/systemd/system/storpool_block.service.
Created symlink from /etc/systemd/system/multi-user.target.wants/storpool_reaffirm.service to /usr/lib/systemd/system/storpool_reaffirm.service.
Created symlink from /etc/systemd/system/storpool_beacon.service.wants/storpool_reaffirm.service to /usr/lib/systemd/system/storpool_reaffirm.service.
Created symlink from /etc/systemd/system/storpool_block.service.wants/storpool_reaffirm.service to /usr/lib/systemd/system/storpool_reaffirm.service.
Created symlink from /etc/systemd/system/storpool_mgmt.service.wants/storpool_reaffirm.service to /usr/lib/systemd/system/storpool_reaffirm.service.
Created symlink from /etc/systemd/system/storpool_server.service.wants/storpool_reaffirm.service to /usr/lib/systemd/system/storpool_reaffirm.service.
Created symlink from /etc/systemd/system/multi-user.target.wants/storpool_controller.service to /usr/lib/systemd/system/storpool_controller.service.
Created symlink from /etc/systemd/system/multi-user.target.wants/storpool_beacon.service to /usr/lib/systemd/system/storpool_beacon.service.
Created symlink from /etc/systemd/system/storpool_server.service.wants/storpool_beacon.service to /usr/lib/systemd/system/storpool_beacon.service.
Created symlink from /etc/systemd/system/storpool_block.service.wants/storpool_beacon.service to /usr/lib/systemd/system/storpool_beacon.service.
Created symlink from /etc/systemd/system/storpool_mgmt.service.wants/storpool_beacon.service to /usr/lib/systemd/system/storpool_beacon.service.
Created symlink from /etc/systemd/system/multi-user.target.wants/storpool_kdump.service to /usr/lib/systemd/system/storpool_kdump.service.
Created symlink from /etc/systemd/system/multi-user.target.wants/storpool_abrtsync.service to /usr/lib/systemd/system/storpool_abrtsync.service.
Created symlink from /etc/systemd/system/multi-user.target.wants/storpool_mgmt.service to /usr/lib/systemd/system/storpool_mgmt.service.
Created symlink from /etc/systemd/system/multi-user.target.wants/storpool_server.service to /usr/lib/systemd/system/storpool_server.service.
Created symlink from /etc/systemd/system/multi-user.target.wants/storpool_flushwbc.service to /usr/lib/systemd/system/storpool_flushwbc.service.
Created symlink from /etc/systemd/system/multi-user.target.wants/storpool_stat.service to /usr/lib/systemd/system/storpool_stat.service.
Created symlink from /etc/systemd/system/sysinit.target.wants/storpool_hugepages.service to /usr/lib/systemd/system/storpool_hugepages.service.
Created symlink from /etc/systemd/system/multi-user.target.wants/storpool_iscsi.service to /usr/lib/systemd/system/storpool_iscsi.service.
storpool_cgmove running
storpool_bridge running
storpool_block running
storpool_reaffirm running
storpool_controller running
storpool_beacon running
storpool_kdump running
storpool_abrtsync running
storpool_mgmt running
storpool_server running
storpool_flushwbc running
storpool_stat running
storpool_hugepages running
storpool_iscsi running
10.5. Disabling services
To disable all services (without stopping them):
# storpool_ctl disable
Removed symlink /etc/systemd/system/multi-user.target.wants/storpool_cgmove.service.
Removed symlink /etc/systemd/system/multi-user.target.wants/storpool_bridge.service.
Removed symlink /etc/systemd/system/multi-user.target.wants/storpool_block.service.
Removed symlink /etc/systemd/system/multi-user.target.wants/storpool_reaffirm.service.
Removed symlink /etc/systemd/system/multi-user.target.wants/storpool_controller.service.
Removed symlink /etc/systemd/system/multi-user.target.wants/storpool_beacon.service.
Removed symlink /etc/systemd/system/multi-user.target.wants/storpool_kdump.service.
Removed symlink /etc/systemd/system/multi-user.target.wants/storpool_abrtsync.service.
Removed symlink /etc/systemd/system/multi-user.target.wants/storpool_mgmt.service.
Removed symlink /etc/systemd/system/multi-user.target.wants/storpool_server.service.
Removed symlink /etc/systemd/system/multi-user.target.wants/storpool_flushwbc.service.
Removed symlink /etc/systemd/system/multi-user.target.wants/storpool_stat.service.
Removed symlink /etc/systemd/system/multi-user.target.wants/storpool_iscsi.service.
Removed symlink /etc/systemd/system/sysinit.target.wants/storpool_hugepages.service.
Removed symlink /etc/systemd/system/storpool_beacon.service.wants/storpool_reaffirm.service.
Removed symlink /etc/systemd/system/storpool_block.service.wants/storpool_reaffirm.service.
Removed symlink /etc/systemd/system/storpool_block.service.wants/storpool_beacon.service.
Removed symlink /etc/systemd/system/storpool_mgmt.service.wants/storpool_reaffirm.service.
Removed symlink /etc/systemd/system/storpool_mgmt.service.wants/storpool_beacon.service.
Removed symlink /etc/systemd/system/storpool_server.service.wants/storpool_reaffirm.service.
Removed symlink /etc/systemd/system/storpool_server.service.wants/storpool_beacon.service.
storpool_kdump not_enabled
storpool_hugepages not_enabled
storpool_bridge not_enabled
storpool_controller not_enabled
storpool_beacon not_enabled
storpool_cgmove not_enabled
storpool_block not_enabled
storpool_mgmt not_enabled
storpool_stat not_enabled
storpool_reaffirm not_enabled
storpool_server not_enabled
storpool_flushwbc not_enabled
storpool_abrtsync not_enabled
storpool_iscsi not_enabled
10.6. Stopping services
To stop all services:
# storpool_ctl stop
storpool_server not_running
storpool_iscsi not_running
storpool_controller not_running
storpool_mgmt not_running
storpool_cgmove not_running
storpool_kdump not_running
storpool_reaffirm not_running
storpool_stat not_running
storpool_beacon not_running
storpool_hugepages not_running
storpool_flushwbc not_running
storpool_block not_running
storpool_bridge not_running
storpool_abrtsync not_running
Module storpool_pci version 6D0D7D6E357D24CBDF2D1BA
Module storpool_disk version D92BDA6C929615392EEAA7E
Module storpool_bd version C6EB4EEF1E0ABF1A4774788
Module storpool_rdma version 4F1FB67DF4617ECD6C472C4
Note
The stop
action does include in additional options:
--servers
to stop just the server instances
--expose-nvme
to expose any configured NVMe devices attached to the selectedSP_NVME_PCI_DRIVER
back to thenvme
driver
11. CLI tutorial
StorPool provides an easy yet powerful Command Line Interface (CLI) for administrating the data
storage cluster or multiple clusters in the same location (17. Multi-site and multi-cluster).
It has an integrated help system that provides useful information on every step.
There are various ways to execute commands in the CLI, depending on the style and needs of the
administrator. The StorPool CLI gets its configuration from /etc/storpool.conf
file and command
line options.
11.1. Examples
Type regular shell command with parameters:
# storpool service list
Use interactive StorPool shell:
# storpool
StorPool> service list
Pipe command output to StorPool CLI:
# echo "service list" | storpool
Redirect the standard input from a predefined file with commands:
# storpool < input_file
Display the available command line options:
# storpool --help
Error message with possible options will be displayed if the shell command is incomplete or wrong:
# storpool attach
Error: incomplete command! Expected:
list - list the current attachments
timeout - seconds to wait for the client to appear
volume - specify a volume to attach
here - attach here
noWait - do not wait for the client
snapshot - specify a snapshot to attach
mode - specify the read/write mode
client - specify a client to attach the volume to
# storpool attach volume
Error: incomplete command! Expected:
volume - the volume to attach
Interactive shell help can be invoked by pressing the question mark key (?
):
# storpool
StorPool> attach <?>
client - specify a client to attach the volume to {M}
here - attach here {M}
list - list the current attachments
mode - specify the read/write mode {M}
noWait - do not wait for the client {M}
snapshot - specify a snapshot to attach {M}
timeout - seconds to wait for the client to appear {M}
volume - specify a volume to attach {M}
Shell autocomplete, invoked by double-pressing the Tab
key, will show available options for current step:
StorPool> attach <tab> <tab>
client here list mode noWait snapshot timeout volume
StorPool shell can detect incomplete lines and suggest options:
# storpool
StorPool> attach <enter>
.................^
Error: incomplete command! Expected:
volume - specify a volume to attach
client - specify a client to attach the volume to
list - list the current attachments
here - attach here
mode - specify the read/write mode
snapshot - specify a snapshot to attach
timeout - seconds to wait for the client to appear
noWait - do not wait for the client
To exit the shell use quit
or exit
commands or directly use the Ctrl+C
or Ctrl+D
keyboard shortcuts of your terminal.
To enter MultiCluster mode use:
StorPool> multiCluster on
[MC] StorPool>
For non-interactive mode use:
# storpool -M <command>
Note
All commands not relevant to multicluster will silently fall-back to
non-multicluster mode. Example storpool -M service list
will list only
local services, same for storpool -M disk list
and storpool -M net list
.
11.2. More information
12. CLI reference
For introduction and examples, see 11. CLI tutorial.
12.1. Location
The location
submenu is used for configuring other StorPool sub-clusters in
the same or different location (17. Multi-site and multi-cluster). The
location ID is the first part (left of the .
) in the SP_CLUSTER_ID
configured in the remote cluster.
For example to add a location with SP_CLUSTER_ID=nzkr.b
use:
# storpool location add nzkr StorPoolLab-Sofia
OK
To list the configured locations use:
# storpool location list
-----------------------------------------------
| id | name | rxBuf | txBuf |
-----------------------------------------------
| nzkr | StorPoolLab-Sofia | 85 KiB | 128 KiB |
-----------------------------------------------
To rename a location use:
# storpool location rename StorPoolLab-Sofia name StorPoolLab-Amsterdam
OK
To remove a location use:
# storpool location remove StorPoolLab-Sofia
OK
Note
This command will fail if there is an existing cluster or a remote bridge configured for this location
To update the send or receive buffer sizes to values different from the defaults, use:
# storpool location update StorPoolLab-Sofia recvBufferSize 16M
OK
# storpool location update StorPoolLab-Sofia sendBufferSize 1M
OK
# storpool location list
-----------------------------------------------
| id | name | rxBuf | txBuf |
-----------------------------------------------
| nzkr | StorPoolLab-Sofia | 16 MiB | 1.0 MiB |
-----------------------------------------------
12.2. Cluster
The cluster
submenu is used for configuring a new cluster for an already
configured location
. The cluster ID is the second part (right from the .
)
in the SP_CLUSTER_ID
configured in the remote, for example to add the cluster
b
for the remote location nzkr
use:
# storpool cluster add StorPoolLab-Sofia b
OK
To list the configured clusters use:
# storpool cluster list
--------------------------------------------------
| name | id | location |
--------------------------------------------------
| StorPoolLab-Sofia-cl1 | b | StorPoolLab-Sofia |
--------------------------------------------------
To remove a cluster use:
# storpool cluster remove StorPoolLab-Sofia b
12.3. Remote Bridge
The remoteBridge
submenu is used to register or deregister a remote bridge
for a configured location.
To register a remote bridge use storpool remoteBridge register <location-name> <IP address> <public-key>
, example:
# storpool remoteBridge register StorPoolLab-Sofia 10.1.100.10 ju9jtefeb8idz.ngmrsntnzhsei.grefq7kzmj7zo.nno515u6ftna6
OK
Will register the StorPoolLab-Sofia
location with an IP address of 10.1.100.10
and the above public key.
In case of a change in the IP address or the public key of a remote location the remote bridge could be de-registered and then registered again with the required parameters, e.g.:
# storpool remoteBridge deregister 10.1.100.10
OK
# storpool remoteBridge register StorPoolLab-Sofia 78.90.13.150 8nbr9q162tjh.ahb6ueg16kk2.mb7y2zj2hn1ru.5km8qut54x7z
OK
A remote bridge might be registered with noCrypto
in case of a secure interconnect
between the clusters, typical use-case is a 17.1. Multicluster setup, with other
sub-clusters in the same datacenter.
To enable deferred deletion on unexport from the remote site the minimumDeleteDelay
flag should also be set, the format of the command is storpool remoteBridge register <location-name> <IP address> <public-key> minimumDeleteDelay <minimumDeleteDelay>
, where the last parameter is a time period provided as X[smhd]
- X is an integer and s
, m
, h
, and d
are seconds, minutes, hours and days accordingly.
For example if we register the remote bridge for StorPoolLab-Sofia
location with a minimumDeleteDelay
of one day the register would look like this:
# storpool remoteBridge register StorPoolLab-Sofia 78.90.13.150 8nbr9q162tjh.ahb6ueg16kk2.mb7y2zj2hn1ru.5km8qut54x7z minimumDeleteDelay 1d
OK
After this operation all snapshots sent from the remote cluster could be unexported later with the deleteAfter
parameter set (check the Remote snapshots section). Any deleteAfter
parameters lower than the minimumDeleteDelay
will be overridden by the bridge in the remote cluster. All such events will be logged on the node with the active
bridge in the remote cluster.
Check the 17.2. Multi site section from this user guide for more on deferred delete.
To list all registered remote bridges use:
# storpool remoteBridge list
StorPool> remoteBridge list
------------------------------------------------------------------------------------------------------------------------------
| ip | remote | minimumDeleteDelay | publicKey | noCrypto |
------------------------------------------------------------------------------------------------------------------------------
| 10.1.200.10 | StorPoolLab-Sofia | | nonwtmwsgdr2p.fos2qus4h1qdk.pnt9ozj8gcktj.d7b2aa24gsegn | 0 |
| 10.1.200.11 | StorPoolLab-Sofia | | jtgeaqhsmqzqd.x277oefofxbpm.bynb2krkiwg54.ja4gzwqdg925j | 0 |
------------------------------------------------------------------------------------------------------------------------------
To show the status of remote bridges use:
# storpool remoteBridge status
------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
| ip | clusterId | connectionState | connectionTime | reconnectCount | receivedExports | sentExports | lastError | lastErrno | errorTime | bytesSentSinceStart | bytesRecvSinceStart | bytesSentSinceConnect | bytesRecvSinceConnect |
------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
| 10.1.200.11 | d.b | connected | 2021-02-07 18:08:25 | 2 | 5 | 2 | socket error | Operation not permitted | 2021-02-07 17:58:58 | 210370560 | 242443328 | 41300272 | 75088624 |
| 10.1.200.10 | d.d | connected | 2021-02-07 17:51:42 | 1 | 7 | 2 | no error | No error information | - | 186118480 | 39063648 | 186118480 | 39063648 |
| 10.1.200.4 | e.n | connected | 2021-02-07 17:51:42 | 1 | 5 | 0 | no error | No error information | - | 117373472 | 316784 | 117373472 | 316784 |
------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
12.4. Network
To list basic details about the cluster network use:
# storpool net list
------------------------------------------------------------
| nodeId | flags | net 1 | net 2 |
------------------------------------------------------------
| 11 | uU + AJ | F4:52:14:76:9C:B0 | F4:52:14:76:9C:B0 |
| 12 | uU + AJ | 02:02:C9:3C:E3:80 | 02:02:C9:3C:E3:81 |
| 13 | uU + AJ | F6:52:14:76:9B:B0 | F6:52:14:76:9B:B1 |
| 14 | uU + AJ | 26:8A:07:43:EE:BC | 26:8A:07:43:EE:BD |
| 15 | uU + AJ | 1A:60:00:00:00:0F | 1E:60:00:00:00:0F |
------------------------------------------------------------
Quorum status: 5 voting beacons up out of 5 expected
Flags:
u - packets recently received from this node
d - no packets recently received from this node
U - this node has enough votes to be in the quorum
D - this node does not have enough votes to be in the quorum
M - this node is being damped by the rest of the nodes in the cluster
+ - this node considers itself in the quorum
B - the connection to this node is through a backup link; check the cabling
A - this node is using hardware acceleration
J - the node uses jumbo frames
N - a non-voting node
12.5. Server
To list the nodes that are configured as StorPool servers and their storpool_server
instances use:
# storpool server list
cluster running, mgmt on node 11
server 11.0 running on node 11
server 12.0 running on node 12
server 13.0 running on node 13
server 14.0 running on node 14
server 11.1 running on node 11
server 12.1 running on node 12
server 13.1 running on node 13
server 14.1 running on node 14
To get more information about which storage devices are provided by a particular server use the storpool server <ID>:
# storpool server 11 disk list
disk | server | size | used | est.free | % | free entries | on-disk size | allocated objects | errors | flags
1103 | 11.0 | 447 GiB | 3.1 GiB | 424 GiB | 1 % | 1919912 | 20 MiB | 40100 / 480000 | 0 / 0 |
1104 | 11.0 | 447 GiB | 3.1 GiB | 424 GiB | 1 % | 1919907 | 20 MiB | 40100 / 480000 | 0 / 0 |
1111 | 11.0 | 465 GiB | 2.6 GiB | 442 GiB | 1 % | 494977 | 20 MiB | 40100 / 495000 | 0 / 0 |
1112 | 11.0 | 365 GiB | 2.6 GiB | 346 GiB | 1 % | 389977 | 20 MiB | 40100 / 390000 | 0 / 0 |
1125 | 11.0 | 931 GiB | 2.6 GiB | 894 GiB | 0 % | 974979 | 20 MiB | 40100 / 975000 | 0 / 0 |
1126 | 11.0 | 931 GiB | 2.6 GiB | 894 GiB | 0 % | 974979 | 20 MiB | 40100 / 975000 | 0 / 0 |
----------------------------------------------------------------------------------------------------------------------------------------
6 | 1.0 | 3.5 TiB | 16 GiB | 3.4 TiB | 0 % | 6674731 | 122 MiB | 240600 / 3795000 | 0 / 0 |
Note
Without specifying instance the first instance is assumed - 11.0
as in the above example. The second, third and fourth storpool_server
instance would be 11.1
, 11.2
and 11.3
accordingly.
To list the servers that are blocked and could not join the cluster for some reason:
# storpool server blocked
cluster waiting, mgmt on node 12
server 11.0 waiting on node 11 missing:1201,1202,1203,1204,1212,1221,1222,1223,1224,1225,1226,1301,1302,1303,1304,1311,1312,1321,1322,1323,1324,1325,1326,4043 pending:1103,1104,1111,1112,1125,1126
server 12.0 down on node 12
server 13.0 down on node 13
server 14.0 waiting on node 14 missing:1201,1202,1203,1204,1212,1221,1222,1223,1224,1225,1226,1301,1302,1303,1304,1311,1312,1321,1322,1323,1324,1325,1326,4043 pending:1403,1404,1411,1412,1421,1423
server 11.1 waiting on node 11 missing:1201,1202,1203,1204,1212,1221,1222,1223,1224,1225,1226,1301,1302,1303,1304,1311,1312,1321,1322,1323,1324,1325,1326,4043 pending:1101,1102,1121,1122,1123,1124
server 12.1 down on node 12
server 13.1 down on node 13
server 14.1 waiting on node 14 missing:1201,1202,1203,1204,1212,1221,1222,1223,1224,1225,1226,1301,1302,1303,1304,1311,1312,1321,1322,1323,1324,1325,1326,4043 pending:1401,1402,1424,1425,1426
12.6. Fault sets
The fault sets are a way to instruct StorPool to use the drives in a group of nodes for only one replica of the data if they are expected to fail simultaneously. Some examples would be:
Multinode chassis
multiple nodes in the same rack backed by the same power supply
nodes connected to the same set of switches and so on.
To define a fault set only a name and some set of server nodes are needed:
# storpool faultSet chassis_1 addServer 11 addServer 12
OK
To list defined fault sets use:
# storpool faultSet list
-------------------------------------------------------------------
| name | servers |
-------------------------------------------------------------------
| chassis_1 | 11 12 |
-------------------------------------------------------------------
To remove a fault set use:
# storpool faultSet chassis_1 delete chassis_1
Attention
A new fault set definition has effect only on newly created volumes. To change the configuration on already created volumes a re-balance operation would be required, see Balancer for more details on re-balancing a cluster after defining fault sets.
12.7. Services
Check the state of all services presently running in the cluster and their uptime:
# storpool service list
cluster running, mgmt on node 12
mgmt 11 running on node 11 ver 20.00.18, started 2022-09-08 18:23:36, uptime 1 day 00:53:43
mgmt 12 running on node 12 ver 20.00.18, started 2022-09-08 18:23:35, uptime 1 day 00:53:44 active
server 11.0 running on node 11 ver 20.00.18, started 2022-09-08 18:23:45, uptime 1 day 00:53:34
server 12.0 running on node 12 ver 20.00.18, started 2022-09-08 18:23:41, uptime 1 day 00:53:38
server 13.0 running on node 13 ver 20.00.18, started 2022-09-08 18:23:35, uptime 1 day 00:53:44
server 14.0 running on node 14 ver 20.00.18, started 2022-09-08 18:23:39, uptime 1 day 00:53:40
server 11.1 running on node 11 ver 20.00.18, started 2022-09-08 18:23:45, uptime 1 day 00:53:34
server 12.1 running on node 12 ver 20.00.18, started 2022-09-08 18:23:44, uptime 1 day 00:53:35
server 13.1 running on node 13 ver 20.00.18, started 2022-09-08 18:23:37, uptime 1 day 00:53:42
server 14.1 running on node 14 ver 20.00.18, started 2022-09-08 18:23:39, uptime 1 day 00:53:40
client 11 running on node 11 ver 20.00.18, started 2022-09-08 18:23:33, uptime 1 day 00:53:46
client 12 running on node 12 ver 20.00.18, started 2022-09-08 18:23:34, uptime 1 day 00:53:45
client 13 running on node 13 ver 20.00.18, started 2022-09-08 18:23:32, uptime 1 day 00:53:47
client 14 running on node 14 ver 20.00.18, started 2022-09-08 18:23:32, uptime 1 day 00:53:47
client 15 running on node 15 ver 20.00.18, started 2020-01-09 10:46:17, uptime 08:31:02
bridge 11 running on node 11 ver 20.00.18, started 2022-09-08 18:23:34, uptime 1 day 00:53:45 active
bridge 12 running on node 12 ver 20.00.18, started 2022-09-08 18:23:34, uptime 1 day 00:53:45
cntrl 11 running on node 11 ver 20.00.18, started 2022-09-08 18:23:35, uptime 1 day 00:53:44
cntrl 12 running on node 12 ver 20.00.18, started 2022-09-08 18:23:34, uptime 1 day 00:53:45
cntrl 13 running on node 13 ver 20.00.18, started 2022-09-08 18:23:31, uptime 1 day 00:53:48
cntrl 14 running on node 14 ver 20.00.18, started 2022-09-08 18:23:31, uptime 1 day 00:53:48
iSCSI 12 running on node 13 ver 20.00.18, started 2022-09-08 18:23:32, uptime 1 day 00:53:47
iSCSI 13 running on node 13 ver 20.00.18, started 2022-09-08 18:23:32, uptime 1 day 00:53:47
12.8. Disk
12.8.1. Disk list main info
The disk submenu is for querying or managing the available disks in the cluster.
To display all available disks in all server instances in the cluster:
# storpool disk list
disk | server | size | used | est.free | % | free entries | on-disk size | allocated objects | errors | flags
1101 | 11.1 | 893 GiB | 2.6 GiB | 857 GiB | 0 % | 3719946 | 664 KiB | 41000 / 930000 | 0 / 0 |
1102 | 11.1 | 446 GiB | 2.6 GiB | 424 GiB | 1 % | 1919946 | 664 KiB | 41000 / 480000 | 0 / 0 |
1103 | 11.0 | 893 GiB | 2.6 GiB | 857 GiB | 0 % | 3719948 | 660 KiB | 41000 / 930000 | 0 / 0 |
1104 | 11.0 | 446 GiB | 2.6 GiB | 424 GiB | 1 % | 1919946 | 664 KiB | 41000 / 480000 | 0 / 0 |
1105 | 11.0 | 446 GiB | 2.6 GiB | 424 GiB | 1 % | 1919947 | 664 KiB | 41000 / 480000 | 0 / 0 |
1111 | 11.1 | 930 GiB | 2.6 GiB | 893 GiB | 0 % | 974950 | 716 KiB | 41000 / 975000 | 0 / 0 |
1112 | 11.1 | 930 GiB | 2.6 GiB | 893 GiB | 0 % | 974949 | 736 KiB | 41000 / 975000 | 0 / 0 |
1113 | 11.1 | 930 GiB | 2.6 GiB | 893 GiB | 0 % | 974943 | 760 KiB | 41000 / 975000 | 0 / 0 |
1114 | 11.1 | 930 GiB | 2.6 GiB | 893 GiB | 0 % | 974937 | 844 KiB | 41000 / 975000 | 0 / 0 |
[snip]
1425 | 14.1 | 931 GiB | 2.6 GiB | 894 GiB | 0 % | 974980 | 20 MiB | 40100 / 975000 | 0 / 0 |
1426 | 14.1 | 931 GiB | 2.6 GiB | 894 GiB | 0 % | 974979 | 20 MiB | 40100 / 975000 | 0 / 0 |
----------------------------------------------------------------------------------------------------------------------------------------
47 | 8.0 | 30 TiB | 149 GiB | 29 TiB | 0 % | 53308967 | 932 MiB | 1844600 / 32430000 | 0 / 0 |
To mark a device as temporarily unavailable:
# storpool disk 1111 eject
OK
Ejecting a disk from the cluster will stop the data replication for this disk, but will keep the metadata about the placement groups in which it participated and which volume objects it contained.
Note
The command above will refuse to eject the disk if this operation would lead to volumes or snapshots going into the down
state (usually when the last up-to-date copy for some parts of a volume/snapshot is on this disk).
This drive will be shown as missing in the storpool disk list
output, e.g.:
# storpool disk list
disk | server | size | used | est.free | % | free entries | on-disk size | allocated objects | errors | flags
[snip]
1422 | 14.1 | - | - | - | - % | - | - | - / - | - / - |
[snip]
Attention
This operation leads to degraded redundancy for all volumes and snapshots that have data on the ejected disk.
Such a disk will not return by itself back into the cluster, and would have to be manually reinserted by removing its EJECTED flag with storpool_initdisk -r /dev/$path
.
12.8.2. Disk list additional info
To display additional info regarding disks:
# storpool disk list info
disk | server | device | model | serial | description | flags |
1101 | 11.1 | 0000:04:00.0-p1 | SAMSUNG MZQLB960HAJR-00007 | S437NF0M500149 | | S |
1102 | 11.1 | /dev/sdj1 | Micron_M500DC_MTFDDAK480MBB | 14250C6368E5 | | S |
1103 | 11.0 | /dev/sdi1 | SAMSUNG_MZ7LH960HAJR-00005 | S45NNE0M229767 | | S |
1104 | 11.0 | /dev/sdd1 | Micron_M500DC_MTFDDAK480MBB | 14250C63689B | | S |
1105 | 11.0 | /dev/sdc1 | Micron_M500DC_MTFDDAK480MBB | 14250C6368EC | | S |
1111 | 11.1 | /dev/sdl1 | Hitachi_HUA722010CLA330 | JPW9K0N13243ZL | | W |
1112 | 11.1 | /dev/sda1 | Hitachi_HUA722010CLA330 | JPW9J0N13LJEEV | | W |
1113 | 11.1 | /dev/sdb1 | Hitachi_HUA722010CLA330 | JPW9J0N13N694V | | W |
1114 | 11.1 | /dev/sdm1 | Hitachi_HUA722010CLA330 | JPW9K0N132R7HL | | W |
[snip]
1425 | 14.1 | /dev/sdm1 | Hitachi_HDS721050CLA360 | JP1532FR1BY75C | | W |
1426 | 14.1 | /dev/sdh1 | Hitachi_HUA722010CLA330 | JPW9K0N13RS95L | | W, J |
To set additional information for some of the disks, shown in the output of storpool disk list info
:
# storpool disk 1111 description HBA2_port7
OK
# storpool disk 1104 description FAILING_SMART
OK
12.8.3. Disk list server internal info
To display internal statistics about each disk:
# storpool disk list internal
--------------------------------------------------------------------------------------------------------------------------------------------------------
| disk | server | aggregate scores | wbc pages | scrub bw | scrub ETA | last scrub completed |
--------------------------------------------------------------------------------------------------------------------------------------------------------
| 1101 | 11.1 | 0 | 0 | 0 | - + - / - | - | - | 2022-09-08 18:23:07 |
| 1102 | 11.1 | 0 | 0 | 0 | - + - / - | - | - | 2022-09-08 18:23:07 |
| 1103 | 11.0 | 0 | 0 | 0 | - + - / - | - | - | 2022-09-08 18:23:08 |
| 1104 | 11.0 | 0 | 0 | 0 | - + - / - | - | - | 2022-09-08 18:23:09 |
| 1105 | 11.1 | 0 | 0 | 0 | - + - / - | - | - | 2022-09-08 18:23:10 |
| 1111 | 11.0 | 0 | 0 | 0 | 0 + 0 / 2560 | - | - | 2022-09-08 18:23:12 |
| 1112 | 11.1 | 0 | 0 | 0 | 0 + 0 / 2560 | - | - | 2022-09-08 18:23:15 |
| 1113 | 11.1 | 0 | 0 | 0 | 0 + 0 / 2560 | - | - | 2022-09-08 18:23:17 |
| 1114 | 11.1 | 0 | 0 | 0 | 0 + 0 / 2560 | - | - | 2022-09-08 18:23:13 |
[snip]
| 1425 | 14.1 | 0 | 0 | 0 | 0 + 0 / 2560 | - | - | 2022-09-08 18:23:15 |
| 1426 | 14.1 | 0 | 0 | 0 | 0 + 0 / 2560 | - | - | 2022-09-08 18:23:19 |
--------------------------------------------------------------------------------------------------------------------------------------------------------
The sections in this output explained:
aggregate scores
- Internal values representing how much data is about to be defragmented on the particular drive. Usually between 0 and 1, on heavily loaded clusters the rightmost column might get into the hundreds or even thousands if some drives are severely loaded.
wbc pages
- Internal statistics for each drive that have the write back cache or journaling in StorPool enabled.
scrub bw
- The scrubbing speed in MB/s
scrub ETA
- Approximate time/date when the scrubbing operation will complete for this drive.
last scrub completed
- The last time/date when the drive was scrubbed
Note
The default installation includes a cron job on the management nodes that starts a scrubbing job for one drive per node. You can increase the number of disks that are scrubbing in parallel per node (the example is for four drives) by running the following:
# . /usr/lib/storpool/storpool_confget.sh # storpool_q -d '{"set":{"scrubbingDiskPerNode":"4"}}' KV/Set/conf
And you can see the number of drives that are scrubbing per node with:
# . /usr/lib/storpool/storpool_confget.sh # storpool_q KV/Get/conf/scrubbingDiskPerNode | jq -re '.data.pairs.scrubbingDiskPerNode'
To configure a different local or remote recovery override for this particular disk, different than the ones configured with mgmtConfig
use:
# storpool disk 1111 maxRecoveryRequestsOverride local 2
OK
# storpool disk 1111 maxRecoveryRequestsOverride remote 4
OK
To remove a configured override use:
# storpool disk 1111 maxRecoveryRequestsOverride remote clear
OK
This will remove the override and the default configured for the whole cluster at 12.22. Management Configuration will take precedence.
12.8.4. Disk list performance info
To display performance related in-server statistics for each disk use:
# storpool disk list perf
| latencies | thresholds | times exceeded | flags
disk | disk (avg) | disk (max) | jrn (avg) | jrn (max) | disk | journal | disk | journal |
2301 | 0.299ms | 0.400ms | - | - | 0.000ms | - | 0 | - |
2302 | 0.304ms | 0.399ms | - | - | 0.000ms | - | 0 | - |
2303 | 0.316ms | 0.426ms | - | - | 0.000ms | - | 0 | - |
[snip]
2621 | 4.376ms | 4.376ms | 0.029ms | 0.029ms | 0.000ms | 0.000ms | 0 | 0 |
2622 | 4.333ms | 4.333ms | 0.025ms | 0.025ms | 0.000ms | 0.000ms | 0 | 0 |
Note
Global latency thresholds are configured through the mgmtConfig section.
To configure a single disk latency threshold override use:
# storpool disk 2301 latencyLimitOverride disk 500
OK
# storpool disk list perf
| latencies | thresholds | times exceeded | flags
disk | disk (avg) | disk (max) | jrn (avg) | jrn (max) | disk | journal | disk | journal |
2301 | 0.119ms | 0.650ms | - | - | 500.000ms | - | 0 | - | D
[snip]
The D
flag means there is a disk latency override, visible in the thresholds section.
Similarly to configure a single disk journal latency threshold override use:
# storpool disk 2621 latencyLimitOverride journal 100
OK
storpool disk list perf
| latencies | thresholds | times exceeded | flags
disk | disk (avg) | disk (max) | jrn (avg) | jrn (max) | disk | journal | disk | journal |
[snip]
2621 | 8.489ms | 13.704ms | 0.052ms | 0.669ms | 0.000ms | 100.000ms | 0 | 0 | J
The J
flag means there is a disk journal latency override, visible in the thresholds section.
To override a single disk to no longer oblige the global limit use:
# storpool disk 2301 latencyLimitOverride disk unlimited
OK
This will show the disk threshold as unlimited
:
# storpool disk list perf
| latencies | thresholds | times exceeded | flags
disk | disk (avg) | disk (max) | jrn (avg) | jrn (max) | disk | journal | disk | journal |
2301 | 0.166ms | 0.656ms | - | - | unlimited | - | 0 | - | D
To clear an override and leave the global limit use:
# storpool disk 2601 latencyLimitOverride disk off
OK
# storpool disk 2621 latencyLimitOverride journal off
OK
If a disk was ejected due to an excessive latency the server will keep a log from the last 128 requests sent to the disk, to list them use:
# storpool disk 2601 ejectLog
log creation time | time of first event
2022-03-31 17:50:16 | 2022-03-31 17:31:08 +761,692usec
req# | start | end | latency | addr | size | op
1 | +0us | +424us | 257us | 0x0000000253199000 | 128 KiB | DISK_OP_READ
[snip]
126 | +1147653582us | +1147653679us | 97us | 0x0000000268EBE000 | 12 KiB | DISK_OP_WRITE_FUA
127 | +1147654920us | +1147655192us | 272us | 0x0000000012961000 | 128 KiB | DISK_OP_WRITE_FUA
total | maxTotal | limit | generation | times exceeded (for this eject)
23335us | 100523us | 1280us | 15338 | 1
The same data is available if a disk journal was ejected after breaching the threshold with:
# storpool disk 2621 ejectLog journal
[snip]
(the output is similar to that of the disk ejectLog above)
12.8.5. Ejecting disks and internal server tests
When a server controlling a disk notices some issues with it (write error, stalled request above a predefined threshold) it is also marked as “test pending”, due to many transient errors when a disk drive (or its controller) stalls a request for more than a predefined threshold.
An eject option is available for manually initiating such a test, which will flag the disk that it requires test and will eject it. The server instance will then perform a quick set of non-intrusive read-write tests on this disk and will return it back into the cluster if all tests did well, example:
# storpool disk 2331 eject test
OK
The tests are done usually couple of seconds up to a minute, to check the results from the last test use:
# storpool disk 2331 testInfo
times tested | test pending | read speed | write speed | read max latency | write max latency | failed
1 | no | 1.0 GiB/sec | 971 MiB/sec | 8 msec | 4 msec | no
If the disk was already marked for testing the option “now” will skip the test on the next attempt to re-open the disk:
# storpool disk 2301 eject now
OK
Attention
Note that this is exactly the same as “eject”, the disk would have to be manually returned into the cluster.
To mark a disk as unavailable by first re-balancing all data out to the other disks in the cluster and only then eject it:
# storpool disk 1422 softEject
OK
Balancer auto mode currently OFF. Must be ON for soft-eject to complete.
Note
This option requires StorPool balancer to be started after the above was issued, see more in the Balancer section below.
To remove a disk from the list of reported disks and all placement groups it participates in:
# storpool disk 1422 forget
OK
To get detailed information about given disk:
# storpool disk 1101 info
agAllocated | agCount | agFree | agFreeing | agFull | agMaxSizeFull | agMaxSizePartial | agPartial
7 | 462 | 455 | 1 | 0 | 0 | 1 | 1
entriesAllocated | entriesCount | entriesFree | sectorsCount
50 | 1080000 | 1079950 | 501215232
objectsAllocated | objectsCount | objectsFree | objectStates
18 | 270000 | 269982 | ok:18
serverId | 1
id | objectsCount | onDiskSize | storedSize | objectStates
#bad_id | 1 | 0 B | 0 B | ok:1
#clusters | 1 | 8.0 KiB | 768 B | ok:1
#drive_state | 1 | 8.0 KiB | 4.0 B | ok:1
#drives | 1 | 100 KiB | 96 KiB | ok:1
#iscsi_config | 1 | 12 KiB | 8.0 KiB | ok:1
[snip]
To get detailed information about the objects on a particular disk:
# storpool disk 1101 list
object name | stored size | on-disk size | data version | object state | parent volume
#bad_id:0 | 0 B | 0 B | 1480:2485 | ok (1) |
#clusters:0 | 768 B | 8.0 KiB | 711:992 | ok (1) |
#drive_state:0 | 4.0 B | 8.0 KiB | 1475:2478 | ok (1) |
#drives:0 | 96 KiB | 100 KiB | 1480:2484 | ok (1) |
[snip]
test:4094 | 0 B | 0 B | 0:0 | ok (1) |
test:4095 | 0 B | 0 B | 0:0 | ok (1) |
----------------------------------------------------------------------------------------------------
4115 objects | 394 KiB | 636 KiB | | |
To get detailed information about the active requests that the disk is performing at the moment:
# storpool disk 1101 activeRequests
-----------------------------------------------------------------------------------------------------------------------------------
| request ID | request IDX | volume | address | size | op | time active |
-----------------------------------------------------------------------------------------------------------------------------------
| 9226469746279625682:285697101441249070 | 9 | testvolume | 85276782592 | 4.0 KiB | read | 0 msec |
| 9226469746279625682:282600876697431861 | 13 | testvolume | 96372936704 | 4.0 KiB | read | 0 msec |
| 9226469746279625682:278097277070061367 | 19 | testvolume | 46629707776 | 4.0 KiB | read | 0 msec |
| 9226469746279625682:278660227023482671 | 265 | testvolume | 56680042496 | 4.0 KiB | write | 0 msec |
-----------------------------------------------------------------------------------------------------------------------------------
To issue retrim operation on a disk (available for SSD disks only):
# storpool disk 1101 retrim
OK
To start, pause or continue a scrubbing operation for a disk:
# storpool disk 1101 scrubbing start
OK
# storpool disk 1101 scrubbing pause
OK
# storpool disk 1101 scrubbing continue
OK
Note
Use storpool disk list internal
to check the status of a running scrub operation or when was the last completed scrubbing operation for this disk.
12.9. Placement Groups
The placement groups are predefined sets of disks, over which volume objects will be replicated. It is possible to specify which individual disks to add to the group.
To display the defined placement groups in the cluster:
# storpool placementGroup list
name
default
hdd
ssd
To display details about a placement group:
# storpool placementGroup ssd list
type | id
disk | 1101 1201 1301 1401
Creating a new placement group or extend an existing one requires specifying its name and providing one or more disks to be added:
# storpool placementGroup ssd addDisk 1102
OK
# storpool placementGroup ssd addDisk 1202
OK
# storpool placementGroup ssd addDisk 1302 addDisk 1402
OK
# storpool placementGroup ssd list
type | id
disk | 1101 1102 1201 1202 1301 1302 1401 1402
To remove one or more disks from a placement group use:
# storpool placementGroup ssd rmDisk 1402
OK
# storpool placementGroup ssd list
type | id
disk | 1101 1102 1201 1202 1301 1302 1401
To rename a placement group:
# storpool placementGroup ssd rename M500DC
OK
The unused placement groups can be removed. To avoid accidents, the name of the group must be entered twice:
# storpool placementGroup ssd delete ssd
OK
12.10. Volumes
The volumes are the basic service of the StorPool storage system. A volume always have a name and a certain size. It can be read from and written to. It could be attached to hosts as read-only or read-write
block device under the /dev/storpool
directory. A volume may have one or more tags created or changed in the form name=value
. The volume name is a string consisting of one or more of the allowed characters - upper and lower latin letters (a-z,A-Z
), numbers (0-9
) and the delimiter dot (.
), colon (:
), dash (-
) and underscore (_
). The same rules apply for the keys and values used for the volume tags. The volume name including tags cannot exceed 200 bytes.
When a volume is created, at minimum the <volumeName>, the <template> or placement/replication details and its size must be specified:
# storpool volume testvolume size 100G template hybrid
Additional parameters that can be used or overridden:
placeAll
- place all objects in placementGroup (Default value: default)placeTail
- name of placementGroup for reader (Default value: same asplaceAll
value)placeHead
- place the third replica in a different placementGroup (Default value: same asplaceAll
value)template
- use template with preconfigured placement, replication and/or limits (please check for more in Templates section) - usage of templates is seriously encouraged due to easier tracking and capacity managementparent
- use a snapshot as a parent for this volumereuseServer
- place multiple copies on the same serverbaseOn
- use parent volume, this will create a transient snapshot used as a parent (please check for more in Snapshots section)iops
- set the maximum IOPS limit for this volume (in IOPS)bw
- set maximum bandwidth limit (in MB/s)tag
- set a tag for this volume in the formname=value
create
- create the volume, fail if it exists (optional for now)update
- update the volume, fail if it does not exist (optional for now)limitType
- specify whetheriops
andbw
limits ought to be for the total size of the block device or per each GiB (one of “total” or “perGiB”)
The create
is useful in scripts when you have to prevent an involuntary update of a volume:
# storpool volume test create template hybrid
OK
# storpool volume test create size 200G template hybrid
Error: Volume 'test' already exists
A statement with update
parameter will fail with an error if the volume does not exist:
# storpool volume test update template hybrid size +100G
OK
# storpool volume test1 update template hybrid
Error: volume 'test1' does not exist
To list all available volumes:
# storpool volume list
-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------
| volume | size | rdnd. | placeHead | placeAll | placeTail | iops | bw | parent | template | flags | tags |
-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------
| testvolume | 100 GiB | 3 | ultrastar | ultrastar | ssd | - | - | testvolume@35691 | hybrid | | name=value |
| testvolume_8_2 | 100 GiB | 8+2 | nvme | nvme | nvme | - | - | testvolume_8_2@35693 | nvme | | name=value |
-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------
Flags:
R - allow placing two disks within a replication chain onto the same server
t - volume move target. Waiting for the move to finish
G - IOPS and bandwidth limits are per GiB and depends on volume/snapshot size
To list volumes exported to other sub-clusters in the multi-cluster:
# storpool volume list exports
---------------------------------
| remote | volume | globalId |
---------------------------------
| Lab-D-cl2 | test | d.n.buy |
---------------------------------
To list volumes exported in other sub-clusters to this one in a multi-cluster setup:
# volume list remote
--------------------------------------------------------------------------
| location | remoteId | name | size | creationTimestamp | tags |
--------------------------------------------------------------------------
| Lab-D | d.n.buy | test | 137438953472 | 2020-05-27 11:57:38 | |
--------------------------------------------------------------------------
Note
Once attached a remotely exported volume will no longer be visible with
volume list remote
, even if the export is still visible in the remote
cluster with volume list exports
. Every export invocation in the local
cluster will be used up for every attach in the remote cluster.
To get an overview of all volumes and snapshots and their state in the system use:
# storpool volume status
----------------------------------------------------------------------------------------------------------------------------------------------------
| volume | size | rdnd. | tags | alloc % | stored | on disk | syncing | missing | status | flags | drives down |
----------------------------------------------------------------------------------------------------------------------------------------------------
| testvolume | 100 GiB | 3 | name=value | 0.0 % | 0 B | 0 B | 0 B | 0 B | up | | |
| testvolume@35691 | 100 GiB | 3 | | 100.0 % | 100 GiB | 317 GiB | 0 B | 0 B | up | S | |
----------------------------------------------------------------------------------------------------------------------------------------------------
| 2 volumes | 200 GiB | | | 50.0 % | 100 GiB | 317 GiB | 0 B | 0 B | | | |
----------------------------------------------------------------------------------------------------------------------------------------------------
Flags:
S - snapshot
B - balancer blocked on this volume
D - decreased redundancy (degraded)
M - migrating data to a new disk
R - allow placing two disks within a replication chain onto the same server
t - volume move target. Waiting for the move to finish
C - disk placement constraints violated, rebalance needed
The columns in this output are:
volume
- name of the volume or snapshot (seeflags
below)
size
- provisioned volume size, the visible size inside a VM for example
rdnd.
- number of copies for this volume or its erasure coding scheme
tags
- all custom key=value tags configured for this volume or snapshot
alloc %
- how much space was used on this volume in percent
stored
- space allocated on this volume
on disk
- the size allocated on all drives in the cluster after replication and the overhead from data protection
syncing
- how much data is not in sync after a drive or server was missing, the data is recovered automatically once the missing drive or server is back in the cluster
missing
- shows how much data is not available for this volume when the volume is with statusdown
, seestatus
below
status
- shows the status of the volume, which could be one of:
up
- all copies are available
down
- none of the copies are available for some parts of the volume
up soon
- all copies are available and the volume will soon get up
flags
- flags denoting features of this volume:
S
- stands for snapshot, which is essentially a read-only (frozen) volume
B
- used to denote that the balancer is blocked for this volume (usually when some of the drives are missing)
D
- this flag is displayed when some of the copies is either not available or outdated and the volume is with decreased redundancy
M
- displayed when changing the replication or a cluster re-balance is in progress
R
- displayed when the policy for keeping copies on different servers is overridden
drives down
- displayed when the volume is indown
state, displaying the drives required to get the volume back up.
C
- displayed when the volume or snapshot placement constraints are violated
Size is in B
/KiB
/MiB
/GiB
, TiB
or PiB
.
To get just the status
data from the storpool_controller
services in the cluster, without any info for stored, on disk size, etc. use:
# storpool volume quickStatus
----------------------------------------------------------------------------------------------------------------------------------------------------
| volume | size | rdnd. | tags | alloc % | stored | on disk | syncing | missing | status | flags | drives down |
----------------------------------------------------------------------------------------------------------------------------------------------------
| testvolume | 100 GiB | 3 | name=value | 0.0 % | 0 B | 0 B | 0 B | 0 B | up | | |
| testvolume@35691 | 100 GiB | 3 | | 0.0 % | 0 B | 0 B | 0 B | 0 B | up | S | |
----------------------------------------------------------------------------------------------------------------------------------------------------
| 2 volumes | 200 GiB | | | 0.0 % | 0 B | 0 B | 0 B | 0 B | | | |
----------------------------------------------------------------------------------------------------------------------------------------------------
Note
The quickStatus
has less of an impact on the storpool_server
services and thus on end-user operations because the gathered data does not include the per-volume detailed storage stats provided with status
.
To check the estimated used space by the volumes in the system use:
# storpool volume usedSpace
-----------------------------------------------------------------------------------------
| volume | size | rdnd. | stored | used | missing info |
-----------------------------------------------------------------------------------------
| testvolume | 100 GiB | 3 | 1.9 GiB | 100 GiB | 0 B |
-----------------------------------------------------------------------------------------
The columns explained:
volume
- name of the volume
size
- the provisioned size of this volume
rdnd.
- number of copies for this volume or its erasure coding scheme
stored
- how much data is stored for this volume (without referring all parent snapshots)
used
- how much data has been written (including the data written in parent snapshots)
missing info
- if this value is anything other than0 B
probably some of thestorpool_controller
services in the cluster is not running correctly.
Note
The used
column shows how much data is accessible and reserved for this volume.
To list the target disk sets and objects of a volume:
# storpool volume testvolume list
volume testvolume
size 100 GiB
replication 3
placeHead hdd
placeAll hdd
placeTail ssd
target disk sets:
0: 1122 1323 1203
1: 1424 1222 1301
2: 1121 1324 1201
[snip]
object: disks
0: 1122 1323 1203
1: 1424 1222 1301
2: 1121 1324 1201
[snip]
Hint
In this example the volume is with hybrid placement with two copies on HDDs and one copy on SSDs (the rightmost disk sets column). The target disk sets are lists of triplets of drives in the cluster used as a template for the actual objects of the volume.
To get detailed info about the disks used for this volume and the number of objects on each of them use:
# storpool volume testvolume info
diskId | count
1101 | 200
1102 | 200
1103 | 200
[snip]
chain | count
1121-1222-1404 | 25
1121-1226-1303 | 25
1121-1226-1403 | 25
[snip]
diskSet | count
218-313-402 | 3
218-317-406 | 3
219-315-402 | 3
Note
The order of the diskSet is not by placeHead
, placeAll
, placeTail
, check the actual order in the storpool volume <volumename> list
output. The reason is to count similar diskSet with a different order in the same slot, i.e. [101, 201, 301]
is accounted as the same diskSet as [201, 101, 301]
.
To rename a volume use:
# storpool volume testvolume rename newvolume
OK
Attention
Changing the name of a volume will not wait for clients that have this volume attached to update the name of the symlink. Always use client sync for all clients with the volume attached.
To add a tag for a volume:
# storpool volume testvolume tag name=value
To change a tag for a volume use:
# storpool volume testvolume tag name=newvalue
To remove a tag just set it to an empty value:
# storpool volume testvolume tag name=
To resize a volume up:
# storpool volume testvolume size +1G
OK
To shrink a volume (resize down):
# storpool volume testvolume size 50G shrinkOk
Attention
Shrinking of a storpool volume changes the size of the block device, but does not adjust the size of LVM or filesystem contained in the volume. Failing to adjust the size of the filesystem or LVM prior to shrinking the StorPool volume would result in data loss.
To delete a volume use:
# storpool volume vol1 delete vol1
Note
To avoid accidents, the volume name must be entered twice. Attached volumes cannot be deleted even when not used as a safety precaution, more in Attachments.
A volume could be converted from based on a snapshot to a stand-alone volume. For example the testvolume
below is based on an anonymous snapshot:
# storpool_tree
StorPool
`-testvolume@37126
`-testvolume
To rebase it against root (known also as “promote”) use:
# storpool volume testvolume rebase
OK
# storpool_tree
StorPool
`- testvolume@255 [snapshot]
`- testvolume [volume]
The rebase operation could also be to a particular snapshot from a chain of parent snapshots on this child volume:
# storpool_tree
StorPool
`- testvolume-snap1 [snapshot]
`- testvolume-snap2 [snapshot]
`- testvolume-snap3 [snapshot]
`- testvolume [volume]
# storpool volume testvolume rebase testvolume-snap2
OK
After the operation the volume is directly based on testvolume-snap2
and includes all changes from testvolume-snap3
:
# storpool_tree
StorPool
`- testvolume-snap1 [snapshot]
`- testvolume-snap2 [snapshot]
|- testvolume [volume]
`- testvolume-snap3 [snapshot]
To backup a volume named testvolume
in a configured remote location StorPoolLab-Sofia
use:
# storpool volume testvolume backup StorPoolLab-Sofia
OK
After this operation a temporary snapshot will be created and will be transferred in StorPoolLab-Sofia
location. After the transfer completes, the local temporary snapshot will be deleted and the remote snapshot will be visible as exported from StorPoolLab-Sofia
, check Remote Snapshots for more on working with snapshot exports.
When backing up a volume, the remote snapshot may have one or more tags applied, example below:
# storpool volume testvolume backup StorPoolLab-Sofia tag key=value # [tag key2=value2]
OK
To move a volume to a different cluster in a multicluster environment (more on clusters here) use:
# storpool volume testvolume moveToRemote Lab-D-cl2 # onAttached export
Note
Moving a volume to a remote cluster will fail if the volume is attached on a
local host. It could be further specified what to do in such case with the
onAttached
parameter, as in the comment in the example above. More info
on volume move is available in 17.13. Volume and snapshot move.
12.11. Snapshots
Snapshots are read-only point-in-time images of volumes. They are created once and cannot be changed. They can be attached to hosts as read-only block devices under /dev/storpool
. Volumes and snapshots share the same name-space, thus their names are unique within a StorPool cluster. Volumes can be based on snapshots. Such volumes contain only the changes since the snapshot was taken. After a volume is created from a snapshot, writes will be recorded within the volume. Reads from volume may be served by volume or by its parent snapshot depending on whether volume contains changed data for the read request or not.
To create an unnamed (known also as anonymous) snapshot of a volume use:
# storpool volume testvolume snapshot
OK
This will create a snapshot named testvolume@<ID>
, where ID
is an unique serial number. Note that any tags on the volume will not be propagated to the snapshot; to set tags on the snapshot at creation time use:
# storpool volume testvolume tag key=value snapshot
To create a named snapshot of a volume use:
# storpool volume testvolume snapshot testsnap
OK
Again to directly set tags:
# storpool volume testvolume snapshot testsnapplustags tag key=value
To remove a tag on a snapshot:
# storpool snapshot testsnapplustags tag key=
To list the snapshots use:
# storpool snapshot list
-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
| snapshot | size | rdnd. | placeHead | placeAll | placeTail | created on | volume | iops | bw | parent | template | flags | targetDeleteDate | tags |
-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
| testsnap | 100 GiB | 3 | hdd | hdd | ssd | 2019-08-30 04:11:23 | testvolume | - | - | testvolume@1430 | hybrid-r3 | | - | key=value |
| testvolume@1430 | 100 GiB | 3 | hdd | hdd | ssd | 2019-08-30 03:56:58 | testvolume | - | - | | hybrid-r3 | A | - | |
-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
Flags:
A - anonymous snapshot with auto-generated name
B - bound snapshot
D - snapshot currently in the process of deletion
T - transient snapshot (created during volume cloning)
R - allow placing two disks within a replication chain onto the same server
P - snapshot delete blocked due to multiple children
To list the snapshots only for a particular volume use:
# storpool volume testvolume list snapshots
[snip]
A volume might directly be converted to a snapshot, the operation is also known as freeze
:
# storpool volume testvolume freeze
OK
Note that the operation will fail if the volume is attached read-write, more in Attachments
To create a bound snapshot on a volume use:
# storpool volume testvolume bound snapshot
OK
This snapshot will be automatically deleted when the last child volume created from it is deleted. Useful for non-persistent images.
To list the target disk sets and objects of a snapshot:
# storpool snapshot testsnap list
[snip]
The output is similar as with storpool volume <volumename> list
.
To get detailed info about the disks used for this snapshot and the number of objects on each of them use:
# storpool snapshot testsnap info
[snip]
The output is similar to the storpool volume <volumename> info
.
To create a volume based on an existing snapshot (cloning) use:
# storpool volume testvolume parent centos73-base-snap
OK
To revert a volume to an existing snapshot use:
# storpool volume testvolume revertToSnapshot centos73-working
OK
Same possible through the use of templates with a parent snapshot (See Templates):
# storpool volume spd template centos73-base
OK
Create a volume based on another existing volume (cloning):
# storpool volume testvolume1 baseOn testvolume
OK
Note
This operation will first create an anonymous bound snapshot on testvolume
and will then create testvolume1
with the bound snapshot as parent. The snapshot will exist until both volumes are deleted and will be automatically deleted afterwards.
To delete a snapshot use:
# storpool snapshot spdb_snap1 delete spdb_snap1
OK
Note
To avoid accidents, the name of the snapshot must be entered twice.
A snapshot could also be bound to its child volumes, it will exist until all child volumes are deleted:
# storpool snapshot testsnap bind
OK
The opposite operation is also possible, to unbind such snapshot use:
# storpool snapshot testsnap unbind
OK
To get the space that will be freed if a snapshot is deleted use:
# storpool snapshot space
----------------------------------------------------------------------------------------------------------------
| snapshot | on volume | size | rdnd. | stored | used | missing info |
----------------------------------------------------------------------------------------------------------------
| testsnap | testvolume | 100 GiB | 3 | 27 GiB | -135 GiB | 0 B |
| testvolume@3794 | testvolume | 100 GiB | 3 | 27 GiB | 1.9 GiB | 0 B |
| testvolume@3897 | testvolume | 100 GiB | 3 | 507 MiB | 432 KiB | 0 B |
| testvolume@3899 | testvolume | 100 GiB | 3 | 334 MiB | 224 KiB | 0 B |
| testvolume@4332 | testvolume | 100 GiB | 3 | 73 MiB | 36 KiB | 0 B |
| testvolume@4333 | testvolume | 100 GiB | 3 | 45 MiB | 40 KiB | 0 B |
| testvolume@4334 | testvolume | 100 GiB | 3 | 59 MiB | 16 KiB | 0 B |
| frozenvolume | - | 8 GiB | 2 | 80 MiB | 80 MiB | 0 B |
----------------------------------------------------------------------------------------------------------------
Used mainly for accounting purposes. The columns explained:
snapshot
- name of the snapshot
on volume
- the name of the volume child for this snapshot if any. For example a frozen volume would have this field empty.
size
- the size of the snapshot as provisioned
rdnd.
- number of copies for this volume or its erasure coding scheme
stored
- how much data is actually written
used
- stands for the amount of data that would be freed from the underlying drives (before redundancy) if the snapshot is removed.
missing info
- if this value is anything other than0 B
probably some of thestorpool_controller
services in the cluster are not running correctly.
The used
column could be negative in some cases when the snapshot has more than one child volume. In these cases deleting the snapshot would “free” negative space i.e. will end up taking more space on the underlying disks.
Similar to volumes a snapshot could have different placementGroups or other attributes, as well as templates:
# storpool snapshot testsnap template all-ssd
OK
Additional parameters that may be used:
placeAll
- place all objects in placementGroup (Default value: default)
placeTail
- name of placementGroup for reader (Default value: same asplaceAll
value)
placeHead
- place the third replica in a different placementGroup (Default value: same asplaceAll
value)
reuseServer
- place multiple copies on the same server
tag
- set a tag in the formkey=value
template
- use template with preconfigured placement, replication and/or limits (please check for more in Templates section)
iops
- set the maximum IOPS limit for this snapshot (in IOPS)
bw
- set maximum bandwidth limit (in MB/s)
limitType
- specify whetheriops
andbw
limits ought to be for the total size of the block device or per each GiB (one of “total” or “perGiB”)
Note
The bandwidth and IOPS limits are concerning only the particular snapshot if it is attached and does not limit any child volumes using this snapshot as parent.
Also similar to the same operation with volumes a snapshot could be renamed with:
# storpool snapshot testsnap rename ubuntu1604-base
OK
Attention
Changing the name of a snapshot will not wait for clients that have this snapshot attached to update the name of the symlink. Always use client sync for all clients with the snapshot attached.
A snapshot could also be rebased to root (promoted) or rebased to another parent snapshot in a chain:
# storpool snapshot testsnap rebase # [parent-snapshot-name]
OK
To delete a snapshot use:
# storpool snapshot testsnap delete testsnap
OK
Note
A snapshot sometimes will not get deleted immediately, during this period of time it will be visible with *
in the output of storpool volume status
or storpool snapshot list
.
To set a snapshot for deferred deletion use:
# storpool snapshot testsnap deleteAfter 1d
OK
The above will set a target delete date for this snapshot in exactly one day from the present time.
Note
The snapshot will be deleted at the desired point in time only if delayed snapshot delete was enabled in the local cluster, check Management configuration section from this guide.
12.11.1. Remote snapshots
In case multi-site or multicluster is enabled (the cluster have a bridge service running) a snapshot could be exported and become visible to other configured clusters.
For example to export a snapshot snap1
to a location named
StorPoolLab-Sofia
use:
# storpool snapshot snap1 export StorPoolLab-Sofia
OK
To list the presently exported snapshots use:
# storpool snapshot list exports
-------------------------------------------------------------------------------
| remote | snapshot | globalId | backingUp | volumeMove |
-------------------------------------------------------------------------------
| StorPoolLab-Sofia | snap1 | nzkr.b.cuj | false | false |
-------------------------------------------------------------------------------
To list the snapshots exported from remote sites use:
# storpool snapshot list remote
------------------------------------------------------------------------------------------
| location | remoteId | name | onVolume | size | creationTimestamp | tags |
------------------------------------------------------------------------------------------
| s02 | a.o.cxz | snapshot1 | | 107374182400 | 2019-08-20 03:21:42 | |
------------------------------------------------------------------------------------------
Single snapshot could be exported to multiple configured locations.
To create a clone of a remote snapshot locally use:
# storpool snapshot snapshot1-copy template hybrid-r3 remote s02 a.o.cxz # [tag key=value]
In this example the remote location
is s02
and the remoteId
is a.o.cxz
. Any key=value
pair tags may be configured at creation time.
To unexport a local snapshot use:
# storpool snapshot snap1 unexport StorPoolLab-Sofia
OK
The remote location could be swapped by the keyword all
. This will attempt to unexport the snapshot from all location it was previously exported to.
Note
If the snapshot is presently being transferred the unexport
operation will fail. It could be forced by adding force
to the end of the unexport command, however this is discouraged in favor to waiting for any active transfer to complete.
To unexport a remote snapshot use:
# storpool snapshot remote s02 a.o.cxz unexport
OK
The snapshot will no longer be visible with storpool snapshot list remote
.
To unexport a remote snapshot and also set for deferred deletion in the remote site:
# storpool snapshot remote s02 a.o.cxz unexport deleteAfter 1h
OK
This will attempt to set a target delete date for a.o.cxz
in the remote site in exactly one hour from the present time for this snapshot. If the minimumDeleteDelay
in the remote site has a higher value, e.g. 1 day the selected value will be overwritten with the minimumDeleteDelay
- in this example 1 day. For more info on deferred deletion check the 17.2. Multi site section of this guide.
To move a snapshot to a different cluster in a multicluster environment (more on clusters here) use:
# storpool snapshot snap1 moveToRemote Lab-D-cl2
Note
Moving a snapshot to a remote cluster is forbidden for attached snapshots More info on snapshot move is available in 17.13. Volume and snapshot move.
12.12. Attachments
Attaching a volume or snapshot makes it accessible to a client under the /dev/storpool
and /dev/storpool-byid
directories. Volumes can be attached as read-only or read-write. Snapshots are always attached read-only.
Attaching a volume testvolume
to a client with ID 1
. This creates the block device /dev/storpool/testvolume
:
# storpool attach volume testvolume client 1
OK
To attach a volume/snapshot to the node you are currently connected to use:
# storpool attach volume testvolume here
OK
# storpool attach snapshot testsnap here
OK
By default this command will block until the volume is attached to the client and the /dev/storpool/<volumename>
symlink is created. For example if the storpool_block
service has not been started the command will wait indefinitely. To set a timeout for this operation use:
# storpool attach volume testvolume here timeout 10 # seconds
OK
To completely disregard the readiness check use:
# storpool attach volume testvolume here noWait
OK
Note
The use of noWait
is discouraged in favor of the default behaviour of the attach
command.
Attaching a volume will create a read-write block device attachment by default. To attach it read-only use:
# storpool volume testvolume2 attach client 12 mode ro
OK
To list all attachments use:
# storpool attach list
-------------------------------------------------------------------
| client | volume | globalId | mode | tags |
-------------------------------------------------------------------
| 11 | testvolume | d.n.a1z | RW | vc-policy=no |
| 12 | testvolume1 | d.n.c2p | RW | vc-policy=no |
| 12 | testvolume2 | d.n.uwp | RO | vc-policy=no |
| 14 | testsnap | d.n.s1m | RO | vc-policy=no |
-------------------------------------------------------------------
To detach use:
# storpool detach volume testvolume client 1 # or 'here' if the command is being executed on client ID 1
If a volume is actively being written or read from, a detach operation will fail:
# storpool detach volume testvolume client 11
Error: 'testvolume' is open at client 11
In this case the detach could be forced, beware that forcing a detachment is discouraged:
# storpool detach volume testvolume client 11 force yes
OK
Attention
Any operations on the volume will receive an IO Error when it is forcefully detached. Some mounted filesystems lead to kernel panic when a block device disappears when there with live operations, thus be extra careful if these filesystems are mounted on a hypervisor node directly.
If a volume or snapshot is attached to more than one client it could be detached from all nodes with a single CLI command:
# storpool detach volume testvolume all
OK
# storpool detach snapshot testsnap all
OK
12.13. Client
To check the status of the active storpool_block
services in the cluster use:
# storpool client status
-----------------------------------
| client | status |
-----------------------------------
| 11 | ok |
| 12 | ok |
| 13 | ok |
| 14 | ok |
-----------------------------------
To wait until a client is updated use:
# storpool client 13 sync
OK
This is a way to ensure a volume with changed size is visible with its new size to any clients it is attached to.
To show detailed information about the active requests on particular client in this moment use:
# storpool client 13 activeRequests
------------------------------------------------------------------------------------------------------------------------------------
| request ID | request IDX | volume | address | size | op | time active |
------------------------------------------------------------------------------------------------------------------------------------
| 9224499360847016133:3181950 | 1044 | testvolume | 10562306048 | 128 KiB | write | 65 msec |
| 9224499360847016133:3188784 | 1033 | testvolume | 10562437120 | 32 KiB | read | 63 msec |
| 9224499360847016133:3188977 | 1029 | testvolume | 10562568192 | 128 KiB | read | 21 msec |
| 9224499360847016133:3189104 | 1026 | testvolume | 10596122624 | 128 KiB | read | 3 msec |
| 9224499360847016133:3189114 | 1035 | testvolume | 10563092480 | 128 KiB | read | 2 msec |
| 9224499360847016133:3189396 | 1048 | testvolume | 10629808128 | 128 KiB | read | 1 msec |
------------------------------------------------------------------------------------------------------------------------------------
12.14. Templates
Templates are enabling easy and consistent setup and usage tracking for a collection of large number of volumes and their snapshots with common attributes, e.g. replication, placement groups and/or common parent snapshot.
To create a template use:
# storpool template nvme replication 3 placeAll nvme
OK
# storpool template magnetic replication 3 placeAll hdd
OK
# storpool template hybrid replication 3 placeAll hdd placeTail ssd
OK
# storpool template ssd-hybrid replication 3 placeAll ssd placeHead hdd
OK
To list all created templates use:
# storpool template list
-------------------------------------------------------------------------------------------------------------------------------------
| template | size | rdnd. | placeHead | placeAll | placeTail | iops | bw | parent | flags |
-------------------------------------------------------------------------------------------------------------------------------------
| magnetic | - | 3 | nvme | nvme | nvme | - | - | | |
| magnetic | - | 3 | hdd | hdd | hdd | - | - | | |
| hybrid | - | 3 | hdd | hdd | ssd | - | - | | |
| ssd-hybrid | - | 3 | hdd | ssd | ssd | - | - | | |
-------------------------------------------------------------------------------------------------------------------------------------
Please refer to 14. Redundancy for more info on replication and erasure coding schemes (shown in rdnd.
above).
To get the status of a template with detailed info on the usage and the available space left with this placement use:
# storpool template status
--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
| template | place head | place all | place tail | rdnd. | volumes | snapshots/removing | size | capacity | avail. | avail. all | avail. tail | avail. head | flags |
--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
| magnetic | hdd | hdd | hdd | 3 | 115 | 631/0 | 28 TiB | 80 TiB | 52 TiB | 240 TiB | 240 TiB | 240 TiB | |
| hybrid | hdd | ssd | hdd | 3 | 208 | 347/9 | 17 TiB | 72 TiB | 55 TiB | 240 TiB | 72 TiB | 240 TiB | |
| ssd-hybrid | ssd | ssd | hdd | 3 | 40 | 7/0 | 4 TiB | 36 TiB | 36 TiB | 240 TiB | 72 TiB | 240 TiB | |
--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
To change template attributes directly use:
# storpool template hdd-only size 120G propagate no
OK
# storpool template hybrid size 40G iops 4000 propagate no
OK
Parameters that can be set:
replication
- change the number of copies for volumes or snapshots created with this template
size
- default size if not specified for each volume created with this template
placeAll
- place all objects in placementGroup (Default value: default)
placeTail
- name of placementGroup for reader (Default value: same asplaceAll
value)
placeHead
- place the third replica in a different placementGroup (Default value: same asplaceAll
value)
iops
- set the maximum IOPS limit for this snapshot (in IOPS)
bw
- set maximum bandwidth limit (in MB/s)
parent
- set parent snapshot for all volumes created in this template
reuseServer
- place multiple copies on the same server
limitType
- specify whetheriops
andbw
limits ought to be for the total size of the block device or per each GiB (one of “total” or “perGiB”)
When changing parameters on an already created template a new propagate
parameter is required in order to specify if the changes would have to be modified on all existing volumes and/or snapshots created with this template or not. The parameter is required regardless of the template having any volumes and/or snapshots created with it.
For example in order change the bandwidth limit for all volumes and snapshots created with the already existing template magnetic
:
# storpool template magnetic bw 100MB propagate yes
OK
Note
When using storpool template $TEMPLATE propagate yes
, all the parameters of $TEMPLATE
will be re-applied to all volumes and snapshots created with it.
Note
Changing template parameters with propagate
option will not
automatically re-allocate content of the existing volumes on disks. If
replication
or placement groups are changed, run balancer to apply
new settings on the existing volumes. However if the changes are made
directly to the volume instead to the template, running a balancer will not be
required.
Attention
Dropping the replication (e.g. from triple to dual) of a large number of volumes is an almost instant operation, however returning them back to triple is similar to creating the third copy for the first time. This is why changing replication to less than the present (e.g. from 3 to 2) will require using replicationReduce
as a safety measure.
To rename a template use:
# storpool template magnetic rename backup
OK
To delete a template use:
# storpool template hdd-only delete hdd-only
OK
Note
The delete operation might fail if there are volumes/snapshots that are created with this template.
12.15. iSCSI
The StorPool iSCSI support is documented more extensively in the 16. StorPool iSCSI support section; these are the commands used to configure it and view the configuration.
To set the cluster’s iSCSI base IQN iqn.2019-08.com.example:examplename:
# storpool iscsi config setBaseName iqn.2019-08.com.example:examplename
OK
12.15.1. Create a Portal Group
To create a portal group examplepg used to group exported volumes for access by initiators using 192.168.42.247/24 (CIDR notation) as the portal IP address:
# storpool iscsi config portalGroup examplepg create addNet 192.168.42.247/24 vlan 42
OK
To create portal for the initiators to connect to (for example portal IP address 192.168.42.202 and StorPool’s SP_OURID 5):
# storpool iscsi config portal create portalGroup examplepg address 192.168.42.202 controller 5
OK
Note
This address will be handled by the storpool_iscsi
process directly
and will not be visible in the node with normal instruments like ip or ifconfig,
check the 12.15.5. iscsi_tool for these purposes.
12.15.2. Register an Initiator
To define the iqn.2019-08.com.example:abcdefgh initiator that is allowed to connect from the 192.168.42.0/24 network (w/o authentication):
# storpool iscsi config initiator iqn.2019-08.com.example:abcdefgh create net 192.168.42.0/24
OK
To define the iqn.2019-08.com.example:client initiator that is allowed to connect from the 192.168.42.0/24 network and must authenticate using the standard iSCSI password-based challenge-response authentication method using the username user and the password secret:
# storpool iscsi config initiator iqn.2019-08.com.example:client create net 192.168.42.0/24 chap user secret
OK
12.15.3. Export a Volume
To specify that the existing StorPool volume tinyvolume should be exported to one or more initiators:
# storpool iscsi config target create tinyvolume
OK
Note
Please note that changing the volume name after creating a target will not change the target name. Re-creating (unexport/re-export) the target will use the new volume name.
To actually export the StorPool volume tinyvolume to to the iqn.2019-08.com.example:abcdefgh initiator via the examplepg portal group (the StorPool iSCSI service will automatically pick a portal to export the volume through):
# storpool iscsi config export initiator iqn.2019-08.com.example:abcdefgh portalGroup examplepg volume tinyvolume
OK
Note
The volume will be visible to the initiator as IQN <BaseName>:<volume>
Same command but without specifying an initiator will export the target to all registered initiators and will be visible as the *
initiator:
# storpool iscsi config export portalGroup examplepg volume tinyvolume
OK
# storpool iscsi initiator list exports
-----------------------------------------------------------------------------------------------------------------
| name | volume | currentControllerId | portalGroup | initiator |
-----------------------------------------------------------------------------------------------------------------
| iqn.2019-08.com.example:examplename:tinyvolume | test | 23 | examplepg | * |
-----------------------------------------------------------------------------------------------------------------
12.15.4. Get iSCSI Configuration
To view the iSCSI cluster base IQN:
# storpool iscsi basename
---------------------------------------
| basename |
---------------------------------------
| iqn.2019-08.com.example:examplename |
---------------------------------------
To view the portal groups:
# storpool iscsi portalGroup list
---------------------------------------------
| name | networksCount | portalsCount |
---------------------------------------------
| examplepg | 1 | 2 |
---------------------------------------------
To view the portals:
# storpool iscsi portalGroup list portals
--------------------------------------------------
| group | address | controller |
--------------------------------------------------
| examplepg | 192.168.42.246:3260 | 1 |
| examplepg | 192.168.42.202:3260 | 5 |
--------------------------------------------------
To view the defined initiators:
# storpool iscsi initiator list
---------------------------------------------------------------------------------------
| name | username | secret | networksCount | exportsCount |
---------------------------------------------------------------------------------------
| iqn.2019-08.com.example:abcdefgh | | | 1 | 1 |
| iqn.2019-08.com.example:client | user | secret | 1 | 0 |
---------------------------------------------------------------------------------------
To view the present state of the configured iSCSI interfaces:
iscsi interfaces list
--------------------------------------------------
| ctrlId | net 0 | net 1 |
--------------------------------------------------
| 23 | 2A:60:00:00:E0:17 | 2A:60:00:00:E0:17 |
| 24 | 2A:60:00:00:E0:18 | 2A:60:00:00:E0:18 |
| 25 | 2A:60:00:00:E0:19 | 2E:60:00:00:E0:19 |
| 26 | 2A:60:00:00:E0:1A | 2E:60:00:00:E0:1A |
--------------------------------------------------
Note
These are the same interfaces configured with SP_ISCSI_IFACE
in the order
of appearance:
# storpool_showconf SP_ISCSI_IFACE SP_ISCSI_IFACE=sp0,spbond1:sp1,spbond1:[lacp]
In the above output the sp0
interface is net ID 0 and sp1
is net ID 1.
To view the volumes that may be exported to initiators:
# storpool iscsi target list
-------------------------------------------------------------------------------------
| name | volume | currentControllerId |
-------------------------------------------------------------------------------------
| iqn.2019-08.com.example:examplename:tinyvolume | tinyvolume | 65535 |
-------------------------------------------------------------------------------------
To view the volumes currently exported to initiators:
# storpool iscsi initiator list exports
--------------------------------------------------------------------------------------------------------------------------------------
| name | volume | currentControllerId | portalGroup | initiator |
--------------------------------------------------------------------------------------------------------------------------------------
| iqn.2019-08.com.example:examplename:tinyvolume | tinyvolume | 1 | | iqn.2019-08.com.example:abcdefgh |
--------------------------------------------------------------------------------------------------------------------------------------
To list the presently active sessions in the cluster use:
# storpool iscsi sessions list
----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
| id | target | initiator | portal addr | initiator addr | timeCreated | nopOut | scsi | task | dataOut | otherOut | nopIn | scsiRsp | taskRsp | dataIn | r2t | otherIn | t free | t dataOut | t queued | t processing | t dataResp | t aborted | ISID |
----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
| 23.0 | iqn.2020-04.com.storpool:autotest:s18-1-iscsi-test-hybrid-win-server2016 | iqn.1991-05.com.microsoft:s18 | 10.1.100.123:3260 | 10.1.100.18:49414 | 2020-07-07 09:25:16 / 00:03:54 | 209 | 89328 | 0 | 0 | 2 | 209 | | | 45736 | 0 | 2 | 129 | 0 | 0 | 0 | 0 | 0 | 1370000 |
| 23.1 | iqn.2020-04.com.storpool:autotest:s11-1-iscsi-test-hybrid-centos6 | iqn.2020-04.com.storpool:s11 | 10.1.100.123:3260 | 10.1.100.11:44392 | 2020-07-07 09:25:33 / 00:03:37 | 218 | 51227 | 0 | 0 | 1 | 218 | | | 25627 | 0 | 1 | 129 | 0 | 0 | 0 | 0 | 0 | 3d0002b8 |
| 24.0 | iqn.2020-04.com.storpool:autotest:s11-1-iscsi-test-hdd-centos6 | iqn.2020-04.com.storpool:s11 | 10.1.100.124:3260 | 10.1.100.11:51648 | 2020-07-07 09:27:27 / 00:01:43 | 107 | 424 | 0 | 0 | 1 | 107 | | | 224 | 0 | 1 | 129 | 0 | 0 | 0 | 0 | 0 | 3d0002b9 |
| 24.1 | iqn.2020-04.com.storpool:autotest:s18-1-iscsi-test-hdd-win-server2016 | iqn.1991-05.com.microsoft:s18 | 10.1.100.124:3260 | 10.1.100.18:49422 | 2020-07-07 09:28:22 / 00:00:48 | 43 | 39568 | 0 | 0 | 2 | 43 | | | 19805 | 0 | 2 | 128 | 0 | 0 | 1 | 0 | 0 | 1370000 |
| 25.0 | iqn.2020-04.com.storpool:autotest:s13-1-iscsi-test-hybrid-centos7 | iqn.2020-04.com.storpool:s13 | 10.1.100.125:3260 | 10.1.100.13:45120 | 2020-07-07 09:20:46 / 00:08:24 | 481 | 154086 | 0 | 0 | 1 | 481 | | | 78308 | 0 | 1 | 129 | 0 | 0 | 0 | 0 | 0 | 3d0000a8 |
| 26.0 | iqn.2020-04.com.storpool:autotest:s13-1-iscsi-test-hdd-centos7 | iqn.2020-04.com.storpool:s13 | 10.1.100.126:3260 | 10.1.100.13:43858 | 2020-07-07 09:22:52 / 00:06:18 | 369 | 147438 | 0 | 0 | 1 | 369 | | | 74883 | 0 | 1 | 129 | 0 | 0 | 0 | 0 | 0 | 3d0000a9 |
----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
Where the fields are:
id
- identifier for the node and connection, first part matches theSP_OURID
of the node with thestorpool_iscsi
service is running and the second is the export number.target
- the target IQNinitiator
- the initiator IQNportal addr
- the portal group floating address and portinitiator addr
- the initiator address and porttimeCreated
- the time when the session was created
Initiator:
nopOut
- number of NOP-out requests from the initiatorscsi
- number of SCSI commands from the initiator for this sessiontask
- number of SCSI Task Management Function Requests from the initiatordataOut
- number of SCSI Data-Out PDUs from the initiatorotherOut
- number of non SCSI Data-Out PDUs sent to the target (Login/Logout/SNACK or Text)ISID
- the initiator part of the session identifier, explicitly specified by the initiator during login.
Target:
nopIn
- number of NOP-in PDUs from the targetscsiRsp
- number of SCSI response PDUs from the targettaskRsp
- number of SCSI Task Management Function Response PDUs from the targetdataIn
- number of SCSI Data-In PDUs from the targetr2t
- number of Ready To Transfer (R2T) PDUs from the targetotherIn
- number of non SCSI Data-In PDUs from the target (Login/Logout/SNACK or Text)
Task queue:
t free
- number of free task queue slotst dataOut
- write request waiting for data from TCPt queued
- number of IO requests received ready to be processedt processing
- number of IO requests sent to the target to processt dataResp
- read request queued for sending over TCPt aborted
- number of aborted requests
To stop exporting the tinyvolume volume to the initiator with iqn iqn.2019-08.com.example:abcdefgh and the examplepg portal group:
# storpool iscsi config unexport initiator iqn.2019-08.com.example:abcdefgh portalGroup examplepg volume tinyvolume
OK
If a target was exported to all initiators (i.e. *
), not specifying an initiator will unexport from all:
# storpool iscsi config unexport portalGroup examplepg volume tinyvolume
OK
To remove an iSCSI definition for the tinyvolume volume:
# storpool iscsi config target delete tinyvolume
OK
To remove access for the iqn.2019-08.com.example:client iSCSI initiator:
# storpool iscsi config initiator iqn.2019-08.com.example:client delete
OK
To remove the portal 192.168.42.202 IP address:
# storpool iscsi config portal delete address 192.168.42.202
OK
To remove portal group examplepg after all the portals have been removed:
# storpool iscsi config portalGroup examplepg delete
OK
Note
Only portal groups without portals may be deleted.
12.15.5. iscsi_tool
With the hardware accelerated iSCSI all traffic from/to the initiators is handled
by the storpool_iscsi
service directly. For example with the above setup the
configured in the cluster setup the addresses exposed on each of the nodes could
be queried with /usr/lib/storpool/iscsi_tool
:
# /usr/lib/storpool/iscsi_tool
usage: /usr/lib/storpool/iscsi_tool change-port 0/1 ifaceName
usage: /usr/lib/storpool/iscsi_tool ip net list
usage: /usr/lib/storpool/iscsi_tool ip neigh list
usage: /usr/lib/storpool/iscsi_tool ip route list
To list the presently configured addresses use:
# /usr/lib/storpool/iscsi_tool ip net list
10.1.100.0/24 vlan 1100 ports 1,2
10.18.1.0/24 vlan 1801 ports 1,2
10.18.2.0/24 vlan 1802 ports 1,2
To list the neighbours and their last state use:
# /usr/lib/storpool/iscsi_tool ip neigh list
10.1.100.11 ok F4:52:14:76:9C:B0 lastSent 1785918292753 us, lastRcvd 918669 us
10.1.100.13 ok 0:25:90:C8:E5:AA lastSent 1785918292803 us, lastRcvd 178521 us
10.1.100.18 ok C:C4:7A:EA:85:4E lastSent 1785918292867 us, lastRcvd 178099 us
10.1.100.108 ok 1A:60:0:0:E0:8 lastSent 1785918293857 us, lastRcvd 857181794 us
10.1.100.112 ok 1A:60:0:0:E0:C lastSent 1785918293906 us, lastRcvd 1157179290 us
10.1.100.113 ok 1A:60:0:0:E0:D lastSent 1785918293922 us, lastRcvd 765392509 us
10.1.100.114 ok 1A:60:0:0:E0:E lastSent 1785918293938 us, lastRcvd 526084270 us
10.1.100.115 ok 1A:60:0:0:E0:F lastSent 1785918293954 us, lastRcvd 616948781 us
10.1.100.123 ours
[snip]
The above output includes also the portalGroup addresses residing on the node with the lowest ID in the cluster.
To list routing info use:
# /usr/lib/storpool/iscsi_tool ip route list
10.1.100.0/24 local
10.18.1.0/24 local
10.18.2.0/24 local
12.15.6. iscsi_targets
The /usr/lib/storpool/iscsi_targets
tool is a helper tool for Linux
based initiators, showing all logged in targets on the node:
# /usr/lib/storpool/iscsi_targets
/dev/sdn iqn.2020-04.com.storpool:autotest:s11-1-iscsi-test-hybrid-centos6
/dev/sdo iqn.2020-04.com.storpool:autotest:s11-1-iscsi-test-hdd-centos6
/dev/sdp iqn.2020-04.com.storpool:autotest:s11-2-iscsi-test-hybrid-centos6
/dev/sdq iqn.2020-04.com.storpool:autotest:s11-2-iscsi-test-hdd-centos6
12.16. Kubernetes
To register a Kubernetes cluster:
# storpool kubernetes add name cluster1
OK
To disable Kubernetes cluster:
# storpool kubernetes update name cluster1 disable yes
OK
To enable Kubernetes cluster:
# storpool kubernetes update name cluster1 disable no
OK
To delete Kubernetes cluster:
# storpool kubernetes delete name cluster1
OK
To list registered Kubernetes clusters:
# storpool kubernetes list
-----------------------
| name | disabled |
-----------------------
| cluster1 | false |
-----------------------
To view the status of the registered Kubernetes clusters:
# storpool kubernetes status
--------------------------------------------------------------
| name | sc | w | pvc | noRsrc | noTempl | mode | noSC |
--------------------------------------------------------------
| cluster1 | 0 | 0/3 | 0 | 0 | 0 | 0 | 0 |
--------------------------------------------------------------
Feilds:
sc - registered Storage Classes
w - watch connections to the kube adm
pvc - persistentVolumeClaims beeing provisioned
noRsrc - persistentVolumeClaims failed due to no resources
noTempl - persistentVolumeClaims failed due to missing template
mode - persistentVolumeClaims failed due to unsupported access mode
noSC - persistentVolumeClaims failed due to missing storage class
12.17. Relocator
The relocator is internal StorPool service that takes care of data reallocation in case of change of volume’s replication, placement group parameters or in case of any pending rebase operations. This service is turned on by default.
In case of need the relocator could be turned off with:
# storpool relocator off
OK
To turn back on use:
# storpool relocator on
OK
To display the relocator status:
# storpool relocator status
relocator on, no volumes to relocate
The following additional relocator commands are available:
storpool relocator disks
- returns the state of the disks after the relocator finishes all presently running tasks, as well as the quantity of objects and data each drive still needs to recover. The output is the same as withstorpool balancer disks
after the balancing task has been committed, see Balancer for more details.
storpool relocator volume <volumename> disks
orstorpool relocator snapshot <snapshotname> disks
- shows the same information as thestorpool relocator disks
with the pending operations specific volume or snapshot.
12.18. Balancer
The balancer is used to redistribute data in case a disk or set of disks (e.g. new node) were added to or removed from a cluster. By default it is off. It has to be turned on in case of changes in cluster configuration for redistribution of data to occur.
To display the status of the balancer:
# storpool balancer status
balancer waiting, auto off
To load a re-balancing task, please refer to the rebalancing_storpool_20.0 section of this guide.
To discard the re-balancing operation use:
# storpool balancer stop
OK
To actually commit the changes and start the relocations of the proposed changes use:
# storpool balancer commit
OK
After the commit all changes will be only visible with storpool relocator disks
and many volumes and snapshots will have the M
flag in the output of storpool volume status
until all relocations are completed. The progress could be followed with storpool task list
(see Tasks).
12.19. Tasks
The tasks are all outstanding operations on recovering or relocating data in the present or between two connected clusters.
For example if a disk with ID 1401
was not in the cluster for a period of time and is then returned, all outdated objects will be recovered from the other drives with the latest changes.
These recovery operations could be listed with:
# storpool task list
----------------------------------------------------------------------------------------
| disk | task id | total obj | completed | started | remaining | % complete |
----------------------------------------------------------------------------------------
| 2301 | RECOVERY | 73 | 5 | 1 | 68 | 6% |
| 2315 | balancer | 180 | 0 | 1 | 180 | 0% |
----------------------------------------------------------------------------------------
| total | | 73 | 5 | 1 | 68 | 6% |
----------------------------------------------------------------------------------------
Other cases when tasks operations could be listed are when a re-balancing operation was committed and relocations are in progress, as well as when a cloning operation for a remote snapshot in the local cluster is in progress.
12.20. Maintenance Mode
The maintenance
submenu is used to configure one or more nodes in
a cluster in maintenance state. A couple of checks will be performed prior to
entering into maintenance state in order to prevent a node from entering in
maintenance in case it has one or more live server instances in the cluster
when for example the cluster is not yet fully recovered or is with decreased
redundancy due to other reasons.
A node could be configured in maintenance state with:
# storpool maintenance set node 23 duration 10m description kernel_update
OK
The above will configure node ID 23 in maintenance state for 10 minutes and will configure the description to “kernel_update”.
To list the present nodes in maintenance use:
# storpool maintenance list
------------------------------------------------------------
| nodeId | started | remaining | description |
------------------------------------------------------------
| 23 | 2020-09-30 12:55:20 | 00:09:50 | kernel_update |
------------------------------------------------------------
To complete a maintenance for a node use:
# storpool maintenance complete node 23
OK
Note
All non-cluster threatening issues will not be sent by the monitoring system to external entities. All alerts will still be received by StorPool support and will be classified as “under maintenance” internally, while the node or the cluster are under maintenance mode.
Attention
Any alerts that are cluster threatening will still send super-critical alerts to both StorPool support and any other configured endpoint. More on super-critical alerts here.
A full cluster maintenance mode is also available for occasions involving full cluster related maintenances. An example would be a scheduled restart of a network switch that will be reported as missing network for all nodes in a cluster.
This mode does not perform any checks and is mainly for informational purposes in order to sync context between customers and StorPool’s support teams. More on how to active it is available here.
Full cluster maintenance mode could be used in addition to the per-node maintenance state explained above when necessary.
12.22. Management Configuration
The mgmtConfig
submenu is used to set some internal configuration parameters.
To list the presently configured parameters use:
# storpool mgmtConfig list
relocator on, interval 5.000 s
relocator transaction: min objects 320, max objects 4294967295
relocator recovery: max tasks per disk 2, max objects per disk 2400
relocator recovery objects trigger 32
relocator min free 150 GB
relocator max objects per HDD tail 0
balancer auto off, interval 5.000 s
snapshot delete interval 1.000 s
disks soft-eject interval 5.000 s
snapshot delayed delete off
snapshot dematerialize interval 1.000 s
mc owner check interval 2.000 s
mc autoreconcile interval 2.000 s
reuse server implicit on disk down disabled
max local recovery requests 1
max remote recovery requests 2
maintenance state production
max disk latency nvme 1000.000 ms
max disk latency ssd 1000.000 ms
max disk latency hdd 1000.000 ms
max disk latency journal 50.000 ms
backup template name backup_template
To disable the deferred snapshot deletion (default on) use:
# storpool mgmtConfig delayedSnapshotDelete off
OK
When enabled all snapshots with configured time to be deleted will be cleared at the configured date and time.
To change the default interval between periodic checks whether disks marked for ejection can actually be ejected (5 sec.) use:
# storpool mgmtConfig disksSoftEjectInterval 20000 # value in ms - 20 sec.
OK
Note
For individual per disk latency thresholds check 12.8.4. Disk list performance info section.
To define a global latency threshold before ejecting a HDD disk use:
# storpool mgmtConfig maxDiskLatencies hdd 1000 # value is in milliseconds
To define a global latency threshold before ejecting a SSD drive use:
# storpool mgmtConfig maxDiskLatencies ssd 1000 # value is in milliseconds
To define a global latency threshold before ejecting a NVMe drive use:
# storpool mgmtConfig maxDiskLatencies nvme 1000 # value is in milliseconds
To define a global latency limit before ejecting a journal device use:
# storpool mgmtConfig maxDiskLatencies journal 50 # value is in milliseconds
To change the default local recovery requests for all disks use:
# storpool mgmtConfig maxLocalRecoveryRequests 1
OK
# storpool mgmtConfig maxRemoteRecoveryRequests 2
OK
To change the default interval (5 sec.) for the relocator to check if there is new work to be done use:
# storpool mgmtConfig relocatorInterval 20000 # value is in ms - 20 sec.
OK
To set a different than the default number of objects per disk (3200) in recovery at a time:
# storpool mgmtConfig relocatorMaxRecoveryObjectsPerDisk 2000 # value in number of objects per disk
OK
To change the default maximum number of recovery tasks per disk (2 tasks) use:
# storpool mgmtConfig relocatorMaxRecoveryTasksPerDisk 4 # value is number of tasks per disk - will set 4 tasks
OK
To change the minimum (default 320) or the maximum (default 4294967295) number of objects per transaction for the relocator use:
# storpool mgmtConfig relocatorMaxTrObjects 2147483647
OK
# storpool mgmtConfig relocatorMinTrObjects 640
OK
To change the maximum number of objects per transaction per HDD tail drives use (0 is unset, 1+ is number of objects):
# storpool mgmtConfig relocatorMaxTrObjectsPerHddTail 2
To change the maximum number of objects in recovery for a disk to be usable by the relocator (default 32) use:
# storpool mgmtConfig relocatorRecoveryObjectsTrigger 64
To change the default check for new snapshots for deleting use:
# storpool mgmtConfig snapshotDeleteInterval
To enable snapshot dematerialization or change the interval use:
# storpool mgmtConfig snapshotDematerializeInterval 30000 # sets the interval 30 seconds, 0 disables it
Snapshot dematerialization checks and removes all objects that do not refer to any data, i.e. no change in this object from the last snapshot (or ever). This helps to reduce the number of used objects per disk in clusters with a large number of snapshots and a small number of changed blocks between the snapshots in the chain.
To update the free space threshold in GB after which the relocator will not be adding new tasks use:
# storpool mgmtConfig relocatorGBFreeBeforeAdd 75 # value is in GB
To set or change the default MultiCluster owner check interval use:
# storpool mcOwnerCheckInterval 2000 # sets the interval to 2 seconds, 0 disables it
To set or change the default MultiCluster auto-reconcile interval use:
# storpool mcAutoReconcileInterval 2000 # sets the interval to 2 seconds, 0 disables it
If there is a disk down, and a new volume could not be allocated, enabling this option will retry the new volume allocation as if reuseServer
was specified, helpful for minimum installation requirements with 3 nodes when one of the nodes or a disk is down. To enable the option use:
# storpool mgmtConfig reuseServerImplicitOnDiskDown enable
In case of a planned maintenance the following will update the full cluster maintenance state to maintenance
:
# storpool mgmtConfig maintenanceState maintenance
OK
… and back into production
:
# storpool mgmtConfig maintenanceState production
OK
To change the default template upon receiving a snapshot from a remote cluster,
through the storpool_bridge
service (was the now deprecated SP_BRIDGE_TEMPLATE
) use:
# storpool mgmtConfig backupTemplateName all-flash # the all-flash template should exist
OK
Please consult with StorPool support before changing the management configuration defaults.
12.23. Mode
Support for couple of different output modes is available both for the interactive shell and when the CLI is invoked directly. Some custom format options are available only for some operations.
Available modes:
csv
- Semicolon-separated values for some commands
json
- Processed JSON output for some commands
pass
- Pass the JSON response through
raw
- Raw output (display the HTTP request and response)
text
- Human readable output (default)
Example with switching to csv
mode in the interactive shell:
StorPool> mode csv
OK
StorPool> net list
nodeId;flags;net 1;net 2
23;uU + AJ;22:60:00:00:F0:17;26:60:00:00:F0:17
24;uU + AJ;2A:60:00:00:00:18;2E:60:00:00:00:18
25;uU + AJ;F6:52:14:76:9C:C0;F6:52:14:76:9C:C1
26;uU + AJ;2A:60:00:00:00:1A;2E:60:00:00:00:1A
29;uU + AJ;52:6B:4B:44:02:FE;52:6B:4B:44:02:FF
The same applies when using the CLI directly:
# storpool -f csv net list # the output is the same as above
[snip]
13. Multi-server
The multi-server feature enables the use of up to seven separate storpool_server
instances on a single node. This makes sense for dedicated storage nodes or in case of heavily-loaded converged setup with more resources isolated for the storage system.
For example a dedicated storage node with 36 drives would provide better peak performance with 4 server instances each controlling 1/4th of all disks/SSDs than with a single instance. Another good example would be a converged node with 16 SSDs/HDDs, which would provide better peak performance with two server instances each controlling half of the drives and running on separate CPU cores or even running on two threads on a single CPU core compared to a single server instance.
13.1. Configuration
The configuration of the CPUs on which the different instances are running is done via cgroups, through the storpool_cg
tool. More details are available in 6.2.24. Cgroup setup.
Configuring which drive is handled by which instance is done with storpool_initdisk
. For example, if we have two drives with IDs of 1101
and 1102
, both controlled by the first server instance, the output from storpool_initdisk
would look like this:
# storpool_initdisk --list
/dev/sde1, diskId 1101, version 10007, server instance 0, cluster init.b, SSD
/dev/sdf1, diskId 1102, version 10007, server instance 0, cluster init.b, SSD
Setting the second SSD drive (1102
) to be controlled by the second server instance is done like this:
# storpool_initdisk -r -i 1 /dev/sdXN # where X is the drive letter and N is the partition number e.g. /dev/sdf1
Hint
The above command will fail if the storpool_server
service is running, please eject the disk prior to re-setting an instance.
In some occasions if the first server instance was configured with a large amount of cache (check SP_CACHE_SIZE
in 6. Node configuration options) first split the cache between the different instances (e.g. from 8192
to 4096
when migrating from one to two instances). These parameters will be automatically taken care of by the storpool_cg
tool, check for more details in 6.2.24. Cgroup setup.
13.2. Helper
A tool for easy reconfiguration between different number of server instances could be used to print the required commands, example for a node with some SSD and some HDDs automatically assigned to 3 SSD only and one HDD only server instances:
[root@s25 ~]# /usr/lib/storpool/multi-server-helper.py -i 4 -s 3
/usr/sbin/storpool_initdisk -r -i 0 2532 0000:01:00.0-p1 # SSD
/usr/sbin/storpool_initdisk -r -i 0 2534 0000:02:00.0-p1 # SSD
/usr/sbin/storpool_initdisk -r -i 0 2533 0000:06:00.0-p1 # SSD
/usr/sbin/storpool_initdisk -r -i 0 2531 0000:07:00.0-p1 # SSD
/usr/sbin/storpool_initdisk -r -i 1 2505 /dev/sde1 # SSD
/usr/sbin/storpool_initdisk -r -i 1 2506 /dev/sdf1 # SSD
/usr/sbin/storpool_initdisk -r -i 1 2507 /dev/sdg1 # SSD
/usr/sbin/storpool_initdisk -r -i 1 2508 /dev/sdh1 # SSD
/usr/sbin/storpool_initdisk -r -i 2 2501 /dev/sda1 # SSD
/usr/sbin/storpool_initdisk -r -i 2 2502 /dev/sdb1 # SSD
/usr/sbin/storpool_initdisk -r -i 2 2503 /dev/sdc1 # SSD
/usr/sbin/storpool_initdisk -r -i 2 2504 /dev/sdd1 # SSD
/usr/sbin/storpool_initdisk -r -i 3 2511 /dev/sdi1 # WBC
/usr/sbin/storpool_initdisk -r -i 3 2512 /dev/sdj1 # WBC
/usr/sbin/storpool_initdisk -r -i 3 2513 /dev/sdk1 # WBC
/usr/sbin/storpool_initdisk -r -i 3 2514 /dev/sdl1 # WBC
/usr/sbin/storpool_initdisk -r -i 3 2515 /dev/sdn1 # WBC
/usr/sbin/storpool_initdisk -r -i 3 2516 /dev/sdo1 # WBC
/usr/sbin/storpool_initdisk -r -i 3 2517 /dev/sdp1 # WBC
/usr/sbin/storpool_initdisk -r -i 3 2518 /dev/sdq1 # WBC
[root@s25 ~]# /usr/lib/storpool/multi-server-helper.py -h
usage: multi-server-helper.py [-h] [-i INSTANCES] [-s [SSD_ONLY]]
Prints relevant commands for dispersing the drives to multiple server
instances
optional arguments:
-h, --help show this help message and exit
-i INSTANCES, --instances INSTANCES
Number of instances
-s [SSD_ONLY], --ssd-only [SSD_ONLY]
Splits by type, 's' SSD-only instances plusi-s HDD
instances (default s: 1)
Note that the commands could be executed only when the relevant storpool_server*
service instances are stopped and a cgroup re-configuration would likely be required after the setup changes (see 6.2.24. Cgroup setup for more info on how to update cgroups).
14. Redundancy
StorPool provides two mechanisms for protecting data from unplanned events: replication and erasure coding.
14.1. Replication
With replication, redundancy is provided by having multiple copies (replicas) of the data written synchronously across the cluster. You can set the number of replication copies as needed. The replication level directly correlates with the number of servers that may be down without interruption in the service. For example, with triple replication the number of the servers that may be down simultaneously, without losing access to the data, is 2.
Each volume or snapshot could be replicated on a different set of drives. Each set of drives is configured through the placement groups. A volume would either have all of its copies in a single set of drives in a different set of nodes, or have different copies in a different set of drives.
Tip
When using the replication mechanism, StorPool recommends having 3 copies as a standard for critical data.
There are many parameters through which you can manage replication. For details, see 12.10. Volumes.
14.1.1. Triple replication
The minimum requirement for triple replication is at least three nodes (with recommended five).
With triple replication each block of data is stored on three different storage nodes. This protects the data against two simultaneous failures - for example, one node is down for maintenance and a drive on another node fails.
14.1.2. Dual replication
Dual replication can be used for non-critical data, or for data that can be recreated from other sources. Dual-replicated data can tolerate a single failure without service interruption.
This type of replication is suitable for test and staging environments, and can be deployed on a single node cluster (not recommended for production deployments). Deployment can also be performed on larger HDD-based backup clusters.
14.2. Erasure Coding
As of release 21.0 revision 21.0.75.1e0880427 StorPool supports erasure coding on NVMe drives.
14.2.1. Features
The erasure coding reduces the amount of data stored on the same hardware set, while at the same time preserves the level of data protection. It provides the following advantages:
Cross-node data protection
Erasure-coded data is always protected across servers with two parity objects, so that any two servers can fail, and user data is safe.
Delayed batch-encoding
Incoming data is initially written with triple replication. The erasure coding mechanism is automatically applied later. This way the data processing overhead is significantly reduced, and the impact on latency for user I/O operations is minimized.
Designed for always-on operations
Up to two storage nodes can be rebooted or brought down for maintenance while the storage system keeps running, and all data is available and in use.
A pure software feature
The implementation requires no additional hardware components.
14.2.2. Redundancy schemes
StorPool supports three redundancy schemes for erasure coding - 2+2
,
4+2
, or 8+2
schemes, depending on the size of the cluster. The naming of
the schemes follows the k+m
pattern:
k
is the number of data blocks stored.m
is the number of parity blocks stored.A redundancy scheme can recover data when any up to
m
blocks are lost.
For example 4+2
stores 4 data blocks and protects them with two parity
blocks. It can operate and recover when any 2 drives or nodes are lost.
When planning, consider the minimum required number of nodes (or fault sets) for each scheme:
Scheme |
Nodes |
Raw space used |
Overhead |
---|---|---|---|
2+2 |
5+ |
2.4x |
140% |
4+2 |
7+ |
1.8x |
80% |
8+2 |
11+ |
1.5x |
50% |
For example, storing 1TB user data using the 8+2
scheme requires 1.5TB raw
storage capacity.
The nodes have to be relatively similar in size. A mixture of a few very large nodes could lead to inability to use their capacity efficiently.
Note
Erasure coding requires making snapshots on a regular basis. Make sure your cluster is configured to create snapshots regularly, for example using the VolumeCare service. A single periodic snapshot per volume is required; more snapshots are optional.
15. Volume management

Volumes are the basic service of the StorPool storage system. A volume always have a name, a global ID and a certain size. It can be read from and written to. It could be attached to hosts as read-only or read-write block device under the /dev/storpool
directory (available as well at /dev/storpool-byid
).
The volume name is a string consisting of one or more of the allowed characters - upper and lower latin letters (a-z,A-Z
), numbers (0-9
) and the delimiter dot (.
), colon (:
), dash (-
) and underscore (_
).
15.1. Creating a volume

Creating volume
15.2. Deleting a volume

Deleting a volume
15.3. Renaming a volume

Renaming a volume
15.4. Resizing a volume

Resizing a volume
15.5. Snapshots

Snapshots are read-only point-in-time images of volumes. They are created once and cannot be changed. They can be attached to hosts as read-only block devices under /dev/storpool
.
All volumes and snapshots share the same name-space. Names of volumes and snapshots are unique within a StorPool cluster. This diagram illustrates the relationship between a snapshot and a volume. Volume vol1
is based on snapshot snap1
. vol1
contains only the changes since snap1
was taken. In the common case this is a small amount of data. Arrows indicate a child-parent relationship. Each volume or snapshot may have exactly one parent which it is based upon. Writes to vol1
are recorded within the volume. Reads from vol1
may be served by vol1
or by its parent snapshot - snap1
, depending on whether vol1
contains changed data for the read request or not.


Snapshots and volumes are completely independent. Each snapshot may have many children (volumes and snapshots). Volumes cannot have children.

snap1
contains a full image. snap2
contains only the changes since snap1
was taken. vol1
and vol2
contain only the changes since snap2
was taken.
15.6. Creating a snapshot of a volume
There is a volume named vol1
.

After the first snapshot the state of vol1
is recorded in a new snapshot named snap1
. vol1
does not occupy any space now, but will record any new writes which come in after the creation of the snapshot. Reads from vol1
may fall through to snap1
.
Then the state of vol1
is recorded in a new snapshot named snap2
. snap2
contains the changes between the moment snap1
was taken and the moment snap2
was taken. snap2
’s parent is the original parent of vol1
.
15.7. Converting a volume to a snapshot (freeze)
There is a volume named vol1
, based on a snapshot snap0
. vol1
contains only the changes since snap0
was taken.

After the freeze operation the state of vol1
is recorded in a new snapshot with the same name. The snapshot vol1
contains changes between the moment snap0
was taken and the moment vol1
was frozen.
15.8. Creating a volume based on an existing snapshot (a.k.a. clone)
Before the creation of vol1
there is a snapshot named snap1
.

A new volume, named vol1
is created. vol1
is based on snap1
. The newly created volume does not occupy any space initially. Reads from the vol1
may fall through to snap1
or to snap1
’s parents (if any).
15.9. Deleting a snapshot
vol1
and vol2
are based on snap1
. snap1
is based on snap0
. snap1
contains the changes between the moment snap0
was taken and when snap1
was taken. vol1
and vol2
contain the changes since the moment snap1
was taken.

After the deletion, vol1
and vol2
are based on snap1
’s original parent (if any). In the example they are now based on snap0
. When deleting a snapshot, the changes contained therein will not be propagated to its children and StorPool will keep the snap1
in deleting state to prevent from an explosion of disk space usage.
15.10. Rebase to null (a.k.a. promote)
vol1
is based on snap1
. snap1
is in turn based on snap0
. snap1
contains the changes between the moment snap0
was taken and when snap1
was taken. vol1
contains the changes since the moment snap1
was taken.

After promotion vol1
is not based on a snapshot. vol1
now contains all data, not just the changes since snap1
was taken. Any relation between snap1
and snap0
is unaffected.
15.11. Rebase
vol1
is based on snap1
. snap1
is in turn based on snap0
. snap1
contains the changes between the moment snap0
was taken and when snap1
was taken. vol1
contains the changes since the moment snap1
was taken.

After the rebase operation vol1
is based on snap0
. vol1
now contains all changes since snap0
was taken, not just since snap1
. snap1
is unchanged.
15.12. Example use of snapshots

This is a semi-realistic example of how volumes and snapshots may be used. There is a snapshot called base.centos7
. This snapshot contains a base CentOS 7 VM image, which was prepared carefully by the service provider. There are 3 customers with 4 virtual machines each. All virtual machine images are based on CentOS 7, but may contain custom data, which is unique to each VM.

This example shows another typical use of snapshots - for restore points back in time for this volume. There is one base image for Centos 7, three snapshot restore points and one live volume cust123.v.1
16. StorPool iSCSI support
If StorPool volumes need to be accessed by hosts that cannot run the StorPool client service (e.g. VMware hypervisors), they may be exported using the iSCSI protocol.
As of version 19, StorPool implements an internal user-space TCP/IP stack, which in conjunction with the NIC hardware acceleration (user-mode drivers) allows for higher performance and independence of the kernel’s TCP/IP stack and its inefficiencies.
16.1. A Quick Overview of iSCSI
The iSCSI remote block device access protocol, as implemented by the StorPool iSCSI service, is a client-server protocol allowing clients (referred to as “initiators”) to read and write data to disks (referred to as “targets”) exported by iSCSI servers.
iSCSI is implemented in StorPool with Portal Groups and Portals.
A portal is one instance of the storpool_iscsi
service, which
listens on a TCP port (usually 3260), on specified IP addresses.
Every portal has its own set of “targets” (exported volumes) that
it provides service for.
A portal group is the “entry point” to the iSCSI service - a
“floating” IP address that’s on the first storpool_iscsi
service in the cluster and is always kept active (by
automatically moving to the next instance if the one serving it
is stopped/dies). All initiators connect to that IP and get
redirected to the relevant instance to communicate with their
target.
16.2. An iSCSI Setup in a StorPool Cluster
The StorPool implementation of iSCSI provides a way to mark StorPool volumes as accessible to iSCSI initiators, define iSCSI portals where hosts running the StorPool iSCSI service listen for connections from initiators, define portal groups over these portals, and export StorPool volumes (iSCSI targets) to iSCSI initiators in the portal groups. To simplify the configuration of the iSCSI initiators, and also to provide load balancing and failover, each portal group has a floating IP address that is automatically brought up on only a single StorPool service at a given moment; the initiators are configured to connect to this floating address, authenticating if necessary, and are then redirected to the portal of the StorPool service that actually exports the target (volume) that they need to access.
Note
As of 19, you don’t need to add the IP addresses
on the nodes, those are handled directly by the StorPool TCP
implementation and are not visible in ifconfig
or ip
.
If you’re going to use multiple VLANs, those are configured
in the CLI and do not require setting up VLAN interfaces on
the host itself, except for debugging/testing or if a local
initiator is required to access volumes through iSCSI.
In the simplest setup, there is a single portal group with a floating IP address, there is a single portal for each StorPool host that runs the iSCSI service, all the initiators connect to the floating IP address and are redirected to the correct host. For quality of service or fine-grained access control, more portal groups may be defined and some volumes may be exported via more than one portal group.
Before configuring iSCSI, the interfaces that would be used for
it need to be described in storpool.conf
. Here is the general
config format:
SP_ISCSI_IFACE=IFACE1,RESOLVE:IFACE2,RESOLVE:[flags]
This row means that the first iSCSI network is on IFACE1
and the
second one on IFACE2
. The order is important for the configuration
later. RESOLVE
is the resolve interface, if different than the
interfaces themselves, i.e. if it’s a bond or a bridge.
[flags]
is not required and more importantly if not needed, must be omitted.
Currently the only supported value is [lacp]
(brackets included)
if the interfaces are in a LACP trunk.
Examples:
Multipath, two separate interfaces used directly:
SP_ISCSI_IFACE=eth0:eth1
Active-backup bond named bond0
:
SP_ISCSI_IFACE=eth0,bond0:eth1,bond0
LACP bond named bond0
:
SP_ISCSI_IFACE=eth0,bond0:eth1,bond0:[lacp]
Bridge interface cloudbr0
on top of LACP bond:
SP_ISCSI_IFACE=eth0,cloudbr0:eth1,cloudbr0:[lacp]
A trivial iSCSI setup can be brought up by the following series of StorPool CLI commands below. See the CLI tutorial for more information about the commands themselves. The setup does the following:
has baseName/IQN of
iqn.2019-08.com.example:poc-cluster
;has floating IP address is
192.168.42.247
, which is in VLAN 42;two nodes from the cluster will be able to export in this group:
node id 1, with IP address
192.168.42.246
node id 3, with IP address
192.168.42.202
one client is defined, with IQN
iqn.2019-08.com.example:poc-cluster:hv1
one volume, called
tinyvolume
will be exported to the client defined, in the portal group.
Note
You need to obtain the exact IQN of the initiator, available at:
Windows Server: iSCSI initiator, it is automatically generated upon installation
VMWare vSphere: it is automatically assigned upon creating a software iSCSI adapter
Linux-based (XenServer, etc.): /etc/iscsi/initiatorname.iscsi
# storpool iscsi config setBaseName iqn.2019-08.com.example:poc-cluster
OK
# storpool iscsi config portalGroup poc create
OK
# storpool iscsi config portalGroup poc addNet 192.168.42.247/24 vlan 42
OK
# storpool iscsi config portal create portalGroup poc address 192.168.42.246 controller 1
OK
# storpool iscsi config portal create portalGroup poc address 192.168.42.202 controller 3
OK
# storpool iscsi portalGroup list
---------------------------------------
| name | networksCount | portalsCount |
---------------------------------------
| poc | 1 | 2 |
---------------------------------------
# storpool iscsi portalGroup list portals
--------------------------------------------
| group | address | controller |
--------------------------------------------
| poc | 192.168.42.246:3260 | 1 |
| poc | 192.168.42.202:3260 | 3 |
--------------------------------------------
# storpool iscsi config initiator iqn.2019-08.com.example:poc-cluster:hv1 create
OK
# storpool volume tinyvolume template tinytemplate create # assumes tinytemplate exists
OK
# storpool iscsi config target create tinyvolume
OK
# storpool iscsi config export volume tinyvolume portalGroup poc initiator iqn.2019-08.com.example:poc-cluster:hv1
OK
# storpool iscsi initiator list
----------------------------------------------------------------------------------------------
| name | username | secret | networksCount | exportsCount |
----------------------------------------------------------------------------------------------
| iqn.2019-08.com.example:poc-cluster:hv1 | | | 0 | 1 |
----------------------------------------------------------------------------------------------
# storpool iscsi initiator list exports
---------------------------------------------------------------------------------------------------------------------------------------------
| name | volume | currentControllerId | portalGroup | initiator |
---------------------------------------------------------------------------------------------------------------------------------------------
| iqn.2019-08.com.example:poc-cluster:tinyvolume | tinyvolume | 1 | poc | iqn.2019-08.com.example:poc-cluster:hv1 |
---------------------------------------------------------------------------------------------------------------------------------------------
Below is a setup with two separate networks that allows for multipath. It uses the 192.168.41.0/24 network on the first interface, 192.168.42.0/24 on the second interface, and the .247 IP for the floating IP in both networks:
# storpool iscsi config setBaseName iqn.2019-08.com.example:poc-cluster
OK
# storpool iscsi config portalGroup poc create
OK
# storpool iscsi config portalGroup poc addNet 192.168.41.247/24
OK
# storpool iscsi config portalGroup poc addNet 192.168.42.247/24
OK
# storpool iscsi config portal create portalGroup poc address 192.168.41.246 controller 1
OK
# storpool iscsi config portal create portalGroup poc address 192.168.42.246 controller 1
OK
# storpool iscsi config portal create portalGroup poc address 192.168.41.202 controller 3
OK
# storpool iscsi config portal create portalGroup poc address 192.168.42.202 controller 3
OK
# storpool iscsi portalGroup list
---------------------------------------
| name | networksCount | portalsCount |
---------------------------------------
| poc | 2 | 4 |
---------------------------------------
# storpool iscsi portalGroup list portals
--------------------------------------------
| group | address | controller |
--------------------------------------------
| poc | 192.168.41.246:3260 | 1 |
| poc | 192.168.41.202:3260 | 3 |
| poc | 192.168.42.246:3260 | 1 |
| poc | 192.168.42.202:3260 | 3 |
--------------------------------------------
Note
Please note that the order of adding the networks relates to the order in the
SP_ISCSI_IFACE
, the first network will be bound to the first interface
appearing in this configuration. More on how to list configured iSCSI interfaces
is available here, more on how to list
addresses exposed by a particular node, check - 12.15.5. iscsi_tool.
There is no difference in exporting volumes in multi-path setups.
16.3. Routed iSCSI setup
16.3.1. Overview
Layer-3/routed networks present some challenges to the operation of StorPool iSCSI, unlike flat layer-2 networks:
routes need to be resolved for destinations based on the kernel routing table, instead of ARP;
floating IP addresses for the portal groups need to be accessible to the whole network;
The first task is accomplished by monitoring the kernel’s routing
table, the second with an integrated BGP speaker in storpool_iscsi
.
Note
StorPool’s iSCSI does not support Linux’s policy-based
routing, and is not affected by iptables
, nftables
, or any
kernel filtering/networking component.
An iSCSI deployment in a layer-3 network has the following general elements:
nodes with
storpool_iscsi
in one or multiple subnets;allocated IP(s) for portal group floating IP addresses;
local routing daemon (
bird
,frr
)access to the network’s routing protocol.
The storpool_iscsi
daemon connects to a local routing daemon via
BGP and announces the floating IPs from the node those are active on.
The local routing daemon talks to the network via its own protocol
(BGP, OSPF or something else) and passes on the updates.
Note
In a fully routed network, the local routing daemon is
also responsible for announcing the IP for cluster management
(managed by storpool_mgmt
)
16.3.2. Configuration
The following needs to be added to storpool.conf
:
SP_ISCSI_ROUTED=1
In routed networks, when adding the portalGroup
floating
IP address, you need to specify it as /32
.
Note
These are example configurations and may not be the exact fit for a particular setup. Handle with care.
Note
In the examples below, the ASN of the network is 65500
,
StorPool has been assigned 65512
, and will need to announce
192.168.42.247
.
To enable the BGP speaker in storpool_iscsi
, the following
snippet for storpool.conf
is needed (the parameters are
described in the comment above it):
# ISCSI_BGP_IP:BGP_DAEMON_IP:AS_FOR_ISCSI:AS_FOR_THE_DAEMON
SP_ISCSI_BGP_CONFIG=127.0.0.2:127.0.0.1:65512:65512
And here’s a snippet from bird.conf
for a BGP speaker
that talks to StorPool’s iSCSI:
# variables
myas = 65512;
remoteas = 65500;
neigh = 192.168.42.1
# filter to only export our floating IP
filter spip {
if (net = 192.168.42.247/32) then accept;
reject;
}
# external gateway
protocol bgp sw100g1 {
local as myas;
neighbor neigh as remoteas;
import all;
export filter spip;
direct;
gateway direct;
allow local as;
}
# StorPool iSCSI
protocol bgp spiscsi {
local as myas;
neighbor 127.0.0.1 port 2179 as myas;
import all;
export all;
multihop;
next hop keep;
allow local as;
}
Note
For protocols different than BGP, please note that the StorPool iSCSI exports the route to the floating IP with a next-hop of the IP address configured for the portal of the node, and this information needs to be preserved when announcing the route.
16.4. Caveats with a Complex iSCSI Architecture
In iSCSI portal definitions, a TCP address/port pair must be unique; only a single portal within the whole cluster may be defined at a single IP address and port. Thus, if the same StorPool iSCSI service should be able to export volumes in more than one portal group, the portals should be placed either on different ports or on different IP addresses (although it is fine that these addresses will be brought up on the same network interface on the host).
Note
Even though StorPool supports reusing IPs, separate TCP ports, etc., the general recommendation on different portal groups is to have a separate VLAN and IP range for each one. There are lots of unknowns with different ports, security issues with multiple customers in the same VLAN, etc..
The redirecting portal on the floating address of a portal group always listens on port 3260. Similarly to the above, different portal groups must have different floating IP addresses, although they are automatically brought up on the same network interfaces as the actual portals within the groups.
Some iSCSI initiator implementations (e.g. VMware vSphere) may only connect to TCP port 3260 for an iSCSI service. In a more complex setup where a StorPool service on a single host may export volumes in more than one portal group, this might mean that the different portals must reside on different IP addresses, since the port number is the same.
For technical reasons, currently a StorPool volume may only be exported by a single StorPool service (host), even though it may be exported in different portal groups. For this reason, some care should be taken in defining the portal groups so that they may have at least some StorPool services (hosts) in common.
17. Multi-site and multi-cluster
There are two sets of features allowing connections and operations to be performed on different clusters in the same (17.1. Multicluster) datacenter or different locations (17.2. Multi site).
General distinction between the two:
multicluster covers closely packed clusters (i.e. pods or racks) with a fast and low-latency connection between them
multi-site covers clusters in separate locations connected through and insecure and/or high-latency connection
17.1. Multicluster
Main use case for the multicluster is seamless scalability in the same datacenter. A volume could be live-migrated between different sub-clusters in a multicluster setup. This way workloads could be balanced between multiple sub-clusters in a location, which is generally referred to as a multicluster setup.
![digraph G {
rankdir=LR;
compound=true;
ranksep=1;
style=radial;
bgcolor="white:gray";
image=svg;
label="Location A";
subgraph cluster_a0 {
style=filled;
bgcolor="white:lightgrey";
node [
style=filled,
shape=square,
];
bridge0;
a00 [label="a0.1"];
a01 [label="a0.2"];
space0 [label="..."];
a03 [label="a0.N"];
label = "Cluster A0";
}
subgraph cluster_a1 {
style=filled;
bgcolor="white:lightgrey";
node [
style=filled,
shape=square,
];
bridge1;
a10 [label="a1.1"];
a11 [label="a1.2"];
space1 [label="..."];
a13 [label="a1.N"];
label = "Cluster A1";
}
subgraph cluster_a2 {
style=filled;
bgcolor="white:lightgrey";
node [
style=filled,
shape=square,
];
bridge2;
a20 [label="a2.1"];
a21 [label="a2.2"];
space2 [label="..."];
a23 [label="a2.N"];
label = "Cluster A2";
}
bridge0 -> bridge1 [dir=both, lhead=cluster_a1, ltail=cluster_a0];
bridge1 -> bridge2 [dir=both, lhead=cluster_a2, ltail=cluster_a1];
bridge0 -> bridge2 [dir=both, lhead=cluster_a2, ltail=cluster_a0];
// was:
// bridge0 -> bridge1 [color="red", lhead=cluster_a1, ltail=cluster_a0];
// bridge1 -> bridge0 [color="blue", lhead=cluster_a0, ltail=cluster_a1];
// bridge1 -> bridge2 [color="red", lhead=cluster_a2, ltail=cluster_a1];
// bridge2 -> bridge1 [color="blue", lhead=cluster_a1, ltail=cluster_a2];
// bridge0 -> bridge2 [color="red", lhead=cluster_a2, ltail=cluster_a0];
// bridge2 -> bridge0 [color="blue", lhead=cluster_a0, ltail=cluster_a2];
}](../_images/graphviz-760e77dc53e31faa9f512b66bf4acd0dde340a7b.png)
17.2. Multi site
Remotely connected clusters in different locations are referred to as multi site. When two remote clusters are connected, they could efficiently transfer snapshots between them. The usual case is remote backup and DR.
![digraph G {
rankdir=LR;
compound=true;
ranksep=2;
image=svg;
subgraph cluster_loc_a {
style=radial;
bgcolor="white:gray";
node [
style=filled,
color="white:lightgrey",
shape=square,
];
a0 [label="Cluster A0"];
a1 [label="Cluster A1"];
a2 [label="Cluster A2"];
label = "Location A";
}
subgraph cluster_loc_b {
style=filled;
color=grey;
node [
style=filled,
color="white:grey",
shape=square,
];
b0 [label="Cluster B0"];
b1 [label="Cluster B1"];
b2 [label="Cluster B2"];
label = "Location B";
}
a1 -> b1 [color="red", lhead=cluster_loc_b, ltail=cluster_loc_a];
b1 -> a1 [color="blue", lhead=cluster_loc_a, ltail=cluster_loc_b];
}](../_images/graphviz-c25f3ef86e1f29c7fa11d3cbb0bc4dadd985b5d2.png)
17.3. Setup
Connecting clusters regardless of their locations requires the storpool_bridge
service to be running on at least two nodes in each cluster.
Each node running the storpool_bridge
needs the following parameters to be
configured in /etc/storpool.conf
or /etc/storpool.conf.d/*.conf
files:
SP_CLUSTER_NAME=<Human readable name of the cluster>
SP_CLUSTER_ID=<location ID>.<cluster ID>
SP_BRIDGE_HOST=<IP address>
The following is required when a single IP will be failed over between the bridges; see 17.5.2. Single IP failed over between the nodes:
SP_BRIDGE_IFACE=<interface> # optional with IP failover
The SP_CLUSTER_NAME
is mandatory human readable name for this cluster.
The SP_CLUSTER_ID
is an unique ID assigned by StorPool for each existing
cluster (example nmjc.b
). The cluster ID consists of two parts:
nmjc.b
| `sub-cluster ID
`location ID
The first part before the dot (nmjc
) is the location ID, and the part after
the dot is the sub-cluster ID (the second part after the .
- b
)
The SP_BRIDGE_HOST
is the IP address to listen for connections from other
bridges. Note that 3749
port should be unblocked in the firewalls between
the two locations.
A backup template should be configured through mgmtConfig (see 12.22. Management Configuration) The backup template is needed to instruct the local bridge which template should be used for incoming snapshots.
Warning
The backupTemplateName mgmtConfig option must be configure in the destination cluster
for storpool volume XXX backup LOCATION
to work (otherwise the transfer won’t start).
The SP_BRIDGE_IFACE
is required when two or more bridges are configured with
the same public/private key pairs. The SP_BRIDGE_HOST
in this case is a
floating IP address and will be configured on the SP_BRIDGE_IFACE
on the
host with the active
bridge.
17.4. Connecting two clusters
In this examples there are two clusters named Cluster_A
and Cluster_B
.
To have these two connected through their bridge services we would have to
introduce each of them to the other.
![digraph G {
rankdir=LR;
image=svg;
subgraph cluster_a {
style=filled;
color=lightgrey;
node [
style=filled,
color=white,
shape=square,
label="Bridge A\n\nSP_CLUSTER_ID = locationAId.aId\nSP_BRIDGE_HOST = 10.10.10.1\npublic_key: aaaa.bbbb.cccc.dddd\n",
];
bridge0;
label = "Cluster A";
}
subgraph cluster_b {
style=filled;
color=grey;
node [
style=filled,
color=white,
shape=square,
label="Bridge B\n\nSP_CLUSTER_ID = locationBId.bId\nSP_BRIDGE_HOST = 10.10.20.1\npublic_key: eeee.ffff.gggg.hhhh\n"
];
bridge1;
label = "Cluster B";
}
bridge0 -> bridge1 [dir=none color=none]
}](../_images/graphviz-14d441da50526e6d2f1e55b9e8fc40b43e8d0208.png)
Note
In case of a multicluster setup the location will be the same for both
clusters, the procedure is the same for both cases with the slight difference
that in case of multicluster the remote bridges are usually configured with
noCrypto
.
17.4.1. Cluster A
The following parameters from Cluster_B
will be required:
The
SP_CLUSTER_ID
-locationBId.bId
The
SP_BRIDGE_HOST
IP address -10.10.20.1
The public key located in
/usr/lib/storpool/bridge/bridge.key.txt
in the remote bridge host inCluster_B
-eeeeeeeeeeeee.ffffffffffff.ggggggggggggg.hhhhhhhhhhhhh
By using the CLI we could add Cluster_B
’s location with the following
commands in Cluster_A
:
user@hostA # storpool location add locationBId location_b
user@hostA # storpool cluster add location_b bId
user@hostA # storpool cluster list
--------------------------------------------
| name | id | location |
--------------------------------------------
| location_b-cl1 | bId | location_b |
--------------------------------------------
The remote name is location_b-cl1
, where the clN
number is automatically
generated based on the cluster ID. The last step in Cluster_A
is to register
the Cluster_B
’s bridge. The command looks like this:
user@hostA # storpool remoteBridge register location_b-cl1 10.10.20.1 eeeeeeeeeeeee.ffffffffffff.ggggggggggggg.hhhhhhhhhhhhh
Registered bridges in Cluster_A
:
user@hostA # storpool remoteBridge list
----------------------------------------------------------------------------------------------------------------------------
| ip | remote | minimumDeleteDelay | publicKey | noCrypto |
----------------------------------------------------------------------------------------------------------------------------
| 10.10.20.1 | location_b-cl1 | | eeeeeeeeeeeee.ffffffffffff.ggggggggggggg.hhhhhhhhhhhhh | 0 |
----------------------------------------------------------------------------------------------------------------------------
Hint
The public key in /usr/lib/storpool/bridge/bridge.key.txt
will be
generated on the first run of the storpool_bridge
service.
Note
The noCrypto
is usually 1
in case of multicluster with a secure
datacenter network for higher throughput and lower latency during migrations.
![digraph G {
rankdir=LR;
image=svg;
compound=true;
ranksep=2;
subgraph cluster_a {
style=filled;
color=lightgrey;
node [
style=filled,
color=white,
shape=square,
label="Bridge A",
];
bridge0;
label = "Cluster A";
}
subgraph cluster_b {
style=filled;
color=grey;
node [
style=filled,
color=white,
shape=square,
label="Bridge B"
];
bridge1;
label = "Cluster B";
}
bridge0 -> bridge1 [color="red", lhead=cluster_b, ltail=cluster_a];
// bridge1 -> bridge0 [color="blue", lhead=cluster_a, ltail=cluster_b];
}](../_images/graphviz-6a54797fdc84d5fe609a2e1c44242a48a6a7bf5c.png)
17.4.2. Cluster B
Similarly the parameters from Cluster_A
will be required for registering the
location, cluster and bridge(s) in Cluster B:
The
SP_CLUSTER_ID
-locationAId.aId
The
SP_BRIDGE_HOST
IP address inCluster_A
-10.10.10.1
The public key in
/usr/lib/storpool/bridge/bridge.key.txt
in the remote bridge host inCluster_A
-aaaaaaaaaaaaa.bbbbbbbbbbbb.ccccccccccccc.ddddddddddddd
Similarly the commands will be:
user@hostB # storpool location add locationAId location_a
user@hostB # storpool cluster add location_a aId
user@hostB # storpool cluster list
--------------------------------------------
| name | id | location |
--------------------------------------------
| location_a-cl1 | aId | location_a |
--------------------------------------------
user@hostB # storpool remoteBridge register location_a-cl1 1.2.3.4 aaaaaaaaaaaaa.bbbbbbbbbbbb.ccccccccccccc.ddddddddddddd
user@hostB # storpool remoteBridge list
-------------------------------------------------------------------------------------------------------------------------
| ip | remote | minimumDeleteDelay | publicKey | noCrypto |
-------------------------------------------------------------------------------------------------------------------------
| 1.2.3.4 | location_a-cl1 | | aaaaaaaaaaaaa.bbbbbbbbbbbb.ccccccccccccc.ddddddddddddd | 0 |
-------------------------------------------------------------------------------------------------------------------------
At this point, provided network connectivity is working, the two bridges will be connected.
![digraph G {
rankdir=LR;
image=svg;
compound=true;
ranksep=2;
subgraph cluster_a {
style=filled;
color=lightgrey;
node [
style=filled,
color=white,
shape=square,
label="Bridge A",
];
bridge0;
label = "Cluster A";
}
subgraph cluster_b {
style=filled;
color=grey;
node [
style=filled,
color=white,
shape=square,
label="Bridge B"
];
bridge1;
label = "Cluster B";
}
bridge0 -> bridge1 [color="red", lhead=cluster_b, ltail=cluster_a];
bridge1 -> bridge0 [color="blue", lhead=cluster_a, ltail=cluster_b];
}](../_images/graphviz-00190737d5bcb0e1658eee80e41f21cf27b5edd5.png)
17.5. Bridge redundancy
There are two ways to add redundancy for the bridge services by configuring and
starting the storpool_bridge
service on two (or more) nodes in each cluster.
For both cases only one bridge is active at a time and is being failed over in case the node or the active service is restarted.
17.5.1. Separate IP addresses
Configure and start the storpool_bridge
with a separate SP_BRIDGE_HOST
address and a separate set of public/private key pairs. In this case each of
the bridge nodes would have to be registered in the same way as explained in the
17.4. Connecting two clusters section. The SP_BRIDGE_IFACE
parameter is unset
and the SP_BRIDGE_HOST
address is expected by the storpool_bridge
service on each of the node where it is started.
In this case each of the bridge nodes in ClusterA
would have to be
configured in ClusterB
and vice-versa.
17.5.2. Single IP failed over between the nodes
For this, configure and start the storpool_bridge
service on the first node.
Then distribute the /usr/lib/storpool/bridge/bridge.key
and the
/usr/lib/storpool/bridge/bridge.key.txt
files on the next node where the
storpool_bridge
service will be running.
The SP_BRIDGE_IFACE
is required and represents the interface where the
SP_BRIDGE_HOST
address will be configured. The SP_BRIDGE_HOST
will be
up only on the node where the active bridge service is running until either the
service or the node itself gets restarted.
With this configuration there will be only one bridge registered in the remote
cluster(s), regardless of the number of nodes with running storpool_bridge
in the local cluster.
The failover SP_BRIDGE_HOST
is better suited for NAT/port-forwarding cases.
17.6. Bridge throughput performance
The throughput performance of a bridge connection depends on a couple of factors (not in this exact sequence) - network throughput, network latency, CPU speed and disk latency. Each could become a bottleneck and could require additional tuning in order to get a higher throughput from the available link between the two sites.
17.6.1. Network
For high-throughput links latency is the most important factor for achieving higher link utilization. For example, a low-latency 10 gbps link will be easily saturated (provided crypto is off), but would require some tuning when the latency is higher for optimizing the TCP window size. Same is in effect with lower-bandwidth links with higher latency.
For these cases the send buffer size could be bumped in small increments so that the TCP window is optimized. Check the 12.1. Location section for more info on how to update the send buffer size in each location.
Note
For testing what would be the best send buffer size for throughput performance from primary to backup site, fill a volume with data in the primary (source) site, then create a backup to the backup (remote) site. While observing the bandwidth utilized, increase the send buffers in small increments in the source and the destination cluster until the throughput either stops rising or stays at an acceptable level.
Note that increasing the send buffers above this value can lead to delays when recovering a backup in the opposite direction.
Further sysctl changes might be required, depending on the NIC driver, check the
/usr/share/doc/storpool/examples/bridge/90-StorPoolBridgeTcp.conf
on the node with the storpool_bridge
service, for
more info on this.
17.6.2. CPU
The CPU usually becomes a bottleneck only when crypto is configured to on, sometimes it is helpful to move the bridge service on a node with a faster CPU.
If a faster CPU is not available in the same cluster, the
SP_BRIDGE_SLEEP_TYPE
configured to hsleep
or even no
might help,
note that when this is configured the storpool_cg
will attempt to isolate a
full-CPU core (i.e. with the second thread free from other processes).
17.6.3. Disks throughput
The default remote recovery setting
(SP_REMOTE_RECOVERY_PARALLEL_REQUESTS_PER_DISK
) is relatively low especially
for dedicated backup clusters, thus when the underlying disks in the receiving
cluster are underutilized (this does not happen with flash media) they become
the bottleneck. This parameter could be tuned for higher parallelism, an example
would be a small cluster of 3 nodes with 8 disks, translating to 48 default
queue depth from the bridge, when there are 8 * 3 * 32 available from the
underlying disks and (by default with a 10gbps link), 2048 requests available
from the bridge service (256 on an 1gbps link).
Note
The storpool_server
services requires restart after the changes in order for the changes to be applied.
17.7. Exports
A snapshot in one of the clusters could be exported and become visible at all
clusters in the location it was exported to, for example a snapshot called
snap1
could be exported with:
user@hostA # storpool snapshot snap1 export location_b
It becomes visible in Cluster_B
which is part of location_b
and could be
listed with:
user@hostB # storpool snapshot list remote
-------------------------------------------------------------------------------------------------------
| location | remoteId | name | onVolume | size | creationTimestamp | tags |
-------------------------------------------------------------------------------------------------------
| location_b | locationAId.aId.1 | snap1 | | 107374182400 | 2019-08-11 15:18:02 | |
-------------------------------------------------------------------------------------------------------
The snapshot may as well be exported to the location of the source cluster where the snapshot resides. This way it will become visible to all sub-clusters in this location.
17.8. Remote clones
Any snapshot export could be cloned locally. For example, to clone a remote
snapshot with globalId
of locationAId.aId.1
locally we could use:
user@hostB # storpool snapshot snap1-copy template hybrid remote location_a locationAId.aId.1
![digraph G {
rankdir=LR;
compound=true;
ranksep=2;
image=svg;
subgraph cluster_a {
style=filled;
color=lightgrey;
node [
style=filled,
color=white,
shape=square,
label="Bridge A",
];
bridge0;
node [
style=filled,
shape=circle,
color=white,
label="snap1\nlocationAId.aId.1"
]
snap1;
label = "Cluster A";
}
subgraph cluster_b {
style=filled;
color=grey;
node [
style=filled,
color=white,
shape=square,
label="Bridge B"
];
bridge1;
node [
style=filled,
shape=circle,
color=white,
label="snap1_clone\nlocationAId.aId.1",
]
snap1_clone;
label = "Cluster B";
}
bridge0 -> bridge1 [color="red", lhead=cluster_b, ltail=cluster_a];
bridge1 -> bridge0 [color="blue", lhead=cluster_a, ltail=cluster_b];
snap1 -> snap1_clone
}](../_images/graphviz-a97cda2cb10fb5cbe619aba1692523275694f84c.png)
The name of the clone of the snapshot in Cluster_B
will be snap1_clone
with all parameters from the hybrid
template.
Note
Note that the name of the snapshot in Cluster_B
could also be exactly the
same in all sub-clusters in a multicluster setup, as well as in clusters in
different locations in a multi site setup.
The transfer will start immediately. Only written parts from the snapshot will
be transferred between the sites. If snap1
has a size of 100GB, but only
1GB of data was ever written in the volume when it was snapshotted, eventually
approximately this amount of data will be transferred between the two
(sub-)clusters.
If another snapshot in the remote cluster is already based on snap1
and then
exported, the actual transfer will again include only the differences between
snap1
and snap2
, since snap1
exists in Cluster_B
.
![digraph G {
graph [nodesep=0.5, ranksep=1]
rankdir=LR;
compound=true;
image=svg;
subgraph cluster_a {
style=filled;
color=lightgrey;
node [
style=filled,
color=white,
shape=square,
label="Bridge A",
];
bridge0;
node [
style=filled,
shape=circle,
color=white,
label="snap1\nlocationAId.aId.1"
]
snap1;
node [
style=filled,
shape=circle,
color=white,
label="snap2\nlocationAId.aId.2"
]
snap2;
label = "Cluster A";
{rank=same; bridge0 snap1 snap2}
}
subgraph cluster_b {
style=filled;
color=grey;
node [
style=filled,
color=white,
shape=square,
label="Bridge B"
];
bridge1;
node [
style=filled,
shape=circle,
color=white,
label="snap1_clone\nlocationAId.aId.1",
]
snap1_clone;
node [
style=filled,
shape=circle,
color=white,
label="snap2_clone\nlocationAId.aId.2",
]
snap2_clone;
{rank=same; bridge1 snap1_clone snap2_clone}
label = "Cluster B";
}
bridge0 -> bridge1 [color="red", lhead=cluster_b, ltail=cluster_a];
bridge1 -> bridge0 [color="blue", lhead=cluster_a, ltail=cluster_b];
snap2 -> snap1 [dir=back];
snap2_clone -> snap1_clone [dir=back];
snap2 -> snap2_clone;
}](../_images/graphviz-0fe3a5e18d615253c5b0fcaa5082d04dc60364a5.png)
The globalId
for this snapshot will be the same for all sites it has been transferred to.
17.9. Creating a remote backup on a volume
The volume backup feature is in essence a set of steps that automate the backup procedure for a particular volume.
For example to backup a volume named volume1
in Cluster_A
to Cluster_B
we will use:
user@hostA # storpool volume volume1 backup Cluster_B
The above command will actually trigger the following set of events:
Creates a local temporary snapshot of
volume1
inCluster_A
to be transferred toCluster_B
Exports the temporary snapshot to
Cluster_B
Instructs
Cluster_B
to initiate the transfer for this snapshotExports the transferred snapshot in
Cluster_B
to be visible fromCluster_A
Deletes the local temporary snapshot
For example if a backup operation has been initiated for a volume called volume1
in Cluster_A
, the progress of the operation could be followed with:
user@hostA # storpool snapshot list exports
-------------------------------------------------------------
| location | snapshot | globalId | backingUp |
-------------------------------------------------------------
| location_b | volume1@1433 | locationAId.aId.p | true |
-------------------------------------------------------------
Once this operation completes the temporary snapshot will no longer be visible as an export and a snapshot with the same globalId
will be visible remotely:
user@hostA # snapshot list remote
------------------------------------------------------------------------------------------------------
| location | remoteId | name | onVolume | size | creationTimestamp | tags |
------------------------------------------------------------------------------------------------------
| location_b | locationAId.aId.p | volume1 | volume1 | 107374182400 | 2019-08-13 16:27:03 | |
------------------------------------------------------------------------------------------------------
Note
You must have a template configured in mgmtConfig backupTemplateName in Cluster_B
for this to work.
17.10. Creating an atomic remote backup for multiple volumes
Sometimes a set of volumes are used simultaneously in the same virtual machine, an example would be different filesystems for a database and its journal. In order to restore back to the same point in time all volumes a group backup could be initiated:
user@hostA# storpool volume groupBackup Cluster_B volume1 volume2
Note
The same underlying feature is used by the VolumeCare for keeping consistent snapshots for all volumes on a virtual machine.
17.11. Restoring a volume from remote snapshot
Restoring the volume to a previous state from a remote snapshot requires the following steps:
Create a local snapshot from the remotely exported one:
user@hostA # storpool snapshot volume1-snap template hybrid remote location_b locationAId.aId.p OK
There are some bits to explain in the above example - from left to right:
volume1-snap
- name of the local snapshot that will be created.
template hybrid
- instructs StorPool what will be the replication and placement for the locally created snapshot.
remote location_b locationAId.aId.p
- instructs StorPool where to look for this snapshot and what is itsglobalId
If the bridges and the connection between the locations are operational, the transfer will begin immediately.
Next create a volume with the newly created snapshot as a parent:
user@hostA # storpool volume volume1-tmp parent volume1-snap
Finally the volume clone would have to be attached where it is needed.
The last two steps could be changed a bit to rename the old volume to something different and directly create the same volume name from the restored snapshot. This is handled differently in different orchestration systems. The procedure for restoring multiple volumes from a group backup requires the same set of steps.
See VolumeCare 5.5. revert for an example implementation.
Note
From 19.01 onwards if the snapshot transfer hasn’t completed yet when the volume is created, read operations on an object that is not yet transferred will be forwarded through the bridge and will be processed by the remote cluster.
17.12. Remote deferred deletion
Note
This feature is available for both multicluster and multi-site
configurations. Note that the minimumDeleteDelay
is per bridge, not per
location, thus all bridges to a remote location should be (re)registered with
the setting.
The remote bridge could be registered with remote deferred deletion enabled.
This feature will enable a user in Cluster A
to unexport and set remote
snapshots for deferred deletion in Cluster B
.
An example for the case without deferred deletion enabled - Cluster_A
and
Cluster_B
are two StorPool clusters in locations A
and B
connected
with a bridge. A volume named volume1
in Cluster_A
has two backup
snapshots in Cluster_B
called volume1@281
and volume1@294
.
![digraph G {
graph [nodesep=0.5, ranksep=1]
rankdir=LR;
compound=true;
image=svg;
subgraph cluster_a {
style=filled;
color=lightgrey;
node [
style=filled,
color=white,
shape=square,
label="Bridge A",
];
bridge0;
node [
shape=rectangle,
]
v1 [label=volume1]
{rank=same; bridge0 v1}
label = "Cluster A";
}
subgraph cluster_b {
style=filled;
color=grey;
node [
style=filled,
color=white,
shape=square,
label="Bridge B"
];
bridge1;
node [
shape=circle,
]
v1s [label="volume1@281"]
v2s [label="volume1@294"]
v1s -> v2s [dir=back]
{rank=same; bridge1 v1s v2s}
label = "Cluster B";
}
bridge0 -> bridge1 [color="red", lhead=cluster_b, ltail=cluster_a];
bridge1 -> bridge0 [color="blue", lhead=cluster_a, ltail=cluster_b];
}](../_images/graphviz-e515a9a8a0c24e03f86f001d234fda36a27e0b32.png)
The remote snapshots could be unexported from Cluster_A
with the
deleteAfter
flag, but it will be silently ignored in Cluster_B
.
To enable this feature the following steps would have to be completed in the
remote bridge for Cluster_A
:
The bridge in
Cluster_A
should be registered withminimumDeleteDelay
inCluster_B
.Deferred snapshot deletion should be enabled in
Cluster_B
; for details, see 12.22. Management Configuration.
This will enable setting up the deleteAfter
parameter on an unexport
operation in Cluster_B
initiated from Cluster_A
.
With the above example volume and remote snapshots, a user in Cluster_A
could unexport the volume1@294
snapshot and set its deleteAfter
flag to
7 days from the unexport with:
user@hostA # storpool snapshot remote location_b locationAId.aId.q unexport deleteAfter 7d
OK
After the completion of this operation the following events will occur:
The
volume1@294
snapshot will immediately stop being visible inCluster_A
.The snapshot will get a
deleteAfter
flag with timestamp a week from the time of the unexport call.A week later the snapshot will be deleted, however only if deferred snapshot deletion is still turned on.
17.13. Volume and snapshot move
17.13.1. Volume move
A volume could be moved both with (live) or without attachment (offline) to a neighbor sub-cluster in a multicluster environment. This is available only for multicluster and not possible for Multi site, where only snapshots could be transferred.
To move a volume use:
# storpool volume <volumeName> moveToRemote <clusterName>
The above command will succeed only in case the volume is not attached on any of
the nodes in this sub-cluster. To move the volume live while it is still
attached an additional option onAttached
should instruct the cluster how to
proceed, for example this command:
Lab-D-cl1> volume test moveToRemote Lab-D-cl2 onAttached export
Will move the volume to the Lab-D-cl2
sub-cluster and if the volume is
attached in the present cluster will export it back to Lab-D-cl1
.
This is an equivalent to:
Lab-D-cl1> multiCluster on
[MC] Lab-D-cl1> cluster cmd Lab-D-cl2 attach volume test client 12
OK
Or directly executing the same CLI command in multicluster mode at a host in
Lab-D-cl2
cluster.
Note
Moving a volume will also trigger moving all of its snapshots. In a case where there are parent snapshots with many child volumes they might end up in each sub-cluster their child volumes ended up being moved to as a space-saving measure.
17.13.2. Snapshot move
Moving a snapshot is essentially the same as moving a volume, with the difference that it cannot be moved when attached.
For example:
Lab-D-cl1> snapshot testsnap moveToRemote Lab-D-cl2
Will succeed only if the snapshot is not attached locally.
A snapshot part of a volume snapshot chain will trigger copying also the parent snapshots which will be automatically managed by the cluster.
18. Rebalancing StorPool
18.1. Overview
In some situations the data in the StorPool cluster needs to be rebalanced. This is performed by the balancer
and the relocator
tools. The relocator
is an integral part of the StorPool management service, the balancer
is presently an external tool executed on some of the nodes with access to the API.
Note
Be advised that he balancer tool will create some files it needs in the present working directory.
The rebalancing operation is performed in the following steps:
The
balancer
tool is executed, to calculate the new state of the cluster;The results from the balancer are verified by set of automated scripts;
The results are also manually reviewed to check whether they contain any inconsistencies and whether they achieve the intended goals. These results are available by running
storpool balancer disks
and will be printed at the end ofbalancer.sh
.If the result is not satisfactory, the
balancer
is executed with different parameters, until a satisfactory result is obtained;
Once the proposed end result is satisfactory, the calculated state is loaded into the
relocator
tool, by doingstorpool balancer commit
;Please note that this step can be reversed only with the
--restore-state
option, which will revert to the initial state. If a balancing operation has ran for a while and for some reason it needs to be “cancelled”, currently that’s not supported.
The
relocator
tool performs the actual move of the data.The progress of the
relocator
tool can be monitored bystorpool task list
for the currently running tasks,storpool relocator status
for an overview of therelocator
state andstorpool relocator disks
(warning: slow command) for the full relocation state.
The balancer tool is executed via the /usr/lib/storpool/balancer.sh
wrapper and accepts the following options:
option |
Meaning |
---|---|
-g placementGroup |
Work only on the specified placement group |
-c factor |
Factor for how much data to try to move around, from 0 to 10. No default, required parameter. |
-f percent |
Allow drives to be filled up to this percentage, from 0 to 99. Default 90. |
-M maxDataToAdd |
Limit the amount of data to copy to a single drive, to be able to rebalance “in pieces”. |
-m maxAgCount |
Limit the maximum allocation group count on drives to this (effectively their usable size). |
-b placementGroup |
Use disks in the specified placement group to restore replication in critical conditions. |
-F |
Only move data from fuller to emptier drives. (default -c factor is 3 when -F is used) |
-A |
Don’t only move data from fuller to emptier drives. (default -c is 10 when -A is used). |
-R |
Only restore replication for degraded volumes. |
-d diskId [-d diskId] |
Put data only on the selected disks. |
-D diskId [-D diskId] |
Don’t move data from those disks. |
–only-empty-disk diskId |
like -D for all other disks |
-V vagId [-V vagId] |
Skip balancing vagId |
-S |
Prefer tail SSD |
-o overridesPgName |
Specify override placement group name (required only if |
–min-disk-full X |
Don’t remove data from disk if it is not at least this X% full |
–ignore-src-pg-violations |
Exactly what it says |
–min-replication R |
Minimum replication required |
–restore-state |
Revert to the initial state of the disks (before the balancer commit execution) |
-v |
Verbose output (shows data how all drives in the cluster would be affected) |
-A
and -F
are the reverse of each other and mutually exclusive.
The -c
value is basically the trade-off between the balancedness of the placement groups and the amount of data moved to accomplish that. A lower factor means less data to be moved around, but sometimes more inequality between the data on the disks, a higher one - more data to be moved, but sometimes with a better result in terms of equality of the amount of data for each drive.
On clusters with drives with unsupported size (HDDs > 4TB) the -m
option is required. It will limit the data moved onto these drives to up to the set number of allocation groups. This is done as the performance per TB space of larger drives is lower, and it degrades the performance for the whole cluster for high performance use cases.
The -M
option is useful when a full rebalancing would involve many tasks until completed and could impact other operations (such as remote transfers, or the time required for a currently running recovery to complete). With the -M
option the amount of data loaded by the balancer for each disk may be reduced, and a more rebalanced state is achieved through several smaller rebalancing operations.
The -f
option is required on clusters whose drives are full above 90%. Extreme care should be used when balancing in such cases.
The -b
option could be used to move data between placementGroups (in most cases from SSDs to HDDs).
18.2. Restoring volume redundancy on a failed drive
Situation: we have lost drive 1802 in placementGroup ssd
. We want to remove it from the cluster and restore the redundancy of the data. We need to do the following:
storpool disk 1802 forget # this will also remove the drive from all placement groups it participated in
mkdir -p ~/storpool/balancer && cd ~/storpool/balancer # it's recommended to run the following commands in a screen/tmux session
/usr/lib/storpool/balancer.sh -R
storpool balancer commit # to actually load the data into the relocator and start the re-balancing operation
18.3. Restoring volume redundancy for two failed drives (single-copy situation)
(Emergency) Situation: we have lost drives 1802 and 1902 in placementGroup ssd
. We want to remove them from the cluster and restore the redundancy of the data. We need to do the following:
storpool disk 1802 forget # this will also remove the drive from all placement groups it participated in
storpool disk 1902 forget # this will also remove the drive from all placement groups it participated in
mkdir -p ~/storpool/balancer && cd ~/storpool/balancer # it's recommended to run the following commands in a screen/tmux session
/usr/lib/storpool/balancer.sh -F --min-replication 2 # first balancing run, to create a second copy of the data
storpool balancer commit # to actually load the data into the relocator and start the re-balancing operation
# wait for the balancing to finish
/usr/lib/storpool/balancer.sh -R # second balancing run, to restore full redundancy
storpool balancer commit # to actually load the data into the relocator and start the re-balancing operation
18.4. Adding new drives and rebalancing data on them
Situation: we have added SSDs 1201, 1202 and HDDs 1510, 1511, that need to go into placement groups ssd
and hdd
respectively, and we want to re-balance the cluster data so that it is re-dispersed onto the new disks as well. We have no other placement groups in the cluster.
storpool placementGroup ssd addDisk 1201 addDisk 1202
storpool placementGroup hdd addDisk 1510 addDisk 1511
mkdir -p ~/storpool/balancer && cd ~/storpool/balancer # it's recommended to run the following commands in a screen/tmux session
/usr/lib/storpool/balancer.sh -F -c 0 # rebalance all placement groups, move data from fuller to emptier drives
storpool balancer commit # to actually load the data into the relocator and start the re-balancing operation
18.5. Restoring volume redundancy with rebalancing data on other placementGroup
Situation: we have to restore the redundancy of a hybrid cluster (2 copies on HDDs, one on SSDs) while the ssd
placementGroup is out of free space because a few SSDs have recently failed. We can’t replace the failed drives with new ones for the moment.
mkdir -p ~/storpool/balancer && cd ~/storpool/balancer # it's recommended to run the following commands in a screen/tmux session
/usr/lib/storpool/balancer.sh -R -c 0 -b hdd # use placementGroup ``hdd`` as a backup and move some data from SSDs
storpool balancer commit # to actually load the data into the relocator and start the re-balancing operation
Note
The -f
argument could be further used in order to instruct the balancer how full to keep the cluster and thus control how much data will be moved in the backup placement group.
18.6. Decommissioning a live node
Situation: a node in the cluster needs to be decommissioned, so that the data on its drives needs to be moved away. The drive numbers on that node are 101
, 102
and 103
.
Note
You have to make sure you have enough space to restore the redundancy before proceeding.
storpool disk 101 softEject # mark all drives for evacuation
storpool disk 102 softEject
storpool disk 103 softEject
mkdir -p ~/storpool/balancer && cd ~/storpool/balancer # it's recommended to run the following commands in a screen/tmux session
/usr/lib/storpool/balancer.sh -R -c 0 # rebalance all placement groups, -F has the same effect in this case
storpool balancer commit # to actually load the data into the relocator and start the re-balancing operation
18.7. Decommissioning a dead node
Situation: a node in the cluster needs to be decommissioned, as it has died and cannot be brought back. The drive numbers on that node are 101
, 102
and 103
.
Note
You have to make sure you have enough space to restore the redundancy before proceeding.
storpool disk 101 forget # remove the drives from all placement groups
storpool disk 102 forget
storpool disk 103 forget
mkdir -p ~/storpool/balancer && cd ~/storpool/balancer # it's recommended to run the following commands in a screen/tmux session
/usr/lib/storpool/balancer.sh -R -c 0 # rebalance all placement groups
storpool balancer commit # to actually load the data into the relocator and start the re-balancing operation
18.8. Resolving imbalances in the drive usage
Situation: we have an imbalance in the drive usage in the whole cluster and we want to improve it.
mkdir -p ~/storpool/balancer && cd ~/storpool/balancer # it's recommended to run the following commands in a screen/tmux session
/usr/lib/storpool/balancer.sh -F -c 0 # rebalance all placement groups
/usr/lib/storpool/balancer.sh -F -c 3 # retry to see if we get a better result with more data movements
storpool balancer commit # to actually load the data into the relocator and start the re-balancing operation
18.9. Resolving imbalances in the drive usage with three-node clusters
Situation: we have an imbalance in the drive usage in the whole cluster and we want to improve it. We have a three-node hybrid cluster and proper balancing requires larger moves of “unrelated” data:
mkdir -p ~/storpool/balancer && cd ~/storpool/balancer # it's recommended to run the following commands in a screen/tmux session
/usr/lib/storpool/balancer.sh -F -c 0 # rebalance all placement groups
/usr/lib/storpool/balancer.sh -A -c 10 # retry to see if we get a better result with more data movements
storpool balancer commit # to actually load the data into the relocator and start the re-balancing operation
18.10. Reverting balancer to a previous state
Situation: we have committed a rebalancing operation, but want to revert back to the previous state:
cd ~/storpool/balancer # it's recommended to run the following commands in a screen/tmux session
ls # list all saved states and choose what to revert to
/usr/lib/storpool/balancer.sh --restore-state 2022-10-28-15-39-40 # revert to 2022-10-28-15-39-40
storpool balancer commit # to actually load the data into the relocator and start the re-balancing operation
18.11. Reading the output of storpool balancer disks
Here is an example output from storpool balancer disks
:
-------------------------------------------------------------------------------------------------------------------------------------------------------------------------
| disk | server | size | stored | on-disk | objects |
-------------------------------------------------------------------------------------------------------------------------------------------------------------------------
| 1 | 14.0 | 373 GB | 4.7 MB -> 4.7 MB (0 B / 0 B) | 6.1 MB -> 6.1 MB (0 B / 0 B) | 26 -> 26 (+0 / +0) / 405000 |
| 1101 | 11.0 | 447 GB | 16 GB -> 15 GB (-1.0 GB / 1.4 GB) | 18 GB -> 17 GB (-1.1 GB / 1.4 GB) | 11798 -> 10040 (-1758 / +3932) / 480000 |
| 1102 | 11.0 | 447 GB | 16 GB -> 15 GB (-268 MB / 1.3 GB) | 17 GB -> 17 GB (-301 MB / 1.4 GB) | 10843 -> 10045 (-798 / +4486) / 480000 |
| 1103 | 11.0 | 447 GB | 16 GB -> 15 GB (-1.0 GB / 1.8 GB) | 18 GB -> 16 GB (-1.2 GB / 1.9 GB) | 12123 -> 10039 (-2084 / +3889) / 480000 |
| 1104 | 11.0 | 447 GB | 16 GB -> 15 GB (-757 MB / 1.3 GB) | 17 GB -> 16 GB (-899 MB / 1.3 GB) | 11045 -> 10072 (-973 / +4279) / 480000 |
| 1111 | 11.0 | 466 GB | 4.7 MB -> 4.7 MB (0 B / 0 B) | 5.1 MB -> 5.1 MB (0 B / 0 B) | 26 -> 26 (+0 / +0) / 495000 |
| 1112 | 11.0 | 366 GB | 4.7 MB -> 4.7 MB (0 B / 0 B) | 5.1 MB -> 5.1 MB (0 B / 0 B) | 26 -> 26 (+0 / +0) / 390000 |
| 1121 | 11.0 | 931 GB | 22 GB -> 21 GB (-1009 MB / 830 MB) | 22 GB -> 21 GB (-1.0 GB / 872 MB) | 13713 -> 12698 (-1015 / +3799) / 975000 |
| 1122 | 11.0 | 931 GB | 21 GB -> 21 GB (-373 MB / 2.0 GB) | 22 GB -> 21 GB (-379 MB / 2.0 GB) | 13469 -> 12742 (-727 / +3801) / 975000 |
| 1123 | 11.0 | 931 GB | 22 GB -> 21 GB (-1.1 GB / 1.9 GB) | 22 GB -> 21 GB (-1.1 GB / 2.0 GB) | 14859 -> 12629 (-2230 / +4102) / 975000 |
| 1124 | 11.0 | 931 GB | 21 GB -> 21 GB (36 MB / 1.8 GB) | 21 GB -> 21 GB (92 MB / 1.9 GB) | 13806 -> 12743 (-1063 / +3389) / 975000 |
| 1201 | 12.0 | 447 GB | 18 GB -> 15 GB (-2.9 GB / 633 MB) | 19 GB -> 16 GB (-3.0 GB / 658 MB) | 14148 -> 10070 (-4078 / +3050) / 480000 |
| 1202 | 12.0 | 447 GB | 17 GB -> 15 GB (-2.1 GB / 787 MB) | 19 GB -> 16 GB (-2.3 GB / 815 MB) | 13243 -> 10067 (-3176 / +2576) / 480000 |
| 1203 | 12.0 | 447 GB | 17 GB -> 15 GB (-2.0 GB / 3.3 GB) | 19 GB -> 16 GB (-2.4 GB / 3.5 GB) | 12746 -> 10062 (-2684 / +3375) / 480000 |
| 1204 | 12.0 | 447 GB | 18 GB -> 15 GB (-2.7 GB / 1.1 GB) | 19 GB -> 16 GB (-2.9 GB / 1.1 GB) | 12835 -> 10075 (-2760 / +3248) / 480000 |
| 1212 | 12.0 | 366 GB | 4.7 MB -> 4.7 MB (0 B / 0 B) | 6.0 MB -> 6.0 MB (0 B / 0 B) | 26 -> 26 (+0 / +0) / 390000 |
| 1221 | 12.0 | 931 GB | 20 GB -> 21 GB (569 MB / 1.5 GB) | 21 GB -> 21 GB (587 MB / 1.6 GB) | 13115 -> 12616 (-499 / +3736) / 975000 |
| 1222 | 12.0 | 931 GB | 22 GB -> 21 GB (-979 MB / 307 MB) | 22 GB -> 21 GB (-1013 MB / 317 MB) | 12938 -> 12697 (-241 / +3291) / 975000 |
| 1223 | 12.0 | 931 GB | 22 GB -> 21 GB (-1.1 GB / 781 MB) | 22 GB -> 21 GB (-1.2 GB / 812 MB) | 13968 -> 12718 (-1250 / +3302) / 975000 |
| 1224 | 12.0 | 931 GB | 21 GB -> 21 GB (-784 MB / 332 MB) | 22 GB -> 21 GB (-810 MB / 342 MB) | 13741 -> 12692 (-1049 / +3314) / 975000 |
| 1225 | 12.0 | 931 GB | 21 GB -> 21 GB (-681 MB / 849 MB) | 22 GB -> 21 GB (-701 MB / 882 MB) | 13608 -> 12748 (-860 / +3420) / 975000 |
| 1226 | 12.0 | 931 GB | 22 GB -> 21 GB (-1.1 GB / 825 MB) | 22 GB -> 21 GB (-1.1 GB / 853 MB) | 13066 -> 12692 (-374 / +3817) / 975000 |
| 1301 | 13.0 | 447 GB | 13 GB -> 15 GB (2.6 GB / 4.2 GB) | 14 GB -> 17 GB (2.7 GB / 4.4 GB) | 7244 -> 10038 (+2794 / +6186) / 480000 |
| 1302 | 13.0 | 447 GB | 12 GB -> 15 GB (3.0 GB / 3.7 GB) | 13 GB -> 17 GB (3.1 GB / 3.9 GB) | 7507 -> 10063 (+2556 / +5619) / 480000 |
| 1303 | 13.0 | 447 GB | 14 GB -> 15 GB (1.3 GB / 3.2 GB) | 15 GB -> 17 GB (1.3 GB / 3.4 GB) | 7888 -> 10038 (+2150 / +5884) / 480000 |
| 1304 | 13.0 | 447 GB | 13 GB -> 15 GB (2.7 GB / 3.7 GB) | 14 GB -> 17 GB (2.8 GB / 3.9 GB) | 7660 -> 10045 (+2385 / +5870) / 480000 |
| 1311 | 13.0 | 466 GB | 4.7 MB -> 4.7 MB (0 B / 0 B) | 6.1 MB -> 6.1 MB (0 B / 0 B) | 26 -> 26 (+0 / +0) / 495000 |
| 1312 | 13.0 | 366 GB | 4.7 MB -> 4.7 MB (0 B / 0 B) | 6.1 MB -> 6.1 MB (0 B / 0 B) | 26 -> 26 (+0 / +0) / 390000 |
| 1321 | 13.0 | 931 GB | 21 GB -> 21 GB (-193 MB / 1.1 GB) | 21 GB -> 21 GB (-195 MB / 1.2 GB) | 13365 -> 12765 (-600 / +5122) / 975000 |
| 1322 | 13.0 | 931 GB | 22 GB -> 21 GB (-1.4 GB / 1.1 GB) | 23 GB -> 21 GB (-1.4 GB / 1.1 GB) | 12749 -> 12739 (-10 / +4651) / 975000 |
| 1323 | 13.0 | 931 GB | 21 GB -> 21 GB (-504 MB / 2.2 GB) | 22 GB -> 21 GB (-496 MB / 2.3 GB) | 13386 -> 12695 (-691 / +4583) / 975000 |
| 1325 | 13.0 | 931 GB | 21 GB -> 20 GB (-698 MB / 557 MB) | 22 GB -> 21 GB (-717 MB / 584 MB) | 13113 -> 12768 (-345 / +2668) / 975000 |
| 1326 | 13.0 | 931 GB | 21 GB -> 21 GB (-507 MB / 724 MB) | 22 GB -> 21 GB (-522 MB / 754 MB) | 13690 -> 12704 (-986 / +3327) / 975000 |
| 1401 | 14.0 | 223 GB | 8.3 GB -> 7.6 GB (-666 MB / 868 MB) | 9.3 GB -> 8.5 GB (-781 MB / 901 MB) | 3470 -> 5043 (+1573 / +2830) / 240000 |
| 1402 | 14.0 | 447 GB | 9.8 GB -> 15 GB (5.6 GB / 5.7 GB) | 11 GB -> 17 GB (5.8 GB / 6.0 GB) | 4358 -> 10060 (+5702 / +6667) / 480000 |
| 1403 | 14.0 | 224 GB | 8.2 GB -> 7.6 GB (-623 MB / 1.1 GB) | 9.3 GB -> 8.6 GB (-710 MB / 1.2 GB) | 4547 -> 5036 (+489 / +2814) / 240000 |
| 1404 | 14.0 | 224 GB | 8.4 GB -> 7.6 GB (-773 MB / 1.5 GB) | 9.4 GB -> 8.5 GB (-970 MB / 1.6 GB) | 4369 -> 5031 (+662 / +2368) / 240000 |
| 1411 | 14.0 | 466 GB | 4.7 MB -> 4.7 MB (0 B / 0 B) | 6.0 MB -> 6.0 MB (0 B / 0 B) | 26 -> 26 (+0 / +0) / 495000 |
| 1412 | 14.0 | 366 GB | 4.7 MB -> 4.7 MB (0 B / 0 B) | 6.0 MB -> 6.0 MB (0 B / 0 B) | 26 -> 26 (+0 / +0) / 390000 |
| 1421 | 14.0 | 931 GB | 19 GB -> 21 GB (1.9 GB / 2.6 GB) | 19 GB -> 21 GB (2.0 GB / 2.7 GB) | 10670 -> 12624 (+1954 / +6196) / 975000 |
| 1422 | 14.0 | 931 GB | 19 GB -> 21 GB (1.6 GB / 3.2 GB) | 20 GB -> 21 GB (1.6 GB / 3.3 GB) | 10653 -> 12844 (+2191 / +6919) / 975000 |
| 1423 | 14.0 | 931 GB | 19 GB -> 21 GB (1.9 GB / 2.5 GB) | 19 GB -> 21 GB (2.0 GB / 2.6 GB) | 10715 -> 12688 (+1973 / +5846) / 975000 |
| 1424 | 14.0 | 931 GB | 18 GB -> 20 GB (2.2 GB / 2.9 GB) | 19 GB -> 21 GB (2.3 GB / 3.0 GB) | 10723 -> 12686 (+1963 / +5505) / 975000 |
| 1425 | 14.0 | 931 GB | 19 GB -> 21 GB (1.3 GB / 2.5 GB) | 20 GB -> 21 GB (1.4 GB / 2.6 GB) | 10702 -> 12689 (+1987 / +5486) / 975000 |
| 1426 | 14.0 | 931 GB | 20 GB -> 21 GB (1.0 GB / 2.5 GB) | 20 GB -> 21 GB (1.0 GB / 2.6 GB) | 10737 -> 12609 (+1872 / +5771) / 975000 |
-------------------------------------------------------------------------------------------------------------------------------------------------------------------------
| 45 | 4.0 | 29 TB | 652 GB -> 652 GB (512 MB / 69 GB) | 686 GB -> 685 GB (-240 MB / 72 GB) | 412818 -> 412818 (+0 / +159118) / 30885000 |
-------------------------------------------------------------------------------------------------------------------------------------------------------------------------
Let’s start with the last line. Here’s the meaning, field by field:
There are 45 drives in total.
There are 4 server instances.
The total disk capacity is 29 TB.
The stored data is 652 GB and will change to 652 GB. The total change for all drives afterwards is 512 MB, and the total amount of changes for the drives is 69 GB (i.e. how much will they “recover” from other drives).
The same is repeated for the on-disk size. Here the total amount of changes is roughly the amount of data that would need to be copied.
The total current number of objects will not change (i.e. from 412818 to 412818), 0 new objects will be created, the total amount of objects to be moved is 159118, and the total number of possible objects in the cluster is 30885000.
The difference between “stored” and “on-disk” size is that in the latter also includes the size of checksums blocks and other internal data.
For the rest of the lines, the data is basically the same, just per disk.
What needs to be taken into account is:
Are there drives that will have too much data on them? Here both data size and objects must be checked, and they should be close to the average percentage for the placement group.
Is the data stored on the drives balanced, i.e. are all the drives’ usages close to the average?
Are there drives that should have data on them, but nothing is scheduled to be moved?
This usually happens because a drive wasn’t added to the right placement group.
Will there be too much data to be moved?
To illustrate the difference of amount to be moved, here is the output of storpool balancer disks
from a run with -c 10
:
-------------------------------------------------------------------------------------------------------------------------------------------------------------------------
| disk | server | size | stored | on-disk | objects |
-------------------------------------------------------------------------------------------------------------------------------------------------------------------------
| 1 | 14.0 | 373 GB | 4.7 MB -> 4.7 MB (0 B / 0 B) | 6.1 MB -> 6.1 MB (0 B / 0 B) | 26 -> 26 (+0 / +0) / 405000 |
| 1101 | 11.0 | 447 GB | 16 GB -> 15 GB (-1.0 GB / 1.7 GB) | 18 GB -> 17 GB (-1.1 GB / 1.7 GB) | 11798 -> 10027 (-1771 / +5434) / 480000 |
| 1102 | 11.0 | 447 GB | 16 GB -> 15 GB (-263 MB / 1.7 GB) | 17 GB -> 17 GB (-298 MB / 1.7 GB) | 10843 -> 10000 (-843 / +5420) / 480000 |
| 1103 | 11.0 | 447 GB | 16 GB -> 15 GB (-1.0 GB / 3.6 GB) | 18 GB -> 16 GB (-1.2 GB / 3.8 GB) | 12123 -> 10005 (-2118 / +6331) / 480000 |
| 1104 | 11.0 | 447 GB | 16 GB -> 15 GB (-752 MB / 2.7 GB) | 17 GB -> 16 GB (-907 MB / 2.8 GB) | 11045 -> 10098 (-947 / +5214) / 480000 |
| 1111 | 11.0 | 466 GB | 4.7 MB -> 4.7 MB (0 B / 0 B) | 5.1 MB -> 5.1 MB (0 B / 0 B) | 26 -> 26 (+0 / +0) / 495000 |
| 1112 | 11.0 | 366 GB | 4.7 MB -> 4.7 MB (0 B / 0 B) | 5.1 MB -> 5.1 MB (0 B / 0 B) | 26 -> 26 (+0 / +0) / 390000 |
| 1121 | 11.0 | 931 GB | 22 GB -> 21 GB (-1003 MB / 6.4 GB) | 22 GB -> 21 GB (-1018 MB / 6.7 GB) | 13713 -> 12742 (-971 / +9712) / 975000 |
| 1122 | 11.0 | 931 GB | 21 GB -> 21 GB (-368 MB / 5.8 GB) | 22 GB -> 21 GB (-272 MB / 6.1 GB) | 13469 -> 12718 (-751 / +8929) / 975000 |
| 1123 | 11.0 | 931 GB | 22 GB -> 21 GB (-1.1 GB / 5.9 GB) | 22 GB -> 21 GB (-1.1 GB / 6.1 GB) | 14859 -> 12699 (-2160 / +8992) / 975000 |
| 1124 | 11.0 | 931 GB | 21 GB -> 21 GB (57 MB / 7.4 GB) | 21 GB -> 21 GB (113 MB / 7.7 GB) | 13806 -> 12697 (-1109 / +9535) / 975000 |
| 1201 | 12.0 | 447 GB | 18 GB -> 15 GB (-2.8 GB / 1.2 GB) | 19 GB -> 17 GB (-3.0 GB / 1.2 GB) | 14148 -> 10033 (-4115 / +4853) / 480000 |
| 1202 | 12.0 | 447 GB | 17 GB -> 15 GB (-2.0 GB / 1.6 GB) | 19 GB -> 16 GB (-2.2 GB / 1.7 GB) | 13243 -> 10055 (-3188 / +4660) / 480000 |
| 1203 | 12.0 | 447 GB | 17 GB -> 15 GB (-2.0 GB / 2.3 GB) | 19 GB -> 16 GB (-2.3 GB / 2.4 GB) | 12746 -> 10070 (-2676 / +4682) / 480000 |
| 1204 | 12.0 | 447 GB | 18 GB -> 15 GB (-2.7 GB / 2.1 GB) | 19 GB -> 16 GB (-2.8 GB / 2.2 GB) | 12835 -> 10110 (-2725 / +5511) / 480000 |
| 1212 | 12.0 | 366 GB | 4.7 MB -> 4.7 MB (0 B / 0 B) | 6.0 MB -> 6.0 MB (0 B / 0 B) | 26 -> 26 (+0 / +0) / 390000 |
| 1221 | 12.0 | 931 GB | 20 GB -> 21 GB (620 MB / 6.3 GB) | 21 GB -> 21 GB (805 MB / 6.7 GB) | 13115 -> 12542 (-573 / +9389) / 975000 |
| 1222 | 12.0 | 931 GB | 22 GB -> 21 GB (-981 MB / 2.9 GB) | 22 GB -> 21 GB (-1004 MB / 3.0 GB) | 12938 -> 12793 (-145 / +8795) / 975000 |
| 1223 | 12.0 | 931 GB | 22 GB -> 21 GB (-1.1 GB / 5.9 GB) | 22 GB -> 21 GB (-1.1 GB / 6.1 GB) | 13968 -> 12698 (-1270 / +10094) / 975000 |
| 1224 | 12.0 | 931 GB | 21 GB -> 21 GB (-791 MB / 4.5 GB) | 22 GB -> 21 GB (-758 MB / 4.7 GB) | 13741 -> 12684 (-1057 / +8616) / 975000 |
| 1225 | 12.0 | 931 GB | 21 GB -> 21 GB (-671 MB / 4.8 GB) | 22 GB -> 21 GB (-677 MB / 4.9 GB) | 13608 -> 12690 (-918 / +8559) / 975000 |
| 1226 | 12.0 | 931 GB | 22 GB -> 21 GB (-1.1 GB / 6.2 GB) | 22 GB -> 21 GB (-1.1 GB / 6.4 GB) | 13066 -> 12737 (-329 / +9386) / 975000 |
| 1301 | 13.0 | 447 GB | 13 GB -> 15 GB (2.6 GB / 4.5 GB) | 14 GB -> 17 GB (2.7 GB / 4.6 GB) | 7244 -> 10077 (+2833 / +6714) / 480000 |
| 1302 | 13.0 | 447 GB | 12 GB -> 15 GB (3.0 GB / 4.9 GB) | 13 GB -> 17 GB (3.2 GB / 5.2 GB) | 7507 -> 10056 (+2549 / +7011) / 480000 |
| 1303 | 13.0 | 447 GB | 14 GB -> 15 GB (1.3 GB / 3.2 GB) | 15 GB -> 17 GB (1.3 GB / 3.3 GB) | 7888 -> 10020 (+2132 / +6926) / 480000 |
| 1304 | 13.0 | 447 GB | 13 GB -> 15 GB (2.7 GB / 4.7 GB) | 14 GB -> 17 GB (2.8 GB / 4.9 GB) | 7660 -> 10075 (+2415 / +7049) / 480000 |
| 1311 | 13.0 | 466 GB | 4.7 MB -> 4.7 MB (0 B / 0 B) | 6.1 MB -> 6.1 MB (0 B / 0 B) | 26 -> 26 (+0 / +0) / 495000 |
| 1312 | 13.0 | 366 GB | 4.7 MB -> 4.7 MB (0 B / 0 B) | 6.1 MB -> 6.1 MB (0 B / 0 B) | 26 -> 26 (+0 / +0) / 390000 |
| 1321 | 13.0 | 931 GB | 21 GB -> 21 GB (-200 MB / 4.1 GB) | 21 GB -> 21 GB (-192 MB / 4.3 GB) | 13365 -> 12690 (-675 / +9527) / 975000 |
| 1322 | 13.0 | 931 GB | 22 GB -> 21 GB (-1.3 GB / 6.9 GB) | 23 GB -> 21 GB (-1.3 GB / 7.2 GB) | 12749 -> 12698 (-51 / +10047) / 975000 |
| 1323 | 13.0 | 931 GB | 21 GB -> 21 GB (-495 MB / 6.1 GB) | 22 GB -> 21 GB (-504 MB / 6.3 GB) | 13386 -> 12693 (-693 / +9524) / 975000 |
| 1325 | 13.0 | 931 GB | 21 GB -> 21 GB (-620 MB / 6.6 GB) | 22 GB -> 21 GB (-612 MB / 6.9 GB) | 13113 -> 12768 (-345 / +9942) / 975000 |
| 1326 | 13.0 | 931 GB | 21 GB -> 21 GB (-498 MB / 7.1 GB) | 22 GB -> 21 GB (-414 MB / 7.4 GB) | 13690 -> 12697 (-993 / +9759) / 975000 |
| 1401 | 14.0 | 223 GB | 8.3 GB -> 7.6 GB (-670 MB / 950 MB) | 9.3 GB -> 8.5 GB (-789 MB / 993 MB) | 3470 -> 5061 (+1591 / +3262) / 240000 |
| 1402 | 14.0 | 447 GB | 9.8 GB -> 15 GB (5.6 GB / 7.1 GB) | 11 GB -> 17 GB (5.8 GB / 7.5 GB) | 4358 -> 10052 (+5694 / +7092) / 480000 |
| 1403 | 14.0 | 224 GB | 8.2 GB -> 7.6 GB (-619 MB / 730 MB) | 9.3 GB -> 8.5 GB (-758 MB / 759 MB) | 4547 -> 5023 (+476 / +2567) / 240000 |
| 1404 | 14.0 | 224 GB | 8.4 GB -> 7.6 GB (-790 MB / 915 MB) | 9.4 GB -> 8.5 GB (-918 MB / 946 MB) | 4369 -> 5062 (+693 / +2483) / 240000 |
| 1411 | 14.0 | 466 GB | 4.7 MB -> 4.7 MB (0 B / 0 B) | 6.0 MB -> 6.0 MB (0 B / 0 B) | 26 -> 26 (+0 / +0) / 495000 |
| 1412 | 14.0 | 366 GB | 4.7 MB -> 4.7 MB (0 B / 0 B) | 6.0 MB -> 6.0 MB (0 B / 0 B) | 26 -> 26 (+0 / +0) / 390000 |
| 1421 | 14.0 | 931 GB | 19 GB -> 21 GB (2.0 GB / 6.8 GB) | 19 GB -> 21 GB (2.1 GB / 7.0 GB) | 10670 -> 12695 (+2025 / +10814) / 975000 |
| 1422 | 14.0 | 931 GB | 19 GB -> 21 GB (1.6 GB / 7.4 GB) | 20 GB -> 21 GB (1.7 GB / 7.7 GB) | 10653 -> 12702 (+2049 / +10414) / 975000 |
| 1423 | 14.0 | 931 GB | 19 GB -> 21 GB (2.0 GB / 7.4 GB) | 19 GB -> 21 GB (2.1 GB / 7.8 GB) | 10715 -> 12683 (+1968 / +10418) / 975000 |
| 1424 | 14.0 | 931 GB | 18 GB -> 21 GB (2.2 GB / 8.0 GB) | 19 GB -> 21 GB (2.3 GB / 8.3 GB) | 10723 -> 12824 (+2101 / +9573) / 975000 |
| 1425 | 14.0 | 931 GB | 19 GB -> 21 GB (1.3 GB / 5.8 GB) | 20 GB -> 21 GB (1.4 GB / 6.1 GB) | 10702 -> 12686 (+1984 / +10231) / 975000 |
| 1426 | 14.0 | 931 GB | 20 GB -> 21 GB (1.0 GB / 6.5 GB) | 20 GB -> 21 GB (1.2 GB / 6.8 GB) | 10737 -> 12650 (+1913 / +10974) / 975000 |
-------------------------------------------------------------------------------------------------------------------------------------------------------------------------
| 45 | 4.0 | 29 TB | 652 GB -> 653 GB (1.2 GB / 173 GB) | 686 GB -> 687 GB (1.2 GB / 180 GB) | 412818 -> 412818 (+0 / +288439) / 30885000 |
-------------------------------------------------------------------------------------------------------------------------------------------------------------------------
This time the total amount of data to be moved is 180GB. It’s possible to have a difference of an order of magnitude in the total data to be moved between -c 0
and -c 10
. Usually best results are achieved by using the -F
directly with rare occasions requiring full re-balancing (i.e. no -F
and higher -c
values)
18.11.1. balancer tool output
Here’s an example of the output of the balancer tool, in non-verbose mode:
1 -== BEFORE BALANCE ==-
2 shards with decreased redundancy 0 (0, 0, 0)
3 server constraint violations 0
4 stripe constraint violations 6652
5 placement group violations 1250
6 pg hdd score 0.6551, objectsScore 0.0269
7 pg ssd score 0.6824, objectsScore 0.0280
8 pg hdd estFree 45T
9 pg ssd estFree 19T
10 Constraint violations detected, doing a replication-restore update first
11 server constraint violations 0
12 stripe constraint violations 7031
13 placement group violations 0
14 -== POST BALANCE ==-
15 shards with decreased redundancy 0 (0, 0, 0)
16 server constraint violations 0
17 stripe constraint violations 6592
18 placement group violations 0
19 moves 14387, (1864GiB) (tail ssd 14387)
20 pg hdd score 0.6551, objectsScore 0.0269, maxDataToSingleDrive 33 GiB
21 pg ssd score 0.6939, objectsScore 0.0285, maxDataToSingleDrive 76 GiB
22 pg hdd estFree 47T
23 pg ssd estFree 19T
The run of the balancer
tool has multiple steps.
First, it shows the current state of the system (lines 2-8):
Shards (volume pieces) with decreased redundancy;
Server constraint violations means that there are pieces of data which which have two or more of their copies on the same server. This is an error condition;
“stripe constraint violation” means that for specific pieces of data it’s not optimally striped on the drives of a specific server. This is NOT an error condition;
“placement group violations” means there is an error condition;
Lines 6 and 7 show the current average “score” (usage in %) of the placement groups, for data and objects;
Lines 8 and 9 show the estimated free space for the placement groups.
Then, in this run it has detected problems (in this case - placement group violations, which in most cases is a missing drive) and has done a pre-run to correct the redundancy (line 10, then again has printed on lines 11-13 the state).
And last, it runs the balancing, and reports the results. The main difference here is that for the placement groups it also reports the maximum data that will be added to a drive. As the balancing happens in parallel on all drives, this is a handy measure to see how long the balance would be (in comparison with a different balancing which might not add that much data to a single drive).
18.12. Errors from the balancer
tool
If the balancer
tool doesn’t complete successfully, its output MUST be examined and the root cause fixed.
18.13. Miscellaneous
If for any reason the currently running rebalancing operation needs to be paused, it can be done via storpool relocator off
. In such cases StorPool Support should also be contacted, as this shouldn’t need to happen. Re-enabling it is done via storpool relocator on
.
19. Troubleshooting Guide
This part outlines the different states of a StorPool cluster, common knowledge about what should be expected and what are the recommended steps in each of them. This is intended to be used as a guideline for the operations team(s) maintaining the production system provided by StorPool.
19.1. Normal state of the system
The normal behaviour of the StorPool storage system is when it is fully configured and in up-and-running state. This is the desired state of the system.
Characteristics of this state:
19.1.1. All nodes in the storage cluster are up and running
This can be checked by using the CLI with storpool service list
on any node with access to the API service.
Note
The storpool service list
provides status for all services running clusterwide, rather than the services running on the node itself.
19.1.2. All configured StorPool services are up and running
This is again easily checked with storpool service list
. Recently restarted services are usually spotted due to their uptime. Recently restarted services are to be taken seriously if the reason for their state is unknown even if they are running at the moment, as in the example with client ID 37
below:
# storpool service list
cluster running, mgmt on node 2
mgmt 1 running on node 1 ver 20.00.18, started 2022-09-08 19:28:37, uptime 144 days 22:45:51
mgmt 2 running on node 2 ver 20.00.18, started 2022-09-08 19:27:18, uptime 144 days 22:47:10 active
server 1 running on node 1 ver 20.00.18, started 2022-09-08 19:28:59, uptime 144 days 22:45:29
server 2 running on node 2 ver 20.00.18, started 2022-09-08 19:25:53, uptime 144 days 22:48:35
server 3 running on node 3 ver 20.00.18, started 2022-09-08 19:23:30, uptime 144 days 22:50:58
client 1 running on node 1 ver 20.00.18, started 2022-09-08 19:28:37, uptime 144 days 22:45:51
client 2 running on node 2 ver 20.00.18, started 2022-09-08 19:25:32, uptime 144 days 22:48:56
client 3 running on node 3 ver 20.00.18, started 2022-09-08 19:23:09, uptime 144 days 22:51:19
client 21 running on node 21 ver 20.00.18, started 2022-09-08 19:20:26, uptime 144 days 22:54:02
client 22 running on node 22 ver 20.00.18, started 2022-09-08 19:19:26, uptime 144 days 22:55:02
client 37 running on node 37 ver 20.00.18, started 2022-09-08 13:08:12, uptime 05:06:16
19.1.3. Working cgroup memory and cpuset isolation is properly configured
Use the storpool_cg
tool with an argument check
to ensure everything is
as expected. The tool should not return any warnings. For more information, see
Control Groups.
When properly configured the sum of all memory limits in the node are less than
the available memory in the node. This protects the running kernel from memory
shortage as well as all processes in the storpool.slice
memory cgroup which
ensures the stability of the storage service.
19.1.4. All network interfaces are properly configured
All network interfaces used by StorPool are up and properly configured with hardware acceleration enabled (where applicable); all network switches are configured with jumbo frames and flow control, and none of them experience any packet loss or delays. The output from storpool net list
is a good start, all configured network interfaces will be seen as up with proper flags explained at the end. The desired state is uU
with a +
at the end for each network interface; if hardware acceleration is supported on an interface the A
flag should also be present:
storpool net list
------------------------------------------------------------
| nodeId | flags | net 1 | net 2 |
------------------------------------------------------------
| 23 | uU + AJ | 1A:00:01:00:00:17 | 1E:00:01:00:00:17 |
| 24 | uU + AJ | 1A:00:01:00:00:18 | 1E:00:01:00:00:18 |
| 25 | uU + AJ | 1A:00:01:00:00:19 | 1E:00:01:00:00:19 |
| 26 | uU + AJ | 12:00:01:00:F0:1A | 16:00:01:00:F0:1A |
| 27 | uU + AJ | 26:8A:07:43:EE:BC | 26:8A:07:43:EE:BD |
------------------------------------------------------------
Quorum status: 5 voting beacons up out of 5 expected
Flags:
u - packets recently received from this node
d - no packets recently received from this node
U - this node has enough votes to be in the quorum
D - this node does not have enough votes to be in the quorum
+ - this node considers itself in the quorum
B - the connection to this node is through a backup link; check the cabling
A - this node is using hardware acceleration
J - the node uses jumbo frames
N - a non-voting node
19.1.5. All drives are up and running
All drives in use for the storage system are performing at their specified speed, are joined in the cluster and serving requests.
This could be checked with storpool disk list internal
, for example in a normally loaded cluster all drives will report low aggregate scores. Below is an example output (trimmed for brevity):
# storpool disk list internal
--------------------------------------------------------------------------------------------------------------------------------------------------------
| disk | server | aggregate scores | wbc pages | scrub bw | scrub ETA | last scrub completed |
--------------------------------------------------------------------------------------------------------------------------------------------------------
| 2301 | 23.0 | 0 | 0 | 0 | - + - / - | - | - | 2022-09-08 15:33:44 |
| 2302 | 23.0 | 0 | 0 | 0 | - + - / - | - | - | 2022-09-08 15:28:48 |
| 2303 | 23.1 | 0 | 0 | 0 | - + - / - | - | - | 2022-09-08 15:28:49 |
| 2304 | 23.1 | 0 | 0 | 0 | - + - / - | - | - | 2022-09-08 15:28:50 |
| 2305 | 23.2 | 0 | 0 | 0 | - + - / - | - | - | 2022-09-08 15:28:51 |
| 2306 | 23.2 | 0 | 0 | 0 | - + - / - | - | - | 2022-09-08 15:28:51 |
| 2307 | 23.3 | 0 | 0 | 0 | - + - / - | - | - | 2022-09-08 15:28:52 |
| 2308 | 23.3 | 0 | 0 | 0 | - + - / - | - | - | 2022-09-08 15:28:53 |
| 2311 | 23.0 | 0 | 0 | 0 | 0 + 0 / 2560 | - | - | 2022-09-08 15:28:38 |
| 2312 | 23.0 | 0 | 0 | 0 | 0 + 0 / 2560 | - | - | 2022-09-08 15:28:43 |
| 2313 | 23.1 | 0 | 0 | 0 | 0 + 0 / 2560 | - | - | 2022-09-08 15:28:44 |
| 2314 | 23.1 | 0 | 0 | 0 | 0 + 0 / 2560 | - | - | 2022-09-08 15:28:45 |
| 2315 | 23.2 | 0 | 0 | 0 | 0 + 0 / 2560 | - | - | 2022-09-08 15:28:47 |
| 2316 | 23.2 | 0 | 0 | 0 | 0 + 0 / 2560 | - | - | 2022-09-08 15:28:39 |
| 2317 | 23.3 | 0 | 0 | 0 | 0 + 0 / 2560 | - | - | 2022-09-08 15:28:40 |
| 2318 | 23.3 | 0 | 0 | 0 | 0 + 0 / 2560 | - | - | 2022-09-08 15:28:42 |
[snip]
All drives are regularly scrubbed, so they would have a stable (not increasing) number of errors. The errors corrected for each drive are visible in the storpool disk list
output. Last completed scrub is visible in storpool disk list internal
as in the example above.
Few notes on the desired state:
Some systems may have fewer than two network interfaces or a single backend switch. Even not recommended, this is still possible and sometimes used (usually in PoC or with a backup server) when a cluster is configured with a single-VLAN network redundancy scheme. A single VLAN network redundancy configuration and an inter-switch connection is required for a cluster where only some of the nodes are with a single interface connected to the cluster.
If one or more of the points describing the state above is not in effect, the system should not be considered healthy. If there is any suspicion that the system is behaving erratically even though all of the above conditions are satisfied, the recommended steps to check if everything is in order are:
Check
top
and look for the state of each of the configuredstorpool_*
services running in the present node. A properly running service is usually in theS
(sleeping) state and rarely seen in theR
(running) state. The CPU usage is often reported at 100% usage when hardware sleep is enabled, due to the kernel misreporting. The actual usage is much lower and could be tracked withcpupower monitor
for the CPU cores.To ensure all services on this node are running correctly is to use the
/usr/lib/storpool/sdump
tool, which will be reporting some CPU and network usage statistics for the running services on the node. Use the-l
option for the long names of the statistics.On some of the nodes with running workloads (i.e. VM instances, containers, etc.)
iostat
will show activity for processed requests on the block devices. The following example shows the normal disk activity on a node running VM instances. Note that the usage may vary greatly depending on the workload. The command used in the example isiostat -xm 1 /dev/sp-*| egrep -v " 0[.,]00$"
, which will print statistics for the StorPool devices each second excluding drives that have no storage IO activity:Device: rrqm/s wrqm/s r/s w/s rMB/s wMB/s avgrq-sz avgqu-sz await r_await w_await svctm %util sp-0 0.00 0.00 0.00 279.00 0.00 0.14 1.00 3.87 13.80 0.00 13.80 3.55 98.99 sp-11 0.00 0.00 165.60 114.10 19.29 14.26 245.66 5.97 20.87 9.81 36.91 0.89 24.78 sp-12 0.00 0.00 171.60 153.60 19.33 19.20 242.67 9.20 28.17 10.46 47.96 1.08 35.01 sp-13 0.00 0.00 6.00 40.80 0.04 5.10 225.12 1.75 37.32 0.27 42.77 1.06 4.98 sp-21 0.00 0.00 0.00 82.20 0.00 1.04 25.90 1.00 12.08 0.00 12.08 12.16 99.99
19.1.6. There are no hanging active requests
The output of /usr/lib/storpool/latthreshold.py
is empty - shows no hanging requests and no service or disk warnings.
19.2. Degraded state
In this state some system components are not fully operational and need attention. Some examples of a degraded state below.
19.2.1. Degraded state due to service issues
Note that this concerns only pools with triple replication, for dual replication this is considered to be a critical state, because there are parts of the system with only one available copy. This is an example output from storpool service list
:
# storpool service list
cluster running, mgmt on node 2
mgmt 1 running on node 1 ver 20.00.18, started 2022-09-08 16:11:59, uptime 19:51:50
mgmt 2 running on node 2 ver 20.00.18, started 2022-09-08 16:11:58, uptime 19:51:51 active
mgmt 3 running on node 3 ver 20.00.18, started 2022-09-08 16:11:58, uptime 19:51:51
server 1 down on node 1 ver 20.00.18
server 2 running on node 2 ver 20.00.18, started 2022-09-08 16:12:03, uptime 19:51:46
server 3 running on node 3 ver 20.00.18, started 2022-09-08 16:12:04, uptime 19:51:45
client 1 running on node 1 ver 20.00.18, started 2022-09-08 16:11:59, uptime 19:51:50
client 2 running on node 2 ver 20.00.18, started 2022-09-08 16:11:57, uptime 19:51:52
client 3 running on node 3 ver 20.00.18, started 2022-09-08 16:11:57, uptime 19:51:52
If this is unexpected, i.e. no one has deliberately restarted or stopped the service for planned maintenance or upgrade, it is very important to first bring the service up and then to investigate the root cause for the service outage. When the storpool_server
service comes back up it will start recovering outdated data on its drives. The recovery process could be monitored with storpool task list
, which will output which disks are recovering, as well as how much data is there left to be recovered. Example output or storpool task list:
# storpool task list
----------------------------------------------------------------------------------------
| disk | task id | total obj | completed | started | remaining | % complete |
----------------------------------------------------------------------------------------
| 2303 | RECOVERY | 1 | 0 | 1 | 1 | 0% |
----------------------------------------------------------------------------------------
| total | | 1 | 0 | 1 | 1 | 0% |
----------------------------------------------------------------------------------------
Some of the volumes or snapshots will have the D
flag (for degraded) visible in the storpool volume status
output, which will disappear once all the data is fully recovered. An example situation would be a reboot of the node for a kernel or a package upgrade that requires reboot and no kernel modules were installed for the new kernel or a service (in this example the storpool_server
) was not configured to start on boot and others.
These could be:
The storpool_block service on some of the storage only nodes, without any attached volumes or snapshots.
A single
storpool_server
service or multiple instances on the same node, note again that this is critical for systems with dual replication.Single API (
storpool_mgmt
) service with another active running API
The reason for these could be the same as in the previous examples, usually the system log contains all information needed to check why the service is not (getting) up.
19.2.2. Degraded state due to host OS misconfiguration
Some examples include:
This could prevent some of the services from running after a fresh boot. For instance due to changed names of the network interfaces used for the storage system after an upgrade, changed PCIe IDs for NVMe devices, etc.
If this occurs, it might be difficult to debug what have caused a kernel crash.
Some of the above cases will be difficult to catch prior to booting with the new environment (e.g. kernel or other updates) and sometimes they are only caught after an event that reveals the issue. Thus it is important to regularly test and ensure the system is in properly configured state and collects normally.
19.2.3. Degraded state due to network interface issues
This could be checked with storpool net list
, e.g.:
# storpool net list
------------------------------------------------------------
| nodeId | flags | net 1 | net 2 |
------------------------------------------------------------
| 23 | uU + AJ | | 1E:00:01:00:00:17 |
| 24 | uU + AJ | | 1E:00:01:00:00:18 |
| 25 | uU + AJ | 1A:00:01:00:00:19 | 1E:00:01:00:00:19 |
| 26 | uU + AJ | 12:00:01:00:F0:1A | 16:00:01:00:F0:1A |
| 27 | uU + AJ | 26:8A:07:43:EE:BC | 26:8A:07:43:EE:BD |
------------------------------------------------------------
Quorum status: 5 voting beacons up out of 5 expected
Flags:
u - packets recently received from this node
d - no packets recently received from this node
U - this node has enough votes to be in the quorum
D - this node does not have enough votes to be in the quorum
+ - this node considers itself in the quorum
B - the connection to this node is through a backup link; check the cabling
A - this node is using hardware acceleration
J - the node uses jumbo frames
N - a non-voting node
In the above example nodes 23 and 24 are not connected to the first network. This is the SP_IFACE1_CFG
interface configuration in /etc/storpool.conf
(check with storpool_showconf SP_IFACE1_CFG
). Note that the beacons are up and running and the system is processing requests through the second network. The possible reasons could be misconfigured interfaces, StorPool configuration, or backend switch/switches.
This is once again checked with storpool net list
:
# storpool net list
------------------------------------------------------------
| nodeId | flags | net 1 | net 2 |
------------------------------------------------------------
| 23 | uU + J | 1A:00:01:00:00:17 | 1E:00:01:00:00:17 |
| 24 | uU + J | 1A:00:01:00:00:18 | 1E:00:01:00:00:18 |
| 25 | uU + AJ | 1A:00:01:00:00:19 | 1E:00:01:00:00:19 |
| 26 | uU + AJ | 12:00:01:00:F0:1A | 16:00:01:00:F0:1A |
| 27 | uU + AJ | 26:8A:07:43:EE:BC | 26:8A:07:43:EE:BD |
------------------------------------------------------------
Quorum status: 5 voting beacons up out of 5 expected
Flags:
u - packets recently received from this node
d - no packets recently received from this node
U - this node has enough votes to be in the quorum
D - this node does not have enough votes to be in the quorum
+ - this node considers itself in the quorum
B - the connection to this node is through a backup link; check the cabling
A - this node is using hardware acceleration
J - the node uses jumbo frames
N - a non-voting node
In the above example nodes 23 and 24 are equipped with NICs qualified for, but are running without hardware acceleration; the possible reason could be either a BIOS or an OS misconfiguration, misconfigured kernel parameters on boot, or network interface misconfiguration. Note that when a system was configured for hardware accelerated operation the cgroups configuration was also sized accordingly, thus running in this state is likely to case performance issues, due to less CPU cores isolated and reserved for the NIC interrupts and storpool_rdma
threads.
Could be seen with storpool net list
, if some of the two networks is with MTU lower than 9k the J
flag will not be listed:
# storpool net list
-------------------------------------------------------------
| nodeId | flags | net 1 | net 2 |
-------------------------------------------------------------
| 23 | uU + A | 12:00:01:00:F0:17 | 16:00:01:00:F0:17 |
| 24 | uU + AJ | 1A:00:01:00:00:18 | 1E:00:01:00:00:18 |
| 25 | uU + AJ | 1A:00:01:00:00:19 | 1E:00:01:00:00:19 |
| 26 | uU + AJ | 1A:00:01:00:00:1A | 1E:00:01:00:00:1A |
-------------------------------------------------------------
Quorum status: 4 voting beacons up out of 4 expected
Flags:
u - packets recently received from this node
d - no packets recently received from this node
U - this node has enough votes to be in the quorum
D - this node does not have enough votes to be in the quorum
M - this node is being damped by the rest of the nodes in the cluster
+ - this node considers itself in the quorum
B - the connection to this node is through a backup link; check the cabling
A - this node is using hardware acceleration
J - the node uses jumbo frames
N - a non-voting node
If the node is not expected to be running without jumbo frames this might be an indication for a misconfigured interface or an issue with applying the interfaces configuration on boot. Note that an OS interface configured for jumbo frames without having the switch port properly configured leads to severe performance issues.
This might affect the latency for some of the storage operations. Depending on the node where the losses occur, it might affect a single client or affect operations in the whole cluster in case of packet loss or delays are happening on a server node. Stats for all interfaces per service are collected in the analytics platform (https://analytics.storpool.com) and could be used to investigate network performance issues. The /usr/lib/storpool/sdump
tool will print the same statistics on each of the nodes with services. The usual causes for packet loss are:
hardware issues (cables, SFPs, etc.)
floods and DDoS attacks “leaking” into the storage network due to misconfiguration
saturation of the CPU cores that handle the interrupts for the network cards and others when hardware acceleration is not available
network loops leading to saturated switch ports or overloaded NICs
19.2.4. Drive/Controller issues
Attention
This concerns only pools with triple replication, for dual replication this is considered as critical state.
The missing drives may be seen using storpool disk list
or storpool server <serverID> disk list
, for example in this output disk 543
is missing from server with ID 54
:
# storpool server 54 disk list
disk | server | size | used | est.free | % | free entries | on-disk size | allocated objects | errors | flags
541 | 54 | 207 GB | 61 GB | 136 GB | 29 % | 713180 | 75 GB | 158990 / 225000 | 0 / 0 |
542 | 54 | 207 GB | 56 GB | 140 GB | 27 % | 719526 | 68 GB | 161244 / 225000 | 0 / 0 |
543 | 54 | - | - | - | - % | - | - | - / - | -
544 | 54 | 207 GB | 62 GB | 134 GB | 30 % | 701722 | 76 GB | 158982 / 225000 | 0 / 0 |
545 | 54 | 207 GB | 61 GB | 135 GB | 30 % | 719993 | 75 GB | 161312 / 225000 | 0 / 0 |
546 | 54 | 207 GB | 54 GB | 142 GB | 26 % | 720023 | 68 GB | 158481 / 225000 | 0 / 0 |
547 | 54 | 207 GB | 62 GB | 134 GB | 30 % | 719996 | 77 GB | 179486 / 225000 | 0 / 0 |
548 | 54 | 207 GB | 53 GB | 143 GB | 26 % | 718406 | 70 GB | 179038 / 225000 | 0 / 0 |
Usual reasons - the drive was ejected from the cluster due to a write error either by the kernel or by the running storpool_server
instance. More information may be found using dmesg | tail
and in the system log. More information about the model and the serial number of the failed drive is shown by storpool disk list info
.
In normal conditions the server will flag the disk to be re-tested and will eject it for a quick test. Provided the disk is still working correctly and test results are not breaching any thresholds the disk will be returned into the cluster to recover. Such a case for example might happen if the stalled request was caused by an intermediate issue, like a reallocated sector.
In case the disk is breaching any sane latency and bandwidth thresholds it will not be automatically returned and will have to be re-balanced out of the cluster. Such disks are marked as “bad” (more available at storpool_initdisk options)
When one or more drives are ejected (marked as bad already) and missing, multiple volumes and/or snapshots will be listed with the D
flag in the output of storpool volume status
(D
as Degraded
), due to the missing replicas for some of the data. This is normal and expected and there are the following options in this situation:
The drive could still be working properly (e.g. a set of bad sectors were reallocated) even after it was tested, in order to re-test you could mark the drive as –good (more info on how at storpool_initdisk options) and attempt to get back into the cluster.
In some occasions a disk might have lost its signatures and would have to be returned in the cluster to recover from scratch - it will be automatically re-tested upon attempt to a full (read-write) stress-test is recommended to ensure it is working correctly (
fio
is a good tool for this kind of tests, check--verify
option). In case the stress test is successful (e.g. the drive has been written to and verified successfully), it may be reinitialized withstorpool_initdisk
with the same disk ID it was before. This will automatically return it to the cluster and it will fully recover all data from scratch as if it was a brand new.The drive has failed irrecoverably and a replacement is available. The replacement drive is initialized with the diskID of the failed drive with
storpool_initdisk
. After returning it to the cluster it will fully recover all the data from the live replicas (please check rebalancing_storpool_20.0 for more).A replacement is not available. The only option is to re-balance the cluster without this drive (more details in rebalancing_storpool_20.0).
Attention
Beware that in some cases with very filled clusters it might be impossible to get the cluster back to full redundancy without overfilling some of the remaining drives. See next section.
With proper planning this should be rarely an issue. A way to evade it is to add more drives or an additional server node with a full set of drives into the cluster. Another option is to remove unused volumes or snapshots.
The storpool snapshot space
command will return relevant information for the referred space for each snapshot on the underlying drives. Note that snapshots with a negative value in their “used” column will not free up any space if they are removed and will remain in the deleting state, because they are parents of multiple cloned child volumes.
Note that depending on the speed with which the cluster is being populated with data by the end users this might also be considered a critical state.
This may be observed in the output of storpool disk list
or storpool server <serverID> disk list
, an example from the latter below:
# storpool server 23 disk list
disk | server | size | used | est.free | % | free entries | on-disk size | allocated objects | errors | flags
2301 | 23.0 | 893 GiB | 2.6 GiB | 857 GiB | 0 % | 3719930 | 660 KiB | 17 / 930000 | 0 / 0 |
2302 | 23.0 | 893 GiB | 2.6 GiB | 857 GiB | 0 % | 3719929 | 668 KiB | 17 / 930000 | 0 / 0 |
2303 | 23.0 | 893 GiB | 2.6 GiB | 857 GiB | 0 % | 3719929 | 668 KiB | 17 / 930000 | 0 / 0 |
2304 | 23.0 | 893 GiB | 2.6 GiB | 857 GiB | 0 % | 3719931 | 668 KiB | 17 / 930000 | 0 / 0 |
2306 | 23.0 | 893 GiB | 2.6 GiB | 857 GiB | 0 % | 3719932 | 664 KiB | 17 / 930000 | 0 / 0 |
2305 | 23.0 | 893 GiB | 2.6 GiB | 857 GiB | 0 % | 3719930 | 660 KiB | 17 / 930000 | 0 / 0 |
2307 | 23.0 | 893 GiB | 2.6 GiB | 857 GiB | 0 % | 19934 | 664 KiB | 17 / 930000 | 0 / 0 |
--------------------------------------------------------------------------------------------------------------------------------------
7 | 1.0 | 6.1 TiB | 18 GiB | 5.9 TiB | 0 % | 26039515 | 4.5 MiB | 119 / 6510000 | 0 / 0 |
This usually happens after the system has been loaded for longer periods of time with a sustained write workload on one or multiple volumes. If this is unexpected and the reason is an erratic workload, the recommended way to handle this is to set a limit (bandwidth, iops or both) on the loaded volumes for example with storpool volume <volumename> bw 100M iops 1000
. The same could be set for multiple volumes/snapshots in a template with storpool template <templatename> bw 100M iops 1000 propagate
. Please note that propagating changes for templates with a very large number of volumes and snapshots might not work. If the overloaded state is due to normally occurring workload it is best to expand the system with more drives and or reformat the drives with larger number of entries (relates mainly to HDD drives). The latter case might be cause due to lower number of hard drives in a HDD only or a hybrid pool and rarely due to overloaded SSDs.
Another case related to overloaded drives is when many volumes are created out of the same template, which requires overrides in order to shuffle the objects where the journals are residing in order to avoid overload of the same triplet of disks when all virtual machines spike for some reason (i.e. unattended upgrades, a syslog intensive cron job, etc.)
A couple of notes on the degraded states - apart from the notes for the replication above none of these should affect the stability of the system at this point. For the example with the missing disk in hybrid systems with a single failed SSD, all read requests on volumes with triple replication that have data on the failed drive will be served by some of the redundant copies on HDDs. This could bring up a bit the read latencies for the operations on the parts of the volume which were on this exact SSD. This is usually negligible in medium to large systems, i.e. in a cluster with 20 SSDs or NVMe drives, these are 1/20th of all the read operations in the cluster. In case of dual replicas on SSDs and a third replica on HDDs there is no read latency penalty whatsoever, which is also the case for missing hard drives - they will not affect the system at all and in fact some write operations are even faster, because they are not waiting for the missing drive.
19.3. Critical state
This is an emergency state that requires immediate attention and intervention from the operations team and/or StorPool Support. Some of the conditions that could lead to this state are:
partial or complete network outage
power loss for some nodes in the cluster
memory shortage leading to a service failure due to missing or incomplete cgroups configuration
The following states are an indication for critical conditions:
19.3.1. API service failure
Requests to the API from any of the nodes configured to access it either stall or cannot reach a working service. This is a critical state due to the fact that the status of the cluster is unknown (might be down for that matter).
This might be caused by:
Misconfigured network for accessing the floating IP address - the address may be obtained by
storpool_showconf http
on any of the nodes with a configuredstorpool_mgmt
service in the cluster:# storpool_showconf http SP_API_HTTP_HOST=10.3.10.78 SP_API_HTTP_PORT=81
Failed interfaces on the hosts that have the storpool_mgmt
service running. To find the interface where the StorPool API should be running use storpool_showconf api_iface
# storpool_showconf api_iface
SP_API_IFACE=bond0.410
It is recommended to have the API on a redundant interface (e.g. an active-backup bond interface). Note that even without an API, provided the cluster is in quorum, there should be no impact on any running operations, but changes in the cluster like creating/attaching/detaching/deleting volumes or snapshots) will be impossible. Running with no API in the cluster triggers highest severity alert to StorPool Support (essentially wake up alert) due to the unknown state of the system.
The cluster is not in quorum - The cluster is in this state if the number of running voting
storpool_beacon
services is less than the half of the expected nodes plus one ((expected / 2) + 1). The configured number of expected nodes in the cluster may be checked withstorpool_showconf expected
, but is generally considered to be the number of server nodes (except when client nodes are configured as voting for some reason). In a system with 6 servers at least 4 voting beacons should be available to get back the cluster in running state:# storpool_showconf expected SP_EXPECTED_NODES=6
The current number of expected votes and the number of voting beacons are displayed in the output of storpool net list
, check the example above (the Quorum status:
line).
These API requests collect data from the running storpool_server
services on each server node. Possible reasons are:
network loss or delays;
failing
storpool_server
services;failing drives or hardware (CPU, memory, controllers, etc.);
overload
19.3.2. Server service failure
Two storpool_server
services or whole servers are down or not joined in the cluster in different failure sets. Very risky state, because there are parts of the volumes with only one live replica, if the latest writes land on a drive returning an IO error or broken data (detected by StorPool) this will lead to data loss.
As in degraded state some of the read operations for parts of the volumes will be served from HDDs in a hybrid system and might raise read latencies. In this state it is very important to bring back the missing services/nodes as soon as possible, because a failure in any of the remaining drives in other nodes or another fault set will bring some of the volumes in down state and might lead to data loss in case of an error returned by a drive with the latest writes.
This state results in some volumes being in down state (storpool volume status
) due to some parts only on the missing drives. Recommended action in this case - check for the reasons for the degraded services or missing (unresponsive) nodes and get them back up.
Possible reasons are:
lost network connectivity
severe packet loss/delays/loops
partial or complete power loss
hardware instabilities, overheating
kernel or other software instabilities, crashes
19.3.3. Client service failure
If the client service (storpool_block
) is down on some of the nodes depending on it, these could be either client-only or converged hypervisors, this will stall all requests on that particular node until the service is back up.
Possible reasons are again:
lost network connectivity
severe packet loss/delays/loops
bugs in the
storpool_block
service or thestorpool_bd
kernel module
In case of power loss or kernel crashes any virtual machine instances that were running on this node could be started on other available nodes.
19.3.4. Network interface or Switch failure
This means that the networks used for StorPool are down or are experiencing heavy packet loss or delays. In this case the quorum service will prevent a split-brain situation and will restart all services to ensure the cluster is fully connected on at least one network before it transitions again to running state. Such issues might be alleviated by a single-vlan setup when different nodes have partial network connectivity, but will be again with severe delays in case of severe packet loss.
19.3.5. Hard Drive/SSD failures
In this case multiple volumes may either experience degraded performance (hybrid placement) or will be in down
state when more than two replicas are missing. All operations on volumes in down
state are stalled, until the redundancy is restored (i.e. at least one replica is available). The recommended steps are to immediately check for the reasons for the missing drives/services/nodes and return them into the cluster as soon as possible.
At some point all cluster operations will stall until either some of the data in the cluster is deleted or new drives/nodes are added. Adding drives requires the new drives to be stress tested and a re-balancing of the system to include them, which should be carefully planned (details in rebalancing_storpool_20.0).
Note
Cleaning up snapshots that have multiple cloned volumes and a negative value for used space in the output of storpool snapshot space
will not free up any space.
This is usually caused by a heavily overloaded system. In this state the latencies for some operations might become very high (measured in seconds). Possible reasons are severely overloaded volumes for long periods of time without any configured bandwidth or iops limits. This could be checked by using iostat to look for volumes that are being constantly 100% loaded with a large number of requests to the storage system. Another way to check for such volumes is to use the “Top volumes” in the analytics in order to get info for the most loaded volumes and apply IOPS and or bandwidth limits accordingly. Other causes are misbehaving (underperforming) drives or misbehaving HBA/SAS controllers, the recommended way to deal with these cases is to investigate for such drives, a good idea is to check the output from storpool disk list internal
for higher aggregation scores on some drives or set of drives (e.g. on the same server) or by the use of the analytics to check for abnormal latency on some of the backend nodes (i.e. drives with significantly higher operations latency compared to other drives of the same type). An example would be a failing controller causing the SATA speed to degrade to SATA1.0 (1.5Gb/s) instead of SATA 3.0 (6Gb/s), weared out batteries on a RAID controller when its cache is used to accelerate the writes on the HDDs, and others.
The circumstances leading a system to the critical state are rare and are usually preventable with by taking measures to handle all the issues at the first signs of a change from the normal to degraded state.
In any of the above cases if you feel that something is not as expected, a consultation with StorPool Support is the best course of action. StorPool Support receives notifications for all detailed cases and pro-actively takes actions to alleviate a system going into degraded or critical state as soon as practically possible.
19.3.6. Hanging requests in the cluster
The output of /usr/lib/storpool/latthreshold.py
shows hanging requests and/or missing services as in the example below:
disk | reported by | peers | s | op | volume | requestId
-------------------------------------------------------------------------------------------------------------------
- | client 2 | client 2 -> server 2.1 | 15 | read | volume-name | 9223936579889289248:270215977642998472
- | client 2 | client 2 -> server 2.1 | 15 | read | volume-name | 9223936579889289248:270497452619709333
- | client 2 | client 2 -> server 2.1 | 15 | read | volume-name | 9223936579889289248:270778927596419790
- | client 2 | client 2 -> server 2.1 | 15 | write | volume-name | 9223936579889289248:271060402573130531
- | client 2 | client 2 -> server 2.1 | 15 | read | volume-name | 9223936579889289248:271341877549841211
- | client 2 | client 2 -> server 2.1 | 15 | read | volume-name | 9223936579889289248:271623352526551744
- | client 2 | client 2 -> server 2.1 | 15 | write | volume-name | 9223936579889289248:271904827503262450
server 2.1 connection status: established no_data timeout
disk 202 EXPECTED_UNKNOWN server 2.1
This could be caused by starving CPU, hardware resets, misbehaving disks or network or stalled services.
The disk
field in the output and the service warnings after the requests table could be used as an indicator for the misbehaving component.
Note that the active requests api call has a timeout for each service to respond. The default timeout that the latthreshold
tool uses is 10 seconds.
This value can be altered by using the latthreshold’s --api-requests-timeout/-A
and passing it a numeric value with a time unit (m, s, ms or us) e.g. 100ms
.
Service connection will have one of the following statuses:
established done
- service reported its active requests as expected; this is not displayed in the regular output, only with--json
not_established
- did not make a connection with the service - this could indicate the server is down, but may also indicate the service version is too old or its stream was overfilled or not connectedestablished no_data timeout
- service did not respond and the connection was closed because the timeout was reachedestablished data timeout
- service responded but the connection was closed because the timeout was reached before it could send all the dataestablished invalid_data
- a message the service sent had invalid data in it
The latthreshold
tool also reports disk statuses. Reported disk statuses will be one of the following:
EXPECTED_MISSING - the service response was good, but did not provide information about the disk
EXPECTED_NO_CONNECTION_TO_PEER - the connection to the service was not established
EXPECTED_NO_PEER - the service is not present
EXPECTED_UNKNOWN - the service response was invalid or a timeout occurred