StorPool User Guide - version 21
1. StorPool Overview
StorPool is distributed block storage software. It pools the attached storage (HDDs, SSDs or NVMe drives) of standard servers to create a single pool of shared storage. The StorPool software is installed on each server in the cluster. It combines the performance and capacity of all drives attached to the servers into one global namespace.
StorPool provides standard block devices. You can create one or more volumes through its sophisticated volume manager. StorPool is compatible with ext3 and XFS file systems and with any system designed to work with a block device, e.g. databases and cluster file systems (like OCFS and GFS). StorPool can also be used with no file system, for example when using volumes to store VM images directly or as LVM physical volumes.
Redundancy is provided by multiple copies (replicas) of the data written synchronously across the cluster. Users may set the number of replication copies or an erasure coding scheme. The replication level directly correlates with the number of servers that may be down without interruption in the service. For replication 3 and all N+2 erasure coding schemes the number of the servers (see 12.6. Fault sets) that may be down simultaneously without losing access to the data is 2.
StorPool protects data and guarantees its integrity by a 64-bit checksum and version for each sector on a StorPool volume or snapshot. StorPool provides a very high degree of flexibility in volume management. Unlike other storage technologies, such as RAID or ZFS, StorPool does not rely on device mirroring (pairing drives for redundancy). So every disk that is added to a StorPool cluster adds capacity and improves the performance of the cluster, not just for new data but also for existing data. Provided that there are sufficient copies of the data, drives can be added or taken away with no impact to the storage service. Unlike rigid systems like RAID, StorPool does not impose any strict hierarchical storage structure dictated by the underlying disks. StorPool simply creates a single pool of storage that utilizes the full capacity and performance of a set of commodity drives.
2. Architecture
StorPool works on a cluster of servers in a distributed shared-nothing architecture. All functions are performed by all servers on an equal peer basis. It works on standard off-the-shelf servers running GNU/Linux.
Each storage node is responsible for data stored on its local drives. Storage nodes collaborate to provide the storage service. StorPool provides a shared storage pool combining all the available storage capacity. It uses synchronous replication across servers. The StorPool client communicates in parallel with all StorPool servers. The StorPool iSCSI target provides access to volumes exported through it to other initiators.
The software consists of two different types of services - storage server
services and a storage client services - that are installed on each physical
server (host, node). The storage client services are the native block on Linux
based systems, the iSCSI target or the NVMeTCP target for other systems. Each
host can be a storage server, a storage client, iSCSI target, NVMeOF controller
or any combination. The StorPool volumes appear to storage clients as block
devices under the /dev/storpool/
directory and behave as normal disk
devices. The data on the volumes can be read and written by all clients
simultaneously; its consistency is guaranteed through a synchronous replication
protocol. Volumes may be used by clients as they would use a local hard drive or
disk array.
3. Feature Highlights
3.1. Scale-out, not Scale-Up
The StorPool solution is fundamentally about scaling out (by adding more drives or nodes) rather than scaling up (adding capacity by replacing a storage box with larger storage box). This means StorPool can scale independently by IOPS, storage space and bandwidth. There is no bottleneck or single point of failure. StorPool can grow without interruption and in small steps - a drive, a server and/or a network interface at a time.
3.2. High Performance
StorPool combines the IOPS performance of all drives in the cluster and optimizes drive access patterns to provide low latency and handling of storage traffic bursts. The load is distributed equally between all servers through striping and sharding.
3.3. High Availability and Reliability
StorPool uses a replication mechanism that slices and stores copies of the data on different servers. For primary, high performance storage this solution has many advantages compared to RAID systems and provides considerably higher levels of reliability and availability. In case of a drive, server, or other component failure, StorPool uses some of the available copies of the data located on other nodes in the same or other racks significantly decreasing the risk of losing access to or losing data.
3.4. Commodity Hardware
StorPool supports drives and servers in a vendor-agnostic manner, allowing you to avoid vendor lock-in. This allows the use of commodity hardware, while preserving reliability and performance requirements. Moreover, unlike RAID, StorPool is drive agnostic - you can mix drives of various types, make, speed or size in a StorPool cluster.
3.6. Co-existence with hypervisor software
StorPool can utilize repurposed existing servers and can co-exist with hypervisor software on the same server. This means that there is no dedicated hardware for storage, and growing an IaaS cloud solution is achieved by simply adding more servers to the cluster.
3.7. Compatibility
StorPool is compatible with 64-bit Intel and AMD based servers. We support all Linux-based hypervisors and hypervisor management software. Any Linux software designed to work with a shared storage solution such as an iSCSI or FC disk array will work with StorPool. StorPool guarantees the functionality and availability of the storage solution at the Linux block device interface.
3.8. CLI interface and API
StorPool provides an easy to use yet powerful command-line interface (CLI) tool for administration of the data storage solution. It is simple and user-friendly - making configuration changes, provisioning, and monitoring fast and efficient. For an introduction to CLI’s functionality, see 11. CLI tutorial.
StorPool also provides a RESTful JSON API reference, and Python bindings exposing all the available functionality, so you can integrate it with any existing management system. This is the basis on which the StorPool integrations are built.
3.9. Reliable support
StorPool comes with reliable dedicated support:
Remote installation and initial configuration by StorPool’s specialists
24x7 support
Live software updates without interruption in the service
4. Hardware requirements
All distributed storage systems are highly dependent on the underlying hardware. There are some aspects that will help achieve maximum performance with StorPool and are best considered in advance. Each node in the cluster can be used as server, client, iSCSI target or any combination; depending on the role, hardware requirements vary.
Note
The system parameters listed in the sections below are intended to serve as initial guidelines. For detailed information about the supported hardware and software, see the StorPool System Requirements document.
You can also contact StorPool’s Technical Account Management for a detailed hardware requirement assessment.
4.1. Minimum StorPool cluster
3 industry-standard x86 servers;
any x86-64 CPU with 4 threads or more;
32 GB ECC RAM per node (8+ GB used by StorPool);
any hard drive controller in JBOD mode;
3x SATA3 hard drives or SSDs;
dedicated 2x10GE LAN;
4.2. Recommended StorPool cluster
5 industry-standard x86 servers;
IPMI, iLO/LOM/iDRAC desirable;
Intel Nehalem generation (or newer) Xeon processor(s);
64GB ECC RAM or more in every node;
any hard drive controller in JBOD mode;
dedicated dual 25GE or faster LAN;
2+ NVMe drives per storage node;
4.3. How StorPool relies on hardware
4.3.1. CPU
When the system load is increased, CPUs are saturated with system interrupts. To avoid the negative effects of this, StorPool’s server and client processes are given one or more dedicated CPU cores. This significantly improves overall the performance and the performance consistency.
4.3.2. RAM
ECC memory can detect and correct the most common kinds of in-memory data corruption thus maintains a memory system immune to single-bit errors. Using ECC memory is an essential requirement for improving the reliability of the node. In fact, StorPool is not designed to work with non-ECC memory.
4.3.3. Storage (HDDs / SSDs)
StorPool ensures the best drive utilization. Replication and data integrity are core functionality, so RAID controllers are not required and all storage devices might be connected as JBOD. All hard drives are journaled either on an NVMe drive similar to Intel Optane series. When write-back cache is available on a RAID controller it could be used in a StorPool specific way in order to provide power-loss protection for the data written on the hard disks. This is not necessary for SATA SSD pools.
4.3.4. Network
StorPool is a distributed system which means that the network is an essential part of it. Designed for efficiency, StorPool combines data transfer from other nodes in the cluster. This greatly improves the data throughput, compared with access to local devices, even if they are SSD or NVMe.
4.4. Software Compatibility
4.4.1. Operating Systems
Linux (various distributions)
Windows and VMware, Citrix Xen through standard protocols (iSCSI)
4.4.2. File Systems
Developed and optimized for Linux, StorPool is very well tested on CentOS, Ubuntu and Debian. Compatible and well tested with ext4 and XFS file systems and with any system designed to work with a block device, e.g. databases and cluster file systems (like GFS2 or OCFS2). StorPool can also be used with no file system, for example when using volumes to store VM images directly. StorPool is compatible with other technologies from the Linux storage stack, such as LVM, dm-cache/bcache, and LIO.
4.4.3. Hypervisors & Cloud Management/Orchestration
KVM
LXC/Containers
OpenStack
OpenNebula
OnApp
CloudStack
any other technology compatible with the Linux storage stack.
5. Installation and upgrade
Currently the installation and upgrade procedures are performed by the StorPool support team.
6. Node configuration options
You can configure a StorPool node by setting options in the
/etc/storpool.conf
configuration file. You can also define options in
configuration files in the /etc/storpool.conf.d/
directory; these files must
meet all of the following requirements:
The name ends with the
.conf
extension.The name does not start with a dot (
.
).
When the system locates files in the /etc/storpool.conf.d/
directly with
correct names (like local.conf
or hugepages.conf
), it would process the
options set in them. It would ignore files with incorrect names, like
.hidden.conf
, local.confx
, server.conf.bak
, or storpool.conf~
.
6.1. Introduction
Here you can find a small example for the /etc/storpool.conf
file, and also information about putting host-specific options in separate sections.
6.1.1. Minimal node configuration
The minimum working configuration must specify the network interface, number of expected nodes, authentication tokens, and unique ID of the node, like in the example below:
#-
# Copyright (c) 2013 - 2017 StorPool.
# All rights reserved.
#
# Human readable name of the cluster, usually in the "Company Name"-"Location"-"number" form
# For example: StorPool-Sofia-1
# Mandatory for monitoring
SP_CLUSTER_NAME= #<Company-Name-PoC>-<City-or-nearest-airport>-<number>
# Remote authentication token provided by StorPool support for data related to crashed services, collected
# vmcore-dmesg files after kernel panic, per-host monitoring alerts, storpool_iolatmon alerts, and so on.
SP_RAUTH_TOKEN= <rauth-token>
# Computed by the StorPool Support team; consists of location and cluster separated by a dot
# For example: nzkr.b
# Mandatory (since version 16.02)
SP_CLUSTER_ID= #Ask StorPool Support
# Interface for storpool communication
#
# Default: empty
SP_IFACE1_CFG=1:eth1.100:eth1:100:10.1.100.1:b:x:r
SP_IFACE2_CFG=1:eth2.100:eth2:100:10.2.100.1:b:x:r
# expected nodes for beacon operation
#
# !!! Must be specified !!!
#
SP_EXPECTED_NODES=3
# API authentication token
#
# 64bit random value
# For example, generate it with: od -vAn -N8 -tu8 /dev/random
SP_AUTH_TOKEN=4306865639163977196
##########################
[spnode1.example.com]
SP_OURID = 1
This section of the documentation describes all options. If you need more
examples, check the /usr/share/doc/storpool/examples/storpool.conf.example
file in your StorPool installation.
6.1.2. Per host configuration
Specific options per host. The value in the square brackets should be the name
of the host, as returned by the hostname
command:
[spnode1.example.com]
SP_OURID=1
Specific configuration details might be added for each host individually, like shown in the example below:
[spnode1.example.com]
SP_OURID=1
SP_IFACE1_CFG=1:eth1.100:eth1:100:10.1.100.1:b:x:r
SP_IFACE2_CFG=1:eth2.100:eth2:100:10.2.100.1:b:x:r
SP_NODE_NON_VOTING=1
Note
It is important that each node has valid configuration sections for
all nodes in the cluster in its local /etc/storpool.conf
file.
Keep consistent /etc/storpool.conf
files across all nodes in the
cluster.
6.2. Identification and voting
You can use the /etc/storpool.conf
configuration file to set the ID of the
node in the cluster, whether it is a voting node (see Add a Node), and
the number of expected nodes.
6.2.1. Node ID
Use the SP_OURID
option to set the ID of the node in the cluster. The value
must be unique throughout the cluster.
SP_OURID=1
6.2.2. Non-voting beacon node
On client-only nodes (see 2. Architecture) the storpool_server
service should not be started. To achieve this, on such nodes you should set the
SP_NODE_NON_VOTING
option to 1, as shown below (default value is 0):
SP_NODE_NON_VOTING=1
Attention
It is strongly recommended to configure SP_NODE_NON_VOTING
at the
per-host configuration sections in storpool.conf
; for details, see
6.1.2. Per host configuration.
6.2.3. Expected nodes
Minimum number of expected nodes for beacon operation (see
9.1. storpool_beacon). Usually equal to the number of nodes with
storpool_server
instances running:
SP_EXPECTED_NODES=3
6.3. Network communication
You can use the /etc/storpool.conf
configuration file to set how the network
interfaces should be used, preferred ports, and other network-related options.
6.3.1. Interfaces for StorPool cluster communication
The network interface options should be defined in the following way:
SP_IFACE1_CFG='config-version':'resolve-interface':'raw-interface':'VLAN':'IP':'resolve':'shared':'mac'
The option names are in the SP_IFACE1_CFG
format, where the number after
SP_IFACE
is the number of the interface. The option values are as follows:
- config-version
version of the config format (currently 1)
- resolve-interface
Name of the kernel interface that has the IP address.
- raw-interface
Name of the raw interfaces to transmit and receive on.
- VLAN
VLAN tag to use or 0 if the VLAN is untagged.
- IP
IP address to use.
- resolve
‘b’ - use broadcast for resolving (for Ethernet based networks); ‘k’ - use kernel address resolving (for IP-only networks).
- shared
‘s’ for shared in case of a bond interface on top; ‘x’ for exclusive - if nothing else is supposed to use this interface
- mac
‘r’ for using the unmodified MAC of the raw interface; ‘P’ (Privatized MAC); ‘v’ the resolve interfaces.
It is recommended to have two dedicated network interfaces for communication between the nodes. Here are a few examples:
Single VLAN active-backup bond interface, storage and API on the same VLAN:
SP_IFACE1_CFG=1:bond0.900:eth2:900:10.9.9.1:b:s:P SP_IFACE2_CFG=1:bond0.900:eth3:900:10.9.9.1:b:s:P
Single VLAN, LACP bond interface, storage and API on the same VLAN:
SP_IFACE1_CFG=1:bond0.900:eth2:900:10.9.9.1:b:s:v SP_IFACE2_CFG=1:bond0.900:eth3:900:10.9.9.1:b:s:v
Two VLANs: one VLAN for storage, one VLAN for API, both over bond:
SP_IFACE1_CFG=1:bond0.101:eth2:101:10.2.1.1:b:s:P SP_IFACE2_CFG=1:bond0.101:eth2:101:10.2.1.1:b:s:P
Three VLANs: two VLANs for storage, API on separate physical interface
SP_IFACE1_CFG=1:eth2.101:eth2:101:10.2.1.1:b:x:v SP_IFACE2_CFG=1:eth3.201:eth2:201:10.2.2.1:b:x:v
6.3.2. Address for API management
Address on which the storpool_mgmt
service receives requests from user-space
tools. Multiple clients can simultaneously send requests to the API; for details
about the management service, see 9.4. storpool_mgmt. By default,
the address is bound on localhost:
SP_API_HTTP_HOST=127.0.0.1
For cluster-wide access and automatic failover between the nodes, multiple nodes
might have the API service started. The specified IP address is brought up only
on one of the nodes in the cluster at a time - the so called active
API
service. You may specify an available IP address (SP_API_HTTP_HOST
), which
will be brought up or down on the corresponding interface (SP_API_IFACE
)
when migrating the API service between the nodes.
To configure an interface (SP_API_IFACE
) and address (SP_API_HTTP_HOST
):
SP_API_HTTP_HOST=10.10.10.240
SP_API_IFACE=eth1
Note
The script that adds or deletes the SP_API_HTTP_HOST
address is located
at /usr/lib/storpool/api-ip
. It could be easily modified for other use
cases, for example configure routing, firewalls, and so on.
6.3.3. Port for API management
Port for the storpool_mgmt
service for API management. The default value is
81:
SP_API_HTTP_PORT=81
6.3.4. API authentication token
This value must be an unique integer for each cluster:
SP_AUTH_TOKEN=0123456789
Hint
The token can be generated with: od -vAn -N8 -tu8 /dev/random
6.3.5. Ignore RX port option
Used to instruct the services that the network can preserve the selected port even when altering ports. Default value is 0:
SP_IGNORE_RX_PORT=0
6.3.6. Preferred port
Used for setting which port is preferred when two networks are specified, but only one of them could actually be used for any reason (in an active-backup bond style). The default value is 0 (load-balancing):
SP_PREFERRED_PORT=0 # load-balancing
When you want to specify a port, you should use one of the following values:
SP_PREFERRED_PORT=1 # use SP_IFACE1_CFG by default
SP_PREFERRED_PORT=2 # use SP_IFACE2_CFG by default
6.3.7. Address for the bridge service
IP address for the storpool_bridge
bridge service (see
9. Background services) to bind to for inbound or outbound connections. If
not specified, the wildcard address (0.0.0.0) will be used.
SP_BRIDGE_HOST=180.220.200.8
6.3.8. Interface for the bridge address
Expected when the SP_BRIDGE_HOST
value is a floating IP address for the
storpool_bridge
service:
SP_BRIDGE_IFACE=bond0.900
6.3.9. Resolve interface is bridge
If set, the top interface in the SP_IFACE1/2_CFG
will be considered as a
bridge by the the iface-genconf
tool. Also, if set the
SP_BOND_IFACE_NAME
is required (see below). Default: (empty)
SP_RESOLVE_IFACE_IS_BRIDGE
6.3.10. Name of the bond interface
Required when the SP_RESOLVE_IFACE_IS_BRIDGE
option is set (see above). Used
by the iface-genconf
tool to indicate what the name of the bond under the
resolve interface would have to be. Default: (empty)
SP_BOND_IFACE_NAME
6.3.11. Bridge service network mask
IP network mask (in bits) for the bridge service. Will default to 32 if unset. Default: (empty)
SP_BRIDGE_NET
6.3.12. Path to iproute2
Path to the iproute2 utilities. Default: unset
SP_IPROUTE2_PATH=
6.3.13. Host and port for the Web interface
You can use the SP_GUI_HTTP_HOST
option to set the HTTP server host for
StorPool’s Web interface. By default it is unset, in which case the Web
interface is not available.
SP_GUI_HTTP_HOST=127.0.0.1
You can also use the option for setting the port. Default: 443
SP_GUI_HTTP_PORT=443
6.4. Drives
You can use the /etc/storpool.conf
configuration file to define a specific
driver to be used for NVMe SSD drives, a group owner for the StorPool devices,
and other drive-related options.
6.4.1. Exclude disks globally or per server instance
A list of paths to drives to be excluded at instance boot time:
SP_EXCLUDE_DISKS=/dev/sda1:/dev/sdb1
Can also be specified for each server instance individually:
SP_EXCLUDE_DISKS=/dev/sdc1
SP_EXCLUDE_DISKS_1=/dev/sda1
6.4.2. Disable drive ejection
Locally disable ejecting disks or journals on exceeding average request latency thresholds. Useful if the cluster cannot get up due to a incorrect maximum latencies configuration. Default: 0 (drive ejection is enabled)
SP_DISABLE_LATENCY_EJECT=0
6.4.3. Group owner for the StorPool devices
The system group to use for the /dev/storpool
and /dev/storpool-byid/
directories and the /dev/sp-*
raw disk devices:
SP_DISK_GROUP=disk
6.4.4. Permissions for the StorPool devices
The access mode to set on the /dev/sp-*
raw disk devices:
SP_DISK_MODE=0660
6.4.5. Mirror directory
If set, a directory containing device nodes with the same names as the
volume/snapshot names exposed in /dev/storpool
so that it may be (for
example) bind-mounted in a container as a whole. Note that the directory must
already exist! Default: unset
SP_MIRROR_DIR=
6.4.6. Mirror directory offset
A number to add to the owner ID and group ID of the devices in the mirror directory so that they may be accessed by processes running in an unprivileged container. Default: 0
SP_MIRROR_OWNER_OFFSET=0
6.4.7. NVMe SSD drives
The storpool_nvmed
service automatically detects all initializes StorPool
devices, and attaches them to the configured SP_NVME_PCI_DRIVER
.
To configure a driver for storpool_nvmed
that is different than the default
storpool_pci
, use the SP_NVME_PCI_DRIVER
option:
SP_NVME_PCI_DRIVER=vfio-pci
The vfio-pci
driver requires the iommu=pt
option for both Intel/AMD CPUs
and the intel_iommu=on
option in addition for Intel CPUs on the
kernel command line.
The ability to use the SP_NVME_PCI_DRIVER
option is available starting with
the 19.2 revision 19.01.1795.5b374e835 release.
6.4.8. Resetting stuck NVMes
Some NVMe devices occasionally stop responding. While resetting the PCI device sometimes resolves the issue, not all hardware handles this correctly, leading to server crashes and ever weirder behaviors.
You can use the SP_NVMED_RESET_INTEL_NVMES
option to set whether the
nvmed
service should try resetting stuck Intel NVMe drives. Default: (empty)
SP_NVMED_RESET_INTEL_NVMES=
6.5. Monitoring and debugging
You can use the /etc/storpool.conf
configuration file to set the cluster
name, cluster ID, and other options related to monitoring and debugging.
6.5.1. Cluster name
Required for the pro-active monitoring performed by StorPool Support team.
Usually in the <Company-Name>-<City-or-nearest-airport>-<number>
form, with
numbering starting from 1.
SP_CLUSTER_NAME=StorPool-Sofia-1
6.5.2. Cluster ID
The Cluster ID is computed from the StorPool Support ID and consists of two
parts - location and cluster - separated by a dot ("."
). Each location
consists of one or more clusters:
SP_CLUSTER_ID=nzkr.b
6.5.3. Local user for debug data collection
User to change the ownership of the storpool_abrtsync
service runtime (see
9. Background services). Unset by default:
SP_ABRTSYNC_USER=
Note
In case of no configuration during installation, this user will be set
by default to storpool
.
6.5.4. Remote addresses for sending debug data
The defaults are shown below. They could be altered (unlikely) in case a jump host or a custom collection nodes are used.
SP_ABRTSYNC_REMOTE_ADDRESSES=reports.storpool.com,reports1.storpool.com,reports2.storpool.com
6.5.5. Deleting local reports
You this option to set whether local reports files should be deleted after sync to the remote. Default:1
SP_DELETE_REPORTS=1
6.6. Cgroup options
You can use the /etc/storpool.conf
configuration file to define several
options related to cgroups. For more information on StorPool and cgroups, see
Control groups.
6.6.1. Enabling cgroups
The following option enables the usage of cgroups. The default value is 1:
SP_USE_CGROUPS=1
Each StorPool process requires a specification of the cgroups it should be
started into; there is a default configuration for each service. One or more
processes may be placed in the same cgroup, or each one may be in a cgroup of
its own, as appropriate. StorPool provides a tool for setting up cgroups called
storpool_cg
. It is able to automatically configure a system depending on
the installed services on all supported operating systems.
6.6.2. StorPool RDMA module
The SP_RDMA_CGROUPS
option is required for setting the kernel threads
started by the storpool_rdma
module:
SP_RDMA_CGROUPS=-g cpuset:storpool.slice/rdma -g memory:storpool.slice/common
6.6.3. Options for StorPool services
Use the following options to configure the StorPool services (see 9. Background services):
storpool_block
Use the
SP_BLOCK_CGROUPS
option:SP_BLOCK_CGROUPS=-g cpuset:storpool.slice/block -g memory:storpool.slice/common
storpool_bridge
Use the
SP_BRIDGE_CGROUPS
option:SP_BRIDGE_CGROUPS=-g cpuset:storpool.slice/bridge -g memory:storpool.slice/alloc
storpool_server
Use the
SP_SERVER_CGROUPS
option:SP_SERVER_CGROUPS=-g cpuset:storpool.slice/server -g memory:storpool.slice/common SP_SERVER1_CGROUPS=-g cpuset:storpool.slice/server_1 -g memory:storpool.slice/common SP_SERVER2_CGROUPS=-g cpuset:storpool.slice/server_2 -g memory:storpool.slice/common SP_SERVER3_CGROUPS=-g cpuset:storpool.slice/server_3 -g memory:storpool.slice/common SP_SERVER4_CGROUPS=-g cpuset:storpool.slice/server_4 -g memory:storpool.slice/common SP_SERVER5_CGROUPS=-g cpuset:storpool.slice/server_5 -g memory:storpool.slice/common SP_SERVER6_CGROUPS=-g cpuset:storpool.slice/server_6 -g memory:storpool.slice/common
storpool_beacon
Use the
SP_BEACON_CGROUPS
option:SP_BEACON_CGROUPS=-g cpuset:storpool.slice/beacon -g memory:storpool.slice/common
storpool_mgmt
Use the
SP_MGMT_CGROUPS
option:SP_MGMT_CGROUPS=-g cpuset:storpool.slice/mgmt -g memory:storpool.slice/alloc
storpool_controller
Use the
SP_CONTROLLER_CGROUPS
option:SP_CONTROLLER_CGROUPS=-g cpuset:system.slice -g memory:system.slice
storpool_cgmove
Use
SP_CGMOVE_SLICE
to set where the service should move processes that it finds in the root cgroup. Default:/system.slice
SP_CGMOVE_SLICE=/system.slice
Use
SP_CGMOVE_CGROUPS
to set which cgroupsstorpool_cgmove
should handle. Default:cpuset memory
SP_CGMOVE_CGROUPS=cpuset memory
storpool_nvmed
Use the
SP_NVMED_CGROUPS
option:SP_NVMED_CGROUPS=-g cpuset:storpool.slice/beacon -g memory:storpool.slice/common
6.6.4. More information
For details about cgroups configuration for other StorPool services, see 6.8. iSCSI options.
6.7. NVMe service
You can use the /etc/storpool.conf
configuration file to set options for the
9.7. storpool_nvmed service.
6.7.1. Routing for NVMe target
Enable routing for NVMe target. Default: (empty)
SP_NVMET_ROUTED=1
6.7.2. Network interface for NVME target
Using the SP_NVMET_IFACE
option you can set network interface names to be
used use for NVME target. Default: (empty)
Here are some examples:
SP_NVMET_IFACE=eth2:eth3
SP_NVMET_IFACE=eth2,bond0:eth3,bond0
SP_NVMET_IFACE=eth2,bond0:eth3,bond0:[lacp]
6.7.3. BGP speaker for routed NVMe target
Using the SP_NVMET_BGP_CONFIG
option you can set BGP speaker for routed NVMe
target:
Requires the
SP_NVMET_ROUTED
option to be enabled.The value is in the
ISCSI_BGP_IP:BGP_DAEMON_IP:AS_FOR_ISCSI:AS_FOR_THE_DAEMON
format.Default: (empty)
Here is an example:
SP_NVMET_BGP_CONFIG=127.0.0.1:127.0.0.2:65514:65513
6.8. iSCSI options
You can use the /etc/storpool.conf
configuration file to set options for the
9.10. storpool_iscsi service. For more information on setting up
iSCSI, see 16. Setting iSCSI targets.
6.8.1. Cgroups configuration
Use the SP_ISCSI_CGROUPS
option to configure the storpool_iscsi
service:
SP_ISCSI_CGROUPS=-g cpuset:storpool.slice/iscsi -g memory:storpool.slice/alloc
6.8.2. Network interface to use
You can set the network interface names to be used for iSCSI using the
SP_ISCSI_IFACE
option. Default: (empty)
SP_ISCSI_IFACE=eth2:eth3
SP_ISCSI_IFACE=eth2,bond0:eth3,bond0
SP_ISCSI_IFACE=eth2,bond0:eth3,bond0:[lacp]
The format is: physical interface1, resolve interface1:physical interface 2,
resolve interface2:[flags]
The only available flag is lacp
, for when the iSCSI interfaces are part of a
LACP bond.
6.8.3. Enabling routing
You can use the SP_ISCSI_ROUTED
option to Enable routing for iSCSI. Default: (empty)
SP_ISCSI_ROUTED=1
6.8.4. Configuring BGP speaker
Using the SP_ISCSI_BGP_CONFIG
option you can configure BGP speaker for the iSCSI routed setup:
Requires the
SP_ISCSI_ROUTED
option to be enabled.The value is in the
ISCSI_BGP_IP:BGP_DAEMON_IP:AS_FOR_ISCSI:AS_FOR_THE_DAEMON
format.Default: (empty)
Here is an example:
SP_ISCSI_BGP_CONFIG=127.0.0.1:127.0.0.2:65514:65513
6.9. Miscellaneous options
6.9.1. Working directory
Used for reports, shared memory dumps, sockets, core files, and so on. The
default value is /var/run/storpool
:
SP_WORKDIR=/var/run/storpool
6.9.2. Restart automatically in case of crash
The main StorPool services (see 9. Background services) are governed by a
special storpool_daemon
service. There is an option for this service, which
specifies a period of time in seconds (default is 1800). In case this service
crashes, and the number of crashes within the specified period is less than 3,
then the service would be restarted automatically.
SP_RESTART_ON_CRASH=1800
6.9.3. Logging of non-read-only open/close for StorPool devices
If set to 0, the storpool_bd
kernel module will not log anything about
opening or closing StorPool devices:
SP_BD_LOG_OPEN_CLOSE=1
6.9.4. Configuring the StorPool log daemon service
For details about the storpool_logd
service, see 9.14. storpool_logd.
To configure an HTTP/S proxy for the service:
SP_HTTPS_PROXY=<proxy URL>
To override the URL for the service:
SP_LOGD_URL=<logd-URL>
Note
Custom instances require HTTPS with properly installed certificates, locally if necessary.
6.9.5. Cache size
Each storpool_server
process allocates the amount of RAM (in MB) set using
SP_CACHE_SIZE
for caching. The size of the cache depends on the number of
storage devices on each storpool_server
instance, and is taken care by the
storpool_cg
tool during cgroups configuration. Here is an example
configuration for all storpool_server
instances:
SP_CACHE_SIZE=4096
Note
A node with three storpool_server
processes running will use
4096*3 = 12GB cache total.
You can override the size of the cache for each of the storpool_server
instances, as shown below. This is useful when different instances control
different number of drives:
SP_CACHE_SIZE=1024
SP_CACHE_SIZE_1=1024
SP_CACHE_SIZE_2=4096
SP_CACHE_SIZE_3=8192
These options are configured via the storpool_cg
service and don’t require
manual setting in most cases.
6.9.6. Internal write-back caching
Setting on the internal write-back caching:
SP_WRITE_BACK_CACHE_ENABLED=1
Attention
UPS is mandatory with write-back caching. A clean server shutdown is required before the UPS batteries are depleted.
6.9.7. Type of sleep
What kind of sleep to use when there are no requests. Default: (empty)
SP_SLEEP_TYPE=
Valid values are:
(empty), or
ksleep
: Use default kernel sleep.hsleep
: Use themsleep
kernel function directly. Processes will appear to be using %100 CPU time, but the processor will be put to sleep lower IO latency thanksleep
.no
: Don’t sleep at all - lowest latency, but at the price of of always keeping the CPU in running state.
6.9.8. Type of sleep for the bridge service
What kind of sleep to use when there are no requests for the bridge service. Default: (empty)
SP_BRIDGE_SLEEP_TYPE=
The valid values are the same as those for the SP_SLEEP_TYPE
option (see
above).
6.9.9. C state latency
Highest C state latency allowed for the CPU, in microseconds. Default: 5
SP_CPU_DMA_LATENCY=5
6.9.10. Bug reports location
Location for collecting automated bug reports and shared memory dumps. Default:
/var/spool/storpool
SP_REPORTDIR=/var/spool/storpool
See also the SP_REPORTS_FREE_SPACE_LIMIT
option below.
6.9.11. Free space for reports
Based on the value of the SP_REPORTS_FREE_SPACE_LIMIT
option, the system
would check if there is enough free space in the directory specified with
SP_REPORTDIR
for creating a crash report:
If there is not enough space no report would be created.
The value should be provided in the
<number>[KMGTP]
format, in bytes.If the value is
0
(default), the check would not be performed.
SP_REPORTS_FREE_SPACE_LIMIT = 0
6.9.12. CLI prompt string
As described in 11.2. Using the interactive shell, you can use StorPool’s Command
Line Interface (CLI) in interactive mode. You can use the SP_CLI_PROMPT
option to set the string that would appear as a prompt for this mode. Consider
the following about the value you set:
You can use the
${SP_VAR}
format, whereSP_VAR
is some of the options in the/etc/storpool.conf
configuration file. As shown in the example below, this could beSP_CLUSTER_NAME
.Trailing spaces are stripped and “> ” is appended.
Default: StorPool
SP_CLI_PROMPT=${SP_CLUSTER_NAME}
7. Storage devices
All storage devices that will be used by StorPool (HDD, SSD, or NVMe) must have
one or more properly aligned partitions, and must have an assigned ID. Larger
NVMe drives should be split into two or more partitions, which allows assigning
them to different instances of the storpool_server
service.
You can initialize the devices quickly using the disk_init_helper
tool
provided by StorPool. Alternatively, you do this manually using the common
parted
tool.
7.1. Journals
All hard disk drives should have a journal provided in one of the following ways:
On a persistent memory device (
/dev/pmemN
)On a small, high-endurance NVMe device (An Intel Optane or similar)
On a regular NVMe on a small separate partition from its main data one
On battery/cachevault power-loss protected virtual device (RAID controllers).
No journals on the HDD is acceptable in case of snapshot-only data (for example, a backup-only cluster).
For persistent memory devices, see Persistent memory support.
7.2. Using disk_init_helper
The disk_init_helper
tool is used in two steps:
Discovery and setup
The tool discovers all drives that do not have partitions and are not used anywhere (no LVM PV, device mapper RAID, StorPool data disks, and so on). It uses this information to generate a suggested configuration, which is stored as a configuration file. You can try different options until you get a configuration that suits you needs.
Initialization
You provide the configuration file from the first step to the tool, and it initializes the drives.
disk_init_helper
is also used in the storpool-ansible
playbook (see
github.com/storpool/ansible), where it
helps providing consistent defaults for known configurations and idempotency.
7.2.1. Example node
This is an example node with 7 x 960GB SSDs, 8 x 2TB HDDs, 1 x 100GB Optane NVMe, and 3 x 1TB NVMe disks:
[root@s25 ~]# lsblk
NAME MAJ:MIN RM SIZE RO TYPE MOUNTPOINT
sda 8:0 0 894.3G 0 disk
sdb 8:16 0 894.3G 0 disk
sdc 8:32 0 894.3G 0 disk
sdd 8:48 0 894.3G 0 disk
sde 8:64 0 894.3G 0 disk
sdf 8:80 0 894.3G 0 disk
sdg 8:96 0 894.3G 0 disk
sdh 8:112 0 1.8T 0 disk
sdi 8:128 0 1.8T 0 disk
sdj 8:144 0 1.8T 0 disk
sdk 8:160 0 1.8T 0 disk
sdl 8:176 0 111.8G 0 disk
|-sdl1 8:177 0 11.8G 0 part
|-sdl2 8:178 0 100G 0 part /
`-sdl128 259:15 0 1.5M 0 part
sdm 8:192 0 1.8T 0 disk
sdn 8:208 0 1.8T 0 disk
sdo 8:224 0 1.8T 0 disk
sdp 8:240 0 1.8T 0 disk
nvme0n1 259:6 0 93.2G 0 disk
nvme1n1 259:0 0 931.5G 0 disk
nvme2n1 259:1 0 931.5G 0 disk
nvme3n1 259:4 0 931.5G 0 disk
This node is used in the examples below.
7.2.2. Discovering drives
7.2.2.1. Basic usage
To assign IDs for all disks on this node run the tool with the --start
argument:
[root@s25 ~]# disk_init_helper discover --start 2501 -d disks.json
sdl partitions: sdl1, sdl2, sdl128
Success generating disks.json, proceed with 'init'
Note
Note that the automatically generated IDs must be unique within the StorPool cluster. Allowed IDs are between 1 and 4000.
StorPool disk IDs will be assigned in an offset by 10 for SSD, NVMe, and HDD drives accordingly, which could be further tweaked by parameters.
By default, the tool discovers all disks without partitions; the one where the
OS is installed (/dev/sdl
) is skipped. The tools does the following:
Prepares all SSD, NVMe, and HDD devices with a single large partition on each one.
Uses the Optane device as a journal-only device for the hard drive journals.
7.2.2.2. Viewing configuration
You can use the --show
option to see what will be done:
[root@s25 ~]# disk_init_helper discover --start 2501 --show
sdl partitions: sdl1, sdl2, sdl128
/dev/sdb (type: SSD):
data partitions
/dev/disk/by-id/ata-SAMSUNG_MZ7LM960HMJP-00005_S2TZNX0J302126-part1 (2501): 894.25 GiB (mv: None)
/dev/sda (type: SSD):
data partitions
/dev/disk/by-id/ata-SAMSUNG_MZ7LM960HMJP-00005_S2TZNX0J302127-part1 (2502): 894.25 GiB (mv: None)
/dev/sdc (type: SSD):
data partitions
/dev/disk/by-id/ata-SAMSUNG_MZ7LM960HMJP-00005_S2TZNX0J302128-part1 (2503): 894.25 GiB (mv: None)
/dev/sdd (type: SSD):
data partitions
/dev/disk/by-id/ata-SAMSUNG_MZ7LM960HMJP-00005_S2TZNX0J302129-part1 (2504): 894.25 GiB (mv: None)
/dev/sde (type: SSD):
data partitions
/dev/disk/by-id/ata-SAMSUNG_MZ7LM960HMJP-00005_S2TZNX0J302137-part1 (2505): 894.25 GiB (mv: None)
/dev/sdf (type: SSD):
data partitions
/dev/disk/by-id/ata-SAMSUNG_MZ7LM960HMJP-00005_S2TZNX0J302138-part1 (2506): 894.25 GiB (mv: None)
/dev/sdg (type: SSD):
data partitions
/dev/disk/by-id/ata-SAMSUNG_MZ7LM960HMJP-00005_S2TZNX0J302139-part1 (2507): 894.25 GiB (mv: None)
/dev/sdh (type: HDD):
data partitions
/dev/disk/by-id/ata-ST2000NM0008-2F3100_ZDS00Y25-part1 (2521): 1863.01 GiB (mv: /dev/disk/by-id/nvme-INTEL_SSDPE21K100GA_PHKE831600DM100EGN-part1)
/dev/sdj (type: HDD):
data partitions
/dev/disk/by-id/ata-ST2000NM0008-2F3100_ZDS03YRJ-part1 (2522): 1863.01 GiB (mv: /dev/disk/by-id/nvme-INTEL_SSDPE21K100GA_PHKE831600DM100EGN-part2)
/dev/sdi (type: HDD):
data partitions
/dev/disk/by-id/ata-ST2000NM0008-2F3100_ZDS041FK-part1 (2523): 1863.01 GiB (mv: /dev/disk/by-id/nvme-INTEL_SSDPE21K100GA_PHKE831600DM100EGN-part3)
/dev/sdk (type: HDD):
data partitions
/dev/disk/by-id/ata-ST2000NM0008-2F3100_ZDS04280-part1 (2524): 1863.01 GiB (mv: /dev/disk/by-id/nvme-INTEL_SSDPE21K100GA_PHKE831600DM100EGN-part4)
/dev/sdp (type: HDD):
data partitions
/dev/disk/by-id/ata-ST2000NM0008-2F3100_ZDS05PTA-part1 (2525): 1863.01 GiB (mv: /dev/disk/by-id/nvme-INTEL_SSDPE21K100GA_PHKE831600DM100EGN-part5)
/dev/sdo (type: HDD):
data partitions
/dev/disk/by-id/ata-ST2000NM0008-2F3100_ZDS05PTB-part1 (2526): 1863.01 GiB (mv: /dev/disk/by-id/nvme-INTEL_SSDPE21K100GA_PHKE831600DM100EGN-part6)
/dev/sdm (type: HDD):
data partitions
/dev/disk/by-id/ata-ST2000NM0008-2F3100_ZDS05PTD-part1 (2527): 1863.01 GiB (mv: /dev/disk/by-id/nvme-INTEL_SSDPE21K100GA_PHKE831600DM100EGN-part7)
/dev/sdn (type: HDD):
data partitions
/dev/disk/by-id/ata-ST2000NM0008-2F3100_ZDS05PTJ-part1 (2528): 1863.01 GiB (mv: /dev/disk/by-id/nvme-INTEL_SSDPE21K100GA_PHKE831600DM100EGN-part8)
/dev/nvme0n1 (type: journal-only NVMe):
journal partitions
/dev/disk/by-id/nvme-INTEL_SSDPE21K100GA_PHKE831600DM100EGN-part1 (None): 0.10 GiB (mv: None)
/dev/disk/by-id/nvme-INTEL_SSDPE21K100GA_PHKE831600DM100EGN-part2 (None): 0.10 GiB (mv: None)
/dev/disk/by-id/nvme-INTEL_SSDPE21K100GA_PHKE831600DM100EGN-part3 (None): 0.10 GiB (mv: None)
/dev/disk/by-id/nvme-INTEL_SSDPE21K100GA_PHKE831600DM100EGN-part4 (None): 0.10 GiB (mv: None)
/dev/disk/by-id/nvme-INTEL_SSDPE21K100GA_PHKE831600DM100EGN-part5 (None): 0.10 GiB (mv: None)
/dev/disk/by-id/nvme-INTEL_SSDPE21K100GA_PHKE831600DM100EGN-part6 (None): 0.10 GiB (mv: None)
/dev/disk/by-id/nvme-INTEL_SSDPE21K100GA_PHKE831600DM100EGN-part7 (None): 0.10 GiB (mv: None)
/dev/disk/by-id/nvme-INTEL_SSDPE21K100GA_PHKE831600DM100EGN-part8 (None): 0.10 GiB (mv: None)
/dev/nvme3n1 (type: NVMe w/ journals):
data partitions
/dev/disk/by-id/nvme-INTEL_SSDPE2KX010T8_BTLJ849207E61P0FGN-part1 (2511): 931.51 GiB (mv: None)
/dev/nvme1n1 (type: NVMe w/ journals):
data partitions
/dev/disk/by-id/nvme-INTEL_SSDPE2KX010T8_BTLJ849207F91P0FGN-part1 (2512): 931.51 GiB (mv: None)
/dev/nvme2n1 (type: NVMe w/ journals):
data partitions
/dev/disk/by-id/nvme-INTEL_SSDPE2KX010T8_BTLJ84920JAJ1P0FGN-part1 (2513): 931.51 GiB (mv: None)
7.2.2.3. Recognizing SSDs
The SSDs and HDDs are auto-discovered by their rotational flag in the
/sys/block
hierarchy. There are however occasions when this flag might be
misleading, and an SSD is visible as a rotational device.
For such cases there are overrides that can further help with proper configuration, as shown in the example below:
# disk_init_helper discover --start 101 '*Micron*M510*:s'
All devices matching the /dev/disk/by-id*Micron*M510*:s
pattern will be
forced as SSD drives, regardless of how they were discovered by the tool.
7.2.2.4. Specifying a journal
Similarly, a journal may be specified for a device, for example:
# disk_init_helper discover --start 101 '*Hitachi*HUA7200*:h:njo'
Instructs the tool to use an NVMe journal-only device for keeping the journals for all Hitachi HUA7200 drives.
The overrides look like this:
<disk-serial-pattern>:<disk-type>[:<journal-type>]
The disk type may be one of:
s
- SSD drive
sj
- SSD drive with HDD journals (used for testing only)
n
- NVMe drive
nj
- NVMe drive with HDD journals
njo
- NVMe drive with journals only (no StorPool data disk)
h
- HDD drive
x
- Exclude this drive match, even if it is with the right size.
The journal-type override is optional, and makes sense mostly when the device is an HDD:
nj
- journal on an NVMe drive - requires at least onenj
device
njo
- journal on an NVMe drive - requires at least onenjo
device
sj
- journal on SSD drive (unusual, but useful for testing); requires at least onesj
device.
7.2.3. Initializing drives
To initialize the drives using an existing configuration file:
# disk_init_helper init disks.json
The above will apply the settings pre-selected during the discovery phase.
More options may be specified to either provide some visibility on what will be
done (like --verbose
and --noop
), or provide additional options to
storpool_initdisk
for the different disk types (like --ssd-args
and
--hdd-args
).
7.3. Manual partitioning
A disk drive can be initialized manually as a StorPool data disk.
7.3.1. Creating partitions
First, an aligned partition should be created spanning the full volume of the disk drive. Here is an example command for creating a partition on the whole drive with the proper alignment:
# parted -s --align optimal /dev/sdX mklabel gpt -- mkpart primary 2MiB 100% # Here, X is the drive letter
For dual partitions on a NVMe drive that is larger than 4TB use:
# parted -s --align optimal /dev/nvmeXnY mklabel gpt -- mkpart primary 2MiB 50% # Here, X is the nvme device controller, and Y is the device's first namespace
# parted -s --align optimal /dev/nvmeXnY mkpart primary 50% 100%
Similarly, to split an even larger (for example, 8TB or larger) NVMe drive to four partitions use:
# parted -s --align optimal /dev/nvmeXnY mklabel gpt -- mkpart primary 2MiB 25% # Here, X is the nvme device controller, and Y is the device's first namespace
# parted -s --align optimal /dev/nvmeXnY mkpart primary 25% 50%
# parted -s --align optimal /dev/nvmeXnY mkpart primary 50% 75%
# parted -s --align optimal /dev/nvmeXnY mkpart primary 75% 100%
Hint
NVMe devices larger than 4TB should always be split as up to 4TiB chunks.
7.3.2. Initializing a drive
On a brand new cluster installation it is necessary to have one drive formatted
with the “init” (-I
) flag of the storpool_initdisk
tool.
This device is necessary only for the first start, and therefore it is best to
pick the first drive in the cluster.
Initializing the first drive on the first server node with the init flag:
# storpool_initdisk -I {diskId} /dev/sdX # Here, X is the drive letter
Initializing an SSD or NVME SSD device with the SSD flag set:
# storpool_initdisk -s {diskId} /dev/sdX # Here, X is the drive letter
Initializing an HDD drive with a journal device:
# storpool_initdisk {diskId} /dev/sdX --journal /dev/sdY # Here, X and Y are the drive letters
To list all initialized devices:
# storpool_initdisk --list
Example output:
0000:01:00.0-p1, diskId 2305, version 10007, server instance 0, cluster e.b, SSD, opened 7745
0000:02:00.0-p1, diskId 2306, version 10007, server instance 0, cluster e.b, SSD, opened 7745
/dev/sdr1, diskId 2301, version 10007, server instance 1, cluster e.b, SSD, opened 8080
/dev/sdq1, diskId 2302, version 10007, server instance 1, cluster e.b, SSD, opened 8080
/dev/sds1, diskId 2303, version 10007, server instance 1, cluster e.b, SSD, opened 8080
/dev/sdt1, diskId 2304, version 10007, server instance 1, cluster e.b, SSD, opened 8080
/dev/sda1, diskId 2311, version 10007, server instance 2, cluster e.b, WBC, jmv 160036C1B49, opened 8185
/dev/sdb1, diskId 2311, version 10007, server instance 2, cluster -, journal mv 160036C1B49, opened 8185
/dev/sdc1, diskId 2312, version 10007, server instance 2, cluster e.b, WBC, jmv 160036CF95B, opened 8185
/dev/sdd1, diskId 2312, version 10007, server instance 2, cluster -, journal mv 160036CF95B, opened 8185
/dev/sde1, diskId 2313, version 10007, server instance 3, cluster e.b, WBC, jmv 160036DF8DA, opened 8971
/dev/sdf1, diskId 2313, version 10007, server instance 3, cluster -, journal mv 160036DF8DA, opened 8971
/dev/sdg1, diskId 2314, version 10007, server instance 3, cluster e.b, WBC, jmv 160036ECC80, opened 8971
/dev/sdh1, diskId 2314, version 10007, server instance 3, cluster -, journal mv 160036ECC80, opened 8971
7.3.3. Drive initialization options
Other available options of the storpool_initdisk
tool:
- --list
List all StorPool disks on this node.
- -i
Specify server instance, used when more than one
storpool_server
instances are running on the same node.- -r
Used to return an ejected disk back to the cluster or change some of the flags.
- -F
Forget this disk and mark it as ejected; succeeds only without a running
storpool_server
instance that has the drive opened.
- -s|–ssd y/n
Set SSD flag - on new initialize only, not reversible with
-r
. Providing they
orn
value forces a disk to be considered as flash-based or not.- -j|–journal (<device>|none)
Used for HDDs when a RAID controller with a working cachevault or battery is present or an NVMe device is used as a power loss protected write-back journal cache.
- --bad
Marks disk as bad. Will be treated as ejected by the servers.
- --good
Resets disk to ejected if it was bad. Use with caution.
- --list-empty
List empty NVMe devices.
- --json
Output the list of devices as a JSON object.
- --nvme-smart nvme-pci-addr
Dump the NVMe S.M.A.R.T. counters; only for devices controlled by the
storpool_nvmed
service.
Advanced options (use with care):
- -e (entries_count)
Initialize the disk by overriding the default number of entries count (default is based on the disk size).
- -o (objects_count)
Initialize the disk by overriding the default number of objects count (default is based on the disk size).
- --wipe-all-data
Used when re-initializing an already initialized StorPool drive. Use with caution.
- --no-test
Disable forced one-time test flag.
- --no-notify
Does not notify servers of the changes. They won’t immediately open the disk. Useful for changing a flag with
-r
without returning the disk back to the server.
- –no-fua (y|n)
Used to forcefully disable FUA support for an SSD device. Use with caution because it might lead to data loss if the device is powered off before issuing a FLUSH CACHE command.
- –no-flush (y|n)
Used to forcefully disable FLUSH support for an SSD device.
- –no-trim (y|n)
Used to forcefully disable TRIM support for an SSD device. Useful when the drive is misbehaving when TRIM is enabled.
- –test-override no/test/pass
Modify the “test override” flag (default during disk init is “test”).
- –wbc (y|n)
Used for HDDs when the internal write-back caching is enabled, implies
SP_WRITE_BACK_CACHE_ENABLED
to have an effect. Turned off by default.
- --nvmed-rescan
Instruct the
storpool_nvmed
service to rescan after device changes.
8. Network interfaces
The recommended mode of operation is with hardware acceleration enabled for
supported network interfaces. Most NICs controlled by the
i40e
/ixgbe
/ice
(Intel), mlx4_core
/mlx5_core
(Nvidia/Mellanox), and bnxt
(Broadcom) drivers do support hardware
acceleration.
When enabled, the StorPool services are directly using the NIC, bypassing the Linux kernel. This reduces CPU usage and processing latency, and StorPool traffic is not affected by issues in Linux kernel (floods). Due to the Linux kernel being now bypassed, the entire network stack is implemented in the StorPool services.
8.1. Preparing Interfaces
There are two ways to configure the network interfaces for StorPool. One is automatic: providing the VLAN ID, the IP network(s), and the mode of operation, and leaving the IP address selection to the helper tooling. The other is semi-manual: providing explicit IP address configuration for each parameter on each of the nodes in the cluster.
8.2. Automatic configuration
The automatic configuration is performed with the net_helper
tool provided
by StorPool. On running the tools it selects addresses based on the SP_OURID
of each node in the cluster. It requires VLAN (default 0, that is, untagged), an
IP network for the storage, and pre-defined mode of operation for the interfaces
in the OS. For a detailed overview, see Introduction to network interface helper.
The supported modes are exclusive-ifaces
, active-backup-bond
,
bridge-active-backup-bond
, mlag-bond
, and bridge-mlag-bond
. They are
described in the sections below.
Note
All modes relate only to the way the kernel networks are configured. The storage services are always in active-active mode (unless configured differently) using directly the underlying interfaces. Any configuration of the kernel interfaces is solely for the purposes of other traffic, for example for access to the API.
8.2.1. Exclusive interfaces mode
The exclusive-ifaces
mode offers the simplest possible configuration, with
just the two main raw interfaces, each configured with a different address.
This mode is used mainly when there is no need for redundancy on the kernel network, usually for multipath iSCSI.
Note
Not recommended for the storage network if the API is configured on top. In such a situation it is recommended to use some of the bond modes.
8.2.2. Active backup bond modes
In the active-backup-bond
and bridge-active-backup-bond
modes, the
underlying storage interfaces are added in an active-backup bond interface
(named spbond0
by default), which uses an ARP monitor to select the active
interface in the bond.
If the VLAN for the storage interfaces is tagged, an additional VLAN interface is created on top of the bond.
Here is a very simple example with untagged VLAN (that is, 0):
In case the network is with a tagged VLAN, it will be created on top of the
spbond0
interface.
Here is an example with tagged VLAN 100:
In the bridge-active-backup-bond
mode, the final resolve interface is a
slave of a bridge interface (named br-storage
by default).
This is a tagged VLAN 100
on the bond:
Lastly, here is a more complex example with four interfaces (sp0
, sp1
,
sp2
, sp3
). The first two for the storage network are in
bridge-active-backup-bond
mode. The other two for the iSCSI network are in
exclusive-ifaces
mode. There are two additional networks on top of the
storage resolve interface (in this example, 1100
and 1200
). There is
also an additional multipath network on the iSCSI interfaces with VLANs:
1301
on the first, and 1302
on the second iSCSI network interface.
Creating such a configuration with the net_helper
tool should be done in the
following way:
# net_helper genconfig sp0 sp1 sp2 sp3 \
--vlan 100 \
--sp-network 10.0.0.1/24 \
--sp-mode bridge-active-backup-bond \
--add-iface-net 1100,10.1.100.0/24 \
--add-iface-net 1200,10.1.200.0/24 \
--iscsi-mode exclusive-ifaces \
--iscsicfg 1301,10.130.1.0/24:1302,10.130.2.0/24
The tooling helps selecting automatically the addresses for the ARP-monitoring targets, if they are not overridden for better active network selection. These addresses are usually the other storage nodes in the cluster. For the iSCSI in this mode, it is best to provide explicit ARP monitoring addresses.
8.2.3. LACP modes
The mlag-bond
and bridge-mlag-bond
modes are very close to the
active-backup-bond
and bridge-active-backup-bond
modes described above,
with the notable difference that they are LACP both when specified for the main
storage or the iSCSI network interfaces.
With these bond types, no additional ARP-monitoring addresses are required or being auto-generated by the tooling.
A quirk with these modes is that multipath networks for the iSCSI are being
created on top of the bond interface, because there is no way to send traffic
through a specific interface under the bond. Use the exclusive-iface
mode
for such cases.
8.2.4. Creating the configuration
You can create a configuration and save it as a file using the net_helper
tool with the genconfig
option. For mode information, see the examples provided below.
8.2.5. Applying the configuration
To actually apply the configuration stored in a file, use the applyifcfg
option:
[root@s11 ~]# net_helper applyifcfg --from-config /etc/storpool/autonets.conf
...
Additional sub-commands available are:
- up
Execute
ifup
/nmcli connection up
on all created interfaces.- down
Execute
ifdown
/nmcli connection down
on all created interfaces.- check
Check whether there is a configuration, or if there is a difference between the present one and a newly created one.
- cleanup
Delete all network interfaces created by the
net_helper
tool. Useful when re-creating the same raw interfaces with a different mode.
8.2.6. Simple example
Here is a minimal example with the following parameters:
Interface names:
sp0
andsp1
(the order is important)VLAN ID: 42
IP Network: 10.4.2.0/24
Predefined mode mode of operation: active-backup bond on top of the storage interfaces
The example below is for a node with SP_OURID=11
. Running net_helper genconfig
this way will just print an example configuration:
[root@s11 ~]# storpool_showconf SP_OURID
SP_OURID=11
[root@s11 ~]# net_helper genconfig sp0 sp1 --vlan 42 --sp-network 10.4.2.0/24 --sp-mode active-backup-bond
interfaces=sp0 sp1
addresses=10.4.2.11
sp_mode=active-backup-bond
vlan=42
add_iface=
sp_mtu=9000
iscsi_mtu=9000
iscsi_add_iface=
arp_ip_targets=10.4.2.12,10.4.2.13,10.4.2.14,10.4.2.15
config_path=/etc/storpool.conf.d/net_helper.conf
To store the configuration on the file system of the node:
[root@s11 ~]# net_helper genconfig sp0 sp1 --vlan 42 --sp-network 10.4.2.0/24 --sp-mode active-backup-bond > /etc/storpool/autonets.conf
With this configuration, the net_helper applyifcfg
command can be used to produce configuration for the network based on the operating system. This example is for CentOS 7 (--noop
just prints what will be done):
[root@s11 ~]# net_helper applyifcfg --from-config /etc/storpool/autonets.conf --noop
Same resolve interface spbond0.42 for both nets, assuming bond
An active-backup bond interface detected
Will patch /etc/storpool.conf.d/net_helper.conf with:
________________
SP_IFACE1_CFG=1:spbond0.42:sp0:42:10.4.2.11:b:s:P
SP_IFACE2_CFG=1:spbond0.42:sp1:42:10.4.2.11:b:s:P
SP_ALL_IFACES=dummy0 sp0 sp1 spbond0 spbond0.42
________________
Executing command: iface-genconf --auto --overwrite --sp-mtu 9000 --iscsi-mtu 9000 --arp-ip-targets 10.4.2.12,10.4.2.13,10.4.2.14,10.4.2.15 --noop
Using /usr/lib/storpool, instead of the default /usr/lib/storpool
Same resolve interface spbond0.42 for both nets, assuming bond
An active-backup bond interface detected
# Noop selected, use the following commands to create the configuration file manually
cat <<_EOF_ > /etc/sysconfig/network-scripts/ifcfg-spbond0.42
# Autogenerated by /usr/sbin/iface-genconf on 2022-07-12 07:02:27
DEVICE=spbond0.42
ONBOOT=yes
TYPE=Vlan
BOOTPROTO=none
MTU=9000
VLAN=yes
PHYSDEV=spbond0
IPADDR=10.4.2.11
PREFIX=24
NM_CONTROLLED=no
_EOF_
# Noop selected, use the following commands to create the configuration file manually
cat <<_EOF_ > /etc/sysconfig/network-scripts/ifcfg-dummy0
# Autogenerated by /usr/sbin/iface-genconf on 2022-07-12 07:02:27
DEVICE=dummy0
ONBOOT=yes
TYPE=dummy
MASTER=spbond0
SLAVE='yes'
BOOTPROTO=none
MTU=9000
NM_CONTROLLED=no
_EOF_
# Noop selected, use the following commands to create the configuration file manually
cat <<_EOF_ > /etc/sysconfig/network-scripts/ifcfg-spbond0
# Autogenerated by /usr/sbin/iface-genconf on 2022-07-12 07:02:27
DEVICE=spbond0
ONBOOT=yes
TYPE=Bond
BONDING_MASTER=yes
BONDING_OPTS="mode=active-backup arp_interval=500 arp_validate=active arp_all_targets=any arp_ip_target=10.4.2.12,10.4.2.13,10.4.2.14,10.4.2.15"
BOOTPROTO=none
MTU=9000
NM_CONTROLLED=no
_EOF_
# Noop selected, use the following commands to create the configuration file manually
cat <<_EOF_ > /etc/sysconfig/network-scripts/ifcfg-sp0
# Autogenerated by /usr/sbin/iface-genconf on 2022-07-12 07:02:27
DEVICE=sp0
ONBOOT=yes
TYPE=Ethernet
MASTER=spbond0
SLAVE='yes'
BOOTPROTO=none
MTU=9000
NM_CONTROLLED=no
_EOF_
# Noop selected, use the following commands to create the configuration file manually
cat <<_EOF_ > /etc/sysconfig/network-scripts/ifcfg-sp1
# Autogenerated by /usr/sbin/iface-genconf on 2022-07-12 07:02:27
DEVICE=sp1
ONBOOT=yes
TYPE=Ethernet
MASTER=spbond0
SLAVE='yes'
BOOTPROTO=none
MTU=9000
NM_CONTROLLED=no
_EOF_
There are many additional options available, for example the name of the bond could be customized, an additional set of VLAN interfaces could be created on top of the bond interface, and so on.
8.2.7. Advanced example
Here is a more advanced example:
Interface names:
sp0
andsp1
for StorPool, andlacp0
andlacp1
for the iSCSI serviceVLAN ID: 42 for the storage interfaces
IP Network: 10.4.2.0/24
Additional VLAN ID: 43 on the bond over the storage interfaces
Storage interfaces kernel mode of operation: with a bridge with MLAG bond on top of the storage interfaces
iSCSI dedicated interfaces kernel mode of operation: with an MLAG bond
VLAN 100, and IP network 172.16.100.0/24 for a portal group in iSCSI
To prepare the configuration:
[root@s11 ~]e net_helper genconfig \
lacp0 lacp1 sp0 sp1
--vlan 42 \
--sp-network 10.4.2.0/24 \
--sp-mode bridge-mlag-bond \
--iscsi-mode mlag-bond \
--add-iface 43,10.4.3.0/24 \
--iscsicfg-net 100,172.16.100.0/24 | tee /etc/storpool/autonets.conf
interfaces=lacp0 lacp1 sp0 sp1
addresses=10.4.2.11
sp_mode=bridge-mlag-bond
vlan=42
iscsi_mode=mlag-bond
add_iface=43,10.4.3.11/24
sp_mtu=9000
iscsi_mtu=9000
iscsi_add_iface=100,172.16.100.11/24
iscsi_arp_ip_targets=
config_path=/etc/storpool.conf.d/net_helper.conf
Example output:
[root@s11 ~]# net_helper applyifcfg --from-config /etc/storpool/autonets.conf --noop
Same resolve interface br-storage for both nets, assuming bond
An 802.3ad bond interface detected
Will patch /etc/storpool.conf.d/net_helper.conf with:
________________
SP_RESOLVE_IFACE_IS_BRIDGE=1
SP_BOND_IFACE_NAME=spbond0.42
SP_IFACE1_CFG=1:br-storage:lacp0:42:10.4.2.11:b:s:v
SP_IFACE2_CFG=1:br-storage:lacp1:42:10.4.2.11:b:s:v
SP_ISCSI_IFACE=sp0,spbond1:sp1,spbond1:[lacp]
SP_ALL_IFACES=br-storage dummy0 dummy1 lacp0 lacp1 sp0 sp1 spbond0 spbond0.42 spbond0.43 spbond1 spbond1.100
________________
Executing command: iface-genconf --auto --overwrite --sp-mtu 9000 --iscsi-mtu 9000 --add-iface 43,10.4.3.11/24 --iscsicfg-explicit 100,172.16.100.11/24 --noop
Using /usr/lib/storpool, instead of the default /usr/lib/storpool
Same resolve interface br-storage for both nets, assuming bond
An 802.3ad bond interface detected
# Noop selected, use the following commands to create the configuration file manually
cat <<_EOF_ > /etc/sysconfig/network-scripts/ifcfg-br-storage
# Autogenerated by /usr/sbin/iface-genconf on 2022-07-12 07:33:39
DEVICE=br-storage
ONBOOT=yes
TYPE=Bridge
BOOTPROTO=none
MTU=9000
IPADDR=10.4.2.11
PREFIX=24
NM_CONTROLLED=no
_EOF_
# Noop selected, use the following commands to create the configuration file manually
cat <<_EOF_ > /etc/sysconfig/network-scripts/ifcfg-spbond0.42
# Autogenerated by /usr/sbin/iface-genconf on 2022-07-12 07:33:39
DEVICE=spbond0.42
ONBOOT=yes
TYPE=Vlan
BRIDGE=br-storage
BOOTPROTO=none
MTU=9000
VLAN=yes
PHYSDEV=spbond0
NM_CONTROLLED=no
_EOF_
# Noop selected, use the following commands to create the configuration file manually
cat <<_EOF_ > /etc/sysconfig/network-scripts/ifcfg-dummy0
# Autogenerated by /usr/sbin/iface-genconf on 2022-07-12 07:33:39
DEVICE=dummy0
ONBOOT=yes
TYPE=dummy
MASTER=spbond0
SLAVE='yes'
BOOTPROTO=none
MTU=9000
NM_CONTROLLED=no
_EOF_
# Noop selected, use the following commands to create the configuration file manually
cat <<_EOF_ > /etc/sysconfig/network-scripts/ifcfg-spbond0
# Autogenerated by /usr/sbin/iface-genconf on 2022-07-12 07:33:39
DEVICE=spbond0
ONBOOT=yes
TYPE=Bond
BRIDGE=br-storage
BONDING_MASTER=yes
BONDING_OPTS="mode=802.3ad miimon=100 lacp_rate=1"
BOOTPROTO=none
MTU=9000
NM_CONTROLLED=no
_EOF_
# Noop selected, use the following commands to create the configuration file manually
cat <<_EOF_ > /etc/sysconfig/network-scripts/ifcfg-lacp0
# Autogenerated by /usr/sbin/iface-genconf on 2022-07-12 07:33:39
DEVICE=lacp0
ONBOOT=yes
TYPE=Ethernet
MASTER=spbond0
SLAVE='yes'
BOOTPROTO=none
MTU=9000
NM_CONTROLLED=no
_EOF_
# Noop selected, use the following commands to create the configuration file manually
cat <<_EOF_ > /etc/sysconfig/network-scripts/ifcfg-lacp1
# Autogenerated by /usr/sbin/iface-genconf on 2022-07-12 07:33:39
DEVICE=lacp1
ONBOOT=yes
TYPE=Ethernet
MASTER=spbond0
SLAVE='yes'
BOOTPROTO=none
MTU=9000
NM_CONTROLLED=no
_EOF_
# Noop selected, use the following commands to create the configuration file manually
cat <<_EOF_ > /etc/sysconfig/network-scripts/ifcfg-spbond0.43
# Autogenerated by /usr/sbin/iface-genconf on 2022-07-12 07:33:39
DEVICE=spbond0.43
ONBOOT=yes
TYPE=Vlan
BOOTPROTO=none
MTU=9000
VLAN=yes
PHYSDEV=spbond0
IPADDR=10.4.3.11
PREFIX=24
NM_CONTROLLED=no
_EOF_
# Noop selected, use the following commands to create the configuration file manually
cat <<_EOF_ > /etc/sysconfig/network-scripts/ifcfg-sp0
# Autogenerated by /usr/sbin/iface-genconf on 2022-07-12 07:33:39
DEVICE=sp0
ONBOOT=yes
TYPE=Ethernet
MASTER=spbond1
SLAVE='yes'
BOOTPROTO=none
MTU=9000
NM_CONTROLLED=no
_EOF_
# Noop selected, use the following commands to create the configuration file manually
cat <<_EOF_ > /etc/sysconfig/network-scripts/ifcfg-sp1
# Autogenerated by /usr/sbin/iface-genconf on 2022-07-12 07:33:39
DEVICE=sp1
ONBOOT=yes
TYPE=Ethernet
MASTER=spbond1
SLAVE='yes'
BOOTPROTO=none
MTU=9000
NM_CONTROLLED=no
_EOF_
# Noop selected, use the following commands to create the configuration file manually
cat <<_EOF_ > /etc/sysconfig/network-scripts/ifcfg-dummy1
# Autogenerated by /usr/sbin/iface-genconf on 2022-07-12 07:33:39
DEVICE=dummy1
ONBOOT=yes
TYPE=dummy
MASTER=spbond1
SLAVE='yes'
BOOTPROTO=none
MTU=9000
NM_CONTROLLED=no
_EOF_
# Noop selected, use the following commands to create the configuration file manually
cat <<_EOF_ > /etc/sysconfig/network-scripts/ifcfg-spbond1.100
# Autogenerated by /usr/sbin/iface-genconf on 2022-07-12 07:33:39
DEVICE=spbond1.100
ONBOOT=yes
TYPE=Vlan
BOOTPROTO=none
MTU=9000
VLAN=yes
PHYSDEV=spbond1
IPADDR=172.16.100.11
PREFIX=24
NM_CONTROLLED=no
_EOF_
# Noop selected, use the following commands to create the configuration file manually
cat <<_EOF_ > /etc/sysconfig/network-scripts/ifcfg-spbond1
# Autogenerated by /usr/sbin/iface-genconf on 2022-07-12 07:33:39
DEVICE=spbond1
ONBOOT=yes
TYPE=Bond
BONDING_MASTER=yes
BONDING_OPTS="mode=802.3ad miimon=100 lacp_rate=1"
BOOTPROTO=none
MTU=9000
NM_CONTROLLED=no
_EOF_
8.2.8. iSCSI configuration
The iface-genconf
tool accepts an additional --iscsicfg
option in the
following form:
VLAN,NET0_IP/NET0_PREFIX
: for single network portal groupsVLAN_NET0,NET0_IP/NET0_PREFIX:VLAN_NET1,NET1_IP/NET1_PREFIX
: for multipath portal groups
The additional option is available only when also using the --auto
option.
This way the usage is assumed as complementary to the interface configurations
initially created with the same tool, and is required to auto-detect
configurations where the interfaces for the storage system and the iSCSI are
overlapping.
An example of adding an additional single portal group with VLAN 100 and portal
group address 10.1.100.251/24
would look like this:
iface-genconf -a --noop --iscsicfg 100,10.1.100.251/24
The above will auto-detect the operating system, the type of interface
configuration used for the storage system and iSCSI, and depending on the
configuration type (for example, exclusive interfaces or a bond) will print
interface configuration on the console. Without the -noop
option
non-existing interface configurations will be created and ones that already
exist will not be automatically replaced (unless iface-genconf
is instructed
to).
The IP addresses for each of the nodes are derived by SP_OURID
,
and could be adjusted with the --iscsi-ip-offset
option that will be
summed to the SP_OURID
when constructing the IP address.
The most common case for single network portal group configuration is either
with an active-backup or LACP bond configured on top of the interfaces
configured as SP_ISCSI_IFACE
.
For example, with SP_ISCSI_IFACE=ens2,bond0;ens3,bond0
the additional
interface will be bond0.100
with IP of 10.1.100.1
for the node with
SP_OURID=1
, etc.
The same example for a multipath portal group with VLAN 201 for the first network and 202 for the second:
iface-genconf -a --noop --iscsicfg 201,10.2.1.251/24:202,10.2.2.251/24
In case of exclusive interfaces (ex. SP_ISCSI_IFACE=ens2:ens3
), or in case
of an active-backup bond configuration (for example,
SP_ISCSI_IFACE=ens2,bond0:ens3,bond0
) the interfaces will be configured on
top of each of the underlying interfaces accordingly:
ens2.201
with IP10.2.1.1
ens2.202
with IP10.2.2.1
The example is assumed for a controller node with SP_OURID=1
.
In case of an LACP bond (for example,
SP_ISCSI_IFACE=ens2,bond0:ens3,bond0:[lacp]
), all VLAN interfaces will be
configured on top of the bond interface (for example bond0.201
and bond0.202
with the same addresses), but such peculiar configurations should be rare.
The --iscsicfg
option could be provided multiple times for multiple portal
group configurations.
You can find information about all configuration available options using
iface-genconf --help
. For further details and some examples, see the
StorPool Ansible playbook
This feature was initially added with 19.1 revision 19.01.1548.00e5a5633 release.
8.3. Manual configuration
The net_helper
tool is merely a glue-like tool covering the following manual
steps:
Construct the
SP_IFACE1_CFG
/SP_IFACE2_CFG
/SP_ISCSI_IFACE
and other configuration statements based on the provided parameters (for the first and second network interfaces for the storage/iSCSI)Execute
iface_genconf
that recognizes these configurations, and dumps configuration in/etc/sysconfig
(CentOS 7) or/etc/network/interfaces
(Debian) or usingnmcli
to configure with Network Manager (Alma8/Rocky8/RHEL8)Execute
/usr/lib/storpool/vf-genconf
to prepare or re-create the configuration for virtual function interfaces.
8.4. Network and storage controllers interrupts affinity
The setirqaff
utility is started by cron every minute. It checks the CPU
affinity settings of several classes of IRQs (network interfaces, HBA, RAID),
and updates them if needed. The policy is built in the script and does not
require any external configuration files, apart from properly configured
storpool.conf
(see 6. Node configuration options) for the present node.
9. Background services
A StorPool installation provides background services that take care of different functionality on each node participating in the cluster.
For details about how to control the services, see 10. Managing services with storpool_ctl.
9.1. storpool_beacon
The beacon must be the first StorPool process started on all nodes in the cluster. It
informs all members about the availability of the node on which it is installed.
If the number of the visible nodes changes, every storpool_beacon
service
checks that its node still participates is the quorum - which means it can
communicate with more than half of the expected nodes, including itself (see
SP_EXPECTED_NODES
in 6. Node configuration options).
If the storpool_beacon
service starts successfully, it will send to the
system log (/var/log/messages
, /var/log/syslog
, or similar) messages as
those shown below for every node that comes up in the StorPool cluster:
[snip]
Jan 21 16:22:18 s01 storpool_beacon[18839]: [info] incVotes(1) from 0 to 1, voteOwner 1
Jan 21 16:23:10 s01 storpool_beacon[18839]: [info] peer 2, beaconStatus UP bootupTime 1390314187662389
Jan 21 16:23:10 s01 storpool_beacon[18839]: [info] incVotes(1) from 1 to 2, voteOwner 2
Jan 21 16:23:10 s01 storpool_beacon[18839]: [info] peer up 1
[snip]
9.2. storpool_server
The storpool_server
service must be started on each node that provides its
storage devices (HDD, SSD, or NVMe drives) to the cluster. If the service starts
successfully, all the drives intended to be used as StorPool disks should be
listed in the system log, as shown in the example below:
Dec 14 09:54:19 s11 storpool_server[13658]: [info] /dev/sdl1: adding as data disk 1101 (ssd)
Dec 14 09:54:19 s11 storpool_server[13658]: [info] /dev/sdb1: adding as data disk 1111
Dec 14 09:54:20 s11 storpool_server[13658]: [info] /dev/sda1: adding as data disk 1114
Dec 14 09:54:20 s11 storpool_server[13658]: [info] /dev/sdk1: adding as data disk 1102 (ssd)
Dec 14 09:54:20 s11 storpool_server[13658]: [info] /dev/sdj1: adding as data disk 1113
Dec 14 09:54:22 s11 storpool_server[13658]: [info] /dev/sdi1: adding as data disk 1112
On a dedicated node (for example, one with a larger amount of spare resources)
you can start more than one instances of storpool_server
service (up to
seven); for details, see 13. Multi-server.
9.3. storpool_block
The storpool_block
service provides the client (initiator) functionality.
StorPool volumes can be attached only to the nodes where this service is
running. When attached to a node, a volume can be used and manipulated as a
regular block device via the /dev/stopool/{volume_name}
symlink:
# lsblk /dev/storpool/test
NAME MAJ:MIN RM SIZE RO TYPE MOUNTPOINT
sp-2 251:2 0 100G 0 disk
9.4. storpool_mgmt
The storpool_mgmt
service should be started on at least two management nodes
in the cluster. It receives requests from user space tools (CLI or API),
executes them in the StorPool cluster, and returns the results back to the
sender. An automatic failover mechanism is available: when the node with the
active storpool_mgmt
service fails, the SP_API_HTTP_HOST
IP address is
automatically configured on the next node with the lowest SP_OURID
with a
running storpool_mgmt
service.
9.5. storpool_bridge
The storpool_bridge
service is started on two or more nodes in the cluster,
with one being active (similarly to the storpool_mgmt
service). This service
synchronizes snapshots for the backup and disaster recovery use cases between
the current cluster and one or more StorPool clusters in different locations.
9.6. storpool_controller
The storpool_controller
service is started on all nodes running the
storpool_server
service. It collects information from all
storpool_server
instances in order to provide statistics data to the API.
Note
The storpool_controller
service requires port 47567 to be open on
the nodes where the API (storpool_mgmt
) service is running.
9.7. storpool_nvmed
The storpool_nvmed
service is started on all nodes that have the
storpool_server
service and have NVMe devices. It handles the management of
the NVMe devices, their unplugging from kernel’s NVMe driver, and passing to the
storpool_pci
or vfio_pci
drivers. You can configure this using the
SP_NVME_PCI_DRIVER
option in the /etc/storpool.conf
file. For more
information about the configuration options for the storpool_nvmed
service,
see 6.7. NVMe service
9.8. storpool_stat
The storpool_stat
service is started on all nodes. It collects the following
system metrics from all nodes:
CPU stats - queue run/wait, user, system, and so on, per CPU
Memory usage stats per cgroup
Network stats for the StorPool services
The I/O stats of the system drives
Per-host validating service checks (for example, if there are processes in the root cgroup, the API is reachable if configured, and so on)
On some nodes it collects additional information:
On all nodes with the
storpool_block
service: the I/O stats of all attached StorPool volumes;On server nodes: stats for the communication of
storpool_server
with the drives.
For more information, see Monitoring metrics collected.
The collected data can be viewed at https://analytics.storpool.com. It can also
be submitted to an InfluxDB instance run by your organization; this can be
configured in storpool.conf
, for details see 6.5. Monitoring and debugging.
9.9. storpool_qos
The storpool_qos
service tracks changes for volumes that match certain
criteria, and takes care for updating the I/O performance settings of the
matching volumes. For details, see Quality of service.
9.10. storpool_iscsi
storpool_iscsi
is a client service (like storpool_block
)
that translates the operations received via iSCSI to the StorPool internal
protocol.
The service differs from the rest of the StorPool services in that it requires 4 network interfaces instead of the regular 2. Two of them are used for communication with the cluster, and the other two are used for providing iSCSI service to initiators.
The service itself needs a separate IP address for every portal and network (different than the ones used in the host kernel). These addresses are handled by the service’s TCP/IP stack and have their own MAC addresses.
Note
Currently, re-use of the host IP address for the iSCSI service is not possible.
Note that the iSCSI service cannot operate without hardware acceleration. For details, see 8. Network interfaces and the StorPool System Requirements document.
For more information on configuring and using the service, see 6.8. iSCSI options and 16. Setting iSCSI targets
9.11. storpool_abrtsync
The storpool_abrtsync
service automatically sends reports about aborted
services to StorPool’s monitoring system.
9.12. storpool_cgmove
The storpool_cgmove
service finds and moves all processes from the root
cgroup into a slice, so that they:
Cannot eat up memory in the root cgroup
Are accounted in some of the slices
The service does this once, when the system boots. For more information about the configuration options for the service, see 6.6. Cgroup options.
If you need to manage further StorPool processes running on the machine their cgroups, it is recommended to use the storpool_process tool. For details about cgroups-related alerts, see Monitoring alerts.
9.13. storpool_havm
The storpool_havm
(highly available virtual machine tracking) service tracks
the state of one or more virtual machines and keeps it active on one of the
nodes in the cluster. The sole purpose of this service is to offload the
orchestration responsibility for virtual machines where the fast startup after a
failover event is crucial.
A virtual machine is configured on all nodes where the storpool API
(storpool_mgmt
service) is running with a predefined VM XML and predefined
volume names. The storpool_havm@<vm_name>
service gets enabled on each API
node in the cluster, then starts tracking the state of this virtual machine.
The VM is kept active on the active
API node. In the typical case where the
active API changes due to service restart, the VM gets live-migrated to the new
active API node.
In case of a failure of the node where the active API was last running, the service takes care to fence the block devices on the old API node and to start the VM on the present active node.
The primary use case is virtual machines for NFS or S3.
This service is available starting with release 20.0 revision 20.0.19.1a208ffab.
9.14. storpool_logd
The StorPool log daemon (storpool_logd
) receives log messages from all
StorPool services working in a cluster and the Linux kernel logs for further
analysis and advanced monitoring.
Tracking the storage service logs for the whole cluster enables more advanced monitoring, as well as safer maintenance operations. In the long term, it allows for:
Better accountability
Reduced times for investigating issues or incidents
Logs inspection for longer periods
Retroactive detection of issues identified in a production cluster in the whole installed base
The service reads messages from two log streams, enqueues them into a persistent backend, and sends them to StorPool’s infrastructure. Once the reception is confirmed, messages are removed from the backend.
The service tries its best to ensure the logs are delivered. Logs can survive
between process and entire node restarts. storpool_logd
prioritizes
persisted messages over incoming ones, so new messages will be dropped if the
persistent storage is full.
Monitoring cluster-wide logs allows raising alerts on cluster-wide events that are based either on specific messages or classified message frequency thresholds. The ultimate goal is to lower the risk of accidents and unplanned downtime by proactively detecting issues based on similarities in the observed services behavior.
The main use cases for the service are:
Proactively detect abnormal and unexpected messages and raise alerts.
Proactively detect an abnormal rate of messages and raise alerts
Enhance situational awareness by allowing operators to monitor the logs for a whole cluster in one view
Allow for easier tracking of newly discovered failure scenarios
There are relevant configuration sections for the service if a proxy is required to be able to send data, or if a custom instance is used to override the URL of the default instance procured in StorPool’s infrastructure.
This service is available starting with release 20.0 revision 20.0.372.4cd1679db.
10. Managing services with storpool_ctl
storpool_ctl
is a helper tool providing an easy way to perform an action for
all installed services on a StorPool node. You can use it to start, stop, or
restart services, or enable/disable starting them on boot. For more information
about the services themselves, see 9. Background services.
Nodes in a StorPool cluster might have different sets of services installed,
depending on the environment, or whether the node is client, a server, converged
node (both) or if services are added or removed. Using storpool_ctl
you can
set the services needed for each node as needed. Here are some typical use cases:
Querying for the status of all services installed on a node.
Starting and enabling all services. This usually happens right after a new installation.
The opposite case, where all services have to be disabled/stopped when a node is being uninstalled or moved to a different location.
When a client-only node is promoted to a server (often referred to as a converged node). In this case after initializing all the drives on this node the tool will take care to start/enable all required additional services.
The storpool_ctl
tool is available in StorPool starting with the
19.2 revision 19.01.1991.f5ec6de23 release.
10.1. Supported actions
To list all supported actions use:
# storpool_ctl --help
usage: storpool_ctl [-h] {disable,start,status,stop,restart,enable} ...
Tool that controls all StorPool services on a node, taking care of
service dependencies, required checks before executing an action and
others.
positional arguments:
{disable,start,status,stop,restart,enable}
action
optional arguments:
-h, --help show this help message and exit
10.2. Getting status
List the status of all services installed on this node:
# storpool_ctl status
storpool_nvmed not_running
storpool_mgmt not_running
storpool_reaffirm not_running
storpool_flushwbc not_running
storpool_stat not_running
storpool_abrtsync not_running
storpool_block not_running
storpool_kdump not_running
storpool_hugepages not_running
storpool_bridge not_running
storpool_beacon not_running
storpool_cgmove not_running
storpool_iscsi not_running
storpool_server not_running
storpool_controller not_running
The status is always one of:
not_running
when the service is disabled or stoppednot_enabled
when the service is running but is not yet enabledrunning
when the service is running and enabled
Note
The tool always prints the status after any selected action was applied.
You can also use storpool_ctl status --problems
, which shows only services
that are not in running state. Note that in this mode the utility exits with
non-zero exit status in case there are any installed services in not_running
or not_enabled
state.
# storpool_ctl status --problems
storpool_mgmt not_running
storpool_server not_enabled
# echo $?
1
10.3. Starting services
To start all services:
# storpool_ctl start
cgconfig running
storpool_abrtsync not_enabled
storpool_cgmove not_enabled
storpool_block not_enabled
storpool_mgmt not_enabled
storpool_flushwbc not_enabled
storpool_server not_enabled
storpool_hugepages not_enabled
storpool_stat not_enabled
storpool_controller not_enabled
storpool_beacon not_enabled
storpool_kdump not_enabled
storpool_reaffirm not_enabled
storpool_bridge not_enabled
storpool_iscsi not_enabled
10.4. Enabling services
Note
The services should be enabled after the configuration of cgroups is completed. For more information, see Control groups.
To enable all services:
# storpool_ctl enable
Created symlink from /etc/systemd/system/multi-user.target.wants/storpool_cgmove.service to /usr/lib/systemd/system/storpool_cgmove.service.
Created symlink from /etc/systemd/system/multi-user.target.wants/storpool_bridge.service to /usr/lib/systemd/system/storpool_bridge.service.
Created symlink from /etc/systemd/system/multi-user.target.wants/storpool_block.service to /usr/lib/systemd/system/storpool_block.service.
Created symlink from /etc/systemd/system/multi-user.target.wants/storpool_reaffirm.service to /usr/lib/systemd/system/storpool_reaffirm.service.
Created symlink from /etc/systemd/system/storpool_beacon.service.wants/storpool_reaffirm.service to /usr/lib/systemd/system/storpool_reaffirm.service.
Created symlink from /etc/systemd/system/storpool_block.service.wants/storpool_reaffirm.service to /usr/lib/systemd/system/storpool_reaffirm.service.
Created symlink from /etc/systemd/system/storpool_mgmt.service.wants/storpool_reaffirm.service to /usr/lib/systemd/system/storpool_reaffirm.service.
Created symlink from /etc/systemd/system/storpool_server.service.wants/storpool_reaffirm.service to /usr/lib/systemd/system/storpool_reaffirm.service.
Created symlink from /etc/systemd/system/multi-user.target.wants/storpool_controller.service to /usr/lib/systemd/system/storpool_controller.service.
Created symlink from /etc/systemd/system/multi-user.target.wants/storpool_beacon.service to /usr/lib/systemd/system/storpool_beacon.service.
Created symlink from /etc/systemd/system/storpool_server.service.wants/storpool_beacon.service to /usr/lib/systemd/system/storpool_beacon.service.
Created symlink from /etc/systemd/system/storpool_block.service.wants/storpool_beacon.service to /usr/lib/systemd/system/storpool_beacon.service.
Created symlink from /etc/systemd/system/storpool_mgmt.service.wants/storpool_beacon.service to /usr/lib/systemd/system/storpool_beacon.service.
Created symlink from /etc/systemd/system/multi-user.target.wants/storpool_kdump.service to /usr/lib/systemd/system/storpool_kdump.service.
Created symlink from /etc/systemd/system/multi-user.target.wants/storpool_abrtsync.service to /usr/lib/systemd/system/storpool_abrtsync.service.
Created symlink from /etc/systemd/system/multi-user.target.wants/storpool_mgmt.service to /usr/lib/systemd/system/storpool_mgmt.service.
Created symlink from /etc/systemd/system/multi-user.target.wants/storpool_server.service to /usr/lib/systemd/system/storpool_server.service.
Created symlink from /etc/systemd/system/multi-user.target.wants/storpool_flushwbc.service to /usr/lib/systemd/system/storpool_flushwbc.service.
Created symlink from /etc/systemd/system/multi-user.target.wants/storpool_stat.service to /usr/lib/systemd/system/storpool_stat.service.
Created symlink from /etc/systemd/system/sysinit.target.wants/storpool_hugepages.service to /usr/lib/systemd/system/storpool_hugepages.service.
Created symlink from /etc/systemd/system/multi-user.target.wants/storpool_iscsi.service to /usr/lib/systemd/system/storpool_iscsi.service.
storpool_cgmove running
storpool_bridge running
storpool_block running
storpool_reaffirm running
storpool_controller running
storpool_beacon running
storpool_kdump running
storpool_abrtsync running
storpool_mgmt running
storpool_server running
storpool_flushwbc running
storpool_stat running
storpool_hugepages running
storpool_iscsi running
10.5. Disabling services
To disable all services (without stopping them):
# storpool_ctl disable
Removed symlink /etc/systemd/system/multi-user.target.wants/storpool_cgmove.service.
Removed symlink /etc/systemd/system/multi-user.target.wants/storpool_bridge.service.
Removed symlink /etc/systemd/system/multi-user.target.wants/storpool_block.service.
Removed symlink /etc/systemd/system/multi-user.target.wants/storpool_reaffirm.service.
Removed symlink /etc/systemd/system/multi-user.target.wants/storpool_controller.service.
Removed symlink /etc/systemd/system/multi-user.target.wants/storpool_beacon.service.
Removed symlink /etc/systemd/system/multi-user.target.wants/storpool_kdump.service.
Removed symlink /etc/systemd/system/multi-user.target.wants/storpool_abrtsync.service.
Removed symlink /etc/systemd/system/multi-user.target.wants/storpool_mgmt.service.
Removed symlink /etc/systemd/system/multi-user.target.wants/storpool_server.service.
Removed symlink /etc/systemd/system/multi-user.target.wants/storpool_flushwbc.service.
Removed symlink /etc/systemd/system/multi-user.target.wants/storpool_stat.service.
Removed symlink /etc/systemd/system/multi-user.target.wants/storpool_iscsi.service.
Removed symlink /etc/systemd/system/sysinit.target.wants/storpool_hugepages.service.
Removed symlink /etc/systemd/system/storpool_beacon.service.wants/storpool_reaffirm.service.
Removed symlink /etc/systemd/system/storpool_block.service.wants/storpool_reaffirm.service.
Removed symlink /etc/systemd/system/storpool_block.service.wants/storpool_beacon.service.
Removed symlink /etc/systemd/system/storpool_mgmt.service.wants/storpool_reaffirm.service.
Removed symlink /etc/systemd/system/storpool_mgmt.service.wants/storpool_beacon.service.
Removed symlink /etc/systemd/system/storpool_server.service.wants/storpool_reaffirm.service.
Removed symlink /etc/systemd/system/storpool_server.service.wants/storpool_beacon.service.
storpool_kdump not_enabled
storpool_hugepages not_enabled
storpool_bridge not_enabled
storpool_controller not_enabled
storpool_beacon not_enabled
storpool_cgmove not_enabled
storpool_block not_enabled
storpool_mgmt not_enabled
storpool_stat not_enabled
storpool_reaffirm not_enabled
storpool_server not_enabled
storpool_flushwbc not_enabled
storpool_abrtsync not_enabled
storpool_iscsi not_enabled
10.6. Stopping services
To stop all services:
# storpool_ctl stop
storpool_server not_running
storpool_iscsi not_running
storpool_controller not_running
storpool_mgmt not_running
storpool_cgmove not_running
storpool_kdump not_running
storpool_reaffirm not_running
storpool_stat not_running
storpool_beacon not_running
storpool_hugepages not_running
storpool_flushwbc not_running
storpool_block not_running
storpool_bridge not_running
storpool_abrtsync not_running
Module storpool_pci version 6D0D7D6E357D24CBDF2D1BA
Module storpool_disk version D92BDA6C929615392EEAA7E
Module storpool_bd version C6EB4EEF1E0ABF1A4774788
Module storpool_rdma version 4F1FB67DF4617ECD6C472C4
The stop
action features the options:
- --servers
Stop just the server instances.
- --expose-nvme
Expose any configured NVMe devices attached to the selected
SP_NVME_PCI_DRIVER
back to thenvme
driver.
11. CLI tutorial
StorPool provides an easy yet powerful Command Line Interface (CLI) for
administrating the data storage cluster or multiple clusters in the same
location (multi-cluster). It has an integrated help system that provides useful
information on every step. There are various ways to execute commands in the
CLI, depending on the style and needs of the administrator. The StorPool CLI
gets its configuration from the /etc/storpool.conf
file and command line
options.
The current document provides an introduction to the CLI and a few useful examples. For more information, see the 12. CLI reference document.
11.1. Using the standard shell
Type regular shell command with parameters:
# storpool service list
Pipe command output to StorPool CLI:
# echo "service list" | storpool
Redirect the standard input from a predefined file with commands:
# storpool < input_file
Display the available command line options:
# storpool --help
11.2. Using the interactive shell
To start the interactive StorPool shell:
# storpool
StorPool> service list
Interactive shell help can be invoked by pressing the question mark key (?
):
# storpool
StorPool> attach <?>
client - specify a client to attach the volume to {M}
here - attach here {M}
list - list the current attachments
mode - specify the read/write mode {M}
noWait - do not wait for the client {M}
snapshot - specify a snapshot to attach {M}
timeout - seconds to wait for the client to appear {M}
volume - specify a volume to attach {M}
Shell autocomplete (invoked by double-pressing the Tab
key) will show
available options for current step:
StorPool> attach <tab> <tab>
client here list mode noWait snapshot timeout volume
The StorPool shell can detect incomplete lines and suggest options:
# storpool
StorPool> attach <enter>
.................^
Error: incomplete command! Expected:
volume - specify a volume to attach
client - specify a client to attach the volume to
list - list the current attachments
here - attach here
mode - specify the read/write mode
snapshot - specify a snapshot to attach
timeout - seconds to wait for the client to appear
noWait - do not wait for the client
To exit the interactive shell use the quit
or exit
command, or directly
use the Ctrl+C
or Ctrl+D
keyboard shortcuts of your terminal.
To set the string that would appear as a prompt for the interactive shell, use
the SP_CLI_PROMPT
option in the /etc/storpool.conf
file. For details,
see 6.9.12. CLI prompt string.
11.3. Error messages
If the shell command is incomplete or wrong the system would display an error message, which would include the possible options:
# storpool attach
Error: incomplete command! Expected:
list - list the current attachments
timeout - seconds to wait for the client to appear
volume - specify a volume to attach
here - attach here
noWait - do not wait for the client
snapshot - specify a snapshot to attach
mode - specify the read/write mode
client - specify a client to attach the volume to
# storpool attach volume
Error: incomplete command! Expected:
volume - the volume to attach
11.4. Multi-cluster mode
To enter multi-cluster mode (see 17. Multi-site and multi-cluster) while in interactive mode:
StorPool> multiCluster on
[MC] StorPool>
For non-interactive mode, use:
# storpool -M <command>
Note
All commands not relevant to multi-cluster will silently fall-back to
non-multi-cluster mode. For example, storpool -M service list
will
list only local services. Same for storpool -M disk list
and
storpool -M net list
.
12. CLI reference
For introduction and examples, see 11. CLI tutorial.
12.1. Location
The location
submenu is used for configuring other StorPool sub-clusters in
the same or different location (see 17. Multi-site and multi-cluster). The
location ID is the first part (left of the .
) in the SP_CLUSTER_ID
configured in the remote cluster.
For example, to add a location with SP_CLUSTER_ID=nzkr.b
:
# storpool location add nzkr StorPoolLab-Sofia
OK
To list the configured locations:
# storpool location list
-----------------------------------------------
| id | name | rxBuf | txBuf |
-----------------------------------------------
| nzkr | StorPoolLab-Sofia | 85 KiB | 128 KiB |
-----------------------------------------------
To rename a location:
# storpool location rename StorPoolLab-Sofia name StorPoolLab-Amsterdam
OK
To remove a location:
# storpool location remove StorPoolLab-Sofia
OK
Note
This command will fail if there is an existing cluster or a remote bridge configured for this location
To update the send or receive buffer sizes to values different from the defaults, use:
# storpool location update StorPoolLab-Sofia recvBufferSize 16M
OK
# storpool location update StorPoolLab-Sofia sendBufferSize 1M
OK
# storpool location list
-----------------------------------------------
| id | name | rxBuf | txBuf |
-----------------------------------------------
| nzkr | StorPoolLab-Sofia | 16 MiB | 1.0 MiB |
-----------------------------------------------
12.2. Cluster
The cluster
submenu is used for configuring a new cluster for an already
configured location
. The cluster ID is the second part (right from the
.
) in the SP_CLUSTER_ID
configured in the remote, for example to add the
cluster b
for the remote location nzkr
use:
# storpool cluster add StorPoolLab-Sofia b
OK
To list the configured clusters use:
# storpool cluster list
--------------------------------------------------
| name | id | location |
--------------------------------------------------
| StorPoolLab-Sofia-cl1 | b | StorPoolLab-Sofia |
--------------------------------------------------
To remove a cluster use:
# storpool cluster remove StorPoolLab-Sofia b
12.3. Remote bridge
The remoteBridge
submenu is used to register or deregister a remote bridge
for a configured location.
12.3.1. Registering and de-registering
To register a remote bridge use storpool remoteBridge register <location-name>
<IP address> <public-key>
as shown in the example below:
# storpool remoteBridge register StorPool-Rome 10.1.100.10 ju9jtefeb8idz.ngmrsntnzhsei.grefq7kzmj7zo.nno515u6ftna6
OK
This will register the StorPool-Rome
location with an IP address of
10.1.100.10
and the above public key.
In case of a change in the IP address or the public key of a remote location, the remote bridge could be de-registered and then registered again with the required parameters; here is an example:
# storpool remoteBridge deregister 10.1.100.10
OK
# storpool remoteBridge register StorPool-Rome 78.90.13.150 8nbr9q162tjh.ahb6ueg16kk2.mb7y2zj2hn1ru.5km8qut54x7z
OK
A remote bridge might be registered with noCrypto
in case of a secure
interconnect between the clusters; typical use-case is a 17.1. Multi-cluster
setup, with other sub-clusters in the same datacenter.
12.3.2. Minimum deletion delay
To enable deferred deletion on unexport from the remote site the
minimumDeleteDelay
flag should also be set, the format of the command is
storpool remoteBridge register <location-name> <IP address> <public-key>
minimumDeleteDelay <minimumDeleteDelay>
, where the last parameter is a time
period provided as X[smhd]
- X is an integer and s
, m
, h
, and
d
are seconds, minutes, hours and days accordingly.
For example, if you want to register the remote bridge for the StorPool-Rome
location with a minimumDeleteDelay
of one day you can do it like this:
# storpool remoteBridge register StorPool-Rome 78.90.13.150 8nbr9q162tjh.ahb6ueg16kk2.mb7y2zj2hn1ru.5km8qut54x7z minimumDeleteDelay 1d
OK
After this operation all snapshots sent from the remote cluster could be
unexported later with the deleteAfter
parameter set (check the
12.11.6. Remote snapshots section). Any deleteAfter
parameters lower than
the minimumDeleteDelay
will be overridden by the bridge in the remote
cluster. All such events will be logged on the node with the active
bridge
in the remote cluster.
For more information about deferred deletion, see 17.2. Multi site.
12.3.3. Listing registered remote bridges
To list all registered remote bridges use:
# storpool remoteBridge list
StorPool> remoteBridge list
------------------------------------------------------------------------------------------------------------------------------
| ip | remote | minimumDeleteDelay | publicKey | noCrypto |
------------------------------------------------------------------------------------------------------------------------------
| 10.1.200.10 | StorPool-Rome | | nonwtmwsgdr2p.fos2qus4h1qdk.pnt9ozj8gcktj.d7b2aa24gsegn | 0 |
| 10.1.200.11 | StorPool-Rome | | jtgeaqhsmqzqd.x277oefofxbpm.bynb2krkiwg54.ja4gzwqdg925j | 0 |
------------------------------------------------------------------------------------------------------------------------------
12.3.4. Status of remote bridges
You can see the state of all registered bridges to clusters in other locations or sub-clusters in the same location:
# storpool remoteBridge status
------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
| ip | clusterId | connectionState | connectionTime | reconnectCount | receivedExports | sentExports | lastError | lastErrno | errorTime | bytesSentSinceStart | bytesRecvSinceStart | bytesSentSinceConnect | bytesRecvSinceConnect |
------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
| 10.1.200.11 | d.b | connected | 2021-02-07 18:08:25 | 2 | 5 | 2 | socket error | Operation not permitted | 2021-02-07 17:58:58 | 210370560 | 242443328 | 41300272 | 75088624 |
| 10.1.200.10 | d.d | connected | 2021-02-07 17:51:42 | 1 | 7 | 2 | no error | No error information | - | 186118480 | 39063648 | 186118480 | 39063648 |
| 10.1.200.4 | e.n | connected | 2021-02-07 17:51:42 | 1 | 5 | 0 | no error | No error information | - | 117373472 | 316784 | 117373472 | 316784 |
------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
Initially added with 19.2 revision 19.01.1813.f4697d8c2 release.
12.4. Network
To list basic details about the cluster network use:
# storpool net list
------------------------------------------------------------
| nodeId | flags | net 1 | net 2 |
------------------------------------------------------------
| 11 | uU + AJ | F4:52:14:76:9C:B0 | F4:52:14:76:9C:B0 |
| 12 | uU + AJ | 02:02:C9:3C:E3:80 | 02:02:C9:3C:E3:81 |
| 13 | uU + AJ | F6:52:14:76:9B:B0 | F6:52:14:76:9B:B1 |
| 14 | uU + AJ | 26:8A:07:43:EE:BC | 26:8A:07:43:EE:BD |
| 15 | uU + AJ | 1A:60:00:00:00:0F | 1E:60:00:00:00:0F |
------------------------------------------------------------
Quorum status: 5 voting beacons up out of 5 expected
Flags:
u - packets recently received from this node
d - no packets recently received from this node
U - this node has enough votes to be in the quorum
D - this node does not have enough votes to be in the quorum
M - this node is being damped by the rest of the nodes in the cluster
+ - this node considers itself in the quorum
B - the connection to this node is through a backup link; check the cabling
A - this node is using hardware acceleration
J - the node uses jumbo frames
N - a non-voting node
12.5. Server
To list the nodes that are configured as StorPool servers and their
storpool_server
instances use:
# storpool server list
cluster running, mgmt on node 11
server 11.0 running on node 11
server 12.0 running on node 12
server 13.0 running on node 13
server 14.0 running on node 14
server 11.1 running on node 11
server 12.1 running on node 12
server 13.1 running on node 13
server 14.1 running on node 14
To get more information about which storage devices are provided by a particular server use the storpool server <ID>:
# storpool server 11 disk list
disk | server | size | used | est.free | % | free entries | on-disk size | allocated objects | errors | flags
1103 | 11.0 | 447 GiB | 3.1 GiB | 424 GiB | 1 % | 1919912 | 20 MiB | 40100 / 480000 | 0 / 0 |
1104 | 11.0 | 447 GiB | 3.1 GiB | 424 GiB | 1 % | 1919907 | 20 MiB | 40100 / 480000 | 0 / 0 |
1111 | 11.0 | 465 GiB | 2.6 GiB | 442 GiB | 1 % | 494977 | 20 MiB | 40100 / 495000 | 0 / 0 |
1112 | 11.0 | 365 GiB | 2.6 GiB | 346 GiB | 1 % | 389977 | 20 MiB | 40100 / 390000 | 0 / 0 |
1125 | 11.0 | 931 GiB | 2.6 GiB | 894 GiB | 0 % | 974979 | 20 MiB | 40100 / 975000 | 0 / 0 |
1126 | 11.0 | 931 GiB | 2.6 GiB | 894 GiB | 0 % | 974979 | 20 MiB | 40100 / 975000 | 0 / 0 |
----------------------------------------------------------------------------------------------------------------------------------------
6 | 1.0 | 3.5 TiB | 16 GiB | 3.4 TiB | 0 % | 6674731 | 122 MiB | 240600 / 3795000 | 0 / 0 |
Note
Without specifying instance the first instance is assumed - 11.0
as in the above example. The second, third and fourth
storpool_server
instance would be 11.1
, 11.2
and 11.3
accordingly.
To list the servers that are blocked and could not join the cluster for some reason:
# storpool server blocked
cluster waiting, mgmt on node 12
server 11.0 waiting on node 11 missing:1201,1202,1203,1204,1212,1221,1222,1223,1224,1225,1226,1301,1302,1303,1304,1311,1312,1321,1322,1323,1324,1325,1326,4043 pending:1103,1104,1111,1112,1125,1126
server 12.0 down on node 12
server 13.0 down on node 13
server 14.0 waiting on node 14 missing:1201,1202,1203,1204,1212,1221,1222,1223,1224,1225,1226,1301,1302,1303,1304,1311,1312,1321,1322,1323,1324,1325,1326,4043 pending:1403,1404,1411,1412,1421,1423
server 11.1 waiting on node 11 missing:1201,1202,1203,1204,1212,1221,1222,1223,1224,1225,1226,1301,1302,1303,1304,1311,1312,1321,1322,1323,1324,1325,1326,4043 pending:1101,1102,1121,1122,1123,1124
server 12.1 down on node 12
server 13.1 down on node 13
server 14.1 waiting on node 14 missing:1201,1202,1203,1204,1212,1221,1222,1223,1224,1225,1226,1301,1302,1303,1304,1311,1312,1321,1322,1323,1324,1325,1326,4043 pending:1401,1402,1424,1425,1426
12.6. Fault sets
The fault sets are a way to instruct StorPool to use the drives in a group of nodes for only one replica of the data if they are expected to fail simultaneously. Some examples would be:
Multinode chassis.
Multiple nodes in the same rack backed by the same power supply.
Nodes connected to the same set of switches.
To define a fault set only a name and some set of server nodes are needed:
# storpool faultSet chassis_1 addServer 11 addServer 12
OK
To list defined fault sets:
# storpool faultSet list
-------------------------------------------------------------------
| name | servers |
-------------------------------------------------------------------
| chassis_1 | 11 12 |
-------------------------------------------------------------------
To remove a fault set:
# storpool faultSet chassis_1 delete chassis_1
Attention
A new fault set definition has effect only on newly created volumes. To change the configuration on already created volumes a re-balance operation would be required. See 12.18. Balancer for more details on re-balancing a cluster after defining fault sets.
12.7. Services
You can check the state of all services presently running in the cluster and their uptime in the following way:
# storpool service list
cluster running, mgmt on node 12
mgmt 11 running on node 11 ver 20.00.18, started 2022-09-08 18:23:36, uptime 1 day 00:53:43
mgmt 12 running on node 12 ver 20.00.18, started 2022-09-08 18:23:35, uptime 1 day 00:53:44 active
server 11.0 running on node 11 ver 20.00.18, started 2022-09-08 18:23:45, uptime 1 day 00:53:34
server 12.0 running on node 12 ver 20.00.18, started 2022-09-08 18:23:41, uptime 1 day 00:53:38
server 13.0 running on node 13 ver 20.00.18, started 2022-09-08 18:23:35, uptime 1 day 00:53:44
server 14.0 running on node 14 ver 20.00.18, started 2022-09-08 18:23:39, uptime 1 day 00:53:40
server 11.1 running on node 11 ver 20.00.18, started 2022-09-08 18:23:45, uptime 1 day 00:53:34
server 12.1 running on node 12 ver 20.00.18, started 2022-09-08 18:23:44, uptime 1 day 00:53:35
server 13.1 running on node 13 ver 20.00.18, started 2022-09-08 18:23:37, uptime 1 day 00:53:42
server 14.1 running on node 14 ver 20.00.18, started 2022-09-08 18:23:39, uptime 1 day 00:53:40
client 11 running on node 11 ver 20.00.18, started 2022-09-08 18:23:33, uptime 1 day 00:53:46
client 12 running on node 12 ver 20.00.18, started 2022-09-08 18:23:34, uptime 1 day 00:53:45
client 13 running on node 13 ver 20.00.18, started 2022-09-08 18:23:32, uptime 1 day 00:53:47
client 14 running on node 14 ver 20.00.18, started 2022-09-08 18:23:32, uptime 1 day 00:53:47
client 15 running on node 15 ver 20.00.18, started 2020-01-09 10:46:17, uptime 08:31:02
bridge 11 running on node 11 ver 20.00.18, started 2022-09-08 18:23:34, uptime 1 day 00:53:45 active
bridge 12 running on node 12 ver 20.00.18, started 2022-09-08 18:23:34, uptime 1 day 00:53:45
cntrl 11 running on node 11 ver 20.00.18, started 2022-09-08 18:23:35, uptime 1 day 00:53:44
cntrl 12 running on node 12 ver 20.00.18, started 2022-09-08 18:23:34, uptime 1 day 00:53:45
cntrl 13 running on node 13 ver 20.00.18, started 2022-09-08 18:23:31, uptime 1 day 00:53:48
cntrl 14 running on node 14 ver 20.00.18, started 2022-09-08 18:23:31, uptime 1 day 00:53:48
iSCSI 12 running on node 13 ver 20.00.18, started 2022-09-08 18:23:32, uptime 1 day 00:53:47
iSCSI 13 running on node 13 ver 20.00.18, started 2022-09-08 18:23:32, uptime 1 day 00:53:47
12.8. Disk
12.8.1. Disk list main info
You can use the disk sub-menu to query and manage the available disks in the cluster.
To display all available disks in all server instances in the cluster:
# storpool disk list
disk | server | size | used | est.free | % | free entries | on-disk size | allocated objects | errors | flags
1101 | 11.1 | 893 GiB | 2.6 GiB | 857 GiB | 0 % | 3719946 | 664 KiB | 41000 / 930000 | 0 / 0 |
1102 | 11.1 | 446 GiB | 2.6 GiB | 424 GiB | 1 % | 1919946 | 664 KiB | 41000 / 480000 | 0 / 0 |
1103 | 11.0 | 893 GiB | 2.6 GiB | 857 GiB | 0 % | 3719948 | 660 KiB | 41000 / 930000 | 0 / 0 |
1104 | 11.0 | 446 GiB | 2.6 GiB | 424 GiB | 1 % | 1919946 | 664 KiB | 41000 / 480000 | 0 / 0 |
1105 | 11.0 | 446 GiB | 2.6 GiB | 424 GiB | 1 % | 1919947 | 664 KiB | 41000 / 480000 | 0 / 0 |
1111 | 11.1 | 930 GiB | 2.6 GiB | 893 GiB | 0 % | 974950 | 716 KiB | 41000 / 975000 | 0 / 0 |
1112 | 11.1 | 930 GiB | 2.6 GiB | 893 GiB | 0 % | 974949 | 736 KiB | 41000 / 975000 | 0 / 0 |
1113 | 11.1 | 930 GiB | 2.6 GiB | 893 GiB | 0 % | 974943 | 760 KiB | 41000 / 975000 | 0 / 0 |
1114 | 11.1 | 930 GiB | 2.6 GiB | 893 GiB | 0 % | 974937 | 844 KiB | 41000 / 975000 | 0 / 0 |
[snip]
1425 | 14.1 | 931 GiB | 2.6 GiB | 894 GiB | 0 % | 974980 | 20 MiB | 40100 / 975000 | 0 / 0 |
1426 | 14.1 | 931 GiB | 2.6 GiB | 894 GiB | 0 % | 974979 | 20 MiB | 40100 / 975000 | 0 / 0 |
----------------------------------------------------------------------------------------------------------------------------------------
47 | 8.0 | 30 TiB | 149 GiB | 29 TiB | 0 % | 53308967 | 932 MiB | 1844600 / 32430000 | 0 / 0 |
To mark a device as temporarily unavailable:
# storpool disk 1111 eject
OK
Ejecting a disk from the cluster will stop the data replication for this disk, but will keep the metadata about the placement groups in which it participated and the volume objects from these groups it contained.
Note
The command above will refuse to eject the disk if this operation
would lead to volumes or snapshots going into the down
state
(usually when the last up-to-date copy for some parts of a
volume/snapshot is on this disk).
This disk will be shown as missing in the storpool disk list
output, for
example:
# storpool disk list
disk | server | size | used | est.free | % | free entries | on-disk size | allocated objects | errors | flags
[snip]
1422 | 14.1 | - | - | - | - % | - | - | - / - | - / - |
[snip]
Attention
This operation leads to degraded redundancy for all volumes and snapshots that have data on the ejected disk.
Such a disk will not return by itself back into the cluster, and would have to
be manually reinserted by removing its EJECTED flag with
storpool_initdisk -r /dev/$path
.
12.8.2. Disk list additional information
To display additional information about the disks:
# storpool disk list info
disk | server | device | model | serial | description | flags |
1101 | 11.1 | 0000:04:00.0-p1 | SAMSUNG MZQLB960HAJR-00007 | S437NF0M500149 | | S |
1102 | 11.1 | /dev/sdj1 | Micron_M500DC_MTFDDAK480MBB | 14250C6368E5 | | S |
1103 | 11.0 | /dev/sdi1 | SAMSUNG_MZ7LH960HAJR-00005 | S45NNE0M229767 | | S |
1104 | 11.0 | /dev/sdd1 | Micron_M500DC_MTFDDAK480MBB | 14250C63689B | | S |
1105 | 11.0 | /dev/sdc1 | Micron_M500DC_MTFDDAK480MBB | 14250C6368EC | | S |
1111 | 11.1 | /dev/sdl1 | Hitachi_HUA722010CLA330 | JPW9K0N13243ZL | | W |
1112 | 11.1 | /dev/sda1 | Hitachi_HUA722010CLA330 | JPW9J0N13LJEEV | | W |
1113 | 11.1 | /dev/sdb1 | Hitachi_HUA722010CLA330 | JPW9J0N13N694V | | W |
1114 | 11.1 | /dev/sdm1 | Hitachi_HUA722010CLA330 | JPW9K0N132R7HL | | W |
[snip]
1425 | 14.1 | /dev/sdm1 | Hitachi_HDS721050CLA360 | JP1532FR1BY75C | | W |
1426 | 14.1 | /dev/sdh1 | Hitachi_HUA722010CLA330 | JPW9K0N13RS95L | | W, J |
To set additional information for some of the disks shown in the output of
storpool disk list info
:
# storpool disk 1111 description HBA2_port7
OK
# storpool disk 1104 description FAILING_SMART
OK
12.8.3. Disk list server internal information
To display internal statistics about each disk:
# storpool disk list internal
--------------------------------------------------------------------------------------------------------------------------------------------------------
| disk | server | aggregate scores | wbc pages | scrub bw | scrub ETA | last scrub completed |
--------------------------------------------------------------------------------------------------------------------------------------------------------
| 1101 | 11.1 | 0 | 0 | 0 | - + - / - | - | - | 2022-09-08 18:23:07 |
| 1102 | 11.1 | 0 | 0 | 0 | - + - / - | - | - | 2022-09-08 18:23:07 |
| 1103 | 11.0 | 0 | 0 | 0 | - + - / - | - | - | 2022-09-08 18:23:08 |
| 1104 | 11.0 | 0 | 0 | 0 | - + - / - | - | - | 2022-09-08 18:23:09 |
| 1105 | 11.1 | 0 | 0 | 0 | - + - / - | - | - | 2022-09-08 18:23:10 |
| 1111 | 11.0 | 0 | 0 | 0 | 0 + 0 / 2560 | - | - | 2022-09-08 18:23:12 |
| 1112 | 11.1 | 0 | 0 | 0 | 0 + 0 / 2560 | - | - | 2022-09-08 18:23:15 |
| 1113 | 11.1 | 0 | 0 | 0 | 0 + 0 / 2560 | - | - | 2022-09-08 18:23:17 |
| 1114 | 11.1 | 0 | 0 | 0 | 0 + 0 / 2560 | - | - | 2022-09-08 18:23:13 |
[snip]
| 1425 | 14.1 | 0 | 0 | 0 | 0 + 0 / 2560 | - | - | 2022-09-08 18:23:15 |
| 1426 | 14.1 | 0 | 0 | 0 | 0 + 0 / 2560 | - | - | 2022-09-08 18:23:19 |
--------------------------------------------------------------------------------------------------------------------------------------------------------
The sections in this output are as follows:
aggregate scores
Internal values representing how much data is about to be defragmented on the particular drive. Usually between 0 and 1, on heavily loaded clusters the rightmost column might get into the hundreds or even thousands if some drives are severely loaded.
wbc pages
Internal statistics for each drive that have the write back cache or journaling in StorPool enabled.
scrub bw
The scrubbing speed in MB/s.
scrub ETA
Approximate time/date when the scrubbing operation will complete for this drive.
last scrub completed
The last time/date when the drive was scrubbed.
Note
The default installation includes a cron job on the management nodes that starts a scrubbing job for one drive per node. You can increase the number of disks that are scrubbing in parallel per node (the example is for four drives) by running the following:
# . /usr/lib/storpool/storpool_confget.sh
# storpool_q -d '{"set":{"scrubbingDiskPerNode":"4"}}' KV/Set/conf
And you can see the number of drives that are scrubbing per node with:
# . /usr/lib/storpool/storpool_confget.sh
# storpool_q KV/Get/conf/scrubbingDiskPerNode | jq -re '.data.pairs.scrubbingDiskPerNode'
To configure a local or remote recovery override for a particular
disk that is different than the overrides configured with mgmtConfig
:
# storpool disk 1111 maxRecoveryRequestsOverride local 2
OK
# storpool disk 1111 maxRecoveryRequestsOverride remote 4
OK
To remove a configured override:
# storpool disk 1111 maxRecoveryRequestsOverride remote clear
OK
This will remove the override, and the default configured for the whole cluster at 12.22. Management configuration will take precedence.
12.8.4. Disk list performance information
To display performance related in-server statistics for each disk use:
# storpool disk list perf
| latencies | thresholds | times exceeded | flags
disk | disk (avg) | disk (max) | jrn (avg) | jrn (max) | disk | journal | disk | journal |
2301 | 0.299ms | 0.400ms | - | - | 0.000ms | - | 0 | - |
2302 | 0.304ms | 0.399ms | - | - | 0.000ms | - | 0 | - |
2303 | 0.316ms | 0.426ms | - | - | 0.000ms | - | 0 | - |
[snip]
2621 | 4.376ms | 4.376ms | 0.029ms | 0.029ms | 0.000ms | 0.000ms | 0 | 0 |
2622 | 4.333ms | 4.333ms | 0.025ms | 0.025ms | 0.000ms | 0.000ms | 0 | 0 |
Note
Global latency thresholds are configured through the mgmtConfig section.
To configure a single disk latency threshold override use:
# storpool disk 2301 latencyLimitOverride disk 500
OK
# storpool disk list perf
| latencies | thresholds | times exceeded | flags
disk | disk (avg) | disk (max) | jrn (avg) | jrn (max) | disk | journal | disk | journal |
2301 | 0.119ms | 0.650ms | - | - | 500.000ms | - | 0 | - | D
[snip]
The D
flag means there is a disk latency override, visible in the thresholds
section.
Similarly, to configure a single disk journal latency threshold override:
# storpool disk 2621 latencyLimitOverride journal 100
OK
storpool disk list perf
| latencies | thresholds | times exceeded | flags
disk | disk (avg) | disk (max) | jrn (avg) | jrn (max) | disk | journal | disk | journal |
[snip]
2621 | 8.489ms | 13.704ms | 0.052ms | 0.669ms | 0.000ms | 100.000ms | 0 | 0 | J
The J
flag means there is a disk journal latency override, visible in the
thresholds section.
To override a single disk to no longer oblige the global limit:
# storpool disk 2301 latencyLimitOverride disk unlimited
OK
This will show the disk threshold as unlimited
:
# storpool disk list perf
| latencies | thresholds | times exceeded | flags
disk | disk (avg) | disk (max) | jrn (avg) | jrn (max) | disk | journal | disk | journal |
2301 | 0.166ms | 0.656ms | - | - | unlimited | - | 0 | - | D
To clear an override and leave the global limit:
# storpool disk 2601 latencyLimitOverride disk off
OK
# storpool disk 2621 latencyLimitOverride journal off
OK
If a disk was ejected due to an excessive latency the server will keep a log from the last 128 requests sent to the disk, to list them use:
# storpool disk 2601 ejectLog
log creation time | time of first event
2022-03-31 17:50:16 | 2022-03-31 17:31:08 +761,692usec
req# | start | end | latency | addr | size | op
1 | +0us | +424us | 257us | 0x0000000253199000 | 128 KiB | DISK_OP_READ
[snip]
126 | +1147653582us | +1147653679us | 97us | 0x0000000268EBE000 | 12 KiB | DISK_OP_WRITE_FUA
127 | +1147654920us | +1147655192us | 272us | 0x0000000012961000 | 128 KiB | DISK_OP_WRITE_FUA
total | maxTotal | limit | generation | times exceeded (for this eject)
23335us | 100523us | 1280us | 15338 | 1
The same data is available if a disk journal was ejected after breaching the threshold with:
# storpool disk 2621 ejectLog journal
[snip]
(the output is similar to that of the disk ejectLog above)
12.8.5. Ejecting disks and internal server tests
When a server controlling a disk notices some issues with it (like a write error, or stalled request above a predefined threshold), the disk is also marked as “test pending”. This happens when there are many transient errors when a disk drive (or its controller) stalls a request for more than a predefined threshold.
Note
See also Automatic drive tests.
An eject option is available for manually initiating such a test, which will flag the disk that it requires test and will eject it. The server instance will then perform a quick set of non-intrusive read-write tests on this disk and will return it back into the cluster if all tests did well; for example:
# storpool disk 2331 eject test
OK
The tests are done usually couple of seconds up to a minute, to check the results from the last test:
# storpool disk 2331 testInfo
times tested | test pending | read speed | write speed | read max latency | write max latency | failed
1 | no | 1.0 GiB/sec | 971 MiB/sec | 8 msec | 4 msec | no
If the disk was already marked for testing the option “now” will skip the test on the next attempt to re-open the disk:
# storpool disk 2301 eject now
OK
Attention
Note that this is exactly the same as “eject”, the disk would have to be manually returned into the cluster.
To mark a disk as unavailable by first re-balancing all data out to the other disks in the cluster and only then eject it:
# storpool disk 1422 softEject
OK
Balancer auto mode currently OFF. Must be ON for soft-eject to complete.
Note
This option requires StorPool balancer to be started after the above was issued; for details, see 12.18. Balancer.
To remove a disk from the list of reported disks and all placement groups it participates in:
# storpool disk 1422 forget
OK
To get detailed information about given disk:
# storpool disk 1101 info
agAllocated | agCount | agFree | agFreeing | agFull | agMaxSizeFull | agMaxSizePartial | agPartial
7 | 462 | 455 | 1 | 0 | 0 | 1 | 1
entriesAllocated | entriesCount | entriesFree | sectorsCount
50 | 1080000 | 1079950 | 501215232
objectsAllocated | objectsCount | objectsFree | objectStates
18 | 270000 | 269982 | ok:18
serverId | 1
id | objectsCount | onDiskSize | storedSize | objectStates
#bad_id | 1 | 0 B | 0 B | ok:1
#clusters | 1 | 8.0 KiB | 768 B | ok:1
#drive_state | 1 | 8.0 KiB | 4.0 B | ok:1
#drives | 1 | 100 KiB | 96 KiB | ok:1
#iscsi_config | 1 | 12 KiB | 8.0 KiB | ok:1
[snip]
To get detailed information about the objects on a particular disk:
# storpool disk 1101 list
object name | stored size | on-disk size | data version | object state | parent volume
#bad_id:0 | 0 B | 0 B | 1480:2485 | ok (1) |
#clusters:0 | 768 B | 8.0 KiB | 711:992 | ok (1) |
#drive_state:0 | 4.0 B | 8.0 KiB | 1475:2478 | ok (1) |
#drives:0 | 96 KiB | 100 KiB | 1480:2484 | ok (1) |
[snip]
test:4094 | 0 B | 0 B | 0:0 | ok (1) |
test:4095 | 0 B | 0 B | 0:0 | ok (1) |
----------------------------------------------------------------------------------------------------
4115 objects | 394 KiB | 636 KiB | | |
To get detailed information about the active requests that the disk is performing at the moment:
# storpool disk 1101 activeRequests
-----------------------------------------------------------------------------------------------------------------------------------
| request ID | request IDX | volume | address | size | op | time active |
-----------------------------------------------------------------------------------------------------------------------------------
| 9226469746279625682:285697101441249070 | 9 | testvolume | 85276782592 | 4.0 KiB | read | 0 msec |
| 9226469746279625682:282600876697431861 | 13 | testvolume | 96372936704 | 4.0 KiB | read | 0 msec |
| 9226469746279625682:278097277070061367 | 19 | testvolume | 46629707776 | 4.0 KiB | read | 0 msec |
| 9226469746279625682:278660227023482671 | 265 | testvolume | 56680042496 | 4.0 KiB | write | 0 msec |
-----------------------------------------------------------------------------------------------------------------------------------
To issue retrim operation on a disk (available for SSD disks only):
# storpool disk 1101 retrim
OK
To start, pause, or continue a scrubbing operation for a disk:
# storpool disk 1101 scrubbing start
OK
# storpool disk 1101 scrubbing pause
OK
# storpool disk 1101 scrubbing continue
OK
Note
Use storpool disk list internal
to check the status of a running
scrub operation or when was the last completed scrubbing operation for
this disk.
12.9. Placement groups
The placement groups are predefined sets of disks over which volume objects will be replicated. It is possible to specify which individual disks to add to the group.
To display the defined placement groups in the cluster:
# storpool placementGroup list
name
default
hdd
ssd
To display details about a placement group:
# storpool placementGroup ssd list
type | id
disk | 1101 1201 1301 1401
Creating a new placement group or extending an existing one requires specifying its name and providing one or more disks to be added:
# storpool placementGroup ssd addDisk 1102
OK
# storpool placementGroup ssd addDisk 1202
OK
# storpool placementGroup ssd addDisk 1302 addDisk 1402
OK
# storpool placementGroup ssd list
type | id
disk | 1101 1102 1201 1202 1301 1302 1401 1402
To remove one or more disks from a placement group:
# storpool placementGroup ssd rmDisk 1402
OK
# storpool placementGroup ssd list
type | id
disk | 1101 1102 1201 1202 1301 1302 1401
To rename a placement group:
# storpool placementGroup ssd rename M500DC
OK
The unused placement groups can be removed. To avoid accidents, the name of the group must be entered twice:
# storpool placementGroup ssd delete ssd
OK
12.10. Volumes
The volumes are the basic service of the StorPool storage system. The basic features of a volume are as follows:
It always has a name and a certain size.
It can be read from and written to.
It could be attached to hosts as read-only or read-write block device under the
/dev/storpool
directory.It may have one or more tags created or changed using the
name=value
form.
The name of a volume is a string consisting of
one or more of the allowed characters - upper and lower latin letters
(a-z,A-Z
), numbers (0-9
) and the delimiter dot (.
), colon (:
),
dash (-
) and underscore (_
). The same rules apply for the keys and
values used for the volume tags. The volume name including tags cannot exceed
200 bytes.
When creating a volume you must specify at minimum its name, the template or placement/replication details, and its size. Here is an example:
# storpool volume testvolume create size 100G template hybrid
12.10.1. Volume parameters
When performing volume operations you can use the following parameters:
placeAll
Place all objects in placementGroup (Default value: default).
placeTail
Name of placementGroup for reader (Default value: same as
placeAll
value).placeHead
Place the third replica in a different placementGroup (Default value: same as
placeAll
value)template
Use template with preconfigured placement, replication, and/or limits; for details, see 12.14. Templates. Usage of templates is seriously encouraged due to easier tracking and capacity management.
parent
Use a snapshot as a parent for this volume.
reuseServer
Place multiple copies on the same server.
baseOn
Use parent volume, this will create a transient snapshot used as a parent. For details, see 12.11. Snapshots).
iops
Set the maximum IOPS limit for this volume (in IOPS).
bw
Set maximum bandwidth limit (in MB/s).
tag
Set a tag for this volume in the form
name=value
.create
Create the volume, fail if it exists. Mandatory when creating a new volume. Creating volumes without setting this option is deprecated.
update
Update the volume, fail if it does not exist. Mandatory for operations where a volume is modified; see the examples in 12.10.7. Managing volumes. Modifying volumes without setting this option is deprecated.
limitType
Specify whether
iops
andbw
limits ought to be for the total size of the block device or per each GiB (one of “total” or “perGiB”)
The create
option is useful in scripts when you have to prevent an
involuntary update of a volume:
# storpool volume test create template hybrid
OK
# storpool volume test create size 200G template hybrid
Error: Volume 'test' already exists
A statement with update
parameter will fail with an error if the volume does not exist:
# storpool volume test update template hybrid size +100G
OK
# storpool volume test1 update template hybrid
Error: volume 'test1' does not exist
12.10.2. Listing all volumes
To list all available volumes:
# storpool volume list
-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------
| volume | size | rdnd. | placeHead | placeAll | placeTail | iops | bw | parent | template | flags | tags |
-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------
| testvolume | 100 GiB | 3 | ultrastar | ultrastar | ssd | - | - | testvolume@35691 | hybrid | | name=value |
| testvolume_8_2 | 100 GiB | 8+2 | nvme | nvme | nvme | - | - | testvolume_8_2@35693 | nvme | | name=value |
-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------
Flags:
R - allow placing two disks within a replication chain onto the same server
t - volume move target. Waiting for the move to finish
G - IOPS and bandwidth limits are per GiB and depends on volume/snapshot size
12.10.3. Listing exported volumes
To list volumes exported to other sub-clusters in the multi-cluster:
# storpool volume list exports
---------------------------------
| remote | volume | globalId |
---------------------------------
| Lab-D-cl2 | test | d.n.buy |
---------------------------------
To list volumes exported in other sub-clusters to this one in a multi-cluster setup:
# volume list remote
--------------------------------------------------------------------------
| location | remoteId | name | size | creationTimestamp | tags |
--------------------------------------------------------------------------
| Lab-D | d.n.buy | test | 137438953472 | 2020-05-27 11:57:38 | |
--------------------------------------------------------------------------
Note
Once attached a remotely exported volume will no longer be visible
with volume list remote
, even if the export is still visible in
the remote cluster with volume list exports
. Every export
invocation in the local cluster will be used up for every attach in
the remote cluster.
12.10.4. Volume status
To get an overview of all volumes and snapshots and their state in the system:
# storpool volume status
----------------------------------------------------------------------------------------------------------------------------------------------------
| volume | size | rdnd. | tags | alloc % | stored | on disk | syncing | missing | status | flags | drives down |
----------------------------------------------------------------------------------------------------------------------------------------------------
| testvolume | 100 GiB | 3 | name=value | 0.0 % | 0 B | 0 B | 0 B | 0 B | up | | |
| testvolume@35691 | 100 GiB | 3 | | 100.0 % | 100 GiB | 317 GiB | 0 B | 0 B | up | S | |
----------------------------------------------------------------------------------------------------------------------------------------------------
| 2 volumes | 200 GiB | | | 50.0 % | 100 GiB | 317 GiB | 0 B | 0 B | | | |
----------------------------------------------------------------------------------------------------------------------------------------------------
Flags:
S - snapshot
B - balancer blocked on this volume
D - decreased redundancy (degraded)
M - migrating data to a new disk
R - allow placing two disks within a replication chain onto the same server
t - volume move target. Waiting for the move to finish
C - disk placement constraints violated, rebalance needed
The columns in this output are:
volume
- name of the volume or snapshot (seeflags
below)size
- provisioned volume size, the visible size inside a VM for examplerdnd.
- number of copies for this volume or its erasure coding schemetags
- all custom key=value tags configured for this volume or snapshotalloc %
- how much space was used on this volume in percentstored
- space allocated on this volumeon disk
- the size allocated on all drives in the cluster after replication and the overhead from data protectionsyncing
- how much data is not in sync after a drive or server was missing, the data is recovered automatically once the missing drive or server is back in the clustermissing
- shows how much data is not available for this volume when the volume is with statusdown
, seestatus
belowstatus
- shows the status of the volume, which could be one of:up
- all copies are availabledown
- none of the copies are available for some parts of the volumeup soon
- all copies are available and the volume will soon get up
flags
- flags denoting features of this volume:S
- stands for snapshot, which is essentially a read-only (frozen) volumeB
- used to denote that the balancer is blocked for this volume (usually when some of the drives are missing)D
- this flag is displayed when some of the copies is either not available or outdated and the volume is with decreased redundancyM
- displayed when changing the replication or a cluster re-balance is in progressR
- displayed when the policy for keeping copies on different servers is overriddenC
- displayed when the volume or snapshot placement constraints are violated
drives down
- displayed when the volume is indown
state, displaying the drives required to get the volume back up.
Size is in B
/KiB
/MiB
/GiB
, TiB
or PiB
.
To get just the status
data from the storpool_controller
services in the
cluster, without any info for stored, on disk size, etc.:
# storpool volume quickStatus
----------------------------------------------------------------------------------------------------------------------------------------------------
| volume | size | rdnd. | tags | alloc % | stored | on disk | syncing | missing | status | flags | drives down |
----------------------------------------------------------------------------------------------------------------------------------------------------
| testvolume | 100 GiB | 3 | name=value | 0.0 % | 0 B | 0 B | 0 B | 0 B | up | | |
| testvolume@35691 | 100 GiB | 3 | | 0.0 % | 0 B | 0 B | 0 B | 0 B | up | S | |
----------------------------------------------------------------------------------------------------------------------------------------------------
| 2 volumes | 200 GiB | | | 0.0 % | 0 B | 0 B | 0 B | 0 B | | | |
----------------------------------------------------------------------------------------------------------------------------------------------------
Note
The quickStatus
has less of an impact on the storpool_server
services and thus on end-user operations because the gathered data
does not include the per-volume detailed storage stats provided with
status
.
12.10.5. Used space estimation
To check the estimated used space by the volumes in the system:
# storpool volume usedSpace
-----------------------------------------------------------------------------------------
| volume | size | rdnd. | stored | used | missing info |
-----------------------------------------------------------------------------------------
| testvolume | 100 GiB | 3 | 1.9 GiB | 100 GiB | 0 B |
-----------------------------------------------------------------------------------------
The columns are as follows:
volume
- name of the volumesize
- the provisioned size of this volumerdnd.
- number of copies for this volume or its erasure coding schemestored
- how much data is stored for this volume (without referring all parent snapshots)used
- how much data has been written (including the data written in parent snapshots)missing info
- if this value is anything other than0 B
probably some of thestorpool_controller
services in the cluster is not running correctly.
Note
The used
column shows how much data is accessible and reserved for
this volume.
12.10.6. Listing disk sets and objects
To list the target disk sets and objects of a volume:
# storpool volume testvolume list
volume testvolume
size 100 GiB
replication 3
placeHead hdd
placeAll hdd
placeTail ssd
target disk sets:
0: 1122 1323 1203
1: 1424 1222 1301
2: 1121 1324 1201
[snip]
object: disks
0: 1122 1323 1203
1: 1424 1222 1301
2: 1121 1324 1201
[snip]
Hint
In this example, the volume is with hybrid placement with two copies on HDDs and one copy on SSDs (the rightmost disk sets column). The target disk sets are lists of triplets of drives in the cluster used as a template for the actual objects of the volume.
To get detailed info about the disks used for this volume and the number of objects on each of them:
# storpool volume testvolume info
diskId | count
1101 | 200
1102 | 200
1103 | 200
[snip]
chain | count
1121-1222-1404 | 25
1121-1226-1303 | 25
1121-1226-1403 | 25
[snip]
diskSet | count
218-313-402 | 3
218-317-406 | 3
219-315-402 | 3
Note
The order of the diskSet is not by placeHead
, placeAll
,
placeTail
, check the actual order in the
storpool volume <volumename> list
output. The reason is to count
similar diskSet with a different order in the same slot, i.e.
[101, 201, 301]
is accounted as the same diskSet as
[201, 101, 301]
.
12.10.7. Managing volumes
To rename a volume:
# storpool volume testvolume update rename newvolume
OK
Attention
Changing the name of a volume will not wait for clients that have this volume attached to update the name of the symlink. Always use client sync for all clients with the volume attached.
To add a tag for a volume:
# storpool volume testvolume update tag name=value
To change a tag for a volume:
# storpool volume testvolume update tag name=newvalue
To remove a tag just set it to an empty value:
# storpool volume testvolume update tag name=
To resize a volume up:
# storpool volume testvolume update size +1G
OK
To shrink a volume (resize down):
# storpool volume testvolume update size 50G shrinkOk
Attention
Shrinking of a storpool volume changes the size of the block device, but does not adjust the size of LVM or filesystem contained in the volume. Failing to adjust the size of the filesystem or LVM prior to shrinking the StorPool volume would result in data loss.
To delete a volume:
# storpool volume vol1 delete vol1
Note
To avoid accidents, the volume name must be entered twice. Attached volumes cannot be deleted even when not used as a safety precaution. For details, see 12.12. Attachments.
A volume could be converted from based on a snapshot to a stand-alone volume.
For example the testvolume
below is based on an anonymous snapshot:
# storpool_tree
StorPool
`-testvolume@37126
`-testvolume
To rebase it against root (known also as “promote”):
# storpool volume testvolume rebase
OK
# storpool_tree
StorPool
`- testvolume@255 [snapshot]
`- testvolume [volume]
The rebase operation could also be to a particular snapshot from a chain of parent snapshots on this child volume:
# storpool_tree
StorPool
`- testvolume-snap1 [snapshot]
`- testvolume-snap2 [snapshot]
`- testvolume-snap3 [snapshot]
`- testvolume [volume]
# storpool volume testvolume rebase testvolume-snap2
OK
After the operation the volume is directly based on testvolume-snap2
and
includes all changes from testvolume-snap3
:
# storpool_tree
StorPool
`- testvolume-snap1 [snapshot]
`- testvolume-snap2 [snapshot]
|- testvolume [volume]
`- testvolume-snap3 [snapshot]
To backup a volume named testvolume
in a configured remote location
LocationA-CityB
:
# storpool volume testvolume backup LocationA-CityB
OK
After this operation a temporary snapshot will be created and will be
transferred in LocationA-CityB
location. After the transfer completes, the
local temporary snapshot will be deleted and the remote snapshot will be visible
as exported from LocationA-CityB
. For more information on
working with snapshot exports, see 12.11.6. Remote snapshots .
When backing up a volume, the remote snapshot may have one or more tags applied, example below:
# storpool volume testvolume backup LocationA-CityB tag key=value # [tag key2=value2]
OK
To move a volume to a different cluster in a multicluster environment (more on clusters here):
# storpool volume testvolume moveToRemote Lab-D-cl2 # onAttached export
Note
Moving a volume to a remote cluster will fail if the volume is
attached on a local host. It could be further specified what to do in
such case with the onAttached
parameter, as in the comment in the
example above. More info on volume move is available in
17.13. Volume and snapshot move.
12.11. Snapshots
Snapshots are read-only point-in-time images of volumes. They are created once
and cannot be changed. They can be attached to hosts as read-only block devices
under /dev/storpool
. Volumes and snapshots share the same name-space, thus
their names are unique within a StorPool cluster. Volumes can be based on
snapshots. Such volumes contain only the changes since the snapshot was taken.
After a volume is created from a snapshot, writes will be recorded within the
volume. Reads from volume may be served by volume or by its parent snapshot
depending on whether volume contains changed data for the read request or not.
For more information, see 15. Volumes and snapshots.
12.11.1. Creating snapshots
To create an unnamed (known also as anonymous) snapshot of a volume:
# storpool volume testvolume snapshot
OK
This will create a snapshot named testvolume@<ID>
, where ID
is an unique
serial number. Note that any tags on the volume will not be propagated to the
snapshot; to set tags on the snapshot at creation time:
# storpool volume testvolume tag key=value snapshot
To create a named snapshot of a volume:
# storpool volume testvolume snapshot testsnap
OK
To directly set tags:
# storpool volume testvolume snapshot testsnapplustags tag key=value
To create a bound snapshot on a volume:
# storpool volume testvolume bound snapshot
OK
This snapshot will be automatically deleted when the last child volume created from it is deleted. Useful for non-persistent images.
12.11.2. Listing snapshots
To list the snapshots:
# storpool snapshot list
-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
| snapshot | size | rdnd. | placeHead | placeAll | placeTail | created on | volume | iops | bw | parent | template | flags | targetDeleteDate | tags |
-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
| testsnap | 100 GiB | 3 | hdd | hdd | ssd | 2019-08-30 04:11:23 | testvolume | - | - | testvolume@1430 | hybrid-r3 | | - | key=value |
| testvolume@1430 | 100 GiB | 3 | hdd | hdd | ssd | 2019-08-30 03:56:58 | testvolume | - | - | | hybrid-r3 | A | - | |
-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
Flags:
A - anonymous snapshot with auto-generated name
B - bound snapshot
D - snapshot currently in the process of deletion
T - transient snapshot (created during volume cloning)
R - allow placing two disks within a replication chain onto the same server
P - snapshot delete blocked due to multiple children
To list the snapshots only for a particular volume:
# storpool volume testvolume list snapshots
[snip]
To list the target disk sets and objects of a snapshot:
# storpool snapshot testsnap list
[snip]
The output is similar as with storpool volume <volumename> list
; for
details, see 12.10.6. Listing disk sets and objects.
To get detailed info about the disks used for this snapshot and the number of objects on each of them:
# storpool snapshot testsnap info
[snip]
The output is similar to the storpool volume <volumename> info
.
12.11.3. Volume operations
To create a volume based on an existing snapshot (cloning):
# storpool volume testvolume parent centos73-base-snap
OK
To revert a volume to an existing snapshot:
# storpool volume testvolume revertToSnapshot centos73-working
OK
This is also possible through the use of templates with a parent snapshot (see 12.14. Templates):
# storpool volume spd template centos73-base
OK
Create a volume based on another existing volume (cloning):
# storpool volume testvolume1 baseOn testvolume
OK
Note
This operation will first create an anonymous bound snapshot on
testvolume
and will then create testvolume1
with the bound
snapshot as parent. The snapshot will exist until both volumes are
deleted and will be automatically deleted afterwards.
12.11.4. Deleting snapshots
To delete a snapshot:
# storpool snapshot spdb_snap1 delete spdb_snap1
OK
Note
To avoid accidents, the name of the snapshot must be entered twice.
Sometimes the system would not delete the snapshot immediately; during this
period of time, it would be visible with *
in the output of storpool
volume status
or storpool snapshot list
.
To set a snapshot for deferred deletion:
# storpool snapshot testsnap deleteAfter 1d
OK
The above will set a target delete date for this snapshot in exactly one day from the present time.
Note
The snapshot will be deleted at the desired point in time only if delayed snapshot delete was enabled in the local cluster, check 12.22. Management configuration for details.
A snapshot could also be bound to its child volumes, it will exist until all child volumes are deleted:
# storpool snapshot testsnap bind
OK
The opposite operation is also possible, to unbind such snapshot:
# storpool snapshot testsnap unbind
OK
To get the space that will be freed if a snapshot is deleted:
# storpool snapshot space
----------------------------------------------------------------------------------------------------------------
| snapshot | on volume | size | rdnd. | stored | used | missing info |
----------------------------------------------------------------------------------------------------------------
| testsnap | testvolume | 100 GiB | 3 | 27 GiB | -135 GiB | 0 B |
| testvolume@3794 | testvolume | 100 GiB | 3 | 27 GiB | 1.9 GiB | 0 B |
| testvolume@3897 | testvolume | 100 GiB | 3 | 507 MiB | 432 KiB | 0 B |
| testvolume@3899 | testvolume | 100 GiB | 3 | 334 MiB | 224 KiB | 0 B |
| testvolume@4332 | testvolume | 100 GiB | 3 | 73 MiB | 36 KiB | 0 B |
| testvolume@4333 | testvolume | 100 GiB | 3 | 45 MiB | 40 KiB | 0 B |
| testvolume@4334 | testvolume | 100 GiB | 3 | 59 MiB | 16 KiB | 0 B |
| frozenvolume | - | 8 GiB | 2 | 80 MiB | 80 MiB | 0 B |
----------------------------------------------------------------------------------------------------------------
Used mainly for accounting purposes. The columns are as follows:
snapshot
Name of the snapshot.
on volume
The name of the volume child for this snapshot if any. For example, a frozen volume would have this field empty.
size
The size of the snapshot as provisioned.
rdnd.
Number of copies for this volume or its erasure coding scheme.
stored
How much data is actually written.
used
Stands for the amount of data that would be freed from the underlying drives (before redundancy) if the snapshot is removed.
missing info
If this value is anything other than
0 B
probably some of thestorpool_controller
services in the cluster are not running correctly.
The used
column could be negative in some cases when the snapshot has more
than one child volume. In these cases deleting the snapshot would “free”
negative space i.e. will end up taking more space on the underlying disks.
12.11.5. Snapshot parameters
Similar to volumes, snapshots could have different placementGroups or other parameters. You can use the following parameters:
placeAll
Place all objects in placementGroup; default value: default.
placeTail
Name of placementGroup for reader; default value: same as the value of
placeAll
.placeHead
Place the third replica in a different placementGroup; default value: same as the value of
placeAll
.reuseServer
Place multiple copies on the same server.
tag
Set a tag in the form
key=value
.template
Use template with preconfigured placement, replication, and/or limits (check 12.14. Templates for details).
iops
Set the maximum IOPS limit for this snapshot (in IOPS).
bw
Set maximum bandwidth limit (in MB/s).
limitType
Specify whether
iops
andbw
limits ought to be for the total size of the block device, or per each GiB (one of “total” or “perGiB”).
Note
The bandwidth and IOPS limits are concerning only the particular snapshot if it is attached and does not limit any child volumes using this snapshot as parent.
Here are two examples - one for setting a template, and one for removing a tag on a snapshot:
# storpool snapshot testsnap template all-ssd
OK
# storpool snapshot testsnapplustags tag key=
Also similar to the same operation with volumes a snapshot could be renamed with:
# storpool snapshot testsnap rename ubuntu1604-base
OK
Attention
Changing the name of a snapshot will not wait for clients that have this snapshot attached to update the name of the symlink. Always use client sync for all clients with the snapshot attached.
A snapshot could also be rebased to root (promoted) or rebased to another parent snapshot in a chain:
# storpool snapshot testsnap rebase # [parent-snapshot-name]
OK
12.11.6. Remote snapshots
In case multi-site or multicluster is enabled (the cluster has a
storpool_bridge
service running), a snapshot could be exported and become
visible to other configured clusters.
For example, to export a snapshot snap1
to a location named
StorPool-Rome
:
# storpool snapshot snap1 export StorPool-Rome
OK
To list the presently exported snapshots:
# storpool snapshot list exports
-------------------------------------------------------------------------------
| remote | snapshot | globalId | backingUp | volumeMove |
-------------------------------------------------------------------------------
| StorPool-Rome | snap1 | nzkr.b.cuj | false | false |
-------------------------------------------------------------------------------
To list the snapshots exported from remote sites:
# storpool snapshot list remote
------------------------------------------------------------------------------------------
| location | remoteId | name | onVolume | size | creationTimestamp | tags |
------------------------------------------------------------------------------------------
| s02 | a.o.cxz | snapshot1 | | 107374182400 | 2019-08-20 03:21:42 | |
------------------------------------------------------------------------------------------
Single snapshot could be exported to multiple configured locations.
To create a clone of a remote snapshot locally:
# storpool snapshot snapshot1-copy template hybrid-r3 remote s02 a.o.cxz # [tag key=value]
In this example, the remote location
is s02
and the remoteId
is
a.o.cxz
. Any key=value
pair tags may be configured at creation time.
To unexport a local snapshot:
# storpool snapshot snap1 unexport StorPool-Rome
OK
If you need to swam the remote location you can use the all
keyword. The
system will attempt to unexport the snapshot from all location it was previously
exported to.
Note
If the snapshot is presently being transferred then the unexport
operation will fail. It could be forced by adding force
to the end
of the unexport command, however this is discouraged in favor to
waiting for any active transfer to complete.
To unexport a remote snapshot:
# storpool snapshot remote s02 a.o.cxz unexport
OK
The snapshot will no longer be visible with storpool snapshot list remote
.
To unexport a remote snapshot and also set for deferred deletion in the remote site:
# storpool snapshot remote s02 a.o.cxz unexport deleteAfter 1h
OK
This will attempt to set a target delete date for a.o.cxz
in the remote site
in exactly one hour from the present time for this snapshot. If the
minimumDeleteDelay
flag (see 12.3.2. Minimum deletion delay) in the
remote site has a higher value (for example, 1 day), the selected value will be
overwritten with the minimumDeleteDelay
- in this example 1 day. For more
information on deferred deletion, see 17.12. Remote deferred deletion.
To move a snapshot to a different cluster in a multi-cluster environment (see 12.2. Cluster):
# storpool snapshot snap1 moveToRemote Lab-D-cl2
Note
Moving a snapshot to a remote cluster is forbidden for attached snapshots. For more information on snapshot moving, see 17.13. Volume and snapshot move.
12.12. Attachments
Attaching a volume or snapshot makes it accessible to a client under the
/dev/storpool
and /dev/storpool-byid
directories. Volumes can be
attached as read-only or read-write. Snapshots are always attached read-only.
Here is an example for attaching a volume testvolume
to a client with ID
1
. This creates the block device /dev/storpool/testvolume
:
# storpool attach volume testvolume client 1
OK
To attach a volume or snapshot to the node you are currently connected to:
# storpool attach volume testvolume here
OK
# storpool attach snapshot testsnap here
OK
By default, this command will block until the volume is attached to the client
and the /dev/storpool/<volumename>
symlink is created. For example, if the
storpool_block
service has not been started the command will wait
indefinitely. To set a timeout for this operation:
# storpool attach volume testvolume here timeout 10 # seconds
OK
To completely disregard the readiness check:
# storpool attach volume testvolume here noWait
OK
Note
The use of noWait
is discouraged in favor of the default behaviour
of the attach
command.
Attaching a volume will create a read-write block device attachment by default. To attach it read-only:
# storpool volume testvolume2 attach client 12 mode ro
OK
To list all attachments:
# storpool attach list
-------------------------------------------------------------------
| client | volume | globalId | mode | tags |
-------------------------------------------------------------------
| 11 | testvolume | d.n.a1z | RW | vc-policy=no |
| 12 | testvolume1 | d.n.c2p | RW | vc-policy=no |
| 12 | testvolume2 | d.n.uwp | RO | vc-policy=no |
| 14 | testsnap | d.n.s1m | RO | vc-policy=no |
-------------------------------------------------------------------
To detach:
# storpool detach volume testvolume client 1 # or 'here' if the command is being executed on client ID 1
If a volume is actively being written or read from, a detach operation will fail:
# storpool detach volume testvolume client 11
Error: 'testvolume' is open at client 11
In this case the detach could be forced; beware that forcing a detachment is discouraged:
# storpool detach volume testvolume client 11 force yes
OK
Attention
Any operations on the volume will receive an IO Error when it is forcefully detached. Some mounted filesystems lead to kernel panic when a block device disappears when there with live operations, thus be extra careful if these filesystems are mounted on a hypervisor node directly.
If a volume or snapshot is attached to more than one client it could be detached from all nodes with a single command:
# storpool detach volume testvolume all
OK
# storpool detach snapshot testsnap all
OK
12.13. Client
To check the status of the active 9.3. storpool_block services in the cluster:
# storpool client status
-----------------------------------
| client | status |
-----------------------------------
| 11 | ok |
| 12 | ok |
| 13 | ok |
| 14 | ok |
-----------------------------------
To wait until a client is updated:
# storpool client 13 sync
OK
This is a way to ensure a volume with changed size is visible with its new size to any clients it is attached to.
To show detailed information about the active requests on a particular client in this moment:
# storpool client 13 activeRequests
------------------------------------------------------------------------------------------------------------------------------------
| request ID | request IDX | volume | address | size | op | time active |
------------------------------------------------------------------------------------------------------------------------------------
| 9224499360847016133:3181950 | 1044 | testvolume | 10562306048 | 128 KiB | write | 65 msec |
| 9224499360847016133:3188784 | 1033 | testvolume | 10562437120 | 32 KiB | read | 63 msec |
| 9224499360847016133:3188977 | 1029 | testvolume | 10562568192 | 128 KiB | read | 21 msec |
| 9224499360847016133:3189104 | 1026 | testvolume | 10596122624 | 128 KiB | read | 3 msec |
| 9224499360847016133:3189114 | 1035 | testvolume | 10563092480 | 128 KiB | read | 2 msec |
| 9224499360847016133:3189396 | 1048 | testvolume | 10629808128 | 128 KiB | read | 1 msec |
------------------------------------------------------------------------------------------------------------------------------------
12.14. Templates
Templates are enabling easy and consistent setup and usage tracking for a collection of large number of volumes and their snapshots with common attributes, for example replication, placement groups, and/or common parent snapshot.
12.14.1. Creating
To create a template:
# storpool template nvme replication 3 placeAll nvme
OK
# storpool template magnetic replication 3 placeAll hdd
OK
# storpool template hybrid replication 3 placeAll hdd placeTail ssd
OK
# storpool template ssd-hybrid replication 3 placeAll ssd placeHead hdd
OK
12.14.2. Listing
To list all created templates:
# storpool template list
-------------------------------------------------------------------------------------------------------------------------------------
| template | size | rdnd. | placeHead | placeAll | placeTail | iops | bw | parent | flags |
-------------------------------------------------------------------------------------------------------------------------------------
| magnetic | - | 3 | nvme | nvme | nvme | - | - | | |
| magnetic | - | 3 | hdd | hdd | hdd | - | - | | |
| hybrid | - | 3 | hdd | hdd | ssd | - | - | | |
| ssd-hybrid | - | 3 | hdd | ssd | ssd | - | - | | |
-------------------------------------------------------------------------------------------------------------------------------------
Please refer to 14. Redundancy for more info on replication and
erasure coding schemes (shown in rdnd.
above).
12.14.3. Getting status
To get the status of a template with detailed info on the usage and the available space left with this placement:
# storpool template status
--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
| template | place head | place all | place tail | rdnd. | volumes | snapshots/removing | size | capacity | avail. | avail. all | avail. tail | avail. head | flags |
--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
| magnetic | hdd | hdd | hdd | 3 | 115 | 631/0 | 28 TiB | 80 TiB | 52 TiB | 240 TiB | 240 TiB | 240 TiB | |
| hybrid | hdd | ssd | hdd | 3 | 208 | 347/9 | 17 TiB | 72 TiB | 55 TiB | 240 TiB | 72 TiB | 240 TiB | |
| ssd-hybrid | ssd | ssd | hdd | 3 | 40 | 7/0 | 4 TiB | 36 TiB | 36 TiB | 240 TiB | 72 TiB | 240 TiB | |
--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
12.14.4. Changing parameters
To change template’s parameters directly:
# storpool template hdd-only size 120G propagate no
OK
# storpool template hybrid size 40G iops 4000 propagate no
OK
Parameters that can be set:
bw
Set maximum bandwidth limit (in MB/s).
- iops
Set the maximum IOPS limit for this snapshot (in IOPS).
limitType
Specify whether
iops
andbw
limits ought to be for the total size of the block device or per each GiB (one of “total” or “perGiB”).- parent
Set parent snapshot for all volumes created in this template.
- placeAll
Place all objects in placementGroup; default: default.
- placeHead
Place the third replica in a different placementGroup; default: same as the value of
placeAll
value.- placeTail
Name of placementGroup for reader; default: same as the value of
placeAll
.- propagate
Required when changing parameters on an already created template. Used for specifying if the changes would have to be modified on all existing volumes and/or snapshots created with this template. This parameter is required regardless whether any volumes or snapshots have been created with the template. The values you can use are
yes
andno
.- replication
Change the number of copies for volumes or snapshots created with this template.
- reuseServer
Place multiple copies on the same server.
- size
Default size if not specified for each volume created with this template.
Here is an example how to change the bandwidth limit for all volumes and
snapshots created with the already existing template magnetic
:
# storpool template magnetic bw 100MB propagate yes
OK
When using the storpool template $TEMPLATE propagate yes
command (as in the
example above), all the parameters of $TEMPLATE
will be re-applied to all
volumes and snapshots created with it.
Note
Changing template parameters with propagate
option will not
automatically re-allocate content of the existing volumes on disks. If
replication
or placement groups are changed, run balancer to apply
new settings on the existing volumes. However if the changes are made
directly to the volume instead to the template, running a balancer will
not be required.
Attention
Dropping the replication (for example, from triple to dual) of a
large number of volumes is an almost instant operation. However,
returning them back to triple is similar to creating the third
copy for the first time. This is why changing replication to less
than the present (for example, from 3 to 2) will require using
replicationReduce
as a safety measure.
12.14.5. Renaming
To rename a template:
# storpool template magnetic rename backup
OK
12.14.6. Deleting
To delete a template:
# storpool template hdd-only delete hdd-only
OK
Note
The delete operation might fail if there are volumes/snapshots that are created with this template.
12.15. iSCSI
The StorPool iSCSI support is documented more extensively in the 16. Setting iSCSI targets section; these are the commands used to configure it and view the configuration.
To set the cluster’s iSCSI base IQN iqn.2019-08.com.example:examplename:
# storpool iscsi config setBaseName iqn.2019-08.com.example:examplename
OK
12.15.1. Creating a portal group
To create a portal group examplepg used to group exported volumes for access by initiators using 192.168.42.247/24 (CIDR notation) as the portal IP address:
# storpool iscsi config portalGroup examplepg create addNet 192.168.42.247/24 vlan 42
OK
To create portal for the initiators to connect to (for example portal IP address 192.168.42.202 and StorPool’s SP_OURID 5):
# storpool iscsi config portal create portalGroup examplepg address 192.168.42.202 controller 5
OK
Note
This address will be handled by the storpool_iscsi
process
directly and will not be visible in the node with normal instruments
like ip or ifconfig, check the 12.15.7. Using iscsi_tool for these
purposes.
12.15.2. Registering an initiator
To define the iqn.2019-08.com.example:abcdefgh initiator that is allowed to connect from the 192.168.42.0/24 network (without authentication):
# storpool iscsi config initiator iqn.2019-08.com.example:abcdefgh create net 192.168.42.0/24
OK
To define the iqn.2019-08.com.example:client initiator that is allowed to connect from the 192.168.42.0/24 network and must authenticate using the standard iSCSI password-based challenge-response authentication method using the user username and the secret password:
# storpool iscsi config initiator iqn.2019-08.com.example:client create net 192.168.42.0/24 chap user secret
OK
12.15.3. Exporting a volume
To specify that the existing StorPool volume tinyvolume should be exported to one or more initiators:
# storpool iscsi config target create tinyvolume
OK
Note
Please note that changing the volume name after creating a target will not change the target name. Re-creating (unexport/re-export) the target will use the new volume name.
To actually export the StorPool volume tinyvolume to to the iqn.2019-08.com.example:abcdefgh initiator via the examplepg portal group (the StorPool iSCSI service will automatically pick a portal to export the volume through):
# storpool iscsi config export initiator iqn.2019-08.com.example:abcdefgh portalGroup examplepg volume tinyvolume
OK
Note
The volume will be visible to the initiator as IQN <BaseName>:<volume>
Using the same command without specifying an initiator will export the target to
all registered initiators and will be visible as the *
initiator:
# storpool iscsi config export portalGroup examplepg volume tinyvolume
OK
# storpool iscsi initiator list exports
-----------------------------------------------------------------------------------------------------------------
| name | volume | currentControllerId | portalGroup | initiator |
-----------------------------------------------------------------------------------------------------------------
| iqn.2019-08.com.example:examplename:tinyvolume | test | 23 | examplepg | * |
-----------------------------------------------------------------------------------------------------------------
12.15.4. Getting iSCSI configuration
To view the iSCSI cluster base IQN:
# storpool iscsi basename
---------------------------------------
| basename |
---------------------------------------
| iqn.2019-08.com.example:examplename |
---------------------------------------
To view the portal groups:
# storpool iscsi portalGroup list
---------------------------------------------
| name | networksCount | portalsCount |
---------------------------------------------
| examplepg | 1 | 2 |
---------------------------------------------
To view the portals:
# storpool iscsi portalGroup list portals
--------------------------------------------------
| group | address | controller |
--------------------------------------------------
| examplepg | 192.168.42.246:3260 | 1 |
| examplepg | 192.168.42.202:3260 | 5 |
--------------------------------------------------
To view the defined initiators:
# storpool iscsi initiator list
---------------------------------------------------------------------------------------
| name | username | secret | networksCount | exportsCount |
---------------------------------------------------------------------------------------
| iqn.2019-08.com.example:abcdefgh | | | 1 | 1 |
| iqn.2019-08.com.example:client | user | secret | 1 | 0 |
---------------------------------------------------------------------------------------
To view the present state of the configured iSCSI interfaces:
iscsi interfaces list
--------------------------------------------------
| ctrlId | net 0 | net 1 |
--------------------------------------------------
| 23 | 2A:60:00:00:E0:17 | 2A:60:00:00:E0:17 |
| 24 | 2A:60:00:00:E0:18 | 2A:60:00:00:E0:18 |
| 25 | 2A:60:00:00:E0:19 | 2E:60:00:00:E0:19 |
| 26 | 2A:60:00:00:E0:1A | 2E:60:00:00:E0:1A |
--------------------------------------------------
Note
These are the same interfaces configured with SP_ISCSI_IFACE
in
the order of appearance:
# storpool_showconf SP_ISCSI_IFACE
SP_ISCSI_IFACE=sp0,spbond1:sp1,spbond1:[lacp]
In the above output, the sp0
interface is net ID 0 and sp1
is
net ID 1.
To view the volumes that may be exported to initiators:
# storpool iscsi target list
-------------------------------------------------------------------------------------
| name | volume | currentControllerId |
-------------------------------------------------------------------------------------
| iqn.2019-08.com.example:examplename:tinyvolume | tinyvolume | 65535 |
-------------------------------------------------------------------------------------
To view the volumes currently exported to initiators:
# storpool iscsi initiator list exports
--------------------------------------------------------------------------------------------------------------------------------------
| name | volume | currentControllerId | portalGroup | initiator |
--------------------------------------------------------------------------------------------------------------------------------------
| iqn.2019-08.com.example:examplename:tinyvolume | tinyvolume | 1 | | iqn.2019-08.com.example:abcdefgh |
--------------------------------------------------------------------------------------------------------------------------------------
12.15.5. Getting active sessions
To list the presently active sessions in the cluster:
# storpool iscsi sessions list
----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
| id | target | initiator | portal addr | initiator addr | timeCreated | nopOut | scsi | task | dataOut | otherOut | nopIn | scsiRsp | taskRsp | dataIn | r2t | otherIn | t free | t dataOut | t queued | t processing | t dataResp | t aborted | ISID |
----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
| 23.0 | iqn.2020-04.com.storpool:autotest:s18-1-iscsi-test-hybrid-win-server2016 | iqn.1991-05.com.microsoft:s18 | 10.1.100.123:3260 | 10.1.100.18:49414 | 2020-07-07 09:25:16 / 00:03:54 | 209 | 89328 | 0 | 0 | 2 | 209 | | | 45736 | 0 | 2 | 129 | 0 | 0 | 0 | 0 | 0 | 1370000 |
| 23.1 | iqn.2020-04.com.storpool:autotest:s11-1-iscsi-test-hybrid-centos6 | iqn.2020-04.com.storpool:s11 | 10.1.100.123:3260 | 10.1.100.11:44392 | 2020-07-07 09:25:33 / 00:03:37 | 218 | 51227 | 0 | 0 | 1 | 218 | | | 25627 | 0 | 1 | 129 | 0 | 0 | 0 | 0 | 0 | 3d0002b8 |
| 24.0 | iqn.2020-04.com.storpool:autotest:s11-1-iscsi-test-hdd-centos6 | iqn.2020-04.com.storpool:s11 | 10.1.100.124:3260 | 10.1.100.11:51648 | 2020-07-07 09:27:27 / 00:01:43 | 107 | 424 | 0 | 0 | 1 | 107 | | | 224 | 0 | 1 | 129 | 0 | 0 | 0 | 0 | 0 | 3d0002b9 |
| 24.1 | iqn.2020-04.com.storpool:autotest:s18-1-iscsi-test-hdd-win-server2016 | iqn.1991-05.com.microsoft:s18 | 10.1.100.124:3260 | 10.1.100.18:49422 | 2020-07-07 09:28:22 / 00:00:48 | 43 | 39568 | 0 | 0 | 2 | 43 | | | 19805 | 0 | 2 | 128 | 0 | 0 | 1 | 0 | 0 | 1370000 |
| 25.0 | iqn.2020-04.com.storpool:autotest:s13-1-iscsi-test-hybrid-centos7 | iqn.2020-04.com.storpool:s13 | 10.1.100.125:3260 | 10.1.100.13:45120 | 2020-07-07 09:20:46 / 00:08:24 | 481 | 154086 | 0 | 0 | 1 | 481 | | | 78308 | 0 | 1 | 129 | 0 | 0 | 0 | 0 | 0 | 3d0000a8 |
| 26.0 | iqn.2020-04.com.storpool:autotest:s13-1-iscsi-test-hdd-centos7 | iqn.2020-04.com.storpool:s13 | 10.1.100.126:3260 | 10.1.100.13:43858 | 2020-07-07 09:22:52 / 00:06:18 | 369 | 147438 | 0 | 0 | 1 | 369 | | | 74883 | 0 | 1 | 129 | 0 | 0 | 0 | 0 | 0 | 3d0000a9 |
----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
Here, the fields are:
id
Identifier for the node and connection. The first part matches the
SP_OURID
of the node with thestorpool_iscsi
service running, and the second is the export number.target
The target IQN.
initiator
The initiator IQN.
portal addr
The portal group floating address and port.
initiator addr
The initiator address and port.
timeCreated
The time when the session was created.
Initiator:
nopOut
Number of NOP-out requests from the initiator.
scsi
Number of SCSI commands from the initiator for this session.
task
Number of SCSI Task Management Function Requests from the initiator.
dataOut
Number of SCSI Data-Out PDUs from the initiator.
otherOut
Number of non SCSI Data-Out PDUs sent to the target (Login/Logout/SNACK or Text).
ISID
The initiator part of the session identifier, explicitly specified by the initiator during login.
Target:
nopIn
Number of NOP-in PDUs from the target.
scsiRsp
Number of SCSI response PDUs from the target.
taskRsp
Number of SCSI Task Management Function Response PDUs from the target.
dataIn
Number of SCSI Data-In PDUs from the target.
r2t
Number of Ready To Transfer (R2T) PDUs from the target.
otherIn
Number of non SCSI Data-In PDUs from the target (Login/Logout/SNACK or Text).
Task queue:
t free
Number of free task queue slots.
t dataOut
Write request waiting for data from TCP.
t queued
Number of IO requests received ready to be processed.
t processing
Number of IO requests sent to the target to process.
t dataResp
Read request queued for sending over TCP.
t aborted
Number of aborted requests.
12.15.6. Operations
To stop exporting the tinyvolume volume to the initiator with iqn iqn.2019-08.com.example:abcdefgh and the examplepg portal group:
# storpool iscsi config unexport initiator iqn.2019-08.com.example:abcdefgh portalGroup examplepg volume tinyvolume
OK
If a target was exported to all initiators (with *
), not specifying an
initiator will unexport from all:
# storpool iscsi config unexport portalGroup examplepg volume tinyvolume
OK
To remove an iSCSI definition for the tinyvolume volume:
# storpool iscsi config target delete tinyvolume
OK
To remove access for the iqn.2019-08.com.example:client iSCSI initiator:
# storpool iscsi config initiator iqn.2019-08.com.example:client delete
OK
To remove the portal 192.168.42.202 IP address:
# storpool iscsi config portal delete address 192.168.42.202
OK
To remove portal group examplepg after all the portals have been removed:
# storpool iscsi config portalGroup examplepg delete
OK
Note
Only portal groups without portals may be deleted.
12.15.7. Using iscsi_tool
With the hardware accelerated iSCSI all traffic from and to the initiators is
handled by the storpool_iscsi
service directly. For example, with the above
setup the addresses exposed on each of the nodes could be queried with
/usr/lib/storpool/iscsi_tool
:
# /usr/lib/storpool/iscsi_tool
usage: /usr/lib/storpool/iscsi_tool change-port 0/1 ifaceName
usage: /usr/lib/storpool/iscsi_tool ip net list
usage: /usr/lib/storpool/iscsi_tool ip neigh list
usage: /usr/lib/storpool/iscsi_tool ip route list
To list the presently configured addresses:
# /usr/lib/storpool/iscsi_tool ip net list
10.1.100.0/24 vlan 1100 ports 1,2
10.18.1.0/24 vlan 1801 ports 1,2
10.18.2.0/24 vlan 1802 ports 1,2
To list the neighbours and their last state:
# /usr/lib/storpool/iscsi_tool ip neigh list
10.1.100.11 ok F4:52:14:76:9C:B0 lastSent 1785918292753 us, lastRcvd 918669 us
10.1.100.13 ok 0:25:90:C8:E5:AA lastSent 1785918292803 us, lastRcvd 178521 us
10.1.100.18 ok C:C4:7A:EA:85:4E lastSent 1785918292867 us, lastRcvd 178099 us
10.1.100.108 ok 1A:60:0:0:E0:8 lastSent 1785918293857 us, lastRcvd 857181794 us
10.1.100.112 ok 1A:60:0:0:E0:C lastSent 1785918293906 us, lastRcvd 1157179290 us
10.1.100.113 ok 1A:60:0:0:E0:D lastSent 1785918293922 us, lastRcvd 765392509 us
10.1.100.114 ok 1A:60:0:0:E0:E lastSent 1785918293938 us, lastRcvd 526084270 us
10.1.100.115 ok 1A:60:0:0:E0:F lastSent 1785918293954 us, lastRcvd 616948781 us
10.1.100.123 ours
[snip]
The above output includes also the portalGroup addresses residing on the node with the lowest ID in the cluster.
To list routing information:
# /usr/lib/storpool/iscsi_tool ip route list
10.1.100.0/24 local
10.18.1.0/24 local
10.18.2.0/24 local
12.15.8. Using iscsi_targets
The /usr/lib/storpool/iscsi_targets
tool is a helper tool for Linux
based initiators, showing all logged in targets on the node:
# /usr/lib/storpool/iscsi_targets
/dev/sdn iqn.2020-04.com.storpool:autotest:s11-1-iscsi-test-hybrid-centos6
/dev/sdo iqn.2020-04.com.storpool:autotest:s11-1-iscsi-test-hdd-centos6
/dev/sdp iqn.2020-04.com.storpool:autotest:s11-2-iscsi-test-hybrid-centos6
/dev/sdq iqn.2020-04.com.storpool:autotest:s11-2-iscsi-test-hdd-centos6
12.16. Kubernetes
To register a Kubernetes cluster:
# storpool kubernetes add name cluster1
OK
To disable a Kubernetes cluster:
# storpool kubernetes update name cluster1 disable yes
OK
To enable a Kubernetes cluster:
# storpool kubernetes update name cluster1 disable no
OK
To delete Kubernetes cluster:
# storpool kubernetes delete name cluster1
OK
To list registered Kubernetes clusters:
# storpool kubernetes list
-----------------------
| name | disabled |
-----------------------
| cluster1 | false |
-----------------------
To view the status of the registered Kubernetes clusters:
# storpool kubernetes status
--------------------------------------------------------------
| name | sc | w | pvc | noRsrc | noTempl | mode | noSC |
--------------------------------------------------------------
| cluster1 | 0 | 0/3 | 0 | 0 | 0 | 0 | 0 |
--------------------------------------------------------------
Feilds:
sc - registered Storage Classes
w - watch connections to the kube adm
pvc - persistentVolumeClaims beeing provisioned
noRsrc - persistentVolumeClaims failed due to no resources
noTempl - persistentVolumeClaims failed due to missing template
mode - persistentVolumeClaims failed due to unsupported access mode
noSC - persistentVolumeClaims failed due to missing storage class
12.17. Relocator
The relocator is an internal StorPool service that takes care of data reallocation in case volume’s replication or placement group parameters are changed. This service is turned on by default. For details, see 18. Rebalancing the cluster.
12.17.1. Turning on and off
When needed, the relocator could be turned off with:
# storpool relocator off
OK
To turn it back on:
# storpool relocator on
OK
12.17.2. Displaying status
To display relocator’s status:
# storpool relocator status
relocator on, no volumes to relocate
12.17.3. Additional relocator commands
To return the state of the disks after the relocator finishes all presently running tasks, as well as the quantity of objects and data each drive still needs to recover:
# storpool relocator disks
The output of this command is the same as the one from storpool balancer
disks
after the balancing task has been committed; see 12.18. Balancer for
details.
To show the same information as storpool relocator disks
with the
pending operations for a specific volume:
# storpool relocator volume <volumename> disks
To show the same information as storpool relocator disks
with the
pending operations for a specific snapshot:
# storpool relocator snapshot <snapshotname> disks
12.18. Balancer
The balancer is used to redistribute data in case a disk or set of disks (for example, new node) were added to or removed from a cluster. By default it is off. In case of changes in cluster configuration it has to be turned on for redistribution of data to occur. For details, see 18. Rebalancing the cluster.
To display the status of the balancer:
# storpool balancer status
balancer waiting, auto off
To discard the re-balancing operation:
# storpool balancer stop
OK
To actually commit the changes and start the relocations of the proposed changes:
# storpool balancer commit
OK
After the commit all changes will be only visible with storpool relocator
disks
, and many volumes and snapshots will have the M
flag in the output of
storpool volume status
until all relocations are completed. The progress
could be followed with storpool task list
(see 12.19. Tasks).
12.19. Tasks
The tasks are all outstanding operations on recovering or relocating data in the present or between two connected clusters.
For example, if a disk with ID 1401
was not in the cluster for a period of
time and is then returned, all outdated objects will be recovered from the other
drives with the latest changes.
These recovery operations could be listed with:
# storpool task list
----------------------------------------------------------------------------------------
| disk | task id | total obj | completed | started | remaining | % complete |
----------------------------------------------------------------------------------------
| 2301 | RECOVERY | 73 | 5 | 1 | 68 | 6% |
| 2315 | balancer | 180 | 0 | 1 | 180 | 0% |
----------------------------------------------------------------------------------------
| total | | 73 | 5 | 1 | 68 | 6% |
----------------------------------------------------------------------------------------
Other cases where tasks operations could be listed are when a re-balancing operation was committed and relocations are in progress, as well as when a cloning operation for a remote snapshot in the local cluster is in progress.
12.20. Maintenance mode
The maintenance
submenu is used to configure one or more nodes in
a cluster in maintenance state. A couple of checks will be performed prior to
entering into maintenance state in order to prevent a node from entering
maintenance in case it has one or more live server instances in the cluster,
when for example the cluster is not yet fully recovered or is with decreased
redundancy due to other reasons.
A node could be configured in maintenance state with:
# storpool maintenance set node 23 duration 10m description kernel_update
OK
The above will configure node ID 23 in maintenance state for 10 minutes and will configure the description to “kernel_update”.
To list the present nodes in maintenance:
# storpool maintenance list
------------------------------------------------------------
| nodeId | started | remaining | description |
------------------------------------------------------------
| 23 | 2020-09-30 12:55:20 | 00:09:50 | kernel_update |
------------------------------------------------------------
To complete a maintenance for a node:
# storpool maintenance complete node 23
OK
Note
All non-cluster threatening issues will not be sent by the monitoring system to external entities. All alerts will still be received by StorPool support and will be classified as “under maintenance” internally, while the node or the cluster are under maintenance mode.
Attention
Any alerts that are cluster threatening will still send super-critical alerts to both StorPool support and any other configured endpoint. For details, see Severity levels.
Consider that a full cluster maintenance mode is also available. For
more information on how to do this with storpool mgmtConfig maintenanceState
,
see 12.22.8. Cluster maintenance mode.
12.22. Management configuration
Tip
Please consult with StorPool support before changing the management configuration defaults.
The mgmtConfig
submenu is used to set some internal configuration
parameters.
12.22.1. Listing current configuration
To list the presently configured parameters:
# storpool mgmtConfig list
relocator on, interval 5.000 s
relocator transaction: min objects 320, max objects 4294967295
relocator recovery: max tasks per disk 2, max objects per disk 2400
relocator recovery objects trigger 32
relocator min free 150 GB
relocator max objects per HDD tail 0
balancer auto off, interval 5.000 s
snapshot delete interval 1.000 s
disks soft-eject interval 5.000 s
snapshot delayed delete off
snapshot dematerialize interval 1.000 s
mc owner check interval 2.000 s
mc autoreconcile interval 2.000 s
reuse server implicit on disk down disabled
max local recovery requests 1
max remote recovery requests 2
maintenance state production
max disk latency nvme 1000.000 ms
max disk latency ssd 1000.000 ms
max disk latency hdd 1000.000 ms
max disk latency journal 50.000 ms
backup template name backup_template
aggScoreSpace sameAg 99
aggScoreSpace suppress for disk full below 1%
aggScoreSpace restore for disk full above 2%
12.22.2. Local and remote recovery
Using the maxLocalRecoveryRequests
and maxRemoteRecoveryRequests
parameters you can set the number of parallel requests to issue while performing
local or remote recovery, respectively. The values of the parameters should be
between 1 and 64.
To set the default local recovery requests for all disks:
StorPool> mgmtConfig maxLocalRecoveryRequests 1
OK
StorPool> mgmtConfig maxRemoteRecoveryRequests 2
OK
You can override the values per disk in the following way:
StorPool> disk 1111 maxRecoveryRequestsOverride local 1
OK
StorPool> disk 1111 maxRecoveryRequestsOverride remote 2
OK
You can also clear the overrides so that the defaults take precedence:
StorPool> disk 1111 maxRecoveryRequestsOverride local clear
OK
StorPool> disk 1111 maxRecoveryRequestsOverride remote clear
OK
An example use case would be the need to speed up or slow down a re-balancing or a remote transfer, based on the operational requirements at the time. For example, to lower the impact on latency-sensitive user operations, or decrease the time required for getting a cluster back to full redundancy, or getting a remote transfer completed faster.
These parameters are introduced with the 19.3 revision 19.01.2592.cf99471bd release.
12.22.3. Miscellaneous parameters
To disable the deferred snapshot deletion (default on):
# storpool mgmtConfig delayedSnapshotDelete off
OK
When enabled, all snapshots with configured time to be deleted will be cleared at the configured date and time.
To change the default interval between periodic checks whether disks marked for ejection can actually be ejected (5 sec.):
# storpool mgmtConfig disksSoftEjectInterval 20000 # value in ms - 20 sec.
OK
To change the default interval (5 sec.) for the relocator to check if there is new work to be done:
# storpool mgmtConfig relocatorInterval 20000 # value is in ms - 20 sec.
OK
To set a different than the default number of objects per disk (3200) in recovery at a time:
# storpool mgmtConfig relocatorMaxRecoveryObjectsPerDisk 2000 # value in number of objects per disk
OK
To change the default maximum number of recovery tasks per disk (2 tasks):
# storpool mgmtConfig relocatorMaxRecoveryTasksPerDisk 4 # value is number of tasks per disk - will set 4 tasks
OK
To change the minimum (default 320) or the maximum (default 4294967295) number of objects per transaction for the relocator:
# storpool mgmtConfig relocatorMaxTrObjects 2147483647
OK
# storpool mgmtConfig relocatorMinTrObjects 640
OK
To change the maximum number of objects per transaction per HDD tail drives use (0 is unset, 1+ is number of objects):
# storpool mgmtConfig relocatorMaxTrObjectsPerHddTail 2
To change the maximum number of objects in recovery for a disk to be usable by the relocator (default 32):
# storpool mgmtConfig relocatorRecoveryObjectsTrigger 64
To change the default check for new snapshots for deleting:
# storpool mgmtConfig snapshotDeleteInterval
12.22.4. Snapshot dematerialization
To enable snapshot dematerialization or change the interval:
# storpool mgmtConfig snapshotDematerializeInterval 30000 # sets the interval 30 seconds, 0 disables it
Snapshot dematerialization checks and removes all objects that do not refer to any data, i.e. no change in this object from the last snapshot (or ever). This helps reducing the number of used objects per disk in clusters with a large number of snapshots and a small number of changed blocks between the snapshots in the chain.
To update the free space threshold in GB after which the relocator will not be adding new tasks:
# storpool mgmtConfig relocatorGBFreeBeforeAdd 75 # value is in GB
12.22.5. Multi-cluster parameters
To set or change the default multi-cluster owner check interval:
# storpool mcOwnerCheckInterval 2000 # sets the interval to 2 seconds, 0 disables it
To set or change the default MultiCluster auto-reconcile interval:
# storpool mcAutoReconcileInterval 2000 # sets the interval to 2 seconds, 0 disables it
12.22.6. Reusing server on disk failure
If there is a disk down, and a new volume could not be allocated, enabling the
reuseServerImplicitOnDiskDown
option will retry the new volume allocation as
if the reuseServer
parameter was specified. This is helpful for minimum
installation requirements with 3 nodes when one of the nodes or a disk is down.
To enable the option:
# storpool mgmtConfig reuseServerImplicitOnDiskDown enable
The only downside is that the volume will have two of its replicas on drives in the same server. When the missing node comes back a re-balancing will be required, so that all replicas created on the same server are re-distributed back on all nodes. A new needbalance alert will be raised for these occasions.
This option is turned on by default for all new installations. Its history is as follows:
Introduced in 19.1 revision 19.01.1025.0baac06a6.
As of 19.3 revision 19.01.2318.10e55fce0, all volumes and snapshots that violate some placement constraints will be visible in the output of
storpool volume status
andstorpool volume quickStatus
with the flag C; for details, see 12.10.4. Volume status.
12.22.7. Changing default template
To change the default template upon receiving a snapshot from a remote cluster,
through the storpool_bridge
service (was the now deprecated
SP_BRIDGE_TEMPLATE
):
# storpool mgmtConfig backupTemplateName all-flash # the all-flash template should exist
OK
12.22.8. Cluster maintenance mode
A full cluster maintenance mode is available for occasions involving full cluster related maintenance activities. An example would be a scheduled restart of a network switch that will be reported as missing network for all nodes in a cluster.
This mode does not perform any checks, and is mainly for informational purposes in order to sync context between customers and StorPool’s support teams. Full cluster maintenance mode could be used in addition to the per-node maintenance state explained above when necessary.
To change the full cluster maintenance state to maintenance
:
# storpool mgmtConfig maintenanceState maintenance
OK
To switch back into production
state:
# storpool mgmtConfig maintenanceState production
OK
In case you only need to do this for a single node you can use storpool
maintenance
, as described in 12.20. Maintenance mode.
12.22.9. Latency thresholds
Note
For individual per disk latency thresholds check 12.8.4. Disk list performance information section.
To define a global latency threshold before ejecting a HDD disk:
# storpool mgmtConfig maxDiskLatencies hdd 1000 # value is in milliseconds
To define a global latency threshold before ejecting a SSD drive:
# storpool mgmtConfig maxDiskLatencies ssd 1000 # value is in milliseconds
To define a global latency threshold before ejecting a NVMe drive:
# storpool mgmtConfig maxDiskLatencies nvme 1000 # value is in milliseconds
To define a global latency limit before ejecting a journal device:
# storpool mgmtConfig maxDiskLatencies journal 50 # value is in milliseconds
12.22.10. Aggregate score parameters
To configure a different defaults for the disk space aggregation algorithm, use:
# storpool mgmtConfig aggScoreSpace suppressEnd 1
# storpool mgmtConfig aggScoreSpace restore 2
# storpool mgmtConfig aggScoreSpace sameAg 99
Note
These settings will be gradually modified in all production installations to new defaults that will be a lot less aggressive when a large amount of data gets deleted. With the new defaults the impact on user operations will be a lot smaller compared to the previous defaults with a mostly linear relation to the amount of data written and then freed from a disk:
# storpool mgmtConfig aggScoreSpace suppressEnd 90
# storpool mgmtConfig aggScoreSpace restore 95
# storpool mgmtConfig aggScoreSpace sameAg 10
Added with 21.0 revision 21.0.841.983f5880c release.
12.23. Mode
Support for couple of different output modes is available both for the interactive shell and when the CLI is invoked directly. Some custom format options are available only for some operations.
The available modes are:
- csv
Semicolon-separated values for some commands.
- json
Processed JSON output for some commands.
- pass
Pass the JSON response through.
- raw
Raw output (display the HTTP request and response).
- text
Human-readable output (default).
Example with switching to csv
mode in the interactive shell:
StorPool> mode csv
OK
StorPool> net list
nodeId;flags;net 1;net 2
23;uU + AJ;22:60:00:00:F0:17;26:60:00:00:F0:17
24;uU + AJ;2A:60:00:00:00:18;2E:60:00:00:00:18
25;uU + AJ;F6:52:14:76:9C:C0;F6:52:14:76:9C:C1
26;uU + AJ;2A:60:00:00:00:1A;2E:60:00:00:00:1A
29;uU + AJ;52:6B:4B:44:02:FE;52:6B:4B:44:02:FF
The same applies when using the CLI directly:
# storpool -f csv net list # the output is the same as above
[snip]
13. Multi-server
The multi-server feature enables the use of up to seven separate
storpool_server
instances on a single node. This makes sense for dedicated
storage nodes, or in the case of a heavily-loaded converged setup with more
resources isolated for the storage system.
For example, a dedicated storage node with 36 drives would provide better peak performance with 4 server instances each controlling 1/4th of all disks/SSDs than with a single instance. Another good example would be a converged node with 16 SSDs/HDDs, which would provide better peak performance with two server instances each controlling half of the drives and running on separate CPU cores, or even running on two threads on a single CPU core compared to a single server instance.
13.1. Configuration
The configuration of the CPUs on which the different instances are running is
done via cgroups, through the storpool_cg
tool; for details, see
6.6. Cgroup options.
Configuring which drive is handled by which instance is done with the
storpool_initdisk
tool. For example, if you have two drives whose IDs are
1101
and 1102
, both controlled by the first server instance, the output
from storpool_initdisk
would look like this:
# storpool_initdisk --list
/dev/sde1, diskId 1101, version 10007, server instance 0, cluster init.b, SSD
/dev/sdf1, diskId 1102, version 10007, server instance 0, cluster init.b, SSD
Setting the second SSD drive (1102
) to be controlled by the second server
instance is done like this (X
is the drive letter and N
is the partition
number, for example /dev/sdf1
):
# storpool_initdisk -r -i 1 /dev/sdXN
Hint
The above command will fail if the storpool_server
service is
running, please eject the disk prior to re-setting an instance.
In some occasions, if the first server instance was configured with a large
amount of cache (check SP_CACHE_SIZE
in 6. Node configuration options),
when migrating from one to two instances it is recommended to first split the
cache between the different instances (for example, from 8192
to 4096
).
These parameters will be automatically taken care of by the storpool_cg
tool, check for more details in 6.6. Cgroup options.
13.2. Helper
StorPool provides a tool for easy reconfiguration between different number of server instances. It can be used to print the required commands. For example, for a node with some SSD and some HDDs automatically assigned to 3 SSD only, and one HDD-only server instances:
[root@s25 ~]# /usr/lib/storpool/multi-server-helper.py -i 4 -s 3
/usr/sbin/storpool_initdisk -r -i 0 2532 0000:01:00.0-p1 # SSD
/usr/sbin/storpool_initdisk -r -i 0 2534 0000:02:00.0-p1 # SSD
/usr/sbin/storpool_initdisk -r -i 0 2533 0000:06:00.0-p1 # SSD
/usr/sbin/storpool_initdisk -r -i 0 2531 0000:07:00.0-p1 # SSD
/usr/sbin/storpool_initdisk -r -i 1 2505 /dev/sde1 # SSD
/usr/sbin/storpool_initdisk -r -i 1 2506 /dev/sdf1 # SSD
/usr/sbin/storpool_initdisk -r -i 1 2507 /dev/sdg1 # SSD
/usr/sbin/storpool_initdisk -r -i 1 2508 /dev/sdh1 # SSD
/usr/sbin/storpool_initdisk -r -i 2 2501 /dev/sda1 # SSD
/usr/sbin/storpool_initdisk -r -i 2 2502 /dev/sdb1 # SSD
/usr/sbin/storpool_initdisk -r -i 2 2503 /dev/sdc1 # SSD
/usr/sbin/storpool_initdisk -r -i 2 2504 /dev/sdd1 # SSD
/usr/sbin/storpool_initdisk -r -i 3 2511 /dev/sdi1 # WBC
/usr/sbin/storpool_initdisk -r -i 3 2512 /dev/sdj1 # WBC
/usr/sbin/storpool_initdisk -r -i 3 2513 /dev/sdk1 # WBC
/usr/sbin/storpool_initdisk -r -i 3 2514 /dev/sdl1 # WBC
/usr/sbin/storpool_initdisk -r -i 3 2515 /dev/sdn1 # WBC
/usr/sbin/storpool_initdisk -r -i 3 2516 /dev/sdo1 # WBC
/usr/sbin/storpool_initdisk -r -i 3 2517 /dev/sdp1 # WBC
/usr/sbin/storpool_initdisk -r -i 3 2518 /dev/sdq1 # WBC
[root@s25 ~]# /usr/lib/storpool/multi-server-helper.py -h
usage: multi-server-helper.py [-h] [-i INSTANCES] [-s [SSD_ONLY]]
Prints relevant commands for dispersing the drives to multiple server
instances
optional arguments:
-h, --help show this help message and exit
-i INSTANCES, --instances INSTANCES
Number of instances
-s [SSD_ONLY], --ssd-only [SSD_ONLY]
Splits by type, 's' SSD-only instances plusi-s HDD
instances (default s: 1)
Note that the commands could be executed only when the relevant
storpool_server*
service instances are stopped and a cgroup re-configuration
would likely be required after the setup changes (see
6.6. Cgroup options for more info on how to update cgroups).
14. Redundancy
StorPool provides two mechanisms for protecting data from unplanned events: replication and erasure coding.
When planning replication and erasure coding schemes you can use the StorPool capacity planner tool. For details, see StorPool capacity planner.
14.1. Replication
With replication, redundancy is provided by having multiple copies (replicas) of the data written synchronously across the cluster. You can set the number of replication copies as needed. The replication level directly correlates with the number of servers that may be down without interruption in the service. For example, with triple replication the number of the servers that may be down simultaneously, without losing access to the data, is 2.
Each volume or snapshot could be replicated on a different set of drives. Each set of drives is configured through the placement groups. A volume would either have all of its copies in a single set of drives in a different set of nodes, or have different copies in a different set of drives. There are many parameters through which you can manage replication; for details, see 12.10. Volumes and 12.9. Placement groups.
Tip
When using the replication mechanism, StorPool recommends having 3 copies as a standard for critical data.
14.1.1. Triple replication
The minimum requirement for triple replication is at least three nodes (with recommended five).
With triple replication each block of data is stored on three different storage nodes. This protects the data against two simultaneous failures - for example, one node is down for maintenance, and a drive on another node fails.
14.1.2. Dual replication
Dual replication can be used for non-critical data, or for data that can be recreated from other sources. Dual-replicated data can tolerate a single failure without service interruption.
This type of replication is suitable for test and staging environments, and can be deployed on a single node cluster (not recommended for production deployments). Deployment can also be performed on larger HDD-based backup clusters.
14.2. Erasure Coding
As of release 21.0 revision 21.0.75.1e0880427 StorPool supports erasure coding on NVMe drives.
14.2.1. Features
The erasure coding mechanism reduces the amount of data stored on the same hardware set, while at the same time preserves the level of data protection. It provides the following advantages:
Cross-node data protection
Erasure-coded data is always protected across servers with two parity objects, so that any two servers can fail, and user data is safe.
Delayed batch-encoding
Incoming data is initially written with triple replication. The erasure coding mechanism is automatically applied later. This way the data processing overhead is significantly reduced, and the impact on latency for user I/O operations is minimized.
Designed for always-on operations
Up to two storage nodes can be rebooted or brought down for maintenance while the storage system keeps running, and all data is available and in use.
A pure software feature
The implementation requires no additional hardware components.
14.2.2. Redundancy schemes
StorPool supports three redundancy schemes for erasure coding - 2+2
,
4+2
, or 8+2
schemes. You can choose which one to use based on the size
of your cluster. The naming of the schemes follows the k+m
pattern:
k
is the number of data blocks stored.m
is the number of parity blocks stored.A redundancy scheme can recover data when any up to
m
blocks are lost.
For example, 4+2
stores 4 data blocks and protects them with two parity
blocks. It can operate and recover when any 2 drives or nodes are lost.
When planning, consider the minimum required number of nodes (or fault sets) for each scheme:
Scheme |
Nodes |
Raw space used |
Overhead |
---|---|---|---|
2+2 |
5+ |
2.4x |
140% |
4+2 |
7+ |
1.8x |
80% |
8+2 |
11+ |
1.5x |
50% |
For example, storing 1TB user data using the 8+2
scheme requires 1.5TB raw
storage capacity.
The nodes have to be relatively similar in size. A mixture of a few very large nodes could lead to inability to use their capacity efficiently.
Note
Erasure coding requires making snapshots on a regular basis. Make sure your cluster is configured to create snapshots regularly, for example using the VolumeCare service. A single periodic snapshot per volume is required; more snapshots are optional.
14.2.3. FAQ
Check the frequently asked questions about StorPool’s erasure coding implementation: Erasure coding
15. Volumes and snapshots
Volumes are the basic service of the StorPool storage system. A volume always
have a name, a global ID and a certain size. It can be read from and written to.
It could be attached to hosts as read-only or read-write block device under the
/dev/storpool
directory (available as well at /dev/storpool-byid
).
The volume name is a string consisting of one or more of the allowed characters
- upper and lower latin letters (a-z,A-Z
), numbers (0-9
) and the
delimiter dot (.
), colon (:
), dash (-
) and underscore (_
).
15.1. Creating a volume
15.2. Deleting a volume
15.3. Renaming a volume
15.4. Resizing a volume
15.5. Snapshots
Snapshots are read-only point-in-time images of volumes. They are created once
and cannot be changed. They can be attached to hosts as read-only block devices
under /dev/storpool
.
All volumes and snapshots share the same name-space. Names of volumes and
snapshots are unique within a StorPool cluster. This diagram illustrates the
relationship between a snapshot and a volume. Volume vol1
is based on
snapshot snap1
. vol1
contains only the changes since snap1
was
taken. In the common case this is a small amount of data. Arrows indicate a
child-parent relationship. Each volume or snapshot may have exactly one parent
which it is based upon. Writes to vol1
are recorded within the volume. Reads
from vol1
may be served by vol1
or by its parent snapshot - snap1
,
depending on whether vol1
contains changed data for the read request or not.
Snapshots and volumes are completely independent. Each snapshot may have many children (volumes and snapshots). Volumes cannot have children.
snap1
contains a full image. snap2
contains only the changes since
snap1
was taken. vol1
and vol2
contain only the changes since
snap2
was taken.
15.6. Creating a snapshot of a volume
There is a volume named vol1
.
After the first snapshot the state of vol1
is recorded in a new snapshot
named snap1
. vol1
does not occupy any space now, but will record any new
writes which come in after the creation of the snapshot. Reads from vol1
may
fall through to snap1
.
Then the state of vol1
is recorded in a new snapshot named snap2
.
snap2
contains the changes between the moment snap1
was taken and the
moment snap2
was taken. snap2
’s parent is the original parent of
vol1
.
15.7. Creating a volume based on an existing snapshot (a.k.a. clone)
Before the creation of vol1
there is a snapshot named snap1
.
A new volume, named vol1
is created. vol1
is based on snap1
. The
newly created volume does not occupy any space initially. Reads from the
vol1
may fall through to snap1
or to snap1
’s parents (if any).
15.8. Deleting a snapshot
vol1
and vol2
are based on snap1
. snap1
is based on snap0
.
snap1
contains the changes between the moment snap0
was taken and when
snap1
was taken. vol1
and vol2
contain the changes since the moment
snap1
was taken.
After the deletion, vol1
and vol2
are based on snap1
’s original
parent (if any). In the example they are now based on snap0
. When deleting a
snapshot, the changes contained therein will not be propagated to its children
and StorPool will keep the snap1
in deleting state to prevent from an
explosion of disk space usage.
15.9. Rebase to null (a.k.a. promote)
vol1
is based on snap1
. snap1
is in turn based on snap0
.
snap1
contains the changes between the moment snap0
was taken and when
snap1
was taken. vol1
contains the changes since the moment snap1
was taken.
After promotion vol1
is not based on a snapshot. vol1
now contains all
data, not just the changes since snap1
was taken. Any relation between
snap1
and snap0
is unaffected.
15.10. Rebase
vol1
is based on snap1
. snap1
is in turn based on snap0
.
snap1
contains the changes between the moment snap0
was taken and when
snap1
was taken. vol1
contains the changes since the moment snap1
was taken.
After the rebase operation vol1
is based on snap0
. vol1
now contains
all changes since snap0
was taken, not just since snap1
. snap1
is
unchanged.
15.11. Example use of snapshots
This is a semi-realistic example of how volumes and snapshots may be used. There
is a snapshot called base.centos7
. This snapshot contains a base CentOS 7 VM
image, which was prepared carefully by the service provider. There are 3
customers with 4 virtual machines each. All virtual machine images are based on
CentOS 7, but may contain custom data, which is unique to each VM.
This example shows another typical use of snapshots - for restore points back in
time for this volume. There is one base image for Centos 7, three snapshot
restore points and one live volume cust123.v.1
15.12. More information
StorPool provides a tool for generating a visual representation of the volumes and snapshots. For details, see StorPool tree.
You can manage volumes and snapshots using the command line interface. For details, see 12.10. Volumes and 12.11. Snapshots.
16. Setting iSCSI targets
If StorPool volumes need to be accessed by hosts that cannot run the StorPool client service (e.g. VMware hypervisors), they may be exported using the iSCSI protocol.
As of version 19, StorPool implements an internal user-space TCP/IP stack, which in conjunction with the NIC hardware acceleration (user-mode drivers) allows for higher performance and independence of the kernel’s TCP/IP stack and its inefficiencies.
This document provides information on how to set up iSCSI targets in a StorPool
cluster. For more information on using the storpool
tool for this purpose,
see the 12.15. iSCSI section of the reference guide. For details on
configuring initiators (clients), see iSCSI overview.
16.1. A Quick Overview of iSCSI
The iSCSI remote block device access protocol, as implemented by the StorPool iSCSI service, is a client-server protocol allowing clients (referred to as “initiators”) to read and write data to disks (referred to as “targets”) exported by iSCSI servers.
iSCSI is implemented in StorPool with Portal Groups and Portals.
A portal is one instance of the 9.10. storpool_iscsi service, which listens on a TCP port (usually 3260), on specified IP addresses. Every portal has its own set of “targets” (exported volumes) that it provides service for.
A portal group is the “entry point” to the iSCSI service - a
“floating” IP address that’s on the first storpool_iscsi
service in the cluster and is always kept active (by
automatically moving to the next instance if the one serving it
is stopped/dies). All initiators connect to that IP and get
redirected to the relevant instance to communicate with their
target.
16.2. An iSCSI Setup in a StorPool Cluster
The StorPool implementation of iSCSI provides a way to mark StorPool volumes as accessible to iSCSI initiators, define iSCSI portals where hosts running the StorPool iSCSI service listen for connections from initiators, define portal groups over these portals, and export StorPool volumes (iSCSI targets) to iSCSI initiators in the portal groups. To simplify the configuration of the iSCSI initiators, and also to provide load balancing and failover, each portal group has a floating IP address that is automatically brought up on only a single StorPool service at a given moment; the initiators are configured to connect to this floating address, authenticating if necessary, and are then redirected to the portal of the StorPool service that actually exports the target (volume) that they need to access.
Note
You don’t need to add the IP addresses on the nodes.
Those are handled directly by the StorPool TCP implementation and are not visible in
ifconfig
orip
. If you’re going to use multiple VLANs, those are configured in the CLI and do not require setting up VLAN interfaces on the host itself except for debugging or testing, or if a local initiator is required to access volumes through iSCSI.Exporting a volume to initiators with more then one portal group is not supported.
For example, you cannot do this:
volume-1 -> iqn.something.storpool:s09-1 -> portal-group-1 volume-1 -> iqn.something.storpool:s09-1 -> portal-group-2
In the simplest setup, there is a single portal group with a floating IP address, there is a single portal for each StorPool host that runs the iSCSI service, all the initiators connect to the floating IP address and are redirected to the correct host. For quality of service or fine-grained access control, more portal groups may be defined and some volumes may be exported via more than one portal group.
Before configuring iSCSI, the interfaces that would be used for
it need to be described in storpool.conf
. Here is the general
config format:
SP_ISCSI_IFACE=IFACE1,RESOLVE:IFACE2,RESOLVE:[flags]
This row means that the first iSCSI network is on IFACE1
and the
second one on IFACE2
. The order is important for the configuration
later. RESOLVE
is the resolve interface, if different than the
interfaces themselves, i.e. if it’s a bond or a bridge.
[flags]
is not required and more importantly if not needed, must be omitted.
Currently the only supported value is [lacp]
(brackets included)
if the interfaces are in a LACP trunk.
Examples:
Multipath, two separate interfaces used directly:
SP_ISCSI_IFACE=eth0:eth1
Active-backup bond named bond0
:
SP_ISCSI_IFACE=eth0,bond0:eth1,bond0
LACP bond named bond0
:
SP_ISCSI_IFACE=eth0,bond0:eth1,bond0:[lacp]
Bridge interface cloudbr0
on top of LACP bond:
SP_ISCSI_IFACE=eth0,cloudbr0:eth1,cloudbr0:[lacp]
A trivial iSCSI setup can be brought up by the following series of StorPool CLI commands below. See the CLI tutorial for more information about the commands themselves. The setup does the following:
has baseName/IQN of
iqn.2019-08.com.example:poc-cluster
;has floating IP address is
192.168.42.247
, which is in VLAN 42;two nodes from the cluster will be able to export in this group:
node id 1, with IP address
192.168.42.246
node id 3, with IP address
192.168.42.202
one client is defined, with IQN
iqn.2019-08.com.example:poc-cluster:hv1
one volume, called
tinyvolume
will be exported to the client defined, in the portal group.
Note
You need to obtain the exact IQN of the initiator, available at:
Windows Server: iSCSI initiator, it is automatically generated upon installation
VMWare vSphere: it is automatically assigned upon creating a software iSCSI adapter
Linux-based (XenServer, etc.): /etc/iscsi/initiatorname.iscsi
# storpool iscsi config setBaseName iqn.2019-08.com.example:poc-cluster
OK
# storpool iscsi config portalGroup poc create
OK
# storpool iscsi config portalGroup poc addNet 192.168.42.247/24 vlan 42
OK
# storpool iscsi config portal create portalGroup poc address 192.168.42.246 controller 1
OK
# storpool iscsi config portal create portalGroup poc address 192.168.42.202 controller 3
OK
# storpool iscsi portalGroup list
---------------------------------------
| name | networksCount | portalsCount |
---------------------------------------
| poc | 1 | 2 |
---------------------------------------
# storpool iscsi portalGroup list portals
--------------------------------------------
| group | address | controller |
--------------------------------------------
| poc | 192.168.42.246:3260 | 1 |
| poc | 192.168.42.202:3260 | 3 |
--------------------------------------------
# storpool iscsi config initiator iqn.2019-08.com.example:poc-cluster:hv1 create
OK
# storpool volume tinyvolume template tinytemplate create # assumes tinytemplate exists
OK
# storpool iscsi config target create tinyvolume
OK
# storpool iscsi config export volume tinyvolume portalGroup poc initiator iqn.2019-08.com.example:poc-cluster:hv1
OK
# storpool iscsi initiator list
----------------------------------------------------------------------------------------------
| name | username | secret | networksCount | exportsCount |
----------------------------------------------------------------------------------------------
| iqn.2019-08.com.example:poc-cluster:hv1 | | | 0 | 1 |
----------------------------------------------------------------------------------------------
# storpool iscsi initiator list exports
---------------------------------------------------------------------------------------------------------------------------------------------
| name | volume | currentControllerId | portalGroup | initiator |
---------------------------------------------------------------------------------------------------------------------------------------------
| iqn.2019-08.com.example:poc-cluster:tinyvolume | tinyvolume | 1 | poc | iqn.2019-08.com.example:poc-cluster:hv1 |
---------------------------------------------------------------------------------------------------------------------------------------------
Below is a setup with two separate networks that allows for multipath. It uses the 192.168.41.0/24 network on the first interface, 192.168.42.0/24 on the second interface, and the .247 IP for the floating IP in both networks:
# storpool iscsi config setBaseName iqn.2019-08.com.example:poc-cluster
OK
# storpool iscsi config portalGroup poc create
OK
# storpool iscsi config portalGroup poc addNet 192.168.41.247/24
OK
# storpool iscsi config portalGroup poc addNet 192.168.42.247/24
OK
# storpool iscsi config portal create portalGroup poc address 192.168.41.246 controller 1
OK
# storpool iscsi config portal create portalGroup poc address 192.168.42.246 controller 1
OK
# storpool iscsi config portal create portalGroup poc address 192.168.41.202 controller 3
OK
# storpool iscsi config portal create portalGroup poc address 192.168.42.202 controller 3
OK
# storpool iscsi portalGroup list
---------------------------------------
| name | networksCount | portalsCount |
---------------------------------------
| poc | 2 | 4 |
---------------------------------------
# storpool iscsi portalGroup list portals
--------------------------------------------
| group | address | controller |
--------------------------------------------
| poc | 192.168.41.246:3260 | 1 |
| poc | 192.168.41.202:3260 | 3 |
| poc | 192.168.42.246:3260 | 1 |
| poc | 192.168.42.202:3260 | 3 |
--------------------------------------------
Note
Please note that the order of adding the networks relates to the order in the
SP_ISCSI_IFACE
, the first network will be bound to the first interface
appearing in this configuration. More on how to list configured iSCSI interfaces
is available here, more on how to list
addresses exposed by a particular node, check - 12.15.7. Using iscsi_tool.
There is no difference in exporting volumes in multi-path setups.
16.3. Routed iSCSI setup
16.3.1. Overview
Layer-3/routed networks present some challenges to the operation of StorPool iSCSI, unlike flat layer-2 networks:
routes need to be resolved for destinations based on the kernel routing table, instead of ARP;
floating IP addresses for the portal groups need to be accessible to the whole network;
The first task is accomplished by monitoring the kernel’s routing
table, the second with an integrated BGP speaker in storpool_iscsi
.
Note
StorPool’s iSCSI does not support Linux’s policy-based
routing, and is not affected by iptables
, nftables
, or any
kernel filtering/networking component.
An iSCSI deployment in a layer-3 network has the following general elements:
nodes with
storpool_iscsi
in one or multiple subnets;allocated IP(s) for portal group floating IP addresses;
local routing daemon (
bird
,frr
)access to the network’s routing protocol.
The storpool_iscsi
daemon connects to a local routing daemon via
BGP and announces the floating IPs from the node those are active on.
The local routing daemon talks to the network via its own protocol
(BGP, OSPF or something else) and passes on the updates.
Note
In a fully routed network, the local routing daemon is
also responsible for announcing the IP for cluster management
(managed by storpool_mgmt
)
16.3.2. Configuration
The following needs to be added to storpool.conf
:
SP_ISCSI_ROUTED=1
In routed networks, when adding the portalGroup
floating
IP address, you need to specify it as /32
.
Note
These are example configurations and may not be the exact fit for a particular setup. Handle with care.
Note
In the examples below, the ASN of the network is 65500
,
StorPool has been assigned 65512
, and will need to announce
192.168.42.247
.
To enable the BGP speaker in storpool_iscsi
, the following
snippet for storpool.conf
is needed (the parameters are
described in the comment above it):
# ISCSI_BGP_IP:BGP_DAEMON_IP:AS_FOR_ISCSI:AS_FOR_THE_DAEMON
SP_ISCSI_BGP_CONFIG=127.0.0.2:127.0.0.1:65512:65512
And here’s a snippet from bird.conf
for a BGP speaker
that talks to StorPool’s iSCSI:
# variables
myas = 65512;
remoteas = 65500;
neigh = 192.168.42.1
# filter to only export our floating IP
filter spip {
if (net = 192.168.42.247/32) then accept;
reject;
}
# external gateway
protocol bgp sw100g1 {
local as myas;
neighbor neigh as remoteas;
import all;
export filter spip;
direct;
gateway direct;
allow local as;
}
# StorPool iSCSI
protocol bgp spiscsi {
local as myas;
neighbor 127.0.0.1 port 2179 as myas;
import all;
export all;
multihop;
next hop keep;
allow local as;
}
Note
For protocols different than BGP, please note that the StorPool iSCSI exports the route to the floating IP with a next-hop of the IP address configured for the portal of the node, and this information needs to be preserved when announcing the route.
16.4. Caveats with a Complex iSCSI Architecture
In iSCSI portal definitions, a TCP address/port pair must be unique; only a single portal within the whole cluster may be defined at a single IP address and port. Thus, if the same StorPool iSCSI service should be able to export volumes in more than one portal group, the portals should be placed either on different ports or on different IP addresses (although it is fine that these addresses will be brought up on the same network interface on the host).
Note
Even though StorPool supports reusing IPs, separate TCP ports, etc., the general recommendation on different portal groups is to have a separate VLAN and IP range for each one. There are lots of unknowns with different ports, security issues with multiple customers in the same VLAN, etc..
The redirecting portal on the floating address of a portal group always listens on port 3260. Similarly to the above, different portal groups must have different floating IP addresses, although they are automatically brought up on the same network interfaces as the actual portals within the groups.
Some iSCSI initiator implementations (e.g. VMware vSphere) may only connect to TCP port 3260 for an iSCSI service. In a more complex setup where a StorPool service on a single host may export volumes in more than one portal group, this might mean that the different portals must reside on different IP addresses, since the port number is the same.
For technical reasons, currently a StorPool volume may only be exported by a single StorPool service (host), even though it may be exported in different portal groups. For this reason, some care should be taken in defining the portal groups so that they may have at least some StorPool services (hosts) in common.
17. Multi-site and multi-cluster
There are two sets of features allowing connections and operations to be performed on different clusters in the same (17.1. Multi-cluster) datacenter or different locations (17.2. Multi site).
General distinction between the two:
Multi-cluster covers closely packed clusters (i.e. pods or racks) with a fast and low-latency connection between them
Multi-site covers clusters in separate locations connected through and insecure and/or high-latency connection
17.1. Multi-cluster
For a detailed overview, see Introduction to multi-cluster mode.
Main use case for the multi-cluster mode is seamless scalability in the same datacenter. A volume could be live-migrated between different sub-clusters in a multi-cluster setup. This way workloads could be balanced between multiple sub-clusters in a location, which is generally referred to as a multi-cluster setup.
17.2. Multi site
Remotely connected clusters in different locations are referred to as multi site. When two remote clusters are connected, they could efficiently transfer snapshots between them. The usual case is remote backup and DR.
17.3. Setup
Connecting clusters regardless of their locations requires the
storpool_bridge
service to be running on at least two nodes in each cluster.
Each node running the storpool_bridge
needs the following parameters to be
configured in /etc/storpool.conf
or /etc/storpool.conf.d/*.conf
files:
SP_CLUSTER_NAME=<Human readable name of the cluster>
SP_CLUSTER_ID=<location ID>.<cluster ID>
SP_BRIDGE_HOST=<IP address>
The following is required when a single IP will be failed over between the bridges; see 17.5.2. Single IP failed over between the nodes:
SP_BRIDGE_IFACE=<interface> # optional with IP failover
The SP_CLUSTER_NAME
is mandatory human readable name for this cluster.
The SP_CLUSTER_ID
is an unique ID assigned by StorPool for each existing
cluster (example nmjc.b
). The cluster ID consists of two parts:
nmjc.b
| `sub-cluster ID
`location ID
The first part before the dot (nmjc
) is the location ID, and the part after
the dot is the sub-cluster ID (the second part after the .
- b
)
The SP_BRIDGE_HOST
is the IP address to listen for connections from other
bridges. Note that 3749
port should be unblocked in the firewalls between
the two locations.
A backup template should be configured through mgmtConfig (see 12.22. Management configuration) The backup template is needed to instruct the local bridge which template should be used for incoming snapshots.
Warning
The backupTemplateName mgmtConfig option must be configure in the
destination cluster for storpool volume XXX backup LOCATION
to work
(otherwise the transfer won’t start).
The SP_BRIDGE_IFACE
is required when two or more bridges are configured with
the same public/private key pairs. The SP_BRIDGE_HOST
in this case is a
floating IP address and will be configured on the SP_BRIDGE_IFACE
on the
host with the active
bridge.
17.4. Connecting two clusters
In this examples there are two clusters named Cluster_A
and Cluster_B
.
To have these two connected through their bridge services we would have to
introduce each of them to the other.
Note
In case of a multi-cluster setup the location will be the same for both
clusters, the procedure is the same for both cases with the slight difference
that in case of multi-cluster the remote bridges are usually configured with
noCrypto
.
17.4.1. Cluster A
The following parameters from Cluster_B
will be required:
The
SP_CLUSTER_ID
-locationBId.bId
The
SP_BRIDGE_HOST
IP address -10.10.20.1
The public key located in
/usr/lib/storpool/bridge/bridge.key.txt
in the remote bridge host inCluster_B
-eeeeeeeeeeeee.ffffffffffff.ggggggggggggg.hhhhhhhhhhhhh
By using the CLI we could add Cluster_B
’s location with the following
commands in Cluster_A
:
user@hostA # storpool location add locationBId location_b
user@hostA # storpool cluster add location_b bId
user@hostA # storpool cluster list
--------------------------------------------
| name | id | location |
--------------------------------------------
| location_b-cl1 | bId | location_b |
--------------------------------------------
The remote name is location_b-cl1
, where the clN
number is automatically
generated based on the cluster ID. The last step in Cluster_A
is to register
the Cluster_B
’s bridge. The command looks like this:
user@hostA # storpool remoteBridge register location_b-cl1 10.10.20.1 eeeeeeeeeeeee.ffffffffffff.ggggggggggggg.hhhhhhhhhhhhh
Registered bridges in Cluster_A
:
user@hostA # storpool remoteBridge list
----------------------------------------------------------------------------------------------------------------------------
| ip | remote | minimumDeleteDelay | publicKey | noCrypto |
----------------------------------------------------------------------------------------------------------------------------
| 10.10.20.1 | location_b-cl1 | | eeeeeeeeeeeee.ffffffffffff.ggggggggggggg.hhhhhhhhhhhhh | 0 |
----------------------------------------------------------------------------------------------------------------------------
Hint
The public key in /usr/lib/storpool/bridge/bridge.key.txt
will be
generated on the first run of the storpool_bridge
service.
Note
The noCrypto
is usually 1
in case of multi-cluster with a
secure datacenter network for higher throughput and lower latency
during migrations.
17.4.2. Cluster B
Similarly the parameters from Cluster_A
will be required for registering the
location, cluster and bridge(s) in Cluster B:
The
SP_CLUSTER_ID
-locationAId.aId
The
SP_BRIDGE_HOST
IP address inCluster_A
-10.10.10.1
The public key in
/usr/lib/storpool/bridge/bridge.key.txt
in the remote bridge host inCluster_A
-aaaaaaaaaaaaa.bbbbbbbbbbbb.ccccccccccccc.ddddddddddddd
Similarly the commands will be:
user@hostB # storpool location add locationAId location_a
user@hostB # storpool cluster add location_a aId
user@hostB # storpool cluster list
--------------------------------------------
| name | id | location |
--------------------------------------------
| location_a-cl1 | aId | location_a |
--------------------------------------------
user@hostB # storpool remoteBridge register location_a-cl1 1.2.3.4 aaaaaaaaaaaaa.bbbbbbbbbbbb.ccccccccccccc.ddddddddddddd
user@hostB # storpool remoteBridge list
-------------------------------------------------------------------------------------------------------------------------
| ip | remote | minimumDeleteDelay | publicKey | noCrypto |
-------------------------------------------------------------------------------------------------------------------------
| 1.2.3.4 | location_a-cl1 | | aaaaaaaaaaaaa.bbbbbbbbbbbb.ccccccccccccc.ddddddddddddd | 0 |
-------------------------------------------------------------------------------------------------------------------------
At this point, provided network connectivity is working, the two bridges will be connected.
17.5. Bridge redundancy
There are two ways to add redundancy for the bridge services by configuring and
starting the storpool_bridge
service on two (or more) nodes in each cluster.
For both cases only one bridge is active at a time and is being failed over in case the node or the active service is restarted.
17.5.1. Separate IP addresses
Configure and start the storpool_bridge
with a separate SP_BRIDGE_HOST
address and a separate set of public/private key pairs. In this case each of the
bridge nodes would have to be registered in the same way as explained in the
17.4. Connecting two clusters section. The SP_BRIDGE_IFACE
parameter is
unset and the SP_BRIDGE_HOST
address is expected by the storpool_bridge
service on each of the node where it is started.
In this case each of the bridge nodes in ClusterA
would have to be
configured in ClusterB
and vice-versa.
17.5.2. Single IP failed over between the nodes
For this, configure and start the storpool_bridge
service on the first node.
Then distribute the /usr/lib/storpool/bridge/bridge.key
and the
/usr/lib/storpool/bridge/bridge.key.txt
files on the next node where the
storpool_bridge
service will be running.
The SP_BRIDGE_IFACE
is required and represents the interface where the
SP_BRIDGE_HOST
address will be configured. The SP_BRIDGE_HOST
will be
up only on the node where the active bridge service is running until either the
service or the node itself gets restarted.
With this configuration there will be only one bridge registered in the remote
cluster(s), regardless of the number of nodes with running storpool_bridge
in the local cluster.
The failover SP_BRIDGE_HOST
is better suited for NAT/port-forwarding cases.
17.6. Bridge throughput performance
The throughput performance of a bridge connection depends on a couple of factors (not in this exact sequence) - network throughput, network latency, CPU speed and disk latency. Each could become a bottleneck and could require additional tuning in order to get a higher throughput from the available link between the two sites.
17.6.1. Network
For high-throughput links latency is the most important factor for achieving higher link utilization. For example, a low-latency 10 gbps link will be easily saturated (provided crypto is off), but would require some tuning when the latency is higher for optimizing the TCP window size. Same is in effect with lower-bandwidth links with higher latency.
For these cases the send buffer size could be bumped in small increments so that the TCP window is optimized. Check the 12.1. Location section for more info on how to update the send buffer size in each location.
Note
For testing what would be the best send buffer size for throughput performance from primary to backup site, fill a volume with data in the primary (source) site, then create a backup to the backup (remote) site. While observing the bandwidth utilized, increase the send buffers in small increments in the source and the destination cluster until the throughput either stops rising or stays at an acceptable level.
Note that increasing the send buffers above this value can lead to delays when recovering a backup in the opposite direction.
Further sysctl changes might be required, depending on the NIC driver, check
the /usr/share/doc/storpool/examples/bridge/90-StorPoolBridgeTcp.conf
on
the node with the storpool_bridge
service, for more info on this.
17.6.2. CPU
The CPU usually becomes a bottleneck only when crypto is configured to on, sometimes it is helpful to move the bridge service on a node with a faster CPU.
If a faster CPU is not available in the same cluster, it could be of help to set
the SP_BRIDGE_SLEEP_TYPE
option (see 6.9.8. Type of sleep for the bridge service) to
hsleep
, or even to no
. Note that when this is configured, the
storpool_cg
tool will attempt to isolate a full-CPU core (with the second
thread free from other processes).
17.6.3. Disks throughput
The default remote recovery setting maxRemoteRecoveryRequests
(see
12.22.2. Local and remote recovery) is relatively low, especially for dedicated
backup clusters. Thus, when the underlying disks in the receiving cluster are
underutilized (this does not happen with flash media) they become the
bottleneck. This parameter could be tuned for higher parallelism. Here is an
example: a small cluster of 3 nodes with 8 disks, translating to 48 default
queue depth from the bridge, when there are 8 * 3 * 32 available from the
underlying disks, and (by default with a 10gbps link) 2048 requests available
from the bridge service (256 on an 1gbps link).
17.7. Exports
A snapshot in one of the clusters could be exported and become visible at all
clusters in the location it was exported to, for example a snapshot called
snap1
could be exported with:
user@hostA # storpool snapshot snap1 export location_b
It becomes visible in Cluster_B
which is part of location_b
and could be
listed with:
user@hostB # storpool snapshot list remote
-------------------------------------------------------------------------------------------------------
| location | remoteId | name | onVolume | size | creationTimestamp | tags |
-------------------------------------------------------------------------------------------------------
| location_b | locationAId.aId.1 | snap1 | | 107374182400 | 2019-08-11 15:18:02 | |
-------------------------------------------------------------------------------------------------------
The snapshot may as well be exported to the location of the source cluster where the snapshot resides. This way it will become visible to all sub-clusters in this location.
17.8. Remote clones
Any snapshot export could be cloned locally. For example, to clone a remote
snapshot with globalId
of locationAId.aId.1
locally we could use:
user@hostB # storpool snapshot snap1-copy template hybrid remote location_a locationAId.aId.1
The name of the clone of the snapshot in Cluster_B
will be snap1_clone
with all parameters from the hybrid
template.
Note
Note that the name of the snapshot in Cluster_B
could also be
exactly the same in all sub-clusters in a multi-cluster setup, as well
as in clusters in different locations in a multi site setup.
The transfer will start immediately. Only written parts from the snapshot will
be transferred between the sites. If snap1
has a size of 100GB, but only
1GB of data was ever written in the volume when it was snapshotted, eventually
approximately this amount of data will be transferred between the two
(sub-)clusters.
If another snapshot in the remote cluster is already based on snap1
and then
exported, the actual transfer will again include only the differences between
snap1
and snap2
, since snap1
exists in Cluster_B
.
The globalId
for this snapshot will be the same for all sites it has been
transferred to.
17.9. Creating a remote backup on a volume
The volume backup feature is in essence a set of steps that automate the backup procedure for a particular volume.
For example to backup a volume named volume1
in Cluster_A
to
Cluster_B
we will use:
user@hostA # storpool volume volume1 backup Cluster_B
The above command will actually trigger the following set of events:
Creates a local temporary snapshot of
volume1
inCluster_A
to be transferred toCluster_B
Exports the temporary snapshot to
Cluster_B
Instructs
Cluster_B
to initiate the transfer for this snapshotExports the transferred snapshot in
Cluster_B
to be visible fromCluster_A
Deletes the local temporary snapshot
For example, if a backup operation has been initiated for a volume called
volume1
in Cluster_A
, the progress of the operation could be followed
with:
user@hostA # storpool snapshot list exports
-------------------------------------------------------------
| location | snapshot | globalId | backingUp |
-------------------------------------------------------------
| location_b | volume1@1433 | locationAId.aId.p | true |
-------------------------------------------------------------
Once this operation completes the temporary snapshot will no longer be visible
as an export and a snapshot with the same globalId
will be visible
remotely:
user@hostA # snapshot list remote
------------------------------------------------------------------------------------------------------
| location | remoteId | name | onVolume | size | creationTimestamp | tags |
------------------------------------------------------------------------------------------------------
| location_b | locationAId.aId.p | volume1 | volume1 | 107374182400 | 2019-08-13 16:27:03 | |
------------------------------------------------------------------------------------------------------
Note
You must have a template configured in mgmtConfig backupTemplateName
in Cluster_B
for this to work.
17.10. Creating an atomic remote backup for multiple volumes
Sometimes a set of volumes are used simultaneously in the same virtual machine, an example would be different filesystems for a database and its journal. In order to restore back to the same point in time all volumes a group backup could be initiated:
user@hostA# storpool volume groupBackup Cluster_B volume1 volume2
Note
The same underlying feature is used by the VolumeCare for keeping consistent snapshots for all volumes on a virtual machine.
17.11. Restoring a volume from remote snapshot
Restoring the volume to a previous state from a remote snapshot requires the following steps:
Create a local snapshot from the remotely exported one:
user@hostA # storpool snapshot volume1-snap template hybrid remote location_b locationAId.aId.p OK
There are some bits to explain in the above example - from left to right:
volume1-snap
- name of the local snapshot that will be created.template hybrid
- instructs StorPool what will be the replication and placement for the locally created snapshot.remote location_b locationAId.aId.p
- instructs StorPool where to look for this snapshot and what is itsglobalId
Tip
If the bridges and the connection between the locations are operational, the transfer will begin immediately.
Next, create a volume with the newly created snapshot as a parent:
.. code-block:: console user@hostA # storpool volume volume1-tmp parent volume1-snap
Finally, the volume clone would have to be attached where it is needed.
The last two steps could be changed a bit to rename the old volume to something different and directly create the same volume name from the restored snapshot. This is handled differently in different orchestration systems. The procedure for restoring multiple volumes from a group backup requires the same set of steps.
See VolumeCare node info for an example implementation.
Note
From 19.01 onwards if the snapshot transfer hasn’t completed yet when the volume is created, read operations on an object that is not yet transferred will be forwarded through the bridge and will be processed by the remote cluster.
17.12. Remote deferred deletion
Note
This feature is available for both multi-cluster and multi-site
configurations. Note that the minimumDeleteDelay
is per bridge,
not per location, thus all bridges to a remote location should be
(re)registered with the setting.
The remote bridge could be registered with remote deferred deletion enabled.
This feature will enable a user in Cluster A
to unexport and set remote
snapshots for deferred deletion in Cluster B
.
An example for the case without deferred deletion enabled - Cluster_A
and
Cluster_B
are two StorPool clusters in locations A
and B
connected
with a bridge. A volume named volume1
in Cluster_A
has two backup
snapshots in Cluster_B
called volume1@281
and volume1@294
.
The remote snapshots could be unexported from Cluster_A
with the
deleteAfter
flag, but it will be silently ignored in Cluster_B
.
To enable this feature the following steps would have to be completed in the
remote bridge for Cluster_A
:
The bridge in
Cluster_A
should be registered withminimumDeleteDelay
inCluster_B
.Deferred snapshot deletion should be enabled in
Cluster_B
; for details, see 12.22. Management configuration.
This will enable setting up the deleteAfter
parameter on an unexport
operation in Cluster_B
initiated from Cluster_A
.
With the above example volume and remote snapshots, a user in Cluster_A
could unexport the volume1@294
snapshot and set its deleteAfter
flag to
7 days from the unexport with:
user@hostA # storpool snapshot remote location_b locationAId.aId.q unexport deleteAfter 7d
OK
After the completion of this operation the following events will occur:
The
volume1@294
snapshot will immediately stop being visible inCluster_A
.The snapshot will get a
deleteAfter
flag with timestamp a week from the time of the unexport call.A week later the snapshot will be deleted, however only if deferred snapshot deletion is still turned on.
17.13. Volume and snapshot move
17.13.1. Volume move
A volume could be moved both with (live) or without attachment (offline) to a neighbor sub-cluster in a multi-cluster environment. This is available only for multi-cluster and not possible for Multi site, where only snapshots could be transferred.
To move a volume use:
# storpool volume <volumeName> moveToRemote <clusterName>
The above command will succeed only in case the volume is not attached on any of
the nodes in this sub-cluster. To move the volume live while it is still
attached an additional option onAttached
should instruct the cluster how to
proceed, for example this command:
Lab-D-cl1> volume test moveToRemote Lab-D-cl2 onAttached export
Will move the volume to the Lab-D-cl2
sub-cluster and if the volume is
attached in the present cluster will export it back to Lab-D-cl1
.
This is an equivalent to:
Lab-D-cl1> multiCluster on
[MC] Lab-D-cl1> cluster cmd Lab-D-cl2 attach volume test client 12
OK
Or directly executing the same CLI command in multi-cluster mode at a host in
Lab-D-cl2
cluster.
Note
Moving a volume will also trigger moving all of its snapshots. In a case where there are parent snapshots with many child volumes they might end up in each sub-cluster their child volumes ended up being moved to as a space-saving measure.
17.13.2. Snapshot move
Moving a snapshot is essentially the same as moving a volume, with the difference that it cannot be moved when attached.
For example:
Lab-D-cl1> snapshot testsnap moveToRemote Lab-D-cl2
Will succeed only if the snapshot is not attached locally.
A snapshot part of a volume snapshot chain will trigger copying also the parent snapshots which will be automatically managed by the cluster.
18. Rebalancing the cluster
18.1. Overview
In some situations the data in the StorPool cluster needs to be rebalanced. This
is performed by the balancer
and the relocator
tools. The relocator
is an integral part of the StorPool management service, the balancer
is
presently an external tool available and executed on some of the nodes with
access to the API.
Note
Be advised that he balancer tool will create some files it needs in the present working directory.
18.2. Rebalancing procedure
The rebalancing operation is performed in the following steps:
The
balancer
tool is executed to calculate the new state of the cluster.The results from the balancer are verified by set of automated scripts.
The results are also manually reviewed to check whether they contain any inconsistencies and whether they achieve the intended goals. These results are available by running
storpool balancer disks
and will be printed at the end ofbalancer.sh
If the result is not satisfactory, the
balancer
is executed with different parameters, until a satisfactory result is obtained.Once the proposed end result is satisfactory, the calculated state is loaded into the
relocator
tool, by doingstorpool balancer commit
.Note that this step can be reversed only with the
--restore-state
option, which will revert to the initial state. If a balancing operation has ran for a while and for some reason it needs to be “cancelled”, currently that’s not supported.The
relocator
tool performs the actual move of the data.The progress of the
relocator
tool can be monitored bystorpool task list
for the currently running tasks,storpool relocator status
for an overview of therelocator
state andstorpool relocator disks
(warning: slow command) for the full relocation state.
18.3. Options
The balancer tool is executed via the /usr/lib/storpool/balancer.sh
wrapper
and accepts the following options:
- -A
Don’t only move data from fuller to emptier drives.
- -b placementGroup
Use disks in the specified placement group to restore replication in critical conditions.
- -c factor
Factor for how much data to try to move around, from 0 to 10. The default is 0, required parameter. In most cases
-c
is not needed. The main use case with-c 10
is for 3-node clusters, where data needs to be “rotated” through the cluster.
- -d diskId [-d diskId]
Put data only on the selected disks.
- -D diskId [-D diskId]
Don’t move data from those disks.
- --do-whatever-you-can
Emergency use only, and after balancer failed; will decrease the redundancy level.
- -E 0-99
Don’t empty if below, in percents
- --empty-down-disks
Proceed with balancing even when there are down disks, and remove all data from them.
- -f percent
Allow drives to be filled up to this percentage, from 0 to 99. Default 90.
- -F
Only move data from fuller to emptier drives.
- -g placementGroup
Work only on the specified placement group.
- --ignore-down-disks
Proceed with balancing even when there are down disks, and do not remove data from them.
- --ignore-src-pg-violations
Exactly what it says
- -m maxAgCount
Limit the maximum allocation group count on drives to this (effectively their usable size).
- -M maxDataToAdd
Limit the amount of data to copy to a single drive, to be able to rebalance “in pieces”.
- --max-disbalance-before-striping X
In percents.
- --min-disk-full X
Don’t remove data from disk if it is not at least this X% full.
- --min-replication R
Minimum replication required.
- -o overridesPgName
Specify override placement group name (required only if
override
template is not created). For more information, see Volume overrides.- --only-empty-disk diskId
Like -D for all other disks.
- -R
Only restore replication for degraded volumes.
- --restore-state
Revert to the initial state of the disks (before the balancer commit execution).
- -S
Prefer tail SSD.
- -V vagId [-V vagId]
Skip balancing vagId.
- -v
Verbose output. Shows data how all drives in the cluster would be affected according to the balancer. This differs from the later output from
storpool balancer disks
, which is the point of view ofstorpool_mgmt
, as that also takes into account all currently loaded relocations.
-A
and -F
are the reverse of each other and mutually exclusive.
The -c
value is basically the trade-off between the uniformity of the data
on the disks and the amount of data moved to accomplish that. A lower factor
means less data to be moved around, but sometimes more inequality between the
data on the disks, a higher one - more data to be moved, but sometimes with a
better result in terms of equality of the amount of data for each drive.
On clusters with drives with unsupported size (HDDs > 4TB) the -m
option is
required. It will limit the data moved onto these drives to up to the set number
of allocation groups. This is done as the performance per TB space of larger
drives is lower, and it degrades the performance for the whole cluster for high
performance use cases.
The -M
option is useful when a full rebalancing would involve many tasks
until completed and could impact other operations (such as remote transfers, or
the time required for a currently running recovery to complete). With the -M
option the amount of data loaded by the balancer for each disk may be reduced,
and a more rebalanced state is achieved through several smaller rebalancing
operations.
The -f
option is required on clusters whose drives are full above 90%.
Extreme care should be used when balancing in such cases.
The -b
option could be used to move data between placementGroups (in most
cases from SSDs to HDDs).
18.4. Restoring volume redundancy on a failed drive
Situation: we have lost drive 1802 in placementGroup ssd
. We want to remove
it from the cluster and restore the redundancy of the data. We need to do the
following:
storpool disk 1802 forget # this will also remove the drive from all placement groups it participated in
mkdir -p ~/storpool/balancer && cd ~/storpool/balancer # it's recommended to run the following commands in a screen/tmux session
/usr/lib/storpool/balancer.sh -R
storpool balancer commit # to actually load the data into the relocator and start the re-balancing operation
18.5. Restoring volume redundancy for two failed drives (single-copy situation)
(Emergency) Situation: we have lost drives 1802 and 1902 in placementGroup
ssd
. We want to remove them from the cluster and restore the redundancy of
the data. We need to do the following:
storpool disk 1802 forget # this will also remove the drive from all placement groups it participated in
storpool disk 1902 forget # this will also remove the drive from all placement groups it participated in
mkdir -p ~/storpool/balancer && cd ~/storpool/balancer # it's recommended to run the following commands in a screen/tmux session
/usr/lib/storpool/balancer.sh -F --min-replication 2 # first balancing run, to create a second copy of the data
storpool balancer commit # to actually load the data into the relocator and start the re-balancing operation
# wait for the balancing to finish
/usr/lib/storpool/balancer.sh -R # second balancing run, to restore full redundancy
storpool balancer commit # to actually load the data into the relocator and start the re-balancing operation
18.6. Adding new drives and rebalancing data on them
Situation: we have added SSDs 1201, 1202 and HDDs 1510, 1511, that need to go
into placement groups ssd
and hdd
respectively, and we want to
re-balance the cluster data so that it is re-dispersed onto the new disks as
well. We have no other placement groups in the cluster.
storpool placementGroup ssd addDisk 1201 addDisk 1202
storpool placementGroup hdd addDisk 1510 addDisk 1511
mkdir -p ~/storpool/balancer && cd ~/storpool/balancer # it's recommended to run the following commands in a screen/tmux session
/usr/lib/storpool/balancer.sh -F -c 0 # rebalance all placement groups, move data from fuller to emptier drives
storpool balancer commit # to actually load the data into the relocator and start the re-balancing operation
18.7. Restoring volume redundancy with rebalancing data on other placementGroup
Situation: we have to restore the redundancy of a hybrid cluster (2 copies on
HDDs, one on SSDs) while the ssd
placementGroup is out of free space because
a few SSDs have recently failed. We can’t replace the failed drives with new
ones for the moment.
mkdir -p ~/storpool/balancer && cd ~/storpool/balancer # it's recommended to run the following commands in a screen/tmux session
/usr/lib/storpool/balancer.sh -R -c 0 -b hdd # use placementGroup ``hdd`` as a backup and move some data from SSDs
storpool balancer commit # to actually load the data into the relocator and start the re-balancing operation
Note
The -f
argument could be further used in order to instruct the
balancer how full to keep the cluster and thus control how much data
will be moved in the backup placement group.
18.8. Decommissioning a live node
Situation: a node in the cluster needs to be decommissioned, so that the data on
its drives needs to be moved away. The drive numbers on that node are 101
,
102
and 103
.
Note
You have to make sure you have enough space to restore the redundancy before proceeding.
storpool disk 101 softEject # mark all drives for evacuation
storpool disk 102 softEject
storpool disk 103 softEject
mkdir -p ~/storpool/balancer && cd ~/storpool/balancer # it's recommended to run the following commands in a screen/tmux session
/usr/lib/storpool/balancer.sh -R -c 0 # rebalance all placement groups, -F has the same effect in this case
storpool balancer commit # to actually load the data into the relocator and start the re-balancing operation
18.9. Decommissioning a dead node
Situation: a node in the cluster needs to be decommissioned, as it has died and
cannot be brought back. The drive numbers on that node are 101
, 102
and
103
.
Note
You have to make sure you have enough space to restore the redundancy before proceeding.
storpool disk 101 forget # remove the drives from all placement groups
storpool disk 102 forget
storpool disk 103 forget
mkdir -p ~/storpool/balancer && cd ~/storpool/balancer # it's recommended to run the following commands in a screen/tmux session
/usr/lib/storpool/balancer.sh -R -c 0 # rebalance all placement groups
storpool balancer commit # to actually load the data into the relocator and start the re-balancing operation
18.10. Resolving imbalances in the drive usage
Situation: we have an imbalance in the drive usage in the whole cluster and we want to improve it.
mkdir -p ~/storpool/balancer && cd ~/storpool/balancer # it's recommended to run the following commands in a screen/tmux session
/usr/lib/storpool/balancer.sh -F -c 0 # rebalance all placement groups
/usr/lib/storpool/balancer.sh -F -c 3 # retry to see if we get a better result with more data movements
storpool balancer commit # to actually load the data into the relocator and start the re-balancing operation
18.11. Resolving imbalances in the drive usage with three-node clusters
Situation: we have an imbalance in the drive usage in the whole cluster and we want to improve it. We have a three-node hybrid cluster and proper balancing requires larger moves of “unrelated” data:
mkdir -p ~/storpool/balancer && cd ~/storpool/balancer # it's recommended to run the following commands in a screen/tmux session
/usr/lib/storpool/balancer.sh -F -c 0 # rebalance all placement groups
/usr/lib/storpool/balancer.sh -A -c 10 # retry to see if we get a better result with more data movements
storpool balancer commit # to actually load the data into the relocator and start the re-balancing operation
18.12. Reverting balancer to a previous state
Situation: we have committed a rebalancing operation, but want to revert back to the previous state:
cd ~/storpool/balancer # it's recommended to run the following commands in a screen/tmux session
ls # list all saved states and choose what to revert to
/usr/lib/storpool/balancer.sh --restore-state 2022-10-28-15-39-40 # revert to 2022-10-28-15-39-40
storpool balancer commit # to actually load the data into the relocator and start the re-balancing operation
18.13. Reading the output of storpool balancer disks
Here is an example output from storpool balancer disks
:
-------------------------------------------------------------------------------------------------------------------------------------------------------------------------
| disk | server | size | stored | on-disk | objects |
-------------------------------------------------------------------------------------------------------------------------------------------------------------------------
| 1 | 14.0 | 373 GB | 4.7 MB -> 4.7 MB (0 B / 0 B) | 6.1 MB -> 6.1 MB (0 B / 0 B) | 26 -> 26 (+0 / +0) / 405000 |
| 1101 | 11.0 | 447 GB | 16 GB -> 15 GB (-1.0 GB / 1.4 GB) | 18 GB -> 17 GB (-1.1 GB / 1.4 GB) | 11798 -> 10040 (-1758 / +3932) / 480000 |
| 1102 | 11.0 | 447 GB | 16 GB -> 15 GB (-268 MB / 1.3 GB) | 17 GB -> 17 GB (-301 MB / 1.4 GB) | 10843 -> 10045 (-798 / +4486) / 480000 |
| 1103 | 11.0 | 447 GB | 16 GB -> 15 GB (-1.0 GB / 1.8 GB) | 18 GB -> 16 GB (-1.2 GB / 1.9 GB) | 12123 -> 10039 (-2084 / +3889) / 480000 |
| 1104 | 11.0 | 447 GB | 16 GB -> 15 GB (-757 MB / 1.3 GB) | 17 GB -> 16 GB (-899 MB / 1.3 GB) | 11045 -> 10072 (-973 / +4279) / 480000 |
| 1111 | 11.0 | 466 GB | 4.7 MB -> 4.7 MB (0 B / 0 B) | 5.1 MB -> 5.1 MB (0 B / 0 B) | 26 -> 26 (+0 / +0) / 495000 |
| 1112 | 11.0 | 366 GB | 4.7 MB -> 4.7 MB (0 B / 0 B) | 5.1 MB -> 5.1 MB (0 B / 0 B) | 26 -> 26 (+0 / +0) / 390000 |
| 1121 | 11.0 | 931 GB | 22 GB -> 21 GB (-1009 MB / 830 MB) | 22 GB -> 21 GB (-1.0 GB / 872 MB) | 13713 -> 12698 (-1015 / +3799) / 975000 |
| 1122 | 11.0 | 931 GB | 21 GB -> 21 GB (-373 MB / 2.0 GB) | 22 GB -> 21 GB (-379 MB / 2.0 GB) | 13469 -> 12742 (-727 / +3801) / 975000 |
| 1123 | 11.0 | 931 GB | 22 GB -> 21 GB (-1.1 GB / 1.9 GB) | 22 GB -> 21 GB (-1.1 GB / 2.0 GB) | 14859 -> 12629 (-2230 / +4102) / 975000 |
| 1124 | 11.0 | 931 GB | 21 GB -> 21 GB (36 MB / 1.8 GB) | 21 GB -> 21 GB (92 MB / 1.9 GB) | 13806 -> 12743 (-1063 / +3389) / 975000 |
| 1201 | 12.0 | 447 GB | 18 GB -> 15 GB (-2.9 GB / 633 MB) | 19 GB -> 16 GB (-3.0 GB / 658 MB) | 14148 -> 10070 (-4078 / +3050) / 480000 |
| 1202 | 12.0 | 447 GB | 17 GB -> 15 GB (-2.1 GB / 787 MB) | 19 GB -> 16 GB (-2.3 GB / 815 MB) | 13243 -> 10067 (-3176 / +2576) / 480000 |
| 1203 | 12.0 | 447 GB | 17 GB -> 15 GB (-2.0 GB / 3.3 GB) | 19 GB -> 16 GB (-2.4 GB / 3.5 GB) | 12746 -> 10062 (-2684 / +3375) / 480000 |
| 1204 | 12.0 | 447 GB | 18 GB -> 15 GB (-2.7 GB / 1.1 GB) | 19 GB -> 16 GB (-2.9 GB / 1.1 GB) | 12835 -> 10075 (-2760 / +3248) / 480000 |
| 1212 | 12.0 | 366 GB | 4.7 MB -> 4.7 MB (0 B / 0 B) | 6.0 MB -> 6.0 MB (0 B / 0 B) | 26 -> 26 (+0 / +0) / 390000 |
| 1221 | 12.0 | 931 GB | 20 GB -> 21 GB (569 MB / 1.5 GB) | 21 GB -> 21 GB (587 MB / 1.6 GB) | 13115 -> 12616 (-499 / +3736) / 975000 |
| 1222 | 12.0 | 931 GB | 22 GB -> 21 GB (-979 MB / 307 MB) | 22 GB -> 21 GB (-1013 MB / 317 MB) | 12938 -> 12697 (-241 / +3291) / 975000 |
| 1223 | 12.0 | 931 GB | 22 GB -> 21 GB (-1.1 GB / 781 MB) | 22 GB -> 21 GB (-1.2 GB / 812 MB) | 13968 -> 12718 (-1250 / +3302) / 975000 |
| 1224 | 12.0 | 931 GB | 21 GB -> 21 GB (-784 MB / 332 MB) | 22 GB -> 21 GB (-810 MB / 342 MB) | 13741 -> 12692 (-1049 / +3314) / 975000 |
| 1225 | 12.0 | 931 GB | 21 GB -> 21 GB (-681 MB / 849 MB) | 22 GB -> 21 GB (-701 MB / 882 MB) | 13608 -> 12748 (-860 / +3420) / 975000 |
| 1226 | 12.0 | 931 GB | 22 GB -> 21 GB (-1.1 GB / 825 MB) | 22 GB -> 21 GB (-1.1 GB / 853 MB) | 13066 -> 12692 (-374 / +3817) / 975000 |
| 1301 | 13.0 | 447 GB | 13 GB -> 15 GB (2.6 GB / 4.2 GB) | 14 GB -> 17 GB (2.7 GB / 4.4 GB) | 7244 -> 10038 (+2794 / +6186) / 480000 |
| 1302 | 13.0 | 447 GB | 12 GB -> 15 GB (3.0 GB / 3.7 GB) | 13 GB -> 17 GB (3.1 GB / 3.9 GB) | 7507 -> 10063 (+2556 / +5619) / 480000 |
| 1303 | 13.0 | 447 GB | 14 GB -> 15 GB (1.3 GB / 3.2 GB) | 15 GB -> 17 GB (1.3 GB / 3.4 GB) | 7888 -> 10038 (+2150 / +5884) / 480000 |
| 1304 | 13.0 | 447 GB | 13 GB -> 15 GB (2.7 GB / 3.7 GB) | 14 GB -> 17 GB (2.8 GB / 3.9 GB) | 7660 -> 10045 (+2385 / +5870) / 480000 |
| 1311 | 13.0 | 466 GB | 4.7 MB -> 4.7 MB (0 B / 0 B) | 6.1 MB -> 6.1 MB (0 B / 0 B) | 26 -> 26 (+0 / +0) / 495000 |
| 1312 | 13.0 | 366 GB | 4.7 MB -> 4.7 MB (0 B / 0 B) | 6.1 MB -> 6.1 MB (0 B / 0 B) | 26 -> 26 (+0 / +0) / 390000 |
| 1321 | 13.0 | 931 GB | 21 GB -> 21 GB (-193 MB / 1.1 GB) | 21 GB -> 21 GB (-195 MB / 1.2 GB) | 13365 -> 12765 (-600 / +5122) / 975000 |
| 1322 | 13.0 | 931 GB | 22 GB -> 21 GB (-1.4 GB / 1.1 GB) | 23 GB -> 21 GB (-1.4 GB / 1.1 GB) | 12749 -> 12739 (-10 / +4651) / 975000 |
| 1323 | 13.0 | 931 GB | 21 GB -> 21 GB (-504 MB / 2.2 GB) | 22 GB -> 21 GB (-496 MB / 2.3 GB) | 13386 -> 12695 (-691 / +4583) / 975000 |
| 1325 | 13.0 | 931 GB | 21 GB -> 20 GB (-698 MB / 557 MB) | 22 GB -> 21 GB (-717 MB / 584 MB) | 13113 -> 12768 (-345 / +2668) / 975000 |
| 1326 | 13.0 | 931 GB | 21 GB -> 21 GB (-507 MB / 724 MB) | 22 GB -> 21 GB (-522 MB / 754 MB) | 13690 -> 12704 (-986 / +3327) / 975000 |
| 1401 | 14.0 | 223 GB | 8.3 GB -> 7.6 GB (-666 MB / 868 MB) | 9.3 GB -> 8.5 GB (-781 MB / 901 MB) | 3470 -> 5043 (+1573 / +2830) / 240000 |
| 1402 | 14.0 | 447 GB | 9.8 GB -> 15 GB (5.6 GB / 5.7 GB) | 11 GB -> 17 GB (5.8 GB / 6.0 GB) | 4358 -> 10060 (+5702 / +6667) / 480000 |
| 1403 | 14.0 | 224 GB | 8.2 GB -> 7.6 GB (-623 MB / 1.1 GB) | 9.3 GB -> 8.6 GB (-710 MB / 1.2 GB) | 4547 -> 5036 (+489 / +2814) / 240000 |
| 1404 | 14.0 | 224 GB | 8.4 GB -> 7.6 GB (-773 MB / 1.5 GB) | 9.4 GB -> 8.5 GB (-970 MB / 1.6 GB) | 4369 -> 5031 (+662 / +2368) / 240000 |
| 1411 | 14.0 | 466 GB | 4.7 MB -> 4.7 MB (0 B / 0 B) | 6.0 MB -> 6.0 MB (0 B / 0 B) | 26 -> 26 (+0 / +0) / 495000 |
| 1412 | 14.0 | 366 GB | 4.7 MB -> 4.7 MB (0 B / 0 B) | 6.0 MB -> 6.0 MB (0 B / 0 B) | 26 -> 26 (+0 / +0) / 390000 |
| 1421 | 14.0 | 931 GB | 19 GB -> 21 GB (1.9 GB / 2.6 GB) | 19 GB -> 21 GB (2.0 GB / 2.7 GB) | 10670 -> 12624 (+1954 / +6196) / 975000 |
| 1422 | 14.0 | 931 GB | 19 GB -> 21 GB (1.6 GB / 3.2 GB) | 20 GB -> 21 GB (1.6 GB / 3.3 GB) | 10653 -> 12844 (+2191 / +6919) / 975000 |
| 1423 | 14.0 | 931 GB | 19 GB -> 21 GB (1.9 GB / 2.5 GB) | 19 GB -> 21 GB (2.0 GB / 2.6 GB) | 10715 -> 12688 (+1973 / +5846) / 975000 |
| 1424 | 14.0 | 931 GB | 18 GB -> 20 GB (2.2 GB / 2.9 GB) | 19 GB -> 21 GB (2.3 GB / 3.0 GB) | 10723 -> 12686 (+1963 / +5505) / 975000 |
| 1425 | 14.0 | 931 GB | 19 GB -> 21 GB (1.3 GB / 2.5 GB) | 20 GB -> 21 GB (1.4 GB / 2.6 GB) | 10702 -> 12689 (+1987 / +5486) / 975000 |
| 1426 | 14.0 | 931 GB | 20 GB -> 21 GB (1.0 GB / 2.5 GB) | 20 GB -> 21 GB (1.0 GB / 2.6 GB) | 10737 -> 12609 (+1872 / +5771) / 975000 |
-------------------------------------------------------------------------------------------------------------------------------------------------------------------------
| 45 | 4.0 | 29 TB | 652 GB -> 652 GB (512 MB / 69 GB) | 686 GB -> 685 GB (-240 MB / 72 GB) | 412818 -> 412818 (+0 / +159118) / 30885000 |
-------------------------------------------------------------------------------------------------------------------------------------------------------------------------
Let’s start with the last line. Here’s the meaning, field by field:
There are 45 drives in total.
There are 4 server instances.
The total disk capacity is 29 TB.
The stored data is 652 GB and will change to 652 GB. The total change for all drives afterwards is 512 MB, and the total amount of changes for the drives is 69 GB (i.e. how much will they “recover” from other drives).
The same is repeated for the on-disk size. Here the total amount of changes is roughly the amount of data that would need to be copied.
The total current number of objects will not change (i.e. from 412818 to 412818), 0 new objects will be created, the total amount of objects to be moved is 159118, and the total number of possible objects in the cluster is 30885000.
The difference between “stored” and “on-disk” size is that in the latter also includes the size of checksums and metadata.
For the rest of the lines, the data is basically the same, just per disk.
What needs to be taken into account is:
Are there drives that will have too much data on them? Here, both data size and objects must be checked, and they should be close to the average percentage for the placement group.
Is the data stored on the drives balanced, i.e. are all the drives’ usages close to the average?
Are there drives that should have data on them, but nothing is scheduled to be moved?
This usually happens because a drive wasn’t added to the right placement group.
Will there be too much data to be moved?
To illustrate the difference of amount to be moved, here is the output of
storpool balancer disks
from a run with -c 10
:
-------------------------------------------------------------------------------------------------------------------------------------------------------------------------
| disk | server | size | stored | on-disk | objects |
-------------------------------------------------------------------------------------------------------------------------------------------------------------------------
| 1 | 14.0 | 373 GB | 4.7 MB -> 4.7 MB (0 B / 0 B) | 6.1 MB -> 6.1 MB (0 B / 0 B) | 26 -> 26 (+0 / +0) / 405000 |
| 1101 | 11.0 | 447 GB | 16 GB -> 15 GB (-1.0 GB / 1.7 GB) | 18 GB -> 17 GB (-1.1 GB / 1.7 GB) | 11798 -> 10027 (-1771 / +5434) / 480000 |
| 1102 | 11.0 | 447 GB | 16 GB -> 15 GB (-263 MB / 1.7 GB) | 17 GB -> 17 GB (-298 MB / 1.7 GB) | 10843 -> 10000 (-843 / +5420) / 480000 |
| 1103 | 11.0 | 447 GB | 16 GB -> 15 GB (-1.0 GB / 3.6 GB) | 18 GB -> 16 GB (-1.2 GB / 3.8 GB) | 12123 -> 10005 (-2118 / +6331) / 480000 |
| 1104 | 11.0 | 447 GB | 16 GB -> 15 GB (-752 MB / 2.7 GB) | 17 GB -> 16 GB (-907 MB / 2.8 GB) | 11045 -> 10098 (-947 / +5214) / 480000 |
| 1111 | 11.0 | 466 GB | 4.7 MB -> 4.7 MB (0 B / 0 B) | 5.1 MB -> 5.1 MB (0 B / 0 B) | 26 -> 26 (+0 / +0) / 495000 |
| 1112 | 11.0 | 366 GB | 4.7 MB -> 4.7 MB (0 B / 0 B) | 5.1 MB -> 5.1 MB (0 B / 0 B) | 26 -> 26 (+0 / +0) / 390000 |
| 1121 | 11.0 | 931 GB | 22 GB -> 21 GB (-1003 MB / 6.4 GB) | 22 GB -> 21 GB (-1018 MB / 6.7 GB) | 13713 -> 12742 (-971 / +9712) / 975000 |
| 1122 | 11.0 | 931 GB | 21 GB -> 21 GB (-368 MB / 5.8 GB) | 22 GB -> 21 GB (-272 MB / 6.1 GB) | 13469 -> 12718 (-751 / +8929) / 975000 |
| 1123 | 11.0 | 931 GB | 22 GB -> 21 GB (-1.1 GB / 5.9 GB) | 22 GB -> 21 GB (-1.1 GB / 6.1 GB) | 14859 -> 12699 (-2160 / +8992) / 975000 |
| 1124 | 11.0 | 931 GB | 21 GB -> 21 GB (57 MB / 7.4 GB) | 21 GB -> 21 GB (113 MB / 7.7 GB) | 13806 -> 12697 (-1109 / +9535) / 975000 |
| 1201 | 12.0 | 447 GB | 18 GB -> 15 GB (-2.8 GB / 1.2 GB) | 19 GB -> 17 GB (-3.0 GB / 1.2 GB) | 14148 -> 10033 (-4115 / +4853) / 480000 |
| 1202 | 12.0 | 447 GB | 17 GB -> 15 GB (-2.0 GB / 1.6 GB) | 19 GB -> 16 GB (-2.2 GB / 1.7 GB) | 13243 -> 10055 (-3188 / +4660) / 480000 |
| 1203 | 12.0 | 447 GB | 17 GB -> 15 GB (-2.0 GB / 2.3 GB) | 19 GB -> 16 GB (-2.3 GB / 2.4 GB) | 12746 -> 10070 (-2676 / +4682) / 480000 |
| 1204 | 12.0 | 447 GB | 18 GB -> 15 GB (-2.7 GB / 2.1 GB) | 19 GB -> 16 GB (-2.8 GB / 2.2 GB) | 12835 -> 10110 (-2725 / +5511) / 480000 |
| 1212 | 12.0 | 366 GB | 4.7 MB -> 4.7 MB (0 B / 0 B) | 6.0 MB -> 6.0 MB (0 B / 0 B) | 26 -> 26 (+0 / +0) / 390000 |
| 1221 | 12.0 | 931 GB | 20 GB -> 21 GB (620 MB / 6.3 GB) | 21 GB -> 21 GB (805 MB / 6.7 GB) | 13115 -> 12542 (-573 / +9389) / 975000 |
| 1222 | 12.0 | 931 GB | 22 GB -> 21 GB (-981 MB / 2.9 GB) | 22 GB -> 21 GB (-1004 MB / 3.0 GB) | 12938 -> 12793 (-145 / +8795) / 975000 |
| 1223 | 12.0 | 931 GB | 22 GB -> 21 GB (-1.1 GB / 5.9 GB) | 22 GB -> 21 GB (-1.1 GB / 6.1 GB) | 13968 -> 12698 (-1270 / +10094) / 975000 |
| 1224 | 12.0 | 931 GB | 21 GB -> 21 GB (-791 MB / 4.5 GB) | 22 GB -> 21 GB (-758 MB / 4.7 GB) | 13741 -> 12684 (-1057 / +8616) / 975000 |
| 1225 | 12.0 | 931 GB | 21 GB -> 21 GB (-671 MB / 4.8 GB) | 22 GB -> 21 GB (-677 MB / 4.9 GB) | 13608 -> 12690 (-918 / +8559) / 975000 |
| 1226 | 12.0 | 931 GB | 22 GB -> 21 GB (-1.1 GB / 6.2 GB) | 22 GB -> 21 GB (-1.1 GB / 6.4 GB) | 13066 -> 12737 (-329 / +9386) / 975000 |
| 1301 | 13.0 | 447 GB | 13 GB -> 15 GB (2.6 GB / 4.5 GB) | 14 GB -> 17 GB (2.7 GB / 4.6 GB) | 7244 -> 10077 (+2833 / +6714) / 480000 |
| 1302 | 13.0 | 447 GB | 12 GB -> 15 GB (3.0 GB / 4.9 GB) | 13 GB -> 17 GB (3.2 GB / 5.2 GB) | 7507 -> 10056 (+2549 / +7011) / 480000 |
| 1303 | 13.0 | 447 GB | 14 GB -> 15 GB (1.3 GB / 3.2 GB) | 15 GB -> 17 GB (1.3 GB / 3.3 GB) | 7888 -> 10020 (+2132 / +6926) / 480000 |
| 1304 | 13.0 | 447 GB | 13 GB -> 15 GB (2.7 GB / 4.7 GB) | 14 GB -> 17 GB (2.8 GB / 4.9 GB) | 7660 -> 10075 (+2415 / +7049) / 480000 |
| 1311 | 13.0 | 466 GB | 4.7 MB -> 4.7 MB (0 B / 0 B) | 6.1 MB -> 6.1 MB (0 B / 0 B) | 26 -> 26 (+0 / +0) / 495000 |
| 1312 | 13.0 | 366 GB | 4.7 MB -> 4.7 MB (0 B / 0 B) | 6.1 MB -> 6.1 MB (0 B / 0 B) | 26 -> 26 (+0 / +0) / 390000 |
| 1321 | 13.0 | 931 GB | 21 GB -> 21 GB (-200 MB / 4.1 GB) | 21 GB -> 21 GB (-192 MB / 4.3 GB) | 13365 -> 12690 (-675 / +9527) / 975000 |
| 1322 | 13.0 | 931 GB | 22 GB -> 21 GB (-1.3 GB / 6.9 GB) | 23 GB -> 21 GB (-1.3 GB / 7.2 GB) | 12749 -> 12698 (-51 / +10047) / 975000 |
| 1323 | 13.0 | 931 GB | 21 GB -> 21 GB (-495 MB / 6.1 GB) | 22 GB -> 21 GB (-504 MB / 6.3 GB) | 13386 -> 12693 (-693 / +9524) / 975000 |
| 1325 | 13.0 | 931 GB | 21 GB -> 21 GB (-620 MB / 6.6 GB) | 22 GB -> 21 GB (-612 MB / 6.9 GB) | 13113 -> 12768 (-345 / +9942) / 975000 |
| 1326 | 13.0 | 931 GB | 21 GB -> 21 GB (-498 MB / 7.1 GB) | 22 GB -> 21 GB (-414 MB / 7.4 GB) | 13690 -> 12697 (-993 / +9759) / 975000 |
| 1401 | 14.0 | 223 GB | 8.3 GB -> 7.6 GB (-670 MB / 950 MB) | 9.3 GB -> 8.5 GB (-789 MB / 993 MB) | 3470 -> 5061 (+1591 / +3262) / 240000 |
| 1402 | 14.0 | 447 GB | 9.8 GB -> 15 GB (5.6 GB / 7.1 GB) | 11 GB -> 17 GB (5.8 GB / 7.5 GB) | 4358 -> 10052 (+5694 / +7092) / 480000 |
| 1403 | 14.0 | 224 GB | 8.2 GB -> 7.6 GB (-619 MB / 730 MB) | 9.3 GB -> 8.5 GB (-758 MB / 759 MB) | 4547 -> 5023 (+476 / +2567) / 240000 |
| 1404 | 14.0 | 224 GB | 8.4 GB -> 7.6 GB (-790 MB / 915 MB) | 9.4 GB -> 8.5 GB (-918 MB / 946 MB) | 4369 -> 5062 (+693 / +2483) / 240000 |
| 1411 | 14.0 | 466 GB | 4.7 MB -> 4.7 MB (0 B / 0 B) | 6.0 MB -> 6.0 MB (0 B / 0 B) | 26 -> 26 (+0 / +0) / 495000 |
| 1412 | 14.0 | 366 GB | 4.7 MB -> 4.7 MB (0 B / 0 B) | 6.0 MB -> 6.0 MB (0 B / 0 B) | 26 -> 26 (+0 / +0) / 390000 |
| 1421 | 14.0 | 931 GB | 19 GB -> 21 GB (2.0 GB / 6.8 GB) | 19 GB -> 21 GB (2.1 GB / 7.0 GB) | 10670 -> 12695 (+2025 / +10814) / 975000 |
| 1422 | 14.0 | 931 GB | 19 GB -> 21 GB (1.6 GB / 7.4 GB) | 20 GB -> 21 GB (1.7 GB / 7.7 GB) | 10653 -> 12702 (+2049 / +10414) / 975000 |
| 1423 | 14.0 | 931 GB | 19 GB -> 21 GB (2.0 GB / 7.4 GB) | 19 GB -> 21 GB (2.1 GB / 7.8 GB) | 10715 -> 12683 (+1968 / +10418) / 975000 |
| 1424 | 14.0 | 931 GB | 18 GB -> 21 GB (2.2 GB / 8.0 GB) | 19 GB -> 21 GB (2.3 GB / 8.3 GB) | 10723 -> 12824 (+2101 / +9573) / 975000 |
| 1425 | 14.0 | 931 GB | 19 GB -> 21 GB (1.3 GB / 5.8 GB) | 20 GB -> 21 GB (1.4 GB / 6.1 GB) | 10702 -> 12686 (+1984 / +10231) / 975000 |
| 1426 | 14.0 | 931 GB | 20 GB -> 21 GB (1.0 GB / 6.5 GB) | 20 GB -> 21 GB (1.2 GB / 6.8 GB) | 10737 -> 12650 (+1913 / +10974) / 975000 |
-------------------------------------------------------------------------------------------------------------------------------------------------------------------------
| 45 | 4.0 | 29 TB | 652 GB -> 653 GB (1.2 GB / 173 GB) | 686 GB -> 687 GB (1.2 GB / 180 GB) | 412818 -> 412818 (+0 / +288439) / 30885000 |
-------------------------------------------------------------------------------------------------------------------------------------------------------------------------
This time the total amount of data to be moved is 180GB. It’s possible to have a
difference of an order of magnitude in the total data to be moved between -c
0
and -c 10
. Usually best results are achieved by using the -F
directly with rare occasions requiring full re-balancing (i.e. no -F
and
higher -c
values)
18.13.1. Balancer tool output
Here’s an example of the output of the balancer tool, in non-verbose mode:
1 -== BEFORE BALANCE ==-
2 shards with decreased redundancy 0 (0, 0, 0)
3 server constraint violations 0
4 stripe constraint violations 6652
5 placement group violations 1250
6 pg hdd score 0.6551, objectsScore 0.0269
7 pg ssd score 0.6824, objectsScore 0.0280
8 pg hdd estFree 45T
9 pg ssd estFree 19T
10 Constraint violations detected, doing a replication-restore update first
11 server constraint violations 0
12 stripe constraint violations 7031
13 placement group violations 0
14 -== POST BALANCE ==-
15 shards with decreased redundancy 0 (0, 0, 0)
16 server constraint violations 0
17 stripe constraint violations 6592
18 placement group violations 0
19 moves 14387, (1864GiB) (tail ssd 14387)
20 pg hdd score 0.6551, objectsScore 0.0269, maxDataToSingleDrive 33 GiB
21 pg ssd score 0.6939, objectsScore 0.0285, maxDataToSingleDrive 76 GiB
22 pg hdd estFree 47T
23 pg ssd estFree 19T
The run of the balancer
tool has multiple steps.
First, it shows the current state of the system (lines 2-8):
Shards (volume pieces) with decreased redundancy.
Server constraint violations means that there are pieces of data which which have two or more of their copies on the same server. This is an error condition.
“stripe constraint violation” means that for specific pieces of data it’s not optimally striped on the drives of a specific server. This is NOT an error condition.
“placement group violations” means there is an error condition.
Lines 6 and 7 show the current average “score” (usage in %) of the placement groups, for data and objects;
Lines 8 and 9 show the estimated free space for the placement groups.
Then, in this run it has detected problems (in this case - placement group violations, which in most cases is a missing drive) and has done a pre-run to correct the redundancy (line 10, then again has printed on lines 11-13 the state).
And last, it runs the balancing, and reports the results. The main difference here is that for the placement groups it also reports the maximum data that will be added to a drive. As the balancing happens in parallel on all drives, this is a handy measure to see how long the balance would be (in comparison with a different balancing which might not add that much data to a single drive).
18.14. Errors from the balancer
tool
If the balancer
tool doesn’t complete successfully, its output MUST be
examined and the root cause fixed.
18.15. Miscellaneous
If for any reason the currently running rebalancing operation needs to be
paused, it can be done via storpool relocator off
. In such cases StorPool
Support should also be contacted, as this shouldn’t need to happen. Re-enabling
it is done via storpool relocator on
.
19. Troubleshooting
This part outlines the different states of a StorPool cluster, common knowledge about what should be expected and what are the recommended steps in each of them. This is intended to be used as a guideline for the operations team(s) maintaining the production system provided by StorPool.
19.1. Normal state of the system
The normal behaviour of the StorPool storage system is when it is fully configured and in up-and-running state. This is the desired state of the system.
Characteristics of this state:
19.1.1. All nodes in the storage cluster are up and running
This can be checked by using the CLI with storpool service list
on any node
with access to the API service.
Note
The storpool service list
provides status for all services running
clusterwide, rather than the services running on the node itself.
19.1.2. All configured StorPool services are up and running
This is again easily checked with storpool service list
. Recently restarted
services are usually spotted due to their uptime. Recently restarted services
are to be taken seriously if the reason for their state is unknown even if they
are running at the moment, as in the example with client ID 37
below:
# storpool service list
cluster running, mgmt on node 2
mgmt 1 running on node 1 ver 20.00.18, started 2022-09-08 19:28:37, uptime 144 days 22:45:51
mgmt 2 running on node 2 ver 20.00.18, started 2022-09-08 19:27:18, uptime 144 days 22:47:10 active
server 1 running on node 1 ver 20.00.18, started 2022-09-08 19:28:59, uptime 144 days 22:45:29
server 2 running on node 2 ver 20.00.18, started 2022-09-08 19:25:53, uptime 144 days 22:48:35
server 3 running on node 3 ver 20.00.18, started 2022-09-08 19:23:30, uptime 144 days 22:50:58
client 1 running on node 1 ver 20.00.18, started 2022-09-08 19:28:37, uptime 144 days 22:45:51
client 2 running on node 2 ver 20.00.18, started 2022-09-08 19:25:32, uptime 144 days 22:48:56
client 3 running on node 3 ver 20.00.18, started 2022-09-08 19:23:09, uptime 144 days 22:51:19
client 21 running on node 21 ver 20.00.18, started 2022-09-08 19:20:26, uptime 144 days 22:54:02
client 22 running on node 22 ver 20.00.18, started 2022-09-08 19:19:26, uptime 144 days 22:55:02
client 37 running on node 37 ver 20.00.18, started 2022-09-08 13:08:12, uptime 05:06:16
19.1.3. Working cgroup memory and cpuset isolation is properly configured
Use the storpool_cg
tool with an argument check
to ensure everything is
as expected. The tool should not return any warnings. For more information, see
Control groups.
When properly configured the sum of all memory limits in the node are less than
the available memory in the node. This protects the running kernel from memory
shortage as well as all processes in the storpool.slice
memory cgroup which
ensures the stability of the storage service.
19.1.4. All network interfaces are properly configured
All network interfaces used by StorPool are up and properly configured with
hardware acceleration enabled (where applicable); all network switches are
configured with jumbo frames and flow control, and none of them experience any
packet loss or delays. The output from storpool net list
is a good start,
all configured network interfaces will be seen as up with proper flags explained
at the end. The desired state is uU
with a +
at the end for each network
interface; if hardware acceleration is supported on an interface the A
flag
should also be present:
storpool net list
------------------------------------------------------------
| nodeId | flags | net 1 | net 2 |
------------------------------------------------------------
| 23 | uU + AJ | 1A:00:01:00:00:17 | 1E:00:01:00:00:17 |
| 24 | uU + AJ | 1A:00:01:00:00:18 | 1E:00:01:00:00:18 |
| 25 | uU + AJ | 1A:00:01:00:00:19 | 1E:00:01:00:00:19 |
| 26 | uU + AJ | 12:00:01:00:F0:1A | 16:00:01:00:F0:1A |
| 27 | uU + AJ | 26:8A:07:43:EE:BC | 26:8A:07:43:EE:BD |
------------------------------------------------------------
Quorum status: 5 voting beacons up out of 5 expected
Flags:
u - packets recently received from this node
d - no packets recently received from this node
U - this node has enough votes to be in the quorum
D - this node does not have enough votes to be in the quorum
+ - this node considers itself in the quorum
B - the connection to this node is through a backup link; check the cabling
A - this node is using hardware acceleration
J - the node uses jumbo frames
N - a non-voting node
19.1.5. All drives are up and running
All drives in use for the storage system are performing at their specified speed, are joined in the cluster and serving requests.
This could be checked with storpool disk list internal
, for example in a
normally loaded cluster all drives will report low aggregate scores. Below is an
example output (trimmed for brevity):
# storpool disk list internal
--------------------------------------------------------------------------------------------------------------------------------------------------------
| disk | server | aggregate scores | wbc pages | scrub bw | scrub ETA | last scrub completed |
--------------------------------------------------------------------------------------------------------------------------------------------------------
| 2301 | 23.0 | 0 | 0 | 0 | - + - / - | - | - | 2022-09-08 15:33:44 |
| 2302 | 23.0 | 0 | 0 | 0 | - + - / - | - | - | 2022-09-08 15:28:48 |
| 2303 | 23.1 | 0 | 0 | 0 | - + - / - | - | - | 2022-09-08 15:28:49 |
| 2304 | 23.1 | 0 | 0 | 0 | - + - / - | - | - | 2022-09-08 15:28:50 |
| 2305 | 23.2 | 0 | 0 | 0 | - + - / - | - | - | 2022-09-08 15:28:51 |
| 2306 | 23.2 | 0 | 0 | 0 | - + - / - | - | - | 2022-09-08 15:28:51 |
| 2307 | 23.3 | 0 | 0 | 0 | - + - / - | - | - | 2022-09-08 15:28:52 |
| 2308 | 23.3 | 0 | 0 | 0 | - + - / - | - | - | 2022-09-08 15:28:53 |
| 2311 | 23.0 | 0 | 0 | 0 | 0 + 0 / 2560 | - | - | 2022-09-08 15:28:38 |
| 2312 | 23.0 | 0 | 0 | 0 | 0 + 0 / 2560 | - | - | 2022-09-08 15:28:43 |
| 2313 | 23.1 | 0 | 0 | 0 | 0 + 0 / 2560 | - | - | 2022-09-08 15:28:44 |
| 2314 | 23.1 | 0 | 0 | 0 | 0 + 0 / 2560 | - | - | 2022-09-08 15:28:45 |
| 2315 | 23.2 | 0 | 0 | 0 | 0 + 0 / 2560 | - | - | 2022-09-08 15:28:47 |
| 2316 | 23.2 | 0 | 0 | 0 | 0 + 0 / 2560 | - | - | 2022-09-08 15:28:39 |
| 2317 | 23.3 | 0 | 0 | 0 | 0 + 0 / 2560 | - | - | 2022-09-08 15:28:40 |
| 2318 | 23.3 | 0 | 0 | 0 | 0 + 0 / 2560 | - | - | 2022-09-08 15:28:42 |
[snip]
All drives are regularly scrubbed, so they would have a stable (not increasing)
number of errors. The errors corrected for each drive are visible in the
storpool disk list
output. Last completed scrub is visible in storpool
disk list internal
as in the example above.
Note that Some systems may have fewer than two network interfaces or a single backend switch. Even not recommended, this is still possible and sometimes used (usually in PoC or with a backup server) when a cluster is configured with a single-VLAN network redundancy scheme. A single VLAN network redundancy configuration and an inter-switch connection is required for a cluster where only some of the nodes are with a single interface connected to the cluster.
Also, if one or more of the points describing the state above is not in effect, the system should not be considered healthy. If there is any suspicion that the system is behaving erratically even though all of the above conditions are satisfied, the recommended steps to check if everything is in order are:
Check
top
and look for the state of each of the configuredstorpool_*
services running in the present node. A properly running service is usually in theS
(sleeping) state and rarely seen in theR
(running) state. The CPU usage is often reported at 100% usage when hardware sleep is enabled, due to the kernel misreporting. The actual usage is much lower and could be tracked withcpupower monitor
for the CPU cores.To ensure all services on this node are running correctly is to use the
/usr/lib/storpool/sdump
tool, which will be reporting some CPU and network usage statistics for the running services on the node. Use the-l
option for the long names of the statistics.On some of the nodes with running workloads (like VM instances or containers)
iostat
will show activity for processed requests on the block devices.The following example shows the normal disk activity on a node running VM instances. Note that the usage may vary greatly depending on the workload. The command used in the example is
iostat -xm 1 /dev/sp-*| egrep -v " 0[.,]00$"
, which will print statistics for the StorPool devices each second excluding drives that have no storage IO activity:Device: rrqm/s wrqm/s r/s w/s rMB/s wMB/s avgrq-sz avgqu-sz await r_await w_await svctm %util sp-0 0.00 0.00 0.00 279.00 0.00 0.14 1.00 3.87 13.80 0.00 13.80 3.55 98.99 sp-11 0.00 0.00 165.60 114.10 19.29 14.26 245.66 5.97 20.87 9.81 36.91 0.89 24.78 sp-12 0.00 0.00 171.60 153.60 19.33 19.20 242.67 9.20 28.17 10.46 47.96 1.08 35.01 sp-13 0.00 0.00 6.00 40.80 0.04 5.10 225.12 1.75 37.32 0.27 42.77 1.06 4.98 sp-21 0.00 0.00 0.00 82.20 0.00 1.04 25.90 1.00 12.08 0.00 12.08 12.16 99.99
19.1.6. There are no hanging active requests
The output of /usr/lib/storpool/latthreshold.py
is empty - shows no hanging
requests and no service or disk warnings.
19.2. Degraded state
In this state some system components are not fully operational and need attention. Some examples of a degraded state below.
19.2.1. Degraded state due to service issues
A single storpool_server
service on one of the storage nodes is not
available or not joined in the cluster
Note that this concerns only pools with triple replication, for dual replication
this is considered to be a critical state, because there are parts of the system
with only one available copy. This is an example output from storpool service
list
:
# storpool service list
cluster running, mgmt on node 2
mgmt 1 running on node 1 ver 20.00.18, started 2022-09-08 16:11:59, uptime 19:51:50
mgmt 2 running on node 2 ver 20.00.18, started 2022-09-08 16:11:58, uptime 19:51:51 active
mgmt 3 running on node 3 ver 20.00.18, started 2022-09-08 16:11:58, uptime 19:51:51
server 1 down on node 1 ver 20.00.18
server 2 running on node 2 ver 20.00.18, started 2022-09-08 16:12:03, uptime 19:51:46
server 3 running on node 3 ver 20.00.18, started 2022-09-08 16:12:04, uptime 19:51:45
client 1 running on node 1 ver 20.00.18, started 2022-09-08 16:11:59, uptime 19:51:50
client 2 running on node 2 ver 20.00.18, started 2022-09-08 16:11:57, uptime 19:51:52
client 3 running on node 3 ver 20.00.18, started 2022-09-08 16:11:57, uptime 19:51:52
If this is unexpected, i.e. no one has deliberately restarted or stopped the
service for planned maintenance or upgrade, it is very important to first bring
the service up and then to investigate the root cause for the service outage.
When the storpool_server
service comes back up it will start recovering
outdated data on its drives. The recovery process could be monitored with
storpool task list
, which will output which disks are recovering, as well as
how much data is there left to be recovered. Example output or storpool task
list:
# storpool task list
----------------------------------------------------------------------------------------
| disk | task id | total obj | completed | started | remaining | % complete |
----------------------------------------------------------------------------------------
| 2303 | RECOVERY | 1 | 0 | 1 | 1 | 0% |
----------------------------------------------------------------------------------------
| total | | 1 | 0 | 1 | 1 | 0% |
----------------------------------------------------------------------------------------
Some of the volumes or snapshots will have the D
flag (for degraded) visible
in the storpool volume status
output, which will disappear once all the data
is fully recovered. An example situation would be a reboot of the node for a
kernel or a package upgrade that requires reboot and no kernel modules were
installed for the new kernel or a service (in this example the
storpool_server
) was not configured to start on boot and others.
Some of the configured StorPool services have failed or is not running
These could be:
The storpool_block service on some of the storage only nodes, without any attached volumes or snapshots.
A single
storpool_server
service or multiple instances on the same node, note again that this is critical for systems with dual replication.Single API (
storpool_mgmt
) service with another active running API.
The reason for these could be the same as in the previous examples, usually the system log contains all information needed to check why the service is not (getting) up.
19.2.2. Degraded state due to host OS misconfiguration
Some examples include:
Changes in the OS configuration after a system update
This could prevent some of the services from running after a fresh boot. For instance, due to changed names of the network interfaces used for the storage system after an upgrade, changed PCIe IDs for NVMe devices, and so on.
Kdump is no longer collecting kernel dump data properly
If this occurs, it might be difficult to debug what have caused a kernel crash.
Some of the above cases will be difficult to catch prior to booting with the new environment (for example, kernel or other updates) and sometimes they are only caught after an event that reveals the issue. Thus it is important to regularly test and ensure the system is in properly configured state and collects normally.
19.2.3. Degraded state due to network interface issues
Some of the interfaces used by StorPool is not up.
This could be checked with storpool net list
, like this:
# storpool net list
------------------------------------------------------------
| nodeId | flags | net 1 | net 2 |
------------------------------------------------------------
| 23 | uU + AJ | | 1E:00:01:00:00:17 |
| 24 | uU + AJ | | 1E:00:01:00:00:18 |
| 25 | uU + AJ | 1A:00:01:00:00:19 | 1E:00:01:00:00:19 |
| 26 | uU + AJ | 12:00:01:00:F0:1A | 16:00:01:00:F0:1A |
| 27 | uU + AJ | 26:8A:07:43:EE:BC | 26:8A:07:43:EE:BD |
------------------------------------------------------------
Quorum status: 5 voting beacons up out of 5 expected
Flags:
u - packets recently received from this node
d - no packets recently received from this node
U - this node has enough votes to be in the quorum
D - this node does not have enough votes to be in the quorum
+ - this node considers itself in the quorum
B - the connection to this node is through a backup link; check the cabling
A - this node is using hardware acceleration
J - the node uses jumbo frames
N - a non-voting node
In the above example nodes 23 and 24 are not connected to the first network.
This is the SP_IFACE1_CFG
interface configuration in /etc/storpool.conf
(check with storpool_showconf SP_IFACE1_CFG
). Note that the beacons are up
and running and the system is processing requests through the second network.
The possible reasons could be misconfigured interfaces, StorPool configuration,
or backend switch/switches.
A HW acceleration qualified interface is running without hardware acceleration
This is once again checked with storpool net list
:
# storpool net list
------------------------------------------------------------
| nodeId | flags | net 1 | net 2 |
------------------------------------------------------------
| 23 | uU + J | 1A:00:01:00:00:17 | 1E:00:01:00:00:17 |
| 24 | uU + J | 1A:00:01:00:00:18 | 1E:00:01:00:00:18 |
| 25 | uU + AJ | 1A:00:01:00:00:19 | 1E:00:01:00:00:19 |
| 26 | uU + AJ | 12:00:01:00:F0:1A | 16:00:01:00:F0:1A |
| 27 | uU + AJ | 26:8A:07:43:EE:BC | 26:8A:07:43:EE:BD |
------------------------------------------------------------
Quorum status: 5 voting beacons up out of 5 expected
Flags:
u - packets recently received from this node
d - no packets recently received from this node
U - this node has enough votes to be in the quorum
D - this node does not have enough votes to be in the quorum
+ - this node considers itself in the quorum
B - the connection to this node is through a backup link; check the cabling
A - this node is using hardware acceleration
J - the node uses jumbo frames
N - a non-voting node
In the above example, nodes 23 and 24 are equipped with NICs qualified for, but
are running without hardware acceleration; the possible reason could be either a
BIOS or an OS misconfiguration, misconfigured kernel parameters on boot, or
network interface misconfiguration. Note that when a system was configured for
hardware accelerated operation the cgroups configuration was also sized
accordingly, thus running in this state is likely to case performance issues,
due to less CPU cores isolated and reserved for the NIC interrupts and
storpool_rdma
threads.
Jumbo frames are expected, but not working on some of the interfaces
Could be seen with storpool net list
, if some of the two networks is with
MTU lower than 9k the J
flag will not be listed:
# storpool net list
-------------------------------------------------------------
| nodeId | flags | net 1 | net 2 |
-------------------------------------------------------------
| 23 | uU + A | 12:00:01:00:F0:17 | 16:00:01:00:F0:17 |
| 24 | uU + AJ | 1A:00:01:00:00:18 | 1E:00:01:00:00:18 |
| 25 | uU + AJ | 1A:00:01:00:00:19 | 1E:00:01:00:00:19 |
| 26 | uU + AJ | 1A:00:01:00:00:1A | 1E:00:01:00:00:1A |
-------------------------------------------------------------
Quorum status: 4 voting beacons up out of 4 expected
Flags:
u - packets recently received from this node
d - no packets recently received from this node
U - this node has enough votes to be in the quorum
D - this node does not have enough votes to be in the quorum
M - this node is being damped by the rest of the nodes in the cluster
+ - this node considers itself in the quorum
B - the connection to this node is through a backup link; check the cabling
A - this node is using hardware acceleration
J - the node uses jumbo frames
N - a non-voting node
If the node is not expected to be running without jumbo frames this might be an indication for a misconfigured interface or an issue with applying the interfaces configuration on boot. Note that an OS interface configured for jumbo frames without having the switch port properly configured leads to severe performance issues.
Some network interfaces are experiencing network loss or delays on one of the networks
This might affect the latency for some of the storage operations. Depending on
the node where the losses occur, it might affect a single client or affect
operations in the whole cluster in case of packet loss or delays are happening
on a server node. Stats for all interfaces per service are collected in the
analytics platform (https://analytics.storpool.com) and could be used to
investigate network performance issues. The /usr/lib/storpool/sdump
tool
will print the same statistics on each of the nodes with services. The usual
causes for packet loss are:
Hardware issues (cables, SFPs, and so on).
Floods and DDoS attacks “leaking” into the storage network due to misconfiguration.
Saturation of the CPU cores that handle the interrupts for the network cards and others when hardware acceleration is not available.
Network loops leading to saturated switch ports or overloaded NICs.
19.2.4. Drive/Controller issues
One or more HDD or SSD drives are missing from a single server in the cluster or from servers in the same fault set
Attention
This concerns only pools with triple replication, for dual replication this is considered as critical state.
The missing drives may be seen using storpool disk list
or storpool server
<serverID> disk list
, for example in this output disk 543
is missing from
server with ID 54
:
# storpool server 54 disk list
disk | server | size | used | est.free | % | free entries | on-disk size | allocated objects | errors | flags
541 | 54 | 207 GB | 61 GB | 136 GB | 29 % | 713180 | 75 GB | 158990 / 225000 | 0 / 0 |
542 | 54 | 207 GB | 56 GB | 140 GB | 27 % | 719526 | 68 GB | 161244 / 225000 | 0 / 0 |
543 | 54 | - | - | - | - % | - | - | - / - | -
544 | 54 | 207 GB | 62 GB | 134 GB | 30 % | 701722 | 76 GB | 158982 / 225000 | 0 / 0 |
545 | 54 | 207 GB | 61 GB | 135 GB | 30 % | 719993 | 75 GB | 161312 / 225000 | 0 / 0 |
546 | 54 | 207 GB | 54 GB | 142 GB | 26 % | 720023 | 68 GB | 158481 / 225000 | 0 / 0 |
547 | 54 | 207 GB | 62 GB | 134 GB | 30 % | 719996 | 77 GB | 179486 / 225000 | 0 / 0 |
548 | 54 | 207 GB | 53 GB | 143 GB | 26 % | 718406 | 70 GB | 179038 / 225000 | 0 / 0 |
Usual reasons - the drive was ejected from the cluster due to a write error
either by the kernel or by the running storpool_server
instance. More
information may be found using dmesg | tail
and in the system log. More
information about the model and the serial number of the failed drive is shown
by storpool disk list info
.
In normal conditions the server will flag the disk to be re-tested and will eject it for a quick test. Provided the disk is still working correctly and test results are not breaching any thresholds the disk will be returned into the cluster to recover. Such a case for example might happen if the stalled request was caused by an intermediate issue, like a reallocated sector.
In case the disk is breaching any sane latency and bandwidth thresholds it will not be automatically returned and will have to be re-balanced out of the cluster. Such disks are marked as “bad” (more available at storpool_initdisk options)
When one or more drives are ejected (marked as bad already) and missing,
multiple volumes and/or snapshots will be listed with the D
flag in the
output of storpool volume status
(D
as Degraded
), due to the missing
replicas for some of the data. This is normal and expected and there are the
following options in this situation:
The drive could still be working properly (for example, a set of bad sectors were reallocated) even after it was tested, in order to re-test you could mark the drive as –good (more info on how at storpool_initdisk options) and attempt to get back into the cluster.
In some occasions a disk might have lost its signatures and would have to be returned in the cluster to recover from scratch - it will be automatically re-tested upon attempt to a full (read-write) stress-test is recommended to ensure it is working correctly (
fio
is a good tool for this kind of tests, check--verify
option). In case the stress test is successful (e.g. the drive has been written to and verified successfully), it may be reinitialized withstorpool_initdisk
with the same disk ID it was before. This will automatically return it to the cluster and it will fully recover all data from scratch as if it was a brand new.The drive has failed irrecoverably and a replacement is available. The replacement drive is initialized with the diskID of the failed drive with
storpool_initdisk
. After returning it to the cluster it will fully recover all the data from the live replicas (please check 18. Rebalancing the cluster for more).A replacement is not available. The only option is to re-balance the cluster without this drive (more details in 18. Rebalancing the cluster).
Attention
Beware that in some cases with very filled clusters it might be impossible to get the cluster back to full redundancy without overfilling some of the remaining drives. See next section.
Some of the drives in the cluster are beyond 90% (up to 96% full)
With proper planning this should be rarely an issue. A way to evade it is to add more drives or an additional server node with a full set of drives into the cluster. Another option is to remove unused volumes or snapshots.
The storpool snapshot space
command will return relevant information for the
referred space for each snapshot on the underlying drives. Note that snapshots
with a negative value in their “used” column will not free up any space if they
are removed and will remain in the deleting state, because they are parents of
multiple cloned child volumes.
Note that depending on the speed with which the cluster is being populated with data by the end users this might also be considered a critical state.
Some of the drives have fewer than 140k free entries (alert for an overloaded system)
This may be observed in the output of storpool disk list
or storpool
server <serverID> disk list
, an example from the latter below:
# storpool server 23 disk list
disk | server | size | used | est.free | % | free entries | on-disk size | allocated objects | errors | flags
2301 | 23.0 | 893 GiB | 2.6 GiB | 857 GiB | 0 % | 3719930 | 660 KiB | 17 / 930000 | 0 / 0 |
2302 | 23.0 | 893 GiB | 2.6 GiB | 857 GiB | 0 % | 3719929 | 668 KiB | 17 / 930000 | 0 / 0 |
2303 | 23.0 | 893 GiB | 2.6 GiB | 857 GiB | 0 % | 3719929 | 668 KiB | 17 / 930000 | 0 / 0 |
2304 | 23.0 | 893 GiB | 2.6 GiB | 857 GiB | 0 % | 3719931 | 668 KiB | 17 / 930000 | 0 / 0 |
2306 | 23.0 | 893 GiB | 2.6 GiB | 857 GiB | 0 % | 3719932 | 664 KiB | 17 / 930000 | 0 / 0 |
2305 | 23.0 | 893 GiB | 2.6 GiB | 857 GiB | 0 % | 3719930 | 660 KiB | 17 / 930000 | 0 / 0 |
2307 | 23.0 | 893 GiB | 2.6 GiB | 857 GiB | 0 % | 19934 | 664 KiB | 17 / 930000 | 0 / 0 |
--------------------------------------------------------------------------------------------------------------------------------------
7 | 1.0 | 6.1 TiB | 18 GiB | 5.9 TiB | 0 % | 26039515 | 4.5 MiB | 119 / 6510000 | 0 / 0 |
This usually happens after the system has been loaded for longer periods of time
with a sustained write workload on one or multiple volumes. If this is
unexpected and the reason is an erratic workload, the recommended way to handle
this is to set a limit (bandwidth, iops or both) on the loaded volumes for
example with storpool volume <volumename> bw 100M iops 1000
. The same could
be set for multiple volumes/snapshots in a template with storpool template
<templatename> bw 100M iops 1000 propagate
. Please note that propagating
changes for templates with a very large number of volumes and snapshots might
not work. If the overloaded state is due to normally occurring workload it is
best to expand the system with more drives and or reformat the drives with
larger number of entries (relates mainly to HDD drives). The latter case might
be cause due to lower number of hard drives in a HDD only or a hybrid pool and
rarely due to overloaded SSDs.
Another case related to overloaded drives is when many volumes are created out of the same template, which requires overrides in order to shuffle the objects where the journals are residing in order to avoid overload of the same triplet of disks when all virtual machines spike for some reason (i.e. unattended upgrades, a syslog intensive cron job, etc.)
A couple of notes on the degraded states - apart from the notes for the replication above none of these should affect the stability of the system at this point. For the example with the missing disk in hybrid systems with a single failed SSD, all read requests on volumes with triple replication that have data on the failed drive will be served by some of the redundant copies on HDDs. This could bring up a bit the read latencies for the operations on the parts of the volume which were on this exact SSD. This is usually negligible in medium to large systems, i.e. in a cluster with 20 SSDs or NVMe drives, these are 1/20th of all the read operations in the cluster. In case of dual replicas on SSDs and a third replica on HDDs there is no read latency penalty whatsoever, which is also the case for missing hard drives - they will not affect the system at all and in fact some write operations are even faster, because they are not waiting for the missing drive.
19.3. Critical state
This is an emergency state that requires immediate attention and intervention from the operations team and/or StorPool Support. Some of the conditions that could lead to this state are:
Partial or complete network outage.
Power loss for some nodes in the cluster.
Memory shortage leading to a service failure due to missing or incomplete cgroups configuration.
The following states are an indication for critical conditions:
19.3.1. API service failure
API not reachable on any of the configured nodes (the ones running the storpool_mgmt
service)
Requests to the API from any of the nodes configured to access it either stall or cannot reach a working service. This is a critical state due to the fact that the status of the cluster is unknown (might be down for that matter).
This might be caused by:
Misconfigured network for accessing the floating IP address - the address may be obtained by
storpool_showconf http
on any of the nodes with a configuredstorpool_mgmt
service in the cluster:# storpool_showconf http SP_API_HTTP_HOST=10.3.10.78 SP_API_HTTP_PORT=81
Failed interfaces on the hosts that have the
storpool_mgmt
service running. To find the interface where the StorPool API should be running usestorpool_showconf api_iface
:# storpool_showconf api_iface SP_API_IFACE=bond0.410
It is recommended to have the API on a redundant interface (e.g. an active-backup bond interface). Note that even without an API, provided the cluster is in quorum, there should be no impact on any running operations, but changes in the cluster like creating/attaching/detaching/deleting volumes or snapshots) will be impossible. Running with no API in the cluster triggers highest severity alert to StorPool Support (essentially wake up alert) due to the unknown state of the system.
The cluster is not in quorum
The cluster is in this state if the number of running voting
storpool_beacon
services is less than the half of the expected nodes plus one ((expected / 2) + 1). The configured number of expected nodes in the cluster may be checked withstorpool_showconf expected
, but is generally considered to be the number of server nodes (except when client nodes are configured as voting for some reason). In a system with 6 servers at least 4 voting beacons should be available to get back the cluster in running state:# storpool_showconf expected SP_EXPECTED_NODES=6
The current number of expected votes and the number of voting beacons are displayed in the output of
storpool net list
, check the example above (theQuorum status:
line).
API requests are not returning for more than 30-60 seconds (e.g. storpool volume status
, storpool snapshot space
, storpool disk list
, etc.)
These API requests collect data from the running storpool_server
services on
each server node. Possible reasons are:
Network loss or delays;
Failing
storpool_server
services;Failing drives or hardware (CPU, memory, controllers, etc.);
Overload
19.3.2. Server service failure
Two storpool_server services or whole servers are down
Two storpool_server
services or whole servers are down or not joined in the
cluster in different failure sets. Very risky state, because there are parts of
the volumes with only one live replica, if the latest writes land on a drive
returning an IO error or broken data (detected by StorPool) this will lead to
data loss.
As in degraded state some of the read operations for parts of the volumes will be served from HDDs in a hybrid system and might raise read latencies. In this state it is very important to bring back the missing services/nodes as soon as possible, because a failure in any of the remaining drives in other nodes or another fault set will bring some of the volumes in down state and might lead to data loss in case of an error returned by a drive with the latest writes.
More than two storpool_server services or whole servers are down
This state results in some volumes being in down state (storpool volume
status
) due to some parts only on the missing drives. Recommended action in
this case - check for the reasons for the degraded services or missing
(unresponsive) nodes and get them back up.
Possible reasons are:
Lost network connectivity
Severe packet loss/delays/loops
Partial or complete power loss
Hardware instabilities, overheating
Kernel or other software instabilities, crashes
19.3.3. Client service failure
If the client service (storpool_block
) is down on some of the nodes
depending on it, these could be either client-only or converged hypervisors,
this will stall all requests on that particular node until the service is back
up.
Possible reasons are again:
Lost network connectivity
Severe packet loss/delays/loops
Bugs in the
storpool_block
service or thestorpool_bd
kernel module
In case of power loss or kernel crashes any virtual machine instances that were running on this node could be started on other available nodes.
19.3.4. Network interface or Switch failure
This means that the networks used for StorPool are down or are experiencing heavy packet loss or delays. In this case the quorum service will prevent a split-brain situation and will restart all services to ensure the cluster is fully connected on at least one network before it transitions again to running state. Such issues might be alleviated by a single-VLAN setup when different nodes have partial network connectivity, but will be again with severe delays in case of severe packet loss.
19.3.5. Hard Drive/SSD failures
Drives from two or more different nodes (fault sets) in the cluster are missing (or from a single node/fault set for systems with dual replication pools)
In this case multiple volumes may either experience degraded performance (hybrid
placement) or will be in down
state when more than two replicas are missing.
All operations on volumes in down
state are stalled, until the redundancy is
restored (i.e. at least one replica is available). The recommended steps are to
immediately check for the reasons for the missing drives/services/nodes and
return them into the cluster as soon as possible.
Some of the drives are more than 97% full
At some point all cluster operations will stall until either some of the data in the cluster is deleted or new drives/nodes are added. Adding drives requires the new drives to be stress tested and a re-balancing of the system to include them, which should be carefully planned (details in 18. Rebalancing the cluster).
Note
Cleaning up snapshots that have multiple cloned volumes and a negative
value for used space in the output of storpool snapshot space
will
not free up any space.
Some of the drives have fewer than 100k free entries
This is usually caused by a heavily overloaded system. In this state the
latencies for some operations might become very high (measured in seconds).
Possible reasons are severely overloaded volumes for long periods of time
without any configured bandwidth or iops limits. This could be checked by using
iostat to look for volumes that are being constantly 100% loaded with a large
number of requests to the storage system. Another way to check for such volumes
is to use the “Top volumes” in the analytics in order to get info for the most
loaded volumes and apply IOPS and or bandwidth limits accordingly. Other causes
are misbehaving (underperforming) drives or misbehaving HBA/SAS controllers, the
recommended way to deal with these cases is to investigate for such drives, a
good idea is to check the output from storpool disk list internal
for higher
aggregation scores on some drives or set of drives (e.g. on the same server) or
by the use of the analytics to check for abnormal latency on some of the backend
nodes (i.e. drives with significantly higher operations latency compared to
other drives of the same type). An example would be a failing controller causing
the SATA speed to degrade to SATA1.0 (1.5Gb/s) instead of SATA 3.0 (6Gb/s),
weared out batteries on a RAID controller when its cache is used to accelerate
the writes on the HDDs, and others.
The circumstances leading a system to the critical state are rare and are usually preventable with by taking measures to handle all the issues at the first signs of a change from the normal to degraded state.
In any of the above cases if you feel that something is not as expected, a consultation with StorPool Support is the best course of action. StorPool Support receives notifications for all detailed cases and pro-actively takes actions to alleviate a system going into degraded or critical state as soon as practically possible.
19.3.6. Hanging requests in the cluster
The output of /usr/lib/storpool/latthreshold.py
shows hanging requests
and/or missing services as in the example below:
disk | reported by | peers | s | op | volume | requestId
-------------------------------------------------------------------------------------------------------------------
- | client 2 | client 2 -> server 2.1 | 15 | read | volume-name | 9223936579889289248:270215977642998472
- | client 2 | client 2 -> server 2.1 | 15 | read | volume-name | 9223936579889289248:270497452619709333
- | client 2 | client 2 -> server 2.1 | 15 | read | volume-name | 9223936579889289248:270778927596419790
- | client 2 | client 2 -> server 2.1 | 15 | write | volume-name | 9223936579889289248:271060402573130531
- | client 2 | client 2 -> server 2.1 | 15 | read | volume-name | 9223936579889289248:271341877549841211
- | client 2 | client 2 -> server 2.1 | 15 | read | volume-name | 9223936579889289248:271623352526551744
- | client 2 | client 2 -> server 2.1 | 15 | write | volume-name | 9223936579889289248:271904827503262450
server 2.1 connection status: established no_data timeout
disk 202 EXPECTED_UNKNOWN server 2.1
This could be caused by starving CPU, hardware resets, misbehaving disks or
network or stalled services. The disk
field in the output and the service
warnings after the requests table could be used as an indicator for the
misbehaving component.
Note that the active requests api call has a timeout for each service to
respond. The default timeout that the latthreshold
tool uses is 10 seconds.
This value can be altered by using the latthreshold’s
--api-requests-timeout/-A
and passing it a numeric value with a time
unit (m, s, ms or us) e.g. 100ms
.
Service connection will have one of the following statuses:
established done
- service reported its active requests as expected; this is not displayed in the regular output, only with--json
not_established
- did not make a connection with the service - this could indicate the server is down, but may also indicate the service version is too old or its stream was overfilled or not connectedestablished no_data timeout
- service did not respond and the connection was closed because the timeout was reachedestablished data timeout
- service responded but the connection was closed because the timeout was reached before it could send all the dataestablished invalid_data
- a message the service sent had invalid data in it
The latthreshold
tool also reports disk statuses. Reported disk statuses
will be one of the following:
EXPECTED_MISSING - the service response was good, but did not provide information about the disk
EXPECTED_NO_CONNECTION_TO_PEER - the connection to the service was not established
EXPECTED_NO_PEER - the service is not present
EXPECTED_UNKNOWN - the service response was invalid or a timeout occurred