StorPool User Guide 19

Document version 2019-08-05

1. StorPool Overview

StorPool is distributed block storage software. It pools the attached storage (HDDs, SSDs or NVMe drives) of standard servers to create a single pool of shared storage. The StorPool software is installed on each server in the cluster. It combines the performance and capacity of all drives attached to the servers into one global namespace.
StorPool's version 19 has been released in November 2019. The new version has numerous improvements and makes the leading block storage SDS even better. Notable features of the new version are large scale deployments, added Windows CSV support, lower latency for NVMe storage and improved Kubernetes bare-metal support.

StorPool provides standard block devices. You can create one or more volumes through its sophisticated volume manager. StorPool is compatible with ext3 and XFS file systems and with any system designed to work with a block device, e.g. databases and cluster file systems (like OCFS and GFS). StorPool can also be used with no file system, for example when using volumes to store VM images directly or as LVM physical volumes.

Redundancy is provided by multiple copies (replicas) of the data written synchronously across the cluster. Users may set the number of replication copies. We recommend 3 copies as a standard and 2 copies for data that is less critical. The replication level directly correlates with the number of servers that may be down without interruption in the service. For replication 3 the number of the servers (see Fault sets) that may be down simultaneously without losing access to the data is 2.

StorPool protects data and guarantees its integrity by a 64-bit checksum and version for each sector on a StorPool volume or snapshot. StorPool provides a very high degree of flexibility in volume management. Unlike other storage technologies, such as RAID or ZFS, StorPool does not rely on device mirroring (pairing drives for redundancy). So every disk that is added to a StorPool cluster adds capacity and improves the performance of the cluster, not just for new data but also for existing data. Provided that there are sufficient copies of the data, drives can be added or taken away with no impact to the storage service. Unlike rigid systems like RAID, StorPool does not impose any strict hierarchical storage structure dictated by the underlying disks. StorPool simply creates a single pool of storage that utilises the full capacity and performance of a set of commodity drives.

2. Architecture

StorPool works on a cluster of servers in a distributed shared-nothing architecture. All functions are performed by all servers on an equal peer basis. It works on standard off-the-shelf servers running GNU/Linux.

Each storage node is responsible for data stored on its local drives. Storage nodes collaborate to provide the storage service. StorPool provides a shared storage pool combining all the available storage capacity. It uses synchronous replication across servers. The StorPool client communicates in parallel with all StorPool servers. The StorPool iSCSI target provides access to volumes exported through it to other initiators.

The software consists of two parts - a storage server and a storage client - that are installed on each physical server (host, node). The storage client might be the native block on Linux based systems or the iSCSI target for other systems. Each host can be a storage server, a storage client, iSCSI target, or any combination. To storage clients the StorPool volumes appear as block devices under the /dev/storpool/ directory and behave as normal disk devices. The data on the volumes can be read and written by all clients simultaneously; its consistency is guaranteed through a synchronous replication protocol. Volumes may be used by clients as they would use a local hard drive or disk array.

3. Feature Highlights

3.1. Scale-out, not Scale-Up

The StorPool solution is fundamentally about scaling out (by adding more drives or nodes) rather than scaling up (adding capacity by replacing a storage box with larger storage box). This means StorPool can scale independently by IOPS, storage space and bandwidth. There is no bottleneck or single point of failure. StorPool can grow without interruption and in small steps - a drive, a server and/or a network interface at a time.

3.2. High Performance

StorPool combines the IOPS performance of all drives in the cluster and optimizes drive access patterns to provide low latency and handling of storage traffic bursts. The load is distributed equally between all servers through striping and sharding.

3.3. High Availability and Reliability

StorPool uses a replication mechanism that slices and stores copies of the data on different servers. For primary, high performance storage this solution has many advantages compared to RAID systems and provides considerably higher levels of reliability and availability. In case of a drive, server, or other component failure, StorPool uses some of the available copies of the data located on other nodes in the same or other racks significantly decreasing the risk of losing access to or losing data.

3.4. Commodity Hardware

StorPool supports drives and servers in a vendor-agnostic manner, allowing you to avoid vendor lock-in. This allows the use of commodity hardware, while preserving reliability and performance requirements. Moreover, unlike RAID, StorPool is drive agnostic - you can mix drives of various types, make, speed or size in a StorPool cluster.

3.5. Shared Block Device

StorPool provides shared block devices with semantics identical to a shared iSCSI or FC disk array.

3.6. Co-existence with hypervisor software

StorPool can utilize repurposed existing servers and can co-exist with hypervisor software on the same server. This means that there is no dedicated hardware for storage, and growing an IaaS cloud solution is achieved by simply adding more servers to the cluster.

3.7. Compatibility

StorPool is compatible with 64-bit Intel and AMD based servers. We support all Linux-based hypervisors and hypervisor management software. Any Linux software designed to work with a shared storage solution such as an iSCSI or FC disk array will work with StorPool. StorPool guarantees the functionality and availability of the storage solution at the Linux block device interface.

3.8. CLI interface and API

StorPool provides an easy to use yet powerful command-line interface (CLI) tool for administration of the data storage solution. It is simple and user-friendly - making configuration changes, provisioning and monitoring fast and efficient. StorPool also provides a RESTful JSON API, and python bindings exposing all the available functionality, so you can integrate it with any existing management system.

3.9. Reliable Support

StorPool comes with reliable dedicated support: remote installation and initial configuration by StorPool’s specialists; 24x7 support; live software updates without interruption in the service

4. Hardware Requirements

All distributed storage systems are highly dependent on the underlying hardware. There are some aspects that will help achieve maximum performance with StorPool and are best considered in advance. Each node in the cluster can be used as server, client, iSCSI target or any combination; depending on the role, hardware requirements vary.

4.1. Minimum StorPool cluster

  • 3 industry-standard x86 servers;

  • any x86-64 CPU with 4 threads or more;

  • 32 GB ECC RAM per node (8+ GB used by StorPool);

  • any hard drive controller in JBOD mode;

  • 3x SATA3 hard drives or SSDs;

  • dedicated 10GE LAN;

4.3. How StorPool relies on hardware

4.3.1. CPU

When the system load is increased, CPUs are saturated with system interrupts. To avoid the negative effects of this, StorPool’s server and client processes are given one or more dedicated CPU cores. This significantly improves overall the performance and the performance consistency.

4.3.2. RAM

ECC memory can detect and correct the most common kinds of in-memory data corruption thus maintains a memory system immune to single-bit errors. Using ECC memory is an essential requirement for improving the reliability of the node. In fact, StorPool is not designed to work with non-ECC memory.

4.3.3. Storage (HDDs / SSDs)

StorPool ensures the best drive utilization. Replication and data integrity are core functionality, so RAID controllers are not required and all storage devices might be connected as JBOD. All hard drives are journaled either on an NVMe drive similar to Intel Optane series. When write-back cache is available on a RAID controller it could be used in a StorPool specific way in order to provide power-loss protection for the data written on the hard disks. This is not necessary for SATA SSD pools.

4.3.4. Network

StorPool is a distributed system which means that the network is an essential part of it. Designed for efficiency, StorPool combines data transfer from other nodes in the cluster. This greatly improves the data throughput, compared with access to local devices, even if they are SSD or NVMe.

4.4. Software Compatibility

4.4.1. Operating Systems

  • Linux (various distributions)

  • Windows and VMWare, Citrix Xen through standard protocols (iSCSI)

4.4.2. File Systems

Developed and optimized for Linux, StorPool is very well tested on CentOS, Ubuntu and Debian. Compatible and well tested with ext4 and XFS file systems and with any system designed to work with a block device, e.g. databases and cluster file systems (like GFS2 or OCFS2). StorPool can also be used with no file system, for example when using volumes to store VM images directly. StorPool is compatible with other technologies from the Linux storage stack, such as LVM, dm-cache/bcache, and LIO.

4.4.3. Hypervisors & Cloud Management/Orchestration

  • KVM

  • LXC/Containers

  • OpenStack

  • OpenNebula

  • OnApp

  • CloudStack

  • any other technology compatible with the Linux storage stack.

5. Installation and Upgrade

Currently the installation and upgrade procedures are performed by the StorPool support team.

6. Configuration Guide

6.1. Minimal Node Configuration

To configure nodes StorPool uses a configuration file, which can be found at /etc/storpool.conf. Host specific configuration can be placed in /etc/storpool.conf.d/ folder. The minimum working configuration must specify the network interface, number of expected nodes, authentication tokens and unique ID of the node like in the following example:

#-
# Copyright (c) 2013 - 2017  StorPool.
# All rights reserved.
#

# Human readable name of the cluster, usuall form "Company Name"-"Location", e.g. StorPoolLab-Sofia
#
# Mandatory for the monitoring
SP_CLUSTER_NAME=  #<Company-Name-PoC>-<City-or-nearest-airport>

# Remote authentication token provided by StorPool support for data related to crashed services, collected
# vmcore-dmesg files after kernel panic, per-host monitoring alerts, storpool_iolatmon alerts, etc.
SP_RAUTH_TOKEN= <rauth-token>

# Computed from the StorPool Support and consists of location and cluster separated by a dot, e.g. nzkr.b
#
# Mandatory since version 16.02
SP_CLUSTER_ID=  #Ask StorPool Support

# Interface for storpool communication
#
# Default: empty
SP_IFACE1_CFG=1:eth1.100:eth1:100:10.1.100.1:b:x:r
SP_IFACE2_CFG=1:eth2.100:eth2:100:10.2.100.1:b:x:r


# expected nodes for beacon operation
#
# !!! Must be specified !!!
#
SP_EXPECTED_NODES=3


# API authentication token
#
# 64bit random value
# generate for example with: 'od -vAn -N8 -tu8 /dev/random'
SP_AUTH_TOKEN=4306865639163977196


##########################


[spnode1.example.com]
SP_OURID = 1

6.2. Full Configuration Options List

The following is a complete list of the configuration options with short explanation for each of them.

6.2.1. Cluster name

Required for the pro-active monitoring performed by StorPool support team. Usually in the form <Company-Name>-<City-or-nearest-airport>:

SP_CLUSTER_NAME=StorPoolLab-Sofia

6.2.2. Cluster ID

The Cluster ID is computed from the StorPool Support ID and consists of two parts - location and cluster separated by a dot ("."). Each location consists of one or more clusters:

SP_CLUSTER_ID=nzkr.b

6.2.3. Non-voting beacon node

For client only nodes, the storpool_server service will refuse to start on a node with SP_NODE_NON_VOTING. Default is 0:

SP_NODE_NON_VOTING=1

Attention

It is strongly recommended to configure SP_NODE_NON_VOTING at the per-host configuration sections in storpool.conf (see Per host configuration for more)

6.2.4. Communication interface for StorPool cluster

Recommended to have two dedicated network interfaces for communication between the nodes:

SP_IFACE1_CFG=1:eth1.100:eth1:100:10.1.100.1:b:x:r
SP_IFACE2_CFG=1:eth2.100:eth2:100:10.2.100.1:b:x:r

For a full explanation of all options, please check /usr/share/doc/storpool/examples/storpool.conf.example

6.2.5. Address for the API management (storpool_mgmt)

Used by the CLI. Multiple clients can simultaneously send requests to the API. The management service is usually started on one or more nodes in the cluster at a time. By default it is bound on localhost:

SP_API_HTTP_HOST=127.0.0.1

For cluster wide access and automatic failover between the nodes, multiple nodes might have the API service started. The specified IP address is brought up only on one of the nodes in the cluster at a time - the so called active API service. You may specify an available IP address (SP_API_HTTP_HOST), which will be brought up or down on the corresponding interface (SP_API_IFACE) when migrating the API service between the nodes.

To configure an interface (SP_API_IFACE) and address (SP_API_HTTP_HOST):

SP_API_HTTP_HOST=10.10.10.240
SP_API_IFACE=eth1

Note

The script that adds or deletes the SP_API_HTTP_HOST address is located at /usr/lib/storpool/api-ip and could be easily modified for other use cases (e.g. configure routing, firewalls, etc.).

6.2.6. Port for the API management (storpool_mgmt)

Port for the API management service, the default is:

SP_API_HTTP_PORT=81

6.2.7. Ignore RX port option

Used to instruct the services that the network can preserve the selected port even when altering ports, default is:

SP_IGNORE_RX_PORT=0

6.2.8. Preferred port

Used to specify which port is preferred when two networks are specified, but only one of them could actually be used for any reason (in an active-backup bond style). The default value is:

SP_PREFERRED_PORT=0 # which is load-balancing

Supported values are:

SP_PREFERRED_PORT=1 # use SP_IFACE1_CFG by default
SP_PREFERRED_PORT=2 # use SP_IFACE2_CFG by default

6.2.9. RDMA interface

The RDMA interface, when such is used for storpool communication, example:

SP_RDMA_IFACE=qib0

6.2.10. Address for the bridge service (storpool_bridge)

Required for the local bridge service, this is the address where the bridge binds to:

SP_BRIDGE_HOST=180.220.200.8

6.2.11. Interface for the bridge address (storpool_bridge)

Expected when the SP_BRIDGE_HOST value is a floating IP address for the storpool_bridge service:

SP_BRIDGE_IFACE=bond0.900

6.2.12. Parallel requests per disk when recovering from remote (storpool_bridge)

Number of parallel requests to issue while perfoming remote recovery, between 1 and 64, default:

SP_REMOTE_RECOVERY_PARALLEL_REQUESTS_PER_DISK=2

6.2.13. Working directory

Used for fifos, sockets, core files, etc., default:

SP_WORKDIR=/var/run/storpool

Hint

On nodes with /var/run in RAM and limited amount of memory, /var/spool/storpool/run is recommended.

6.2.14. Report directory

Location for collecting automated bug reports and shared memory dumps:

SP_REPORTDIR=/var/spool/storpool

6.2.15. Restart automatically in case of crash

Restart the service in case of crash if there are less than 3 crashes during this interval in seconds. If this value is 0 service will not restart at all and will have to be started manually, default is 30 minutes:

SP_RESTART_ON_CRASH=1800

6.2.16. Expected nodes

Minimum expected nodes for beacon operation, usually equal to the number of nodes with storpool_server instances running:

SP_EXPECTED_NODES=3

6.2.17. Local user for debug data collection

User to change the ownership of the storpool_abrtsync service runtime. Unset by default:

SP_ABRTSYNC_USER=

Note

In case of no configuration during installation, this user will be set by default to storpool.

6.2.18. Remote addresses for sending debug data

The defaults are below, in the unlikely event they should be altered in case a jumphost or a custom collection nodes are used:

SP_ABRTSYNC_REMOTE_ADDRESSES=reports.storpool.com,reports1.storpool.com,reports2.storpool.com

6.2.19. Remote ports for sending debug data

The default port is below, might be altered in case a jumphost or a custom collection nodes are used:

SP_CRASH_REMOTE_PORT=2266

6.2.20. Group owner for the StorPool devices

The system group to use for the /dev/storpool directory and the /dev/sp-* raw disk devices:

SP_DISK_GROUP=disk

6.2.21. Permissions for the StorPool devices

The access mode to set on the /dev/sp-* raw disk devices:

SP_DISK_MODE=0660

6.2.22. Exclude disks globally or per server instance

A list of paths to drives to be excluded at instance boot time:

SP_EXCLUDE_DISKS=/dev/sda1:/dev/sdb1

Can also be specified for each server instance individually:

SP_EXCLUDE_DISKS=/dev/sdc1
SP_EXCLUDE_DISKS_1=/dev/sda1

6.2.23. Cgroup setup

More info on StorPool and cgroups is available in Control Groups

The following enables the usage of cgroups, default is on (1):

SP_USE_CGROUPS=1

Each StorPool process requires a specification of the cgroups it should be started into, there is a default configuration for each service. One or more processes may be placed in the same cgroup or each one may be in a cgroup of its own, as appropriate. The tool for setting up cgroups is storpool_cg and is able to automatically configure a system depending on the installed services on all supported operating systems, more info available in Control Groups.

The SP_RDMA_CGROUPS is required for setting the kernel threads started by the storpool_rdma module:

SP_RDMA_CGROUPS=-g cpuset:storpool.slice/rdma -g memory:storpool.slice/common

Set cgroups for the storpool_block service:

SP_BLOCK_CGROUPS=-g cpuset:storpool.slice/block -g memory:storpool.slice/common

Set cgroups for the storpool_bridge service:

SP_BRIDGE_CGROUPS=-g cpuset:storpool.slice/bridge -g memory:storpool.slice/alloc

Set cgroups for the storpool_server service:

SP_SERVER_CGROUPS=-g cpuset:storpool.slice/server -g memory:storpool.slice/common
SP_SERVER1_CGROUPS=-g cpuset:storpool.slice/server_1 -g memory:storpool.slice/common
SP_SERVER2_CGROUPS=-g cpuset:storpool.slice/server_2 -g memory:storpool.slice/common
SP_SERVER3_CGROUPS=-g cpuset:storpool.slice/server_3 -g memory:storpool.slice/common
SP_SERVER4_CGROUPS=-g cpuset:storpool.slice/server_4 -g memory:storpool.slice/common
SP_SERVER5_CGROUPS=-g cpuset:storpool.slice/server_5 -g memory:storpool.slice/common
SP_SERVER6_CGROUPS=-g cpuset:storpool.slice/server_6 -g memory:storpool.slice/common

Set cgroups for the storpool_beacon service:

SP_BEACON_CGROUPS=-g cpuset:storpool.slice/beacon -g memory:storpool.slice/common

Set cgroups for the storpool_mgmt service:

SP_MGMT_CGROUPS=-g cpuset:storpool.slice/mgmt -g memory:storpool.slice/alloc

Set cgroups for the storpool_controller service:

SP_CONTROLLER_CGROUPS=-g cpuset:system.slice -g memory:system.slice

Set cgroups for the storpool_iscsi target service:

SP_ISCSI_CGROUPS=-g cpuset:storpool.slice/iscsi -g memory:storpool.slice/alloc

Set cgroups for the storpool_nvmed service:

SP_NVMED_CGROUPS=-g cpuset:storpool.slice/beacon -g memory:storpool.slice/common

6.2.24. Network and Storage controllers interrupts affinity

The setirqaff utility is started by cron every minute. It checks the CPU affinity settings of several classes of IRQs (network interfaces, HBA, RAID) and updates them if needed. The policy is build in the script and does not require any external configuration files, apart from properly configured storpool.conf for the present node.

6.2.25. Cache size

Each storpool_server process allocates this amount of RAM (in MB) for caching. The size of the cache depends on the number of storage devices on each storpool_server instance and is taken care by the storpool_cg tool during cgroups configuration. Example configuration for all storpool_server instances:

SP_CACHE_SIZE=4096

Note

A node with three storpool_server processes running will use 4096*3 = 12GB cache total.

Override the size of the cache for each of the storpool_server instances, useful when different instances control different number of drives:

SP_CACHE_SIZE=1024
SP_CACHE_SIZE_1=1024
SP_CACHE_SIZE_2=4096
SP_CACHE_SIZE_3=8192

Set the internal write-back caching to on:

SP_WRITE_BACK_CACHE_ENABLED=1

Attention

UPS is mandatory with WBC, a clean server shutdown is required before the UPS batteries are depleted.

6.2.26. API authentication token

This value must be an unique integer for each cluster:

SP_AUTH_TOKEN=0123456789

Hint

Generated with: od -vAn -N8 -tu8 /dev/random

6.2.27. NVMe SSD drives

To instruct the storpool_server which is the PCIe ID of the NVMe SSD configure the following:

SP_NVME_PCI_ID=0000:04:00.0

Hint

More than one PCIe NVMe device could be specified by space separating their PCIe IDs:

SP_NVME_PCI_ID=0000:01:00.0 0000:02:00.0 0000:06:00.0 0000:07:00.0

6.2.28. Per host configuration

Specific details per host. The value in the square brackets should be the name of the host as returned by the hostname command. The SP_OURID value for this node must be unique throughout the cluster:

[spnode1.example.com]
SP_OURID=1

The highest ID in this release is 62 or up to this number of nodes in a single cluster.

Specific configuration details might be added for each host individually, e.g.:

[spnode1.example.com]
SP_OURID=1
SP_IFACE1_CFG=1:eth1.100:eth1:100:10.1.100.1:b:x:r
SP_IFACE2_CFG=1:eth2.100:eth2:100:10.2.100.1:b:x:r
SP_NODE_NON_VOTING=1

7. Prepare Storage Devices

All hard drives, SSD, or NVME drives that will be used by StorPool must have one or more properly aligned partitions and must have an assigned ID. Larger NVMe drives could be split into two or more partitions to be able to assign them to different storpool_server instances and work around a bottleneck posed by a saturated CPU by using a separate hyperthread or isolated CPU core for the additional storpool_server instances.

The ID should be a number between 1 and 4000 and must be unique within the StorPool cluster. An example command for creating a partition on the whole drive with the proper alignment:

# parted -s --align optimal /dev/sdXN mklabel gpt -- mkpart primary 2MiB 100%    # where X is the drive letter and N is the partition number

For dual partitions on an NVMe drive use:

# parted -s --align optimal /dev/nvmeXnY mklabel gpt -- mkpart primary 2MiB 50%   # where X is the nvme device controller and Y is the device's first namespace
# parted -s --align optimal /dev/nvmeXnY mkpart primary 50% 100%

Similarly to split an even larger (e.g. 8TB+) NVMe drive to four partitions use:

# parted -s --align optimal /dev/nvmeXnY mklabel gpt -- mkpart primary 2MiB 25%   # where X is the nvme device controller and Y is the device's first namespace
# parted -s --align optimal /dev/nvmeXnY mkpart primary 25% 50%
# parted -s --align optimal /dev/nvmeXnY mkpart primary 50% 75%
# parted -s --align optimal /dev/nvmeXnY mkpart primary 75% 100%

Hint

NVMe devices larger than 4TB should always be splitted as up to 4TiB chunks.

On a brand new cluster installation it is necessary to have one drive formatted with the “init” (-I) flag of storpool_initdisk. This device is necessary only for the first start and therefore it is best to pick the first drive in the cluster.

Initializing the first drive on the first server node with the init flag:

# storpool_initdisk -I {diskId} /dev/sdXN   # where X is the drive letter and N is the partition number

Initializing an SSD or NVME SSD device with the SSD flag set:

# storpool_initdisk -s {diskId} /dev/sdXN   # where X is the drive letter and N is the partition number

Initializing an HDD drive with a journal device:

# storpool_initdisk {diskId} /dev/sdXN --journal /dev/sdYM   # where X and Y are the drive letters, and N and M are the partition numbers

List all initialized devices:

# storpool_initdisk --list

Example output:

0000:01:00.0-p1, diskId 2305, version 10007, server instance 0, cluster e.b, SSD, opened 7745
0000:02:00.0-p1, diskId 2306, version 10007, server instance 0, cluster e.b, SSD, opened 7745
/dev/sdr1, diskId 2301, version 10007, server instance 1, cluster e.b, SSD, opened 8080
/dev/sdq1, diskId 2302, version 10007, server instance 1, cluster e.b, SSD, opened 8080
/dev/sds1, diskId 2303, version 10007, server instance 1, cluster e.b, SSD, opened 8080
/dev/sdt1, diskId 2304, version 10007, server instance 1, cluster e.b, SSD, opened 8080
/dev/sda1, diskId 2311, version 10007, server instance 2, cluster e.b, WBC, jmv 160036C1B49, opened 8185
/dev/sdb1, diskId 2311, version 10007, server instance 2, cluster -, journal mv 160036C1B49, opened 8185
/dev/sdc1, diskId 2312, version 10007, server instance 2, cluster e.b, WBC, jmv 160036CF95B, opened 8185
/dev/sdd1, diskId 2312, version 10007, server instance 2, cluster -, journal mv 160036CF95B, opened 8185
/dev/sde1, diskId 2313, version 10007, server instance 3, cluster e.b, WBC, jmv 160036DF8DA, opened 8971
/dev/sdf1, diskId 2313, version 10007, server instance 3, cluster -, journal mv 160036DF8DA, opened 8971
/dev/sdg1, diskId 2314, version 10007, server instance 3, cluster e.b, WBC, jmv 160036ECC80, opened 8971
/dev/sdh1, diskId 2314, version 10007, server instance 3, cluster -, journal mv 160036ECC80, opened 8971

Other avaialble options:

  • -i - Specify server instance, used when more than one storpool_server instances are running on the same node

  • -r - Used to return an ejected disk back to the cluster or change some of the flags

  • -F - Forget this disk and mark it as ejected (succeeds only without a running storpool_server instance that has the drive opened)

  • -s - set SSD flag - on new initialize only, not reversible with -r).

  • -e (count) - Initialize the disk by overriding the default entries count.

  • -j|--journal (<device>|none) - Used for HDDs when a RAID controller with a working cachevault or battery is present or an NVMe device is used as a power loss protected write-back journal cache.

  • --bad - Marks disk as bad. Will be treated as ejected by the servers.

  • --good - Resets disk to ejected if it was bad. Use with caution.

  • --wbc (y|n) - Used for HDDs when the internal write-back caching is enabled, implies SP_WRITE_BACK_CACHE_ENABLED to have an effect. Turned off by default.

  • --nofua (y|n) - Used to forcefully disable FUA support for an SSD device. Use with caution because may lead to data lost if device is powered off before issuing a FLUSH CACHE command.

  • --no-flush (y|n) - Used to forcefully disable FLUSH support for an SSD device.

  • --no-trim (y|n) - Used to forcefully disable TRIM support for an SSD device. Useful when the drive is misbehaving when TRIM is enabled.

  • --wipe-all-data - Used when re-initializing an already initialized StorPool drive. Use with caution.

  • --list-empty - list empty nvme devices.

  • --json - output the list of devices as a JSON object.

  • --no-test - disable forced one-time test flag.

  • --no-notify - does not notify servers of the changes. They won’t immediately open the disk. Useful for changing a flag with -r without returning the disk back to the server.

8. Verify the Installation

A StorPool installation provides the following daemons that take care of different functionality on each participant node in the cluster.

8.1. storpool_beacon

The beacon must be the first started process on all nodes in cluster. It informs all members about the availability of the node on which it is installed. If the number of the visible nodes changes, every storpool_beacon service checks that its node still participates is the quorum, i.e. it can communicate with more than half of the expected nodes, including itself (see SP_EXPECTED_NODES in the Full Configuration Options List section). If the storpool_beacon service has been started successfully, it will send to the system log (/var/log/messages, /var/log/syslog, or similar) messages such as the following for every node that comes up in the StorPool cluster:

[snip]
Jan 21 16:22:18 s01 storpool_beacon[18839]: [info] incVotes(1) from 0 to 1, voteOwner 1
Jan 21 16:23:10 s01 storpool_beacon[18839]: [info] peer 2, beaconStatus UP bootupTime 1390314187662389
Jan 21 16:23:10 s01 storpool_beacon[18839]: [info] incVotes(1) from 1 to 2, voteOwner 2
Jan 21 16:23:10 s01 storpool_beacon[18839]: [info] peer up 1
[snip]

8.2. storpool_server

The storpool_server service must be started on each node that provides its hard drives, SSDs or NVMe drives to the cluster. If the service has started successfully, all the hard drives intended to be used as StorPool disks should be listed in the system log, e.g.:

Dec 14 09:54:19 s11 storpool_server[13658]: [info] /dev/sdl1: adding as data disk 1101 (ssd)
Dec 14 09:54:19 s11 storpool_server[13658]: [info] /dev/sdb1: adding as data disk 1111
Dec 14 09:54:20 s11 storpool_server[13658]: [info] /dev/sda1: adding as data disk 1114
Dec 14 09:54:20 s11 storpool_server[13658]: [info] /dev/sdk1: adding as data disk 1102 (ssd)
Dec 14 09:54:20 s11 storpool_server[13658]: [info] /dev/sdj1: adding as data disk 1113
Dec 14 09:54:22 s11 storpool_server[13658]: [info] /dev/sdi1: adding as data disk 1112

On a dedicated or node with a larger amount of spare resources, more than one storpool_server instance could be started (up to four instances).

8.3. storpool_block

The storpool_block service provides the client (initiator) functionality. StorPool volumes can be attached only to the nodes where this service is running. When attached to a node, a volume can be used and manipulated as a regular block device via the /dev/stopool/{volume_name} symlink:

# lsblk /dev/storpool/test
NAME MAJ:MIN RM  SIZE RO TYPE MOUNTPOINT
sp-2 251:2    0  100G  0 disk

8.4. storpool_mgmt

The storpool_mgmt service should be started on the management node. It receives requests from user space tools (CLI or API), executes them in the StorPool cluster and returns the results back to the sender. Two or more nodes in the same cluster can be used as an API management server, with only one node active at the same time. An automatic failover mechanism is available: when the node with the active storpool_mgmt service fails, the SP_API_HTTP_HOST IP address is configured on the next node with the lowest SP_OURID with a running storpool_mgmt service.

8.5. storpool_bridge

The storpool_bridge service is started on one or more nodes in the cluster with one working as the active similar to the storpool_mgmt service. This service synchronizes snapshots for the backup and disaster recovery use cases between this and one or more StorPool clusters in different locations.

8.6. storpool_controller

The storpool_controller service is started on all nodes running the storpool_server service. It collects information from all storpool_server instances in order to provide statistics data to the API.

Note

The storpool_controller service requires port 47567 to be open on the nodes where the API (storpool_mgmt) service is running.

8.7. storpool_nvmed

The storpool_nvmed service is started on all nodes that have the storpool_server service and have NVMe devices. It handles the management of the NVMe devices, their unplugging from the kernel’s nvme driver and passing to the storpool_pci driver.

8.8. storpool_stat

The storpool_stat service is started on all nodes and collects metrics about different aspects of the system:

  • on all nodes, CPU stats - queue run/wait, user, system, etc., per CPU;

  • on all nodes, memory usage stats per cgroup;

  • on all nodes, network stats for the StorPool services;

  • on all nodes, the I/O stats of the system drives;

  • on all nodes, per-host validating service checks (for example, if there are processes in the root cgroup, the API is reacheable if configured, etc.);

  • on all nodes with storpool_block, the I/O stats of all attached StorPool volumes;

  • on server nodes, stats for the communication of storpool_server with the drives.

The collected data can be viewed at https://analytics.storpool.com and can also be submitted to an InfluxDB instance of the customer, configurable in storpool.conf.

9. CLI Tutorial

StorPool provides an easy yet powerful Command Line Interface (CLI) for administrating the data storage cluster or multiple clusters in the same location (13.  Multi site and Multicluster). It has an integrated help system that provides useful information on every step. There are various ways to execute commands in the CLI, depending on the style and needs of the administrator. The StorPool CLI gets its configuration from /etc/storpool.conf file and command line options.

Type regular shell command with parameters:

# storpool service list

Use interactive StorPool shell:

# storpool
StorPool> service list

Pipe command output to StorPool CLI:

# echo "service list" | storpool

Redirect the standard input from a predefined file with commands:

# storpool < input_file

Display the available command line options:

# storpool --help

Error message with possible options will be displayed if the shell command is incomplete or wrong:

# storpool attach
Error: incomplete command! Expected:
    list - list the current attachments
    timeout - seconds to wait for the client to appear
    volume - specify a volume to attach
    here - attach here
    noWait - do not wait for the client
    snapshot - specify a snapshot to attach
    mode - specify the read/write mode
    client - specify a client to attach the volume to

# storpool attach volume
Error: incomplete command! Expected:
  volume - the volume to attach

Interactive shell help can be invoked by pressing the question mark key (?):

# storpool
StorPool> attach <?>
  client - specify a client to attach the volume to {M}
  here - attach here {M}
  list - list the current attachments
  mode - specify the read/write mode {M}
  noWait - do not wait for the client {M}
  snapshot - specify a snapshot to attach {M}
  timeout - seconds to wait for the client to appear {M}
  volume - specify a volume to attach {M}

Shell autocomplete, invoked by double-pressing the Tab key, will show available options for current step:

StorPool> attach <tab> <tab>
client    here      list      mode      noWait    snapshot  timeout   volume

StorPool shell can detect incomplete lines and suggest options:

# storpool
StorPool> attach <enter>
.................^
Error: incomplete command! Expected:
    volume - specify a volume to attach
    client - specify a client to attach the volume to
    list - list the current attachments
    here - attach here
    mode - specify the read/write mode
    snapshot - specify a snapshot to attach
    timeout - seconds to wait for the client to appear
    noWait - do not wait for the client

To exit the shell use quit or exit commands or directly use the Ctrl+C or Ctrl+D keyboard shortcuts of your terminal.

To enter MultiCluster mode use:

StorPool> multiCluster on
[MC] StorPool>

For non-interactive mode use:

# storpool -M <command>

Note

All commands not relevant to multicluster will silently fall-back to non-multicluster mode. Example storpool -M service list will list only local services, same for storpool -M disk list and storpool -M net list.

9.1. Location

The location submenu is used for configuring other StorPool sub-clusters in the same or different location (13.  Multi site and Multicluster). The location ID is the first part (left of the .) in the SP_CLUSTER_ID configured in the remote cluster.

For example to add a location with SP_CLUSTER_ID=nzkr.b use:

# storpool location add nzkr StorPoolLab-Sofia
OK

To list the configured locations use:

# storpool location list
-----------------------------------------------
| id   | name              | rxBuf  | txBuf   |
-----------------------------------------------
| nzkr | StorPoolLab-Sofia | 85 KiB | 128 KiB |
-----------------------------------------------

To rename a location use:

# storpool location rename StorPoolLab-Sofia name StorPoolLab-Amsterdam
OK

To remove a location use:

# storpool location remove StorPoolLab-Sofia
OK

Note

This command will fail if there is an existing cluster or a remote bridge configured for this location

To update the send or receive buffer sizes to values different from the defaults, use:

# storpool location update StorPoolLab-Sofia recvBufferSize 16M
OK
# storpool location update StorPoolLab-Sofia sendBufferSize 1M
OK
# storpool location list
-----------------------------------------------
| id   | name              | rxBuf  | txBuf   |
-----------------------------------------------
| nzkr | StorPoolLab-Sofia | 16 MiB | 1.0 MiB |
-----------------------------------------------

9.2. Cluster

The cluster submenu is used for configuring a new cluster for an already configured location. The cluster ID is the second part (right from the .) in the SP_CLUSTER_ID configured in the remote, for example to add the cluster b for the remote location nzkr use:

# storpool cluster add StorPoolLab-Sofia b
OK

To list the configured clusters use:

# storpool cluster list
--------------------------------------------------
| name                  | id | location          |
--------------------------------------------------
| StorPoolLab-Sofia-cl1 | b  | StorPoolLab-Sofia |
--------------------------------------------------

To remove a cluster use:

# storpool cluster remove StorPoolLab-Sofia b

9.3. Remote Bridge

The remoteBridge submenu is used to register or deregister a remote bridge for a configured location.

To register a remote bridge use storpool remoteBridge register <location-name> <IP address> <public-key>, example:

# storpool remoteBridge register StorPoolLab-Sofia 10.1.100.10 ju9jtefeb8idz.ngmrsntnzhsei.grefq7kzmj7zo.nno515u6ftna6
OK

Will register the StorPoolLab-Sofia location with an IP address of 10.1.100.10 and the above public key.

In case of a change in the IP address or the public key of a remote location the remote bridge could be de-registered and then registered again with the required parameters, e.g.:

# storpool remoteBridge deregister 10.1.100.10
OK
# storpool remoteBridge register StorPoolLab-Sofia 78.90.13.150 8nbr9q162tjh.ahb6ueg16kk2.mb7y2zj2hn1ru.5km8qut54x7z
OK

A remote bridge might be registered with noCrypto in case of a secure inteconnect between the clusters, typical use-case is a 13.1.  Multicluster setup, with other sub-clusters in the same datacenter.

To enable deferred deletion on unexport from the remote site the minimumDeleteDelay flag should also be set, the format of the command is storpool remoteBridge register <location-name> <IP address> <public-key> <minimumDeleteDelay>, where the last parameter is a time period provided as X[smhd] - X is an integer and s, m, h, and d are seconds, minutes, hours and days accordingly.

For example if we register the remote bridge for StorPoolLab-Sofia location with a minimumDeleteDelay of one day the register would look like this:

# storpool remoteBridge register StorPoolLab-Sofia 78.90.13.150 8nbr9q162tjh.ahb6ueg16kk2.mb7y2zj2hn1ru.5km8qut54x7z 1d
OK

After this operation all snapshots sent from the remote cluster could be unexported later with the deleteAfter parameter set (check the Remote snapshots section). Any deleteAfter parameters lower than the minimumDeleteDelay will be overridden by the bridge in the remote cluster. All such events will be logged on the node with the active bridge in the remote cluster.

To list all registered remote bridges use:

# storpool remoteBridge list
StorPool> remoteBridge list
------------------------------------------------------------------------------------------------------------------------------
| ip           | remote            | minimumDeleteDelay | publicKey                                               | noCrypto |
------------------------------------------------------------------------------------------------------------------------------
| 10.1.200.10  | StorPoolLab-Sofia |                    | nonwtmwsgdr2p.fos2qus4h1qdk.pnt9ozj8gcktj.d7b2aa24gsegn | 0        |
| 10.1.200.11  | StorPoolLab-Sofia |                    | jtgeaqhsmqzqd.x277oefofxbpm.bynb2krkiwg54.ja4gzwqdg925j | 0        |
------------------------------------------------------------------------------------------------------------------------------

Check the 13.2.  Multi site section from this user guide for more on deferred delete.

9.4. Network

To list basic details about the cluster network use:

# storpool net list
------------------------------------------------------------
| nodeId | flags   | net 1             | net 2             |
------------------------------------------------------------
|     11 | uU + AJ | F4:52:14:76:9C:B0 | F4:52:14:76:9C:B0 |
|     12 | uU + AJ | 02:02:C9:3C:E3:80 | 02:02:C9:3C:E3:81 |
|     13 | uU + AJ | F6:52:14:76:9B:B0 | F6:52:14:76:9B:B1 |
|     14 | uU + AJ | 26:8A:07:43:EE:BC | 26:8A:07:43:EE:BD |
|     15 | uU + AJ | 1A:60:00:00:00:0F | 1E:60:00:00:00:0F |
------------------------------------------------------------
Quorum status: 5 voting beacons up out of 5 expected
Flags:
  u - packets recently received from this node
  d - no packets recently received from this node

  U - this node has enough votes to be in the quorum
  D - this node does not have enough votes to be in the quorum
  M - this node is being damped by the rest of the nodes in the cluster
  + - this node considers itself in the quorum
  B - the connection to this node is through a backup link; check the cabling
  A - this node is using hardware acceleration
  J - the node uses jumbo frames

  N - a non-voting node

9.5. Server

To list the nodes that are configured as StorPool servers and their storpool_server instances use:

# storpool server list
cluster running, mgmt on node 11
    server  11.0 running on node 11
    server  12.0 running on node 12
    server  13.0 running on node 13
    server  14.0 running on node 14
    server  11.1 running on node 11
    server  12.1 running on node 12
    server  13.1 running on node 13
    server  14.1 running on node 14

To get more information about which storage devices are provided by a particular server use the storpool server <ID>:

# storpool server 11 disk list
disk  |  server  |    size     |    used     |   est.free  |      %  |  free entries  |   on-disk size  |  allocated objects |  errors |   flags
1103  |    11.0  |    447 GiB  |    3.1 GiB  |    424 GiB  |    1 %  |       1919912  |         20 MiB  |    40100 / 480000  |   0 / 0 |
1104  |    11.0  |    447 GiB  |    3.1 GiB  |    424 GiB  |    1 %  |       1919907  |         20 MiB  |    40100 / 480000  |   0 / 0 |
1111  |    11.0  |    465 GiB  |    2.6 GiB  |    442 GiB  |    1 %  |        494977  |         20 MiB  |    40100 / 495000  |   0 / 0 |
1112  |    11.0  |    365 GiB  |    2.6 GiB  |    346 GiB  |    1 %  |        389977  |         20 MiB  |    40100 / 390000  |   0 / 0 |
1125  |    11.0  |    931 GiB  |    2.6 GiB  |    894 GiB  |    0 %  |        974979  |         20 MiB  |    40100 / 975000  |   0 / 0 |
1126  |    11.0  |    931 GiB  |    2.6 GiB  |    894 GiB  |    0 %  |        974979  |         20 MiB  |    40100 / 975000  |   0 / 0 |
----------------------------------------------------------------------------------------------------------------------------------------
   6  |     1.0  |    3.5 TiB  |     16 GiB  |    3.4 TiB  |    0 %  |       6674731  |        122 MiB  |   240600 / 3795000 |   0 / 0 |

Note

Without specifying instance the first instance is assumed - 11.0 as in the above example. The second, third and fourth storpool_server instance would be 11.1, 11.2 and 11.3 accordingly.

To list the servers that are blocked and could not join the cluster for some reason:

# storpool server blocked
cluster waiting, mgmt on node 12
  server  11.0 waiting on node 11 missing:1201,1202,1203,1204,1212,1221,1222,1223,1224,1225,1226,1301,1302,1303,1304,1311,1312,1321,1322,1323,1324,1325,1326,4043 pending:1103,1104,1111,1112,1125,1126
  server  12.0    down on node 12
  server  13.0    down on node 13
  server  14.0 waiting on node 14 missing:1201,1202,1203,1204,1212,1221,1222,1223,1224,1225,1226,1301,1302,1303,1304,1311,1312,1321,1322,1323,1324,1325,1326,4043 pending:1403,1404,1411,1412,1421,1423
  server  11.1 waiting on node 11 missing:1201,1202,1203,1204,1212,1221,1222,1223,1224,1225,1226,1301,1302,1303,1304,1311,1312,1321,1322,1323,1324,1325,1326,4043 pending:1101,1102,1121,1122,1123,1124
  server  12.1    down on node 12
  server  13.1    down on node 13
  server  14.1 waiting on node 14 missing:1201,1202,1203,1204,1212,1221,1222,1223,1224,1225,1226,1301,1302,1303,1304,1311,1312,1321,1322,1323,1324,1325,1326,4043 pending:1401,1402,1424,1425,1426

9.6. Fault sets

The fault sets are a way to instruct StorPool to use the drives in a group of nodes for only one replica of the data if they are expected to fail simultaneously. Some examples would be:

  • Multinode chassis

  • multiple nodes in the same rack backed by the same power supply

  • nodes connected to the same set of switches and so on.

To define a fault set only a name and some set of servers are needed:

# storpool faultSet chassis_1 addServer 11 addServer 12
OK

To list defined fault sets use:

# storpool faultSet list
-------------------------------------------------------------------
| name                 |                                  servers |
-------------------------------------------------------------------
| chassis_1            |                                    11 12 |
-------------------------------------------------------------------

To remove a fault set use:

# storpool faultSet chassis_1 delete chassis_1

Attention

A new fault set definition has effect only on newly created volumes. To change the configuration on already created volumes a re-balance operation would be required, see Balancer for more details on re-balancing a cluster after defining fault sets.

9.7. Services

Check the state of all services presently running in the cluster and their uptime:

# storpool service list
cluster running, mgmt on node 12
  mgmt    11 running on node 11 ver 19.01.538, started 2020-01-08 18:23:36, uptime 1 day 00:53:43
  mgmt    12 running on node 12 ver 19.01.538, started 2020-01-08 18:23:35, uptime 1 day 00:53:44 active
server  11.0 running on node 11 ver 19.01.538, started 2020-01-08 18:23:45, uptime 1 day 00:53:34
server  12.0 running on node 12 ver 19.01.538, started 2020-01-08 18:23:41, uptime 1 day 00:53:38
server  13.0 running on node 13 ver 19.01.538, started 2020-01-08 18:23:35, uptime 1 day 00:53:44
server  14.0 running on node 14 ver 19.01.538, started 2020-01-08 18:23:39, uptime 1 day 00:53:40
server  11.1 running on node 11 ver 19.01.538, started 2020-01-08 18:23:45, uptime 1 day 00:53:34
server  12.1 running on node 12 ver 19.01.538, started 2020-01-08 18:23:44, uptime 1 day 00:53:35
server  13.1 running on node 13 ver 19.01.538, started 2020-01-08 18:23:37, uptime 1 day 00:53:42
server  14.1 running on node 14 ver 19.01.538, started 2020-01-08 18:23:39, uptime 1 day 00:53:40
client    11 running on node 11 ver 19.01.538, started 2020-01-08 18:23:33, uptime 1 day 00:53:46
client    12 running on node 12 ver 19.01.538, started 2020-01-08 18:23:34, uptime 1 day 00:53:45
client    13 running on node 13 ver 19.01.538, started 2020-01-08 18:23:32, uptime 1 day 00:53:47
client    14 running on node 14 ver 19.01.538, started 2020-01-08 18:23:32, uptime 1 day 00:53:47
client    15 running on node 15 ver 19.01.538, started 2020-01-09 10:46:17, uptime 08:31:02
bridge    11 running on node 11 ver 19.01.538, started 2020-01-08 18:23:34, uptime 1 day 00:53:45 active
bridge    12 running on node 12 ver 19.01.538, started 2020-01-08 18:23:34, uptime 1 day 00:53:45
 cntrl    11 running on node 11 ver 19.01.538, started 2020-01-08 18:23:35, uptime 1 day 00:53:44
 cntrl    12 running on node 12 ver 19.01.538, started 2020-01-08 18:23:34, uptime 1 day 00:53:45
 cntrl    13 running on node 13 ver 19.01.538, started 2020-01-08 18:23:31, uptime 1 day 00:53:48
 cntrl    14 running on node 14 ver 19.01.538, started 2020-01-08 18:23:31, uptime 1 day 00:53:48
 iSCSI    12 running on node 13 ver 19.01.538, started 2020-01-08 18:23:32, uptime 1 day 00:53:47
 iSCSI    13 running on node 13 ver 19.01.538, started 2020-01-08 18:23:32, uptime 1 day 00:53:47

9.8. Disk

The disk submenu is for quering or managing the available disks in the cluster.

To display all available disks in all server instances in the cluster:

# storpool disk list
disk  |  server  |    size     |     used    |   est.free  |      %  |  free entries  |   on-disk size  |   allocated objects |  errors |  flags
1101  |    11.1  |    893 GiB  |    2.6 GiB  |    857 GiB  |    0 %  |       3719946  |        664 KiB  |     41000 / 930000  |   0 / 0 |
1102  |    11.1  |    446 GiB  |    2.6 GiB  |    424 GiB  |    1 %  |       1919946  |        664 KiB  |     41000 / 480000  |   0 / 0 |
1103  |    11.0  |    893 GiB  |    2.6 GiB  |    857 GiB  |    0 %  |       3719948  |        660 KiB  |     41000 / 930000  |   0 / 0 |
1104  |    11.0  |    446 GiB  |    2.6 GiB  |    424 GiB  |    1 %  |       1919946  |        664 KiB  |     41000 / 480000  |   0 / 0 |
1105  |    11.0  |    446 GiB  |    2.6 GiB  |    424 GiB  |    1 %  |       1919947  |        664 KiB  |     41000 / 480000  |   0 / 0 |
1111  |    11.1  |    930 GiB  |    2.6 GiB  |    893 GiB  |    0 %  |        974950  |        716 KiB  |     41000 / 975000  |   0 / 0 |
1112  |    11.1  |    930 GiB  |    2.6 GiB  |    893 GiB  |    0 %  |        974949  |        736 KiB  |     41000 / 975000  |   0 / 0 |
1113  |    11.1  |    930 GiB  |    2.6 GiB  |    893 GiB  |    0 %  |        974943  |        760 KiB  |     41000 / 975000  |   0 / 0 |
1114  |    11.1  |    930 GiB  |    2.6 GiB  |    893 GiB  |    0 %  |        974937  |        844 KiB  |     41000 / 975000  |   0 / 0 |
[snip]
1425  |    14.1  |    931 GiB  |    2.6 GiB  |    894 GiB  |    0 %  |        974980  |         20 MiB  |     40100 / 975000  |   0 / 0 |
1426  |    14.1  |    931 GiB  |    2.6 GiB  |    894 GiB  |    0 %  |        974979  |         20 MiB  |     40100 / 975000  |   0 / 0 |
----------------------------------------------------------------------------------------------------------------------------------------
  47  |     8.0  |     30 TiB  |    149 GiB  |     29 TiB  |    0 %  |      53308967  |        932 MiB  |  1844600 / 32430000 |   0 / 0 |

To display additional info regarding disks:

# storpool disk list info
disk   |  server  |    device    |       model        |           serial            |             description          |         flags          |
 1101  |    11.1  |  0000:04:00.0-p1  |  SAMSUNG MZQLB960HAJR-00007  |  S437NF0M500149             |                                  |  S                     |
 1102  |    11.1  |  /dev/sdj1   |  Micron_M500DC_MTFDDAK480MBB  |  14250C6368E5               |                                  |  S                     |
 1103  |    11.0  |  /dev/sdi1   |  SAMSUNG_MZ7LH960HAJR-00005  |  S45NNE0M229767             |                                  |  S                     |
 1104  |    11.0  |  /dev/sdd1   |  Micron_M500DC_MTFDDAK480MBB  |  14250C63689B               |                                  |  S                     |
 1105  |    11.0  |  /dev/sdc1   |  Micron_M500DC_MTFDDAK480MBB  |  14250C6368EC               |                                  |  S                     |
 1111  |    11.1  |  /dev/sdl1   |  Hitachi_HUA722010CLA330  |  JPW9K0N13243ZL             |                                  |  W                     |
 1112  |    11.1  |  /dev/sda1   |  Hitachi_HUA722010CLA330  |  JPW9J0N13LJEEV             |                                  |  W                     |
 1113  |    11.1  |  /dev/sdb1   |  Hitachi_HUA722010CLA330  |  JPW9J0N13N694V             |                                  |  W                     |
 1114  |    11.1  |  /dev/sdm1   |  Hitachi_HUA722010CLA330  |  JPW9K0N132R7HL             |                                  |  W                     |
 [snip]
 1425  |    14.1  |  /dev/sdm1   |  Hitachi_HDS721050CLA360  |  JP1532FR1BY75C             |                                  |  W                     |
 1426  |    14.1  |  /dev/sdh1   |  Hitachi_HUA722010CLA330  |  JPW9K0N13RS95L             |                                  |  W, J                  |

To display internal statistics about each disk:

# storpool disk list internal

--------------------------------------------------------------------------------------------------------------------------------------------------------
| disk | server |        aggregate scores        |         wbc pages        |     scrub bw |                          scrub ETA | last scrub completed |
--------------------------------------------------------------------------------------------------------------------------------------------------------
| 1101 |   11.1 |        0 |        0 |        0 |        - + -     / -     |            - |                                  - |  2020-01-08 18:23:07 |
| 1102 |   11.1 |        0 |        0 |        0 |        - + -     / -     |            - |                                  - |  2020-01-08 18:23:07 |
| 1103 |   11.0 |        0 |        0 |        0 |        - + -     / -     |            - |                                  - |  2020-01-08 18:23:08 |
| 1104 |   11.0 |        0 |        0 |        0 |        - + -     / -     |            - |                                  - |  2020-01-08 18:23:09 |
| 1105 |   11.1 |        0 |        0 |        0 |        - + -     / -     |            - |                                  - |  2020-01-08 18:23:10 |
| 1111 |   11.0 |        0 |        0 |        0 |        0 + 0     / 2560  |            - |                                  - |  2020-01-08 18:23:12 |
| 1112 |   11.1 |        0 |        0 |        0 |        0 + 0     / 2560  |            - |                                  - |  2020-01-08 18:23:15 |
| 1113 |   11.1 |        0 |        0 |        0 |        0 + 0     / 2560  |            - |                                  - |  2020-01-08 18:23:17 |
| 1114 |   11.1 |        0 |        0 |        0 |        0 + 0     / 2560  |            - |                                  - |  2020-01-08 18:23:13 |
[snip]
| 1425 |   14.1 |        0 |        0 |        0 |        0 + 0     / 2560  |            - |                                  - |  2020-01-08 18:23:15 |
| 1426 |   14.1 |        0 |        0 |        0 |        0 + 0     / 2560  |            - |                                  - |  2020-01-08 18:23:19 |
--------------------------------------------------------------------------------------------------------------------------------------------------------

The sections in this output explained:

  • aggregate scores - Internal values representing how much data is about to be defragmented on the particular drive. Usually between 0 and 1, on heavily loaded clusters the rightmost column might get into the hundreds or even thousands if some drives are severely loaded.

  • wbc pages - Internal statistics for each drive that have the write back cache or journaling in StorPool enabled.

  • scrub bw - The scrubbing speed in MB/s

  • scrub ETA - Approximate time/date when the scrubbing operation will complete for this drive.

  • last scrub completed - The last time/date when the drive was scrubbed

Note

The default installation includes a cron job on the management nodes that starts a scrubbing job for all drives in the cluster once per week.

To set additional information for some of the disks, seen in description field with storpool disk list info:

# storpool disk 1111 description HBA2_port7
OK
# storpool disk 1104 description FAILING_SMART
OK

To mark a device as temporarily unavailable:

# storpool disk 1111 eject
OK

This will stop data replication for this disk, but will keep info on the placement groups in which it participated and which volume objects it contained.

Note

The command above will refuse to eject the disk if this operation would lead to volumes or snapshots in down state. Usually when the last up-to-date copy for some parts of a volume/snapshot is on this disk.

This drive will be visible with storpool disk list as missing, e.g.:

# storpool disk list
    disk  |  server  |    size    |    used    |  est.free  |      %  |  free entries  |  on-disk size  |  allocated objects |  errors  |  flags
    [snip]
    1422  |    14.1  |         -  |         -  |         -  |    - %  |             -  |             -  |        - / -       |    - / - |
    [snip]

Attention

This operation leads to degraded redundancy for all volumes and snapshots that have data on the ejected disk.

Such disk will not return by itself back into the cluster, and would have to be manually reinserted by removing its EJECTED flag with storpool_initdisk -r /dev/$path.

When a server controlling a disk notices some issues with it (write error, stalled request above a predefined threshold) it is also marked as “test pending”, due to many transient errors when a disk drive (or its controller) stalls a requiest for more than a predefined threshold.

An eject option is available for manually initiating such a test, which will flag the disk that it requires test and will eject it. The server instance will then perform a quick set of non-intrusive read-write tests on this disk and will return it back into the cluster if all tests did well, example:

# storpool disk 2331 eject test
OK

The tests are done usually couple of seconds up to a minute, to check the results from the last test use:

# storpool disk 2331 testInfo
 times tested  |   test pending  |  read speed   |  write speed  |  read max latency   |  write max latency  | failed
            1  |             no  |  1.0 GiB/sec  |  971 MiB/sec  |  8 msec             |  4 msec             |     no

If the disk was already marked for testing the option “now” will skip the test on the next attempt to re-open the disk:

# storpool disk 2301 eject now
OK

Attention

Note that this is exactly the same as “eject”, the disk would have to be manually returned into the cluster.

To mark a disk as unavailable by first re-balancing all data out to the other disks in the cluster and only then eject it:

# storpool disk 1422 softEject
OK
Balancer auto mode currently OFF. Must be ON for soft-eject to complete.

Note

This option requires StorPool balancer to be started after the above was issued, see more in the Balancer section below.

To remove a disk from the list of reported disks and all placement groups it participates in:

# storpool disk 1422 forget
OK

To get detailed information about given disk:

# storpool disk 1101 info
agAllocated | agCount | agFree | agFreeing | agFull | agMaxSizeFull | agMaxSizePartial | agPartial
      7 |     462 |    455 |         1 |      0 |             0 |                1 |         1

entriesAllocated | entriesCount | entriesFree | sectorsCount
              50 |      1080000 |     1079950 |    501215232

objectsAllocated | objectsCount | objectsFree | objectStates
              18 |       270000 |      269982 | ok:18

serverId | 1

id                   | objectsCount | onDiskSize | storedSize | objectStates
#bad_id              |            1 |       0  B |       0  B | ok:1
#clusters            |            1 |    8.0 KiB |     768  B | ok:1
#drive_state         |            1 |    8.0 KiB |     4.0  B | ok:1
#drives              |            1 |    100 KiB |     96 KiB | ok:1
#iscsi_config        |            1 |     12 KiB |    8.0 KiB | ok:1
[snip]

To get detailed information about the objects on a particular disk:

# storpool disk 1101 list
object name         |  stored size | on-disk size | data version | object state | parent volume
#bad_id:0            |         0  B |         0  B |    1480:2485 |       ok (1) |
#clusters:0          |       768  B |      8.0 KiB |      711:992 |       ok (1) |
#drive_state:0       |       4.0  B |      8.0 KiB |    1475:2478 |       ok (1) |
#drives:0            |       96 KiB |      100 KiB |    1480:2484 |       ok (1) |
[snip]
test:4094            |         0  B |         0  B |          0:0 |       ok (1) |
test:4095            |         0  B |         0  B |          0:0 |       ok (1) |
----------------------------------------------------------------------------------------------------
4115 objects         |      394 KiB |      636 KiB |              |              |

To get detailed information about the active requests that the disk is performing at the moment:

# storpool disk 1101 activeRequests
-----------------------------------------------------------------------------------------------------------------------------------
| request ID                     |  request IDX |               volume |         address |       size |       op |    time active |
-----------------------------------------------------------------------------------------------------------------------------------
| 9226469746279625682:285697101441249070 |            9 |           testvolume |     85276782592 |     4.0 KiB |     read |         0 msec |
| 9226469746279625682:282600876697431861 |           13 |           testvolume |     96372936704 |     4.0 KiB |     read |         0 msec |
| 9226469746279625682:278097277070061367 |           19 |           testvolume |     46629707776 |     4.0 KiB |     read |         0 msec |
| 9226469746279625682:278660227023482671 |          265 |           testvolume |     56680042496 |     4.0 KiB |    write |         0 msec |
-----------------------------------------------------------------------------------------------------------------------------------

To issue retrim operation on a disk (available for SSD disks only):

# storpool disk 1101 retrim
OK

To start, pause or continue a scrubbing operation for a disk:

# storpool disk 1101 scrubbing start
OK
# storpool disk 1101 scrubbing pause
OK
# storpool disk 1101 scrubbing continue
OK

Note

Use storpool disk list internal to check the status of a running scrub operation or when was the last completed scrubbing operation for this disk.

9.9. Placement Groups

The placement groups are predefined sets of disks, over which volume objects will be replicated. It is possible to specify which individual disks to add to the group.

To display the defined placement groups in the cluster:

# storpool placementGroup list
name
default
hdd
ssd

To display details about a placement group:

# storpool placementGroup ssd list
type   | id
disk   | 1101 1201 1301 1401

Creating a new placement group or extend an existing one requires specifying its name and providing one or more disks to be added:

# storpool placementGroup ssd addDisk 1102
OK
# storpool placementGroup ssd addDisk 1202
OK
# storpool placementGroup ssd addDisk 1302 addDisk 1402
OK
# storpool placementGroup ssd list
type | id
disk   | 1101 1102 1201 1202 1301 1302 1401 1402

To remove one or more disks from a placement group use:

# storpool placementGroup ssd rmDisk 1402
OK
# storpool placementGroup ssd list
type | id
disk   | 1101 1102 1201 1202 1301 1302 1401

To rename a placement group:

# storpool placementGroup ssd rename M500DC
OK

The unused placement groups can be removed. To avoid accidents, the name of the group must be entered twice:

# storpool placementGroup ssd delete ssd
OK

9.10. Volumes

The volumes are the basic service of the StorPool storage system. A volume always have a name and a certain size. It can be read from and written to. It could be attached to hosts as read-only or read-write block device under the /dev/storpool directory. A volume may have one or more tags created or changed in the form name=value. The volume name is a string consisting of one or more of the allowed characters - upper and lower latin letters (a-z,A-Z), numbers (0-9) and the delimiter dot (.), colon (:), dash (-) and underscore (_). The same rules apply for the keys and values used for the volume tags. The volume name including tags cannot exceed 200 bytes.

When a volume is created, at minimum the <volumeName>, the <template> or placement/replication details and its size must be specified:

# storpool volume testvolume size 100G template hybrid

Additional parameters that can be used or overridden:

  • placeAll - place all objects in placementGroup (Default value: default)

  • placeTail - name of placementGroup for reader (Default value: same as placeAll value)

  • placeHead - place the third replica in a different placementGroup (Default value: same as placeAll value)

  • template - use template with preconfigured placement, replication and/or limits (please check for more in Templates section) - usage of templates is seriously encouraged due to easier tracking and capacity management

  • parent - use a snapshot as a parent for this volume

  • reuseServer - place multiple copies on the same server

  • baseOn - use parent volume, this will create a transient snapshot used as a parent (please check for more in Snapshots section)

  • iops - set the maximum IOPS limit for this volume (in IOPS)

  • bw - set maximum bandwidth limit (in MB/s)

  • tag - set a tag for this volume in the form name=value

  • create - create the volume, fail if it exists (optional for now)

  • update - update the volume, fail if it does not exist (optional for now)

The create is useful in scripts when you have to prevent an involuntary update of a volume:

# storpool volume test create template hybrid
OK
# storpool volume test create size 200G template hybrid
Error: Volume 'test' already exists

A statement with update parameter will fail with an error if the volume does not exist:

# storpool volume test update template hybrid size +100G
OK
# storpool volume test1 update template hybrid
Error: volume 'test1' does not exist

To list all available volumes:

# storpool volume list
-----------------------------------------------------------------------------------------------------------------------------------------------------------------
| volume               |    size  | repl. | placeHead  | placeAll   | placeTail  |   iops  |    bw   | parent               | template             | tags       |
-----------------------------------------------------------------------------------------------------------------------------------------------------------------
| testvolume           |  100 GiB |     3 | ultrastar  | ultrastar  | ssd        |       - |       - | testvolume@35691     | hybrid               | name=value |
-----------------------------------------------------------------------------------------------------------------------------------------------------------------

To list volumes exported to other sub-clusters in the multi-cluster:

# storpool volume list exports
---------------------------------
| remote    | volume | globalId |
---------------------------------
| Lab-D-cl2 | test   | d.n.buy  |
---------------------------------

To list volumes exported in other sub-clusters to this one in a multi-cluster setup:

# volume list remote
--------------------------------------------------------------------------
| location | remoteId | name | size         | creationTimestamp   | tags |
--------------------------------------------------------------------------
| Lab-D    | d.n.buy  | test | 137438953472 | 2020-05-27 11:57:38 |      |
--------------------------------------------------------------------------

Note

Once attached a remotely exported volume will no longer be visible with volume list remote, even if the export is still visible in the remote

cluster with volume list exports. Every export invocation in the local cluster will be used up for every attach in the remote cluster.

To get an overview of all volumes and snapshots and their state in the system use:

# storpool volume status
----------------------------------------------------------------------------------------------------------------------------------------------------
| volume               |     size | repl. | tags       |  alloc % |   stored |  on disk | syncing | missing | status    | flags | drives down      |
----------------------------------------------------------------------------------------------------------------------------------------------------
| testvolume           |  100 GiB |     3 | name=value |    0.0 % |     0  B |     0  B |    0  B |    0  B | up        |       |                  |
| testvolume@35691     |  100 GiB |     3 |            |  100.0 % |  100 GiB |  317 GiB |    0  B |    0  B | up        | S     |                  |
----------------------------------------------------------------------------------------------------------------------------------------------------
| 2 volumes            |  200 GiB |       |            |   50.0 % |  100 GiB |  317 GiB |    0  B |    0  B |           |       |                  |
----------------------------------------------------------------------------------------------------------------------------------------------------

Flags:
  S - snapshot
  B - balancer blocked on this volume
  D - decreased redundancy (degraded)
  M - migrating data to a new disk
  R - allow placing two disks within a replication chain onto the same server

The columns in this output are:

  • volume - name of the volume or snapshot (see flags below)

  • size - provisioned volume size, the visible size inside a VM for example

  • repl. - number of copies for this volume

  • tags - all custom key=value tags configured for this volume or snapshot

  • alloc % - how much space was used on this volume in percent

  • stored - space allocated on this volume

  • on disk - the size allocated on all drives in the cluster after replication and the overhead from data protection

  • syncing - how much data is not in sync after a drive or server was missing, the data is recovered automatically once the missing drive or server is back in the cluster

  • missing - shows how much data is not available for this volume when the volume is with status down, see status below

  • status - shows the status of the volume, which could be one of:

  • up - all copies are available

  • down - none of the copies are available for some parts of the volume

  • up soon - all copies are available and the volume will soon get up

  • flags - flags denoting features of this volume:

  • S - stands for snapshot, which is essentially a read-only (frozen) volume

  • B - used to denote that the balancer is blocked for this volume (usually when some of the drives are missing)

  • D - this flag is displayed when some of the copies is either not available or outdated and the volume is with decreased redundancy

  • M - displayed when changing the replication or a cluster re-balance is in progress

  • R - displayed when the policy for keeping copies on different servers is overridden

  • drives down - displayed when the volume is in down state, displaying the drives required to get the volume back up.

Size is in B/KiB/MiB/GiB, TiB or PiB.

To get just the status data from the storpool_controller services in the cluster, without any info for stored, on disk size, etc. use:

# storpool volume quickStatus
----------------------------------------------------------------------------------------------------------------------------------------------------
| volume               |     size | repl. | tags       |  alloc % |   stored |  on disk | syncing | missing | status    | flags | drives down      |
----------------------------------------------------------------------------------------------------------------------------------------------------
| testvolume           |  100 GiB |     3 | name=value |    0.0 % |     0  B |     0  B |    0  B |    0  B | up        |       |                  |
| testvolume@35691     |  100 GiB |     3 |            |    0.0 % |     0  B |     0  B |    0  B |    0  B | up        | S     |                  |
----------------------------------------------------------------------------------------------------------------------------------------------------
| 2 volumes            |  200 GiB |       |            |    0.0 % |     0  B |     0  B |    0  B |    0  B |           |       |                  |
----------------------------------------------------------------------------------------------------------------------------------------------------

Note

The quickStatus might actually take longer than normal status, but will have less of an impact on the storpool_server services, because the status data will be collected from the storpool_controller.

To check the estimated used space by the volumes in the system use:

# storpool volume usedSpace
-----------------------------------------------------------------------------------------
| volume               |        size | repl. |      stored |        used | missing info |
-----------------------------------------------------------------------------------------
| testvolume           |     100 GiB |     3 |     1.9 GiB |     100 GiB |         0  B |
-----------------------------------------------------------------------------------------

The columns explained:

  • volume - name of the volume

  • size - the provisioned size of this volume

  • repl. - the replication of the volume

  • stored - how much data is stored for this volume (without referring all parent snapshots)

  • used - how much data has been written (including the data written in parent snapshots)

  • missing info - if this value is anything other than 0  B probably some of the storpool_controller services in the cluster is not running correctly.

Note

The used column shows how much data is accessible and reserved for this volume.

To list the target disk sets and objects of a volume:

# storpool volume testvolume list
volume testvolume
size 100 GiB
replication 3
placeHead hdd
placeAll hdd
placeTail ssd
target disk sets:
       0: 1122 1323 1203
       1: 1424 1222 1301
       2: 1121 1324 1201
[snip]
  object: disks
       0: 1122 1323 1203
       1: 1424 1222 1301
       2: 1121 1324 1201
[snip]

Hint

In this example the volume is with hybrid placement with two copies on HDDs and one copy on SSDs (the rightmost disk sets column). The target disk sets are lists of triplets of drives in the cluster used as a template for the actual objects of the volume.

To get detailed info about the disks used for this volume and the number of objects on each of them use:

# storpool volume testvolume info
    diskId | count
  1101 |   200
  1102 |   200
  1103 |   200
  [snip]
chain                | count
1121-1222-1404       |  25
1121-1226-1303       |  25
1121-1226-1403       |  25

To rename a volume use:

# storpool volume testvolume rename newvolume
OK

To add a tag for a volume:

# storpool volume testvolume tag name=value

To change a tag for a volume use:

# storpool volume testvolume tag name=newvalue

To remove a tag just set it to an empty value:

# storpool volume testvolume tag name=

To resize a volume up:

# storpool volume testvolume size +1G
OK

To shrink a volume (resize down):

# storpool volume testvolume size 50G shrinkOk

Attention

Shrinking of a storpool volume changes the size of the block device, but does not adjust the size of LVM or filesystem contained in the volume. Failing to adjust the size of the filesystem or LVM prior to shrinking the StorPool volume would result in data loss.

To delete a volume use:

# storpool volume vol1 delete vol1

Note

To avoid accidents, the volume name must be entered twice. Attached volumes cannot be deleted even when not used as a safety precaution, more in Attachments.

A volume could be converted from based on a snapshot to a stand-alone volume. For example the testvolume below is based on an anonymous snapshot:

# storpool_tree
StorPool
  `-testvolume@37126
     `-testvolume

To rebase it against root (known also as “promote”) use:

# storpool volume testvolume rebase
OK
# storpool_tree
StorPool
  `- testvolume@255 [snapshot]
     `- testvolume [volume]

The rebase operation could also be to a particular snapshot from a chain of parent snapshots on this child volume:

# storpool_tree
StorPool
  `- testvolume-snap1 [snapshot]
     `- testvolume-snap2 [snapshot]
        `- testvolume-snap3 [snapshot]
           `- testvolume [volume]
# storpool volume testvolume rebase testvolume-snap2
OK

After the operation the volume is directly based on testvolume-snap2 and includes all changes from testvolume-snap3:

# storpool_tree
StorPool
  `- testvolume-snap1 [snapshot]
     `- testvolume-snap2 [snapshot]
        |- testvolume [volume]
        `- testvolume-snap3 [snapshot]

To backup a volume named testvolume in a configured remote location StorPoolLab-Sofia use:

# storpool volume testvolume backup StorPoolLab-Sofia
OK

After this operation a temporary snapshot will be created and will be transferred in StorPoolLab-Sofia location. After the transfer completes, the local temporary snapshot will be deleted and the remote snapshot will be visible as exported from StorPoolLab-Sofia, check Remote Snapshots for more on working with snapshot exports.

When backing up a volume, the remote snapshot may have one or more tags applied, example below:

# storpool volume testvolume backup StorPoolLab-Sofia tag key=value # [tag key2=value2]
OK

To move a volume to a different cluster in a multicluster environment (more on clusters here) use:

# storpool volume testvolume moveToRemote Lab-D-cl2 # onAttached export

Note

Moving a volume to a remote cluster will fail if the volume is attached on a local host. It could be further specified what to do in such case with the onAttached parameter, as in the comment in the example above. More info on volume move is available in 13.13.  Volume and snapshot move.

9.11. Snapshots

Snapshots are read-only point-in-time images of volumes. They are created once and cannot be changed. They can be attached to hosts as read-only block devices under /dev/storpool. Volumes and snapshots share the same name-space, thus their names are unique within a StorPool cluster. Volumes can be based on snapshots. Such volumes contain only the changes since the snapshot was taken. After a volume is created from a snapshot, writes will be recorded within the volume. Reads from volume may be served by volume or by its parent snapshot depending on whether volume contains changed data for the read request or not.

To create an unnamed (known also as anonymous) snapshot of a volume use:

# storpool volume testvolume snapshot
OK

This will create a snapshot named testvolume@<ID>, where ID is an unique serial number. Note that any tags on the volume will not be propagated to the snapshot; to set tags on the snapshot at creation time use:

# storpool volume testvolume tag key=value snapshot

To create a named snapshot of a volume use:

# storpool volume testvolume snapshot testsnap
OK

Again to directly set tags:

# storpool volume testvolume snapshot testsnapplustags tag key=value

To remove a tag on a snapshot:

# storpool snapshot testsnapplustags tag key=

To list the snapshots use:

# storpool snapshot list
-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
| snapshot          |    size  | repl. | placeHead | placeAll   | placeTail | created on          | volume      | iops  | bw   | parent          | template  | flags | targetDeleteDate | tags      |
-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
| testsnap          |  100 GiB |     3 | hdd       | hdd        | ssd       | 2019-08-30 04:11:23 | testvolume  |     - |    - | testvolume@1430 | hybrid-r3 |       | -                | key=value |
| testvolume@1430   |  100 GiB |     3 | hdd       | hdd        | ssd       | 2019-08-30 03:56:58 | testvolume  |     - |    - |                 | hybrid-r3 | A     | -                |           |
-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------

Flags:
  A - anonymous snapshot with auto-generated name
  B - bound snapshot
  D - snapshot currently in the process of deletion
  T - transient snapshot (created during volume cloning)
  R - allow placing two disks within a replication chain onto the same server
  P - snapshot delete blocked due to multiple children

To list the snapshots only for a particular volume use:

# storpool volume testvolume list snapshots
[snip]

A volume might directly be converted to a snapshot, the operation is also known as freeze:

# storpool volume testvolume freeze
OK

Note that the operation will fail if the volume is attached read-write, more in Attachments

To create a bound snapshot on a volume use:

# storpool volume testvolume bound snapshot
OK

This snapshot will be automatically deleted when the last child volume created from it is deleted. Useful for non-persistent images.

To list the target disk sets and objects of a snapshot:

# storpool snapshot testsnap list
[snip]

The output is similar as with storpool volume <volumename> list.

To get detailed info about the disks used for this snapshot and the number of objects on each of them use:

# storpool snapshot testsnap info
[snip]

The output is similar to the storpool volume <volumename> info.

To create a volume based on an existing snapshot (cloning) use:

# storpool volume testvolume parent centos73-base-snap
OK

Same possible through the use of templates with a parent snapshot (See Templates):

# storpool volume spd template centos73-base
OK

Create a volume based on another existing volume (cloning):

# storpool volume testvolume1 baseOn testvolume
OK

Note

This operation will first create an anonymous bound snapshot on testvolume and will then create testvolume1 with the bound snapshot as parent. The snapshot will exist until both volumes are deleted and will be automatically deleted afterwards.

To delete a snapshot use:

# storpool snapshot spdb_snap1 delete spdb_snap1
OK

Note

To avoid accidents, the name of the snapshot must be entered twice.

A snapshot could also be binded to its child volumes, it will exist until all child volumes are deleted:

# storpool snapshot testsnap bind
OK

The opposite operation is also possible, to unbind such snapshot use:

# storpool snapshot testsnap unbind
OK

To get the space that will be freed if a snapshot is deleted use:

# storpool snapshot space
----------------------------------------------------------------------------------------------------------------
| snapshot             | on volume            |        size | repl. |      stored |        used | missing info |
----------------------------------------------------------------------------------------------------------------
| testsnap             | testvolume           |     100 GiB |     3 |      27 GiB |    -135 GiB |         0  B |
| testvolume@3794      | testvolume           |     100 GiB |     3 |      27 GiB |     1.9 GiB |         0  B |
| testvolume@3897      | testvolume           |     100 GiB |     3 |     507 MiB |     432 KiB |         0  B |
| testvolume@3899      | testvolume           |     100 GiB |     3 |     334 MiB |     224 KiB |         0  B |
| testvolume@4332      | testvolume           |     100 GiB |     3 |      73 MiB |      36 KiB |         0  B |
| testvolume@4333      | testvolume           |     100 GiB |     3 |      45 MiB |      40 KiB |         0  B |
| testvolume@4334      | testvolume           |     100 GiB |     3 |      59 MiB |      16 KiB |         0  B |
| frozenvolume         | -                    |       8 GiB |     2 |      80 MiB |      80 MiB |         0  B |
----------------------------------------------------------------------------------------------------------------

Used mainly for accounting purposes. The columns explained:

  • snapshot - name of the snapshot

  • on volume - the name of the volume child for this snapshot if any. For example a frozen volume would have this field empty.

  • size - the size of the snapshot as provisioned

  • repl. - replication

  • stored - how much data is actually written

  • used - stands for the amount of data that would be freed from the underlying drives (before replication) if the snapshot is removed.

  • missing info - if this value is anything other than 0  B probably some of the storpool_controller services in the cluster are not running correctly.

The used column could be negative in some cases when the snapshot has more than one child volume. In these cases deleting the snapshot would “free” negative space i.e. will end up taking more space on the underlying disks.

Similar to volumes a snapshot could have different placementGroups or other attributes, as well as templates:

# storpool snapshot testsnap template all-ssd
OK

Additional parameters that may be used:

  • placeAll - place all objects in placementGroup (Default value: default)

  • placeTail - name of placementGroup for reader (Default value: same as placeAll value)

  • placeHead - place the third replica in a different placementGroup (Default value: same as placeAll value)

  • reuseServer - place multiple copies on the same server

  • tag - set a tag in the form key=value

  • template - use template with preconfigured placement, replication and/or limits (please check for more in Templates section)

  • iops - set the maximum IOPS limit for this snapshot (in IOPS)

  • bw - set maximum bandwidth limit (in MB/s)

Note

The bandwidth and IOPS limits are concerning only the particular snapshot if it is attached and does not limit any child volumes using this snapshot as parent.

Also similar to the same operation with volumes a snapshot could be renamed with:

# storpool snapshot testsnap rename ubuntu1604-base
OK

A snapshot could also be rebased to root (promoted) or rebased to another parent snapshot in a chain:

# storpool snapshot testsnap rebase # [parent-snapshot-name]
OK

To delete a snapshot use:

# storpool snapshot testsnap delete testsnap
OK

Note

A snapshot sometimes will not get deleted immediately, during this period of time it will be visible with * in the output of storpool volume status or storpool snapshot list.

To set a snapshot for deferred deletion use:

# storpool snapshot testsnap deleteAfter 1d
OK

The above will set a target delete date for this snapshot in exactly one day from the present time.

Note

The snapshot will be deleted at the desired point in time only if delayed snapshot delete was enabled in the local cluster, check Management configuration section from this guide.

9.11.1. Remote snapshots

In case multi-site or multicluster is enabled (the cluster have a bridge service running) a snapshot could be exported and become visible to other configured clusters.

For example to export a snapshot snap1 to a location named StorPoolLab-Sofia use:

# storpool snapshot snap1 export StorPoolLab-Sofia
OK

To list the presently exported snapshots use:

# storpool snapshot list exports
------------------------------------------------------------------
| location               | snapshot    | globalId    | backingUp |
------------------------------------------------------------------
| StorPoolLab-Sofia      | snap1       | nzkr.b.cuj  | false     |
------------------------------------------------------------------

To list the snapshots exported from remote sites use:

# storpool snapshot list remote
-----------------------------------------------------------------------------------
| location | remoteId | name      | onVolume | size         | creationTimestamp   |
-----------------------------------------------------------------------------------
| s02      | a.o.cxz  | snapshot1 |          | 107374182400 | 2019-08-20 03:21:42 |
-----------------------------------------------------------------------------------

Single snapshot could be exported to multiple configured locations.

To create a clone of a remote snapshot locally use:

# storpool snapshot snapshot1-copy template hybrid-r3 remote s02 a.o.cxz # [tag key=value]

In this example the remote location is s02 and the remoteId is a.o.cxz. Any key=value pair tags may be configured at creation time.

To unexport a local snapshot use:

# storpool snapshot snap1 unexport StorPoolLab-Sofia
OK

The remote location could be swapped by the keyword all. This will attempt to unexport the snapshot from all location it was previously exported to.

Note

If the snapshot is presently being transferred the unexport operation will fail. It could be forced by adding force to the end of the unexport command, however this is discouraged in favor to waiting for any active transfer to complete.

To unexport a remote snapshot use:

# storpool snapshot remote s02 a.o.cxz unexport
OK

The snapshot will no longer be visible with storpool snapshot list remote.

To unexport a remote snapshot and also set for deferred deletion in the remote site:

# storpool snapshot remote s02 a.o.cxz unexport deleteAfter 1h
OK

This will attempt to set a target delete date for a.o.cxz in the remote site in exactly one hour from the present time for this snapshot. If the minimumDeleteDelay in the remote site has a higher value, e.g. 1 day the selected value will be overwritten with the minimumDeleteDelay - in this example 1 day. For more info on deferred deletion check the Multi Site section of this guide.

To move a snapshot to a different cluster in a multicluster environment (more on clusters here) use:

# storpool snapshot snap1 moveToRemote Lab-D-cl2

Note

Moving a snapshot to a remote cluster is forbidden for attached snapshots More info on snapshot move is available in 13.13.  Volume and snapshot move.

9.12. Attachments

Attaching a volume or snapshot makes it accessible to a client under the /dev/storpool and /dev/storpool-byid directories. Volumes can be attached as read-only or read-write. Snapshots are always attached read-only.

Attaching a volume testvolume to a client with ID 1. This creates the block device /dev/storpool/testvolume:

# storpool attach volume testvolume client 1
OK

To attach a volume/snapshot to the node you are currently connected to use:

# storpool attach volume testvolume here
OK
# storpool attach snapshot testsnap here
OK

By default this command will block until the volume is attached to the client and the /dev/storpool/<volumename> symlink is created. For example if the storpool_block service has not been started the command will wait indefinitely. To set a timeout for this operation use:

# storpool attach volume testvolume here timeout 10 # seconds
OK

To completely disregard the readiness check use:

# storpool attach volume testvolume here noWait
OK

Note

The use of noWait is discouraged in favor of the default behaviour of the attach command.

Attaching a volume will create a read-write block device attachment by default. To attach it read-only use:

# storpool volume testvolume2 attach client 12 mode ro
OK

To list all attachments use:

# storpool attach list
----------------------------------------
| client | volume               | mode |
----------------------------------------
|     11 | testvolume           | RW   |
|     12 | testvolume1          | RW   |
|     12 | testvolume2          | RO   |
|     14 | testsnap             | RO   |
----------------------------------------

To detach use:

# storpool detach volume testvolume client 1 # or 'here' if the command is being executed on client ID 1

If a volume is actively being written or read from, a detach operation will fail:

# storpool detach volume testvolume client 11
Error: 'testvolume' is open at client 11

In this case the detach could be forced, beware that forcing a detachment is discouraged:

# storpool detach volume testvolume client 11 force yes
OK

Attention

Any operations on the volume will receive an IO Error when it is forcefully detached. Some mounted filesystems lead to kernel panic when a block device disappears when there with live operations, thus be extra careful if these filesystems are mounted on a hypervisor node directly.

If a volume or snapshot is attached to more than one client it could be detached from all nodes with a single CLI command:

# storpool detach volume testvolume all
OK
# storpool detach snapshot testsnap all
OK

9.13. Client

To check the status of the active storpool_block services in the cluster use:

# storpool client status
-----------------------------------
|  client  |       status         |
-----------------------------------
|       11 | ok                   |
|       12 | ok                   |
|       13 | ok                   |
|       14 | ok                   |
-----------------------------------

To wait until a client is updated use:

# storpool client 13 sync
OK

This is a way to ensure a volume with changed size is visible with its new size to any clients it is attached to.

To show detailed information about the active requests on particular client in this moment use:

# storpool client 13 activeRequests
------------------------------------------------------------------------------------------------------------------------------------
| request ID                     |  request IDX |               volume |         address |        size |       op |    time active |
------------------------------------------------------------------------------------------------------------------------------------
| 9224499360847016133:3181950    |  1044        | testvolume           |     10562306048 |     128 KiB |    write |        65 msec |
| 9224499360847016133:3188784    |  1033        | testvolume           |     10562437120 |      32 KiB |     read |        63 msec |
| 9224499360847016133:3188977    |  1029        | testvolume           |     10562568192 |     128 KiB |     read |        21 msec |
| 9224499360847016133:3189104    |  1026        | testvolume           |     10596122624 |     128 KiB |     read |         3 msec |
| 9224499360847016133:3189114    |  1035        | testvolume           |     10563092480 |     128 KiB |     read |         2 msec |
| 9224499360847016133:3189396    |  1048        | testvolume           |     10629808128 |     128 KiB |     read |         1 msec |
------------------------------------------------------------------------------------------------------------------------------------

9.14. Templates

Templates are enabling easy and consistent setup and usage tracking for a collection of large number of volumes and their snapshots with common attributes, e.g. replication, placement groups and/or common parent snapshot.

To create a template use:

# storpool template magnetic replication 3 placeAll hdd
OK
# storpool template hybrid replication 3 placeAll hdd placeTail ssd
OK
# storpool template ssd-hybrid replication 3 placeAll ssd placeHead hdd
OK

To list all created templates use:

# storpool template list
-------------------------------------------------------------------------------------------------------------------------------------
| template             |   size  | repl. | placeHead   | placeAll   | placeTail  |   iops  |    bw   | parent               | flags |
-------------------------------------------------------------------------------------------------------------------------------------
| magnetic             |       - |     3 | hdd         | hdd        | hdd        |       - |       - |                      |       |
| hybrid               |       - |     3 | hdd         | hdd        | ssd        |       - |       - |                      |       |
| ssd-hybrid           |       - |     3 | hdd         | ssd        | ssd        |       - |       - |                      |       |
-------------------------------------------------------------------------------------------------------------------------------------

Hint

Understanding placement - Each volume/snapshot could be replicated on different set of drives. Each set of drives is configured through the placement groups. A volume would either have all of its copies in a single set of drives in different nodes (only placeAll configured) or have its different copies in a different set of drives by using the placeTail and placeHead parameters.

Hint

Understanding placement (dual replication) - With dual replication only the placeAll and placeTail parameters have an effect. If not provided placeTail is the same as the placeAll parameter. If configured, the placeTail overrides the placeAll and the read operations will be served by the drives in this placement group. With this placement an example volume with dual replication and placeAll in the hdd placement group and placeTail in the ssd group would have one copy of the data on the drives in each placement group. In case some of the drives in the ssd placement group is missing the reads will be served by a drive in the hdd placement group.

Hint

Understanding placement (triple replication) - With triple replication the placeAll - placeTail dependencies and policy are again in effect, but now the placeHead parameter also comes into play. The placeHead parameter overrides the placeAll parameter. So a volume with placeAll configured in the ssd and placeHead configured in the hdd placement group would have two copies on the drives in ssd and a third copy for each chunk of data on a drive in the hdd placement group. In case a single drive (or a node with many drives) in the ssd placement group fails or is missing for other reasons all reads would still be coming from the drives in the placeAll placement group, if however a drive from another node (or another node with many drives) is missing all reads will be served by the drives in the placeHead placement group.

To get the status of a template with detailed info on the usage and the available space left with this placement use:

# storpool template status
--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
| template             | place head |  place all | place tail | repl. | volumes | snapshots/removing |     size |  capacity |   avail. |  avail. all |  avail. tail |  avail. head | flags |
--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
| magnetic             | hdd        | hdd        | hdd        |     3 |     115 |       631/0        |   28 TiB |    80 TiB |   52 TiB |     240 TiB |      240 TiB |      240 TiB |       |
| hybrid               | hdd        | ssd        | hdd        |     3 |     208 |       347/9        |   17 TiB |    72 TiB |   55 TiB |     240 TiB |       72 TiB |      240 TiB |       |
| ssd-hybrid           | ssd        | ssd        | hdd        |     3 |      40 |         7/0        |    4 TiB |    36 TiB |   36 TiB |     240 TiB |       72 TiB |      240 TiB |       |
--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------

To change template attributes directly use:

# storpool template hdd-only size 120G propagate no
OK
# storpool template hybrid size 40G iops 4000 propagate no
OK

Parameters that can be set:

  • replication - change the number of copies for volumes or snapshots created with this template

  • size - default size if not specified for each volume created with this template

  • placeAll - place all objects in placementGroup (Default value: default)

  • placeTail - name of placementGroup for reader (Default value: same as placeAll value)

  • placeHead - place the third replica in a different placementGroup (Default value: same as placeAll value)

  • iops - set the maximum IOPS limit for this snapshot (in IOPS)

  • bw - set maximum bandwidth limit (in MB/s)

  • parent - set parent snapshot for all volumes created in this template

  • reuseServer - place multiple copies on the same server

When changing parameters on an already created template a new propagate parameter is required in order to specify if the changes would have to be modified on all existing volumes and/or snapshots created with this template or not. The parameter is required regardless of the template having any volumes and/or snapshots created with it.

For example in order change the bandwidth limit for all volumes and snapshots created with the already existing template magnetic:

# storpool template magnetic bw 100MB propagate yes
OK

Note

When using storpool template $TEMPLATE propagate yes, all the parameters of $TEMPLATE will be re-applied to all volumes and snapshots created with it.

Note

Changing template parameters with propagate option will not automatically re-allocate content of the existing volumes on disks. If replication or placement groups are changed, run balancer to apply new settings on the existing volumes. However if the changes are made directly to the volume instead to the template, running a balancer will not be required.

Attention

Dropping the replication (e.g. from triple to dual) of a large number of volumes is an almost instant operation, however returning them back to triple is similar to creating the third copy for the first time. This is why changing replication to less than the present (e.g. from 3 to 2) will require using replicationReduce as a safety measure.

To rename a template use:

# storpool template magnetic rename backup
OK

To delete a template use:

# storpool template hdd-only delete hdd-only
OK

Note

The delete operation might fail if there are volumes/snapshots that are created with this template.

9.15. iSCSI

The StorPool iSCSI support is documented more extensively in the StorPool iSCSI support section; these are the commands used to configure it and view the configuration.

To set the cluster’s iSCSI base IQN iqn.2019-08.com.example:examplename:

# storpool iscsi config setBaseName iqn.2019-08.com.example:examplename
OK

To create a portal group examplepg used to group exported volumes for access by initiators using 192.168.42.247/24 (CIDR notation) as the portal IP address:

# storpool iscsi config portalGroup examplepg create addNet 192.168.42.247/24 vlan 42
OK

To create portal for the initiators to connect to (for example portal IP address 192.168.42.202 and StorPool’s SP_OURID 5):

# storpool iscsi config portal create portalGroup examplepg address 192.168.42.202 controller 5
OK

Note

This address will be handled by the storpool_iscsi process directly and will not be visible in the node with normal instruments like ip or ifconfig, check the 9.15.1.  iscsi_tool for these purposes.

To define the iqn.2019-08.com.example:abcdefgh initiator that is allowed to connect from the 192.168.42.0/24 network (w/o authentication):

# storpool iscsi config initiator iqn.2019-08.com.example:abcdefgh create net 192.168.42.0/24
OK

To define the iqn.2019-08.com.example:client initiator that is allowed to connect from the 192.168.42.0/24 network and must authenticate using the standard iSCSI password-based challenge-response authentication method using the username user and the password secret:

# storpool iscsi config initiator iqn.2019-08.com.example:client create net 192.168.42.0/24 chap user secret
OK

To specify that the StorPool volume tinyvolume should be exported to one or more initiators:

# storpool iscsi config target create tinyvolume
OK

To actually export the StorPool volume tinyvolume to to the iqn.2019-08.com.example:abcdefgh initiator via the examplepg portal group (the StorPool iSCSI service will automatically pick a portal to export the volume through):

# storpool iscsi config export initiator iqn.2019-08.com.example:abcdefgh portalGroup examplepg volume tinyvolume
OK

Note

The volume will be visible to the initiator as IQN <BaseName>:<volume>

To view the iSCSI cluster base IQN:

# storpool iscsi basename
---------------------------------------
| basename                            |
---------------------------------------
| iqn.2019-08.com.example:examplename |
---------------------------------------

To view the portal groups:

# storpool iscsi portalGroup list
---------------------------------------------
| name       | networksCount | portalsCount |
---------------------------------------------
| examplepg  |             1 |            0 |
---------------------------------------------

To view the portals:

# storpool iscsi portalGroup list portals
--------------------------------------------------
| group       | address             | controller |
--------------------------------------------------
| examplepg   | 192.168.42.246:3260 |          1 |
| examplepg   | 192.168.42.202:3260 |          5 |
--------------------------------------------------

To view the defined initiators:

# storpool iscsi initiator list
---------------------------------------------------------------------------------------
| name                             | username | secret | networksCount | exportsCount |
---------------------------------------------------------------------------------------
| iqn.2019-08.com.example:abcdefgh |          |        |             1 |            1 |
| iqn.2019-08.com.example:client   | user     | secret |             1 |            0 |
---------------------------------------------------------------------------------------

To view the present state of the configured iSCSI interfaces:

iscsi interfaces list
--------------------------------------------------
| ctrlId | net 0             | net 1             |
--------------------------------------------------
|     23 | 2A:60:00:00:E0:17 | 2A:60:00:00:E0:17 |
|     24 | 2A:60:00:00:E0:18 | 2A:60:00:00:E0:18 |
|     25 | 2A:60:00:00:E0:19 | 2E:60:00:00:E0:19 |
|     26 | 2A:60:00:00:E0:1A | 2E:60:00:00:E0:1A |
--------------------------------------------------

Note

These are the same interfaces configured with SP_ISCSI_IFACE in the order of appearance:

# storpool_showconf SP_ISCSI_IFACE
SP_ISCSI_IFACE=sp0,spbond1:sp1,spbond1:[lacp]

In the above output the sp0 interface is net ID 0 and sp1 is net ID 1.

To view the volumes that may be exported to initiators:

# storpool iscsi target list
-------------------------------------------------------------------------------------
| name                                           | volume     | currentControllerId |
-------------------------------------------------------------------------------------
| iqn.2019-08.com.example:examplename:tinyvolume | tinyvolume |               65535 |
-------------------------------------------------------------------------------------

To view the volumes currently exported to initiators:

# storpool iscsi initiator list exports
--------------------------------------------------------------------------------------------------------------------------------------
| name                                           | volume     | currentControllerId | portalGroup | initiator                        |
--------------------------------------------------------------------------------------------------------------------------------------
| iqn.2019-08.com.example:examplename:tinyvolume | tinyvolume |                   1 |             | iqn.2019-08.com.example:abcdefgh |
--------------------------------------------------------------------------------------------------------------------------------------

To stop exporting the tinyvolume volume to the initiator with iqn iqn.2019-08.com.example:abcdefgh and the examplepg portal group:

# storpool iscsi config unexport initiator iqn.2019-08.com.example:abcdefgh portalGroup examplepg volume tinyvolume
OK

To remove an iSCSI definition for the tinyvolume volume:

# storpool iscsi config target delete tinyvolume
OK

To remove access for the iqn.2019-08.com.example:client iSCSI initiator:

# storpool iscsi config initiator iqn.2019-08.com.example:client delete
OK

To remove the portal 192.168.42.202 IP address:

# storpool iscsi config portal delete address 192.168.42.202
OK

To remove portal group examplepg after all the portals have been removed:

# storpool iscsi config portalGroup examplepg delete
OK

Note

Only portal groups without portals may be deleted.

9.15.1. iscsi_tool

With the hardware accelerated iSCSI all traffic from/to the initiators is handled by the storpool_iscsi service directly. For example with the above setup the configured in the cluster setup the addresses exposed on each of the nodes could be queried with /usr/lib/storpool/iscsi_tool:

# /usr/lib/storpool/iscsi_tool
usage: /usr/lib/storpool/iscsi_tool change-port 0/1 ifaceName
usage: /usr/lib/storpool/iscsi_tool ip net list
usage: /usr/lib/storpool/iscsi_tool ip neigh list
usage: /usr/lib/storpool/iscsi_tool ip route list

To list the presently configured addresses use:

# /usr/lib/storpool/iscsi_tool ip net list
10.1.100.0/24 vlan 1100 ports 1,2
10.18.1.0/24 vlan 1801 ports 1,2
10.18.2.0/24 vlan 1802 ports 1,2

To list the neighbours and their last state use:

# /usr/lib/storpool/iscsi_tool ip neigh list
10.1.100.11 ok F4:52:14:76:9C:B0 lastSent 1785918292753 us, lastRcvd 918669 us
10.1.100.13 ok 0:25:90:C8:E5:AA lastSent 1785918292803 us, lastRcvd 178521 us
10.1.100.18 ok C:C4:7A:EA:85:4E lastSent 1785918292867 us, lastRcvd 178099 us
10.1.100.108 ok 1A:60:0:0:E0:8 lastSent 1785918293857 us, lastRcvd 857181794 us
10.1.100.112 ok 1A:60:0:0:E0:C lastSent 1785918293906 us, lastRcvd 1157179290 us
10.1.100.113 ok 1A:60:0:0:E0:D lastSent 1785918293922 us, lastRcvd 765392509 us
10.1.100.114 ok 1A:60:0:0:E0:E lastSent 1785918293938 us, lastRcvd 526084270 us
10.1.100.115 ok 1A:60:0:0:E0:F lastSent 1785918293954 us, lastRcvd 616948781 us
10.1.100.123 ours
[snip]

The above output includes also the portalGroup addresses residing on the node with the lowest ID in the cluster.

To list routing info use:

# /usr/lib/storpool/iscsi_tool ip route list
10.1.100.0/24 local
10.18.1.0/24 local
10.18.2.0/24 local

9.15.2. iscsi_targets

The /usr/lib/storpool/iscsi_targets tool is a helper tool for Linux based initiators, showing all logged in targets on the node:

# /usr/lib/storpool/iscsi_targets
/dev/sdn      iqn.2020-04.com.storpool:autotest:s11-1-iscsi-test-hybrid-centos6
/dev/sdo      iqn.2020-04.com.storpool:autotest:s11-1-iscsi-test-hdd-centos6
/dev/sdp      iqn.2020-04.com.storpool:autotest:s11-2-iscsi-test-hybrid-centos6
/dev/sdq      iqn.2020-04.com.storpool:autotest:s11-2-iscsi-test-hdd-centos6

9.16. Kubernetes

To register a Kubernetes cluster:

# storpool kubernetes add name cluster1
OK

To disable Kubernetes cluster:

# storpool kubernetes update name cluster1 disable yes
OK

To enable Kubernetes cluster:

# storpool kubernetes update name cluster1 disable no
OK

To delete Kubernetes cluster:

# storpool kubernetes delete name cluster1
OK

To list registered Kubernetes clusters:

# storpool kubernetes list
-----------------------
| name     | disabled |
-----------------------
| cluster1 | false    |
-----------------------

To view the status of the registered Kubernetes clusters:

# storpool kubernetes status
--------------------------------------------------------------
| name     | sc | w   | pvc | noRsrc | noTempl | mode | noSC |
--------------------------------------------------------------
| cluster1 |  0 | 0/3 |   0 |      0 |       0 |    0 |    0 |
--------------------------------------------------------------
Feilds:
  sc      - registered Storage Classes
  w       - watch connections to the kube adm
  pvc     - persistentVolumeClaims beeing provisioned
  noRsrc  - persistentVolumeClaims failed due to no resources
  noTempl - persistentVolumeClaims failed due to missing template
  mode    - persistentVolumeClaims failed due to unsupported access mode
  noSC    - persistentVolumeClaims failed due to missing storage class

9.17. Relocator

The relocator is internal StorPool service that takes care of data reallocation in case of change of volume’s replication, placement group parameters or in case of any pending rebase operations. This service is turned on by default.

In case of need the relocator could be turned off with:

# storpool relocator off
OK

To turn back on use:

# storpool relocator on
OK

To display the relocator status:

# storpool relocator status
relocator on, no volumes to relocate

The following additional relocator commands are available:

  • storpool relocator disks - returns the state of the disks after the relocator finishes all presently running tasks, as well as the quantity of objects and data each drive still needs to recover. The output is the same as with storpool balancer disks after the balancing task has been committed, see Balancer for more details.

  • storpool relocator volume <volumename> disks or storpool relocator snapshot <snapshotname> disks - shows the same information as the storpool relocator disks with the pending operations specific volume or snapshot.

9.18. Balancer

The balancer is used to redistribute data in case a disk or set of disks (e.g. new node) were added to or removed from a cluster. By default it is off. It has to be turned on in case of changes in cluster configuration for redistribution of data to occur.

To display the status of the balancer:

# storpool balancer status
balancer waiting, auto off

To load a re-balancing task, please refer to the Rebalancing StorPool section of this guide.

To discard the re-balancing operation use:

# storpool balancer stop
OK

To actually commit the changes and start the relocations of the proposed changes use:

# storpool balancer commit
OK

After the commit all changes will be only visible with storpool relocator disks and many volumes and snapshots will have the M flag in the output of storpool volume status until all relocations are completed. The progress could be followed with storpool task list (see Tasks).

9.19. Tasks

The tasks are all outstanding operations on recovering or relocating data in the present or between two connected clusters.

For example if a disk with ID 1401 was not in the cluster for a period of time and is then returned, all outdated objects will be recovered from the other drives with the latest changes.

These recovery opertions could be listed with:

# storpool task list
----------------------------------------------------------------------------------------
|     disk |  task id |  total obj |  completed |    started |  remaining | % complete |
----------------------------------------------------------------------------------------
|     2301 | RECOVERY |         73 |          5 |          1 |         68 |         6% |
----------------------------------------------------------------------------------------
|    total |          |         73 |          5 |          1 |         68 |         6% |
----------------------------------------------------------------------------------------

Other cases when tasks operations could be listed are when a re-balancing operation was commited and relocations are in progress, as well as when a cloning operation for a remote snapshot in the local cluster is in progress.

9.20. Maintenance Mode

The maintenance submenu is used to configure one or more nodes in a cluster in maintenance state. A couple of checks will be performed prior to entering into maintenance state in order to prevent a node from entering in maintenance in case it has one or more live server instaces in the cluster when for example the cluster is not yet fully recovered or is with decreased redundancy due to other reasons.

A node could be configured in maintenance state with:

# storpool maintenance set node 23 duration 10m description kernel_update
OK

The above will configure node ID 23 in maintenance state for 10 minutes and will configure the description to “kernel_update”.

To list the present nodes in maintenance use:

# storpool maintenance list
------------------------------------------------------------
| nodeId | started             | remaining | description   |
------------------------------------------------------------
|     23 | 2020-09-30 12:55:20 | 00:09:50  | kernel_update |
------------------------------------------------------------

To complete a maintenance for a node before it expires use:

# storpool maintenance complete node 23
OK

Note

All non-cluster threatening issues will not be sent by the monitoring system to external entities. All alerts will still be received by StorPool support and will be classiffied as “under maintenance” internally, while the node or the cluster are under maintenance mode.

Attention

Any alerts that are cluster threatening will still send super-critical alerts to both StorPool support and any other configured endpoint. More on super-critical alerts here.

A full cluster maintenance mode is also available for occasions involving full cluster related maintenances. An example would be a scheduled restart of a network switch that will be reported as missing network for all nodes in a cluster.

This mode does not perform any checks and is mainly for informational purposes in order to sync context between customers and StorPool’s support teams. More on how to active it is available here.

Full cluster maintenance mode could be used in addition to the per-node maintenance state explained above when necessary.

9.21. Management Configuration

The mgmtConfig submenu is used to set some internal configuration parameters.

To list the presently configured parameters use:

# storpool mgmtConfig list
relocator on, interval 5 sec
relocator transaction: min objects 320, max objects 4294967295
relocator recovery: max tasks per disk 2, max objects per disk 3200
relocator recovery objects trigger 32
relocator min free 80 GB
relocator max objects per HDD tail 0
balancer auto off, interval 5 sec
snapshot delete interval 1 sec
disks soft-eject interval 5 sec
snapshot delayed delete off
snapshot dematerialize interval 10 sec
mc owner check interval 2 sec
mc autoreconcile interval 2 sec
reuse server implicit on disk down disabled
maintenance state production

To enable deferred snapshot deletion (default off) use:

# storpool mgmtConfig delayedSnapshotDelete on
OK

When enabled all snapshots with configured time to be deleted will be cleared at the configured date and time.

To change the default interval between periodic checks whether disks marked for ejection can actually be ejected (5 sec.) use:

# storpool mgmtConfig disksSoftEjectInterval 20000 # value in ms - 20 sec.
OK

To change the default interval (5 sec.) for the relocator to check if there is new work to be done use:

# storpool mgmtConfig relocatorInterval 20000 # value is in ms - 20 sec.
OK

To set a different than the default number of objects per disk (3200) in recovery at a time:

# storpool mgmtConfig relocatorMaxRecoveryObjectsPerDisk 2000 # value in number of objects per disk
OK

To change the default maximum number of recovery tasks per disk (2 tasks) use:

# storpool mgmtConfig relocatorMaxRecoveryTasksPerDisk 4 # value is number of tasks per disk - will set 4 tasks
OK

To change the minimum (default 320) or the maximum (default 4294967295) number of objects per transaction for the relocator use:

# storpool mgmtConfig relocatorMaxTrObjects 2147483647
OK
# storpool mgmtConfig relocatorMinTrObjects 640
OK

To change the maximum number of objects per transaction per HDD tail drives use (0 is unset, 1+ is number of objects):

# storpool mgmtConfig relocatorMaxTrObjectsPerHddTail 2

To change the maximum number of objects in recovery for a disk to be usable by the relocator (default 32) use:

# storpool mgmtConfig relocatorRecoveryObjectsTrigger 64

To change the default check for new snapshots for deleting use:

# storpool mgmtConfig snapshotDeleteInterval

To enable snapshot dematerialization or change the interval use:

# storpool mgmtConfig snapshotDematerializeInterval 30000 # sets the interval 30 seconds, 0 disables it

Snapshot dematerialization checks and removes all objects that do not refer to any data, i.e. no change in this object from the last snapshot (or ever). This helps to reduce the number of used objects per disk in clusters with a large number of snapshots and a small number of changed blocks between the snapshots in the chain.

To update the free space threshold in GB after which the relocator will not be adding new tasks use:

# storpool mgmtConfig relocatorGBFreeBeforeAdd 75 # value is in GB

To set or change the default MultiCluster owner check interval use:

# storpool mcOwnerCheckInterval 2000 # sets the interval to 2 seconds, 0 disables it

To set or change the default MultiCluster auto-reconcile interval use:

# storpool mcAutoReconcileInterval 2000 # sets the interval to 2 seconds, 0 disables it

If there is a disk down, and a new volume could not be allocated, enabling this option will retry the new volume allocation as if reuseServer was specified, helpful for minimum installation requirements with 3 nodes when one of the nodes or a disk is down. To enable the option use:

# storpool mgmtConfig reuseServerImplicitOnDiskDown enable

In case of a planned maintenance the following will update the full cluster maintenance state to maintenance:

# storpool mgmtConfig maintenanceState maintenance
OK

… and back into production:

# storpool mgmtConfig maintenanceState production
OK

Please consult with StorPool support before changing the management configuration defaults.

9.22. Mode

Support for couple of different output modes is available both for the interactive shell and when the CLI is invoked directly. Some custom format options are available only for some operations.

Available modes:

  • csv - Semicolon-separated values for some commands

  • json - Processed JSON output for some commands

  • pass - Pass the JSON response through

  • raw - Raw output (display the HTTP request and response)

  • text - Human readable output (default)

Example with switching to csv mode in the interactive shell:

StorPool> mode csv
OK
StorPool> net list
nodeId;flags;net 1;net 2
23;uU + AJ;22:60:00:00:F0:17;26:60:00:00:F0:17
24;uU + AJ;2A:60:00:00:00:18;2E:60:00:00:00:18
25;uU + AJ;F6:52:14:76:9C:C0;F6:52:14:76:9C:C1
26;uU + AJ;2A:60:00:00:00:1A;2E:60:00:00:00:1A
29;uU + AJ;52:6B:4B:44:02:FE;52:6B:4B:44:02:FF

The same applies when using the CLI directly:

# storpool -f csv net list # the output is the same as above
[snip]

10. Multi server

The multi-server feature enables the use of up to seven separate storpool_server instances on a single node. This makes sense for dedicated storage nodes or in case of heavily-loaded converged setup with more resources isolated for the storage system.

For example a dedicated storage node with 36 drives would provide better peak performance with 4 server instances each controlling 1/4th of all disks/SSDs than with a single instance. Another good example would be a converged node with 16 SSDs/HDDs, which would provide better peak performance with two server instances each controlling half of the drives and running on separate CPU cores or even running on two threads on a single CPU core compared to a single server instance.

The configuration of the CPUs on which the different instances are running is done via cgroups, through the storpool_cg tool. More details are available in Cgroup setup.

Configuring which drive is handled by which instance is done with storpool_initdisk. For example, if we have two drives with IDs of 1101 and 1102, both controlled by the first server instance, the output from storpool_initdisk would look like this:

# storpool_initdisk --list
/dev/sde1, diskId 1101, version 10007, server instance 0, cluster init.b, SSD
/dev/sdf1, diskId 1102, version 10007, server instance 0, cluster init.b, SSD

Setting the second SSD drive (1102) to be controlled by the second server instance is done like this:

# storpool_initdisk -r -i 1 /dev/sdXN   # where X is the drive letter and N is the partition number e.g. /dev/sdf1

Hint

The above command will fail if the storpool_server service is running, please eject the disk prior to re-setting an instance.

In some occasions if the first server instance was configured with a large amount of cache (check SP_CACHE_SIZE in the Configuration Guide) first split the cache between the different instances (e.g. from 8192 to 4096 when migrating from one to two instances). These parameters will be autoatically take care of by the storpool_cg tool, check for more details in Cgroup setup.

A tool for easy reconfiguration between different number of server instances could be used to print the required commands, example for a node with some SSD and some HDDs automatically assigned to 3 SSD only and one HDD only server instances:

[root@s25 ~]# /usr/lib/storpool/multi-server-helper.py -i 4 -s 3
/usr/sbin/storpool_initdisk -r -i 0 2532 0000:01:00.0-p1  # SSD
/usr/sbin/storpool_initdisk -r -i 0 2534 0000:02:00.0-p1  # SSD
/usr/sbin/storpool_initdisk -r -i 0 2533 0000:06:00.0-p1  # SSD
/usr/sbin/storpool_initdisk -r -i 0 2531 0000:07:00.0-p1  # SSD
/usr/sbin/storpool_initdisk -r -i 1 2505 /dev/sde1  # SSD
/usr/sbin/storpool_initdisk -r -i 1 2506 /dev/sdf1  # SSD
/usr/sbin/storpool_initdisk -r -i 1 2507 /dev/sdg1  # SSD
/usr/sbin/storpool_initdisk -r -i 1 2508 /dev/sdh1  # SSD
/usr/sbin/storpool_initdisk -r -i 2 2501 /dev/sda1  # SSD
/usr/sbin/storpool_initdisk -r -i 2 2502 /dev/sdb1  # SSD
/usr/sbin/storpool_initdisk -r -i 2 2503 /dev/sdc1  # SSD
/usr/sbin/storpool_initdisk -r -i 2 2504 /dev/sdd1  # SSD
/usr/sbin/storpool_initdisk -r -i 3 2511 /dev/sdi1  # WBC
/usr/sbin/storpool_initdisk -r -i 3 2512 /dev/sdj1  # WBC
/usr/sbin/storpool_initdisk -r -i 3 2513 /dev/sdk1  # WBC
/usr/sbin/storpool_initdisk -r -i 3 2514 /dev/sdl1  # WBC
/usr/sbin/storpool_initdisk -r -i 3 2515 /dev/sdn1  # WBC
/usr/sbin/storpool_initdisk -r -i 3 2516 /dev/sdo1  # WBC
/usr/sbin/storpool_initdisk -r -i 3 2517 /dev/sdp1  # WBC
/usr/sbin/storpool_initdisk -r -i 3 2518 /dev/sdq1  # WBC

[root@s25 ~]# /usr/lib/storpool/multi-server-helper.py -h
usage: multi-server-helper.py [-h] [-i INSTANCES] [-s [SSD_ONLY]]

Prints relevant commands for dispersing the drives to multiple server
instances

optional arguments:
  -h, --help            show this help message and exit
  -i INSTANCES, --instances INSTANCES
                        Number of instances
  -s [SSD_ONLY], --ssd-only [SSD_ONLY]
                        Splits by type, 's' SSD-only instances plusi-s HDD
                        instances (default s: 1)

Note that the commands could be executed only when the relevant storpool_server* service instances are stopped and a cgroup re-configuration would likely be required after the setup changes (see Cgroup setup for more info on how to update cgroups).

11. Volume management

Volume

Volumes are the basic service of the StorPool storage system. A volume always have a name, a global ID and a certain size. It can be read from and written to. It could be attached to hosts as read-only or read-write block device under the /dev/storpool directory (available as well at /dev/storpool-byid).

The volume name is a string consisting of one or more of the allowed characters - upper and lower latin letters (a-z,A-Z), numbers (0-9) and the delimiter dot (.), colon (:), dash (-) and underscore (_).



11.1. Creating a volume

Creating a volume

Creating volume

11.2. Deleting a volume

Deleting a volume

Deleting a volume

11.3. Renaming a volume

Renaming a volume

Renaming a volume

11.4. Resizing a volume

Resizing a volume

Resizing a volume

11.5. Snapshots

Snapshot

Snapshots are read-only point-in-time images of volumes. They are created once and cannot be changed. They can be attached to hosts as read-only block devices under /dev/storpool.


All volumes and snapshots share the same name-space. Names of volumes and snapshots are unique within a StorPool cluster. This diagram illustrates the relationship between a snapshot and a volume. Volume vol1 is based on snapshot snap1. vol1 contains only the changes since snap1 was taken. In the common case this is a small amount of data. Arrows indicate a child-parent relationship. Each volume or snapshot may have exactly one parent which it is based upon. Writes to vol1 are recorded within the volume. Reads from vol1 may be served by vol1 or by its parent snapshot - snap1, depending on whether vol1 contains changed data for the read request or not.

Namespace for volumes and snapshots
Volume snapshot relation

Snapshots and volumes are completely independent. Each snapshot may have many children (volumes and snapshots). Volumes cannot have children.






Volume snapshot chain

snap1 contains a full image. snap2 contains only the changes since snap1 was taken. vol1 and vol2 contain only the changes since snap2 was taken.









11.6. Creating a snapshot of a volume

There is a volume named vol1.

Creating a snapshot

After the first snapshot the state of vol1 is recorded in a new snapshot named snap1. vol1 does not occupy any space now, but will record any new writes which come in after the creation of the snapshot. Reads from vol1 may fall through to snap1.







Then the state of vol1 is recorded in a new snapshot named snap2. snap2 contains the changes between the moment snap1 was taken and the moment snap2 was taken. snap2’s parent is the original parent of vol1.



11.7. Converting a volume to a snapshot (freeze)

There is a volume named vol1, based on a snapshot snap0. vol1 contains only the changes since snap0 was taken.


Volume freeze

After the freeze operation the state of vol1 is recorded in a new snapshot with the same name. The snapshot vol1 contains changes between the moment snap0 was taken and the moment vol1 was frozen.






11.8. Creating a volume based on an existing snapshot (a.k.a. clone)

Before the creation of vol1 there is a snapshot named snap1.

Snapshot clones

A new volume, named vol1 is created. vol1 is based on snap1. The newly created volume does not occupy any space initially. Reads from the vol1 may fall through to snap1 or to snap1’s parents (if any).








11.9. Deleting a snapshot

vol1 and vol2 are based on snap1. snap1 is based on snap0. snap1 contains the changes between the moment snap0 was taken and when snap1 was taken. vol1 and vol2 contain the changes since the moment snap1 was taken.


Deleting a snapshot

After the deletion, vol1 and vol2 are based on snap1’s original parent (if any). In the example they are now based on snap0. When deleting a snapshot, the changes contained therein will not be propagated to its children and StorPool will keep the snap1 in deleting state to prevent from an explosion of disk space usage.





11.10. Rebase to null (a.k.a. promote)

vol1 is based on snap1. snap1 is in turn based on snap0. snap1 contains the changes between the moment snap0 was taken and when snap1 was taken. vol1 contains the changes since the moment snap1 was taken.

Rebase to null

After promotion vol1 is not based on a snapshot. vol1 now contains all data, not just the changes since snap1 was taken. Any relation between snap1 and snap0 is unaffected.








11.11. Rebase

vol1 is based on snap1. snap1 is in turn based on snap0. snap1 contains the changes between the moment snap0 was taken and when snap1 was taken. vol1 contains the changes since the moment snap1 was taken.


Rebase

After the rebase operation vol1 is based on snap0. vol1 now contains all changes since snap0 was taken, not just since snap1. snap1 is unchanged.









11.12. Example use of snapshots

Example use of snapshots

This is a semi-realistic example of how volumes and snapshots may be used. There is a snapshot called base.centos7. This snapshot contains a base CentOS 7 VM image, which was prepared carefully by the service provider. There are 3 customers with 4 virtual machines each. All virtual machine images are based on CentOS 7, but may contain custom data, which is unique to each VM.












Example use of snapshots

This example shows another typical use of snapshots - for restore points back in time for this volume. There is one base image for Centos 7, three snapshot restore points and one live volume cust123.v.1














12. StorPool iSCSI support

If StorPool volumes need to be accessed by hosts that cannot run the StorPool client service (e.g. VMware hypervisors), they may be exported using the iSCSI protocol.

As of version 19, StorPool implements an internal user-space TCP/IP stack, which in conjunction with the NIC hardware acceleration (user-mode drivers) allows for higher performance and independence of the kernel’s TCP/IP stack and its inefficiencies.

12.1. A Quick Overview of iSCSI

The iSCSI remote block device access protocol, as implemented by the StorPool iSCSI service, is a client-server protocol allowing clients (referred to as “initiators”) to read and write data to disks (referred to as “targets”) exported by iSCSI servers. The iSCSI servers listen on portals (TCP ports, usually 3260, on specific IP addresses); these portals may be grouped into the so called portal groups to provide fine-grained access control or load balancing for the iSCSI connections.

12.2. An iSCSI Setup in a StorPool Cluster

The StorPool implementation of iSCSI provides a way to mark StorPool volumes as accessible to iSCSI initiators, define iSCSI portals where hosts running the StorPool iSCSI service listen for connections from initiators, define portal groups over these portals, and export StorPool volumes (iSCSI targets) to iSCSI initiators in the portal groups. To simplify the configuration of the iSCSI initiators, and also to provide load balancing and failover, each portal group has a floating IP address that is automatically brought up on only a single StorPool service at a given moment; the initiators are configured to connect to this floating address, authenticating if necessary, and are then redirected to the portal of the StorPool service that actually exports the target (volume) that they need to access.

Note

As of 19, you don’t need to add the IP addresses on the nodes, those are handled directly by the StorPool TCP implementation and are not visible in ifconfig or ip. If you’re going to use multiple VLANs, those are configured in the CLI and do not require setting up VLAN interfaces on the host itself, except for debugging/testing or if a local initiator is required to access volumes through iSCSI.

In the simplest setup, there is a single portal group with a floating IP address, there is a single portal for each StorPool host that runs the iSCSI service, all the initiators connect to the floating IP address and are redirected to the correct host. For quality of service or fine-grained access control, more portal groups may be defined and some volumes may be exported via more than one portal group.

Before configuring iSCSI, the interfaces that would be used for it need to be described in storpool.conf. Here is the general config format:

SP_ISCSI_IFACE=IFACE1,RESOLVE:IFACE2,RESOLVE:[flags]

This row means that the first iSCSI network is on IFACE1 and the second one on IFACE2. The order is important for the configuration later. RESOLVE is the resolve interface, if different than the interfaces themselves, i.e. if it’s a bond or a bridge.

[flags] is not required and more importantly if not needed, must be omitted. Currently the only supported value is [lacp] (brackets included) if the interfaces are in a LACP trunk.

Examples:

Multipath, two separate interfaces used directly:

SP_ISCSI_IFACE=eth0:eth1

Active-backup bond named bond0:

SP_ISCSI_IFACE=eth0,bond0:eth1,bond0

LACP bond named bond0:

SP_ISCSI_IFACE=eth0,bond0:eth1,bond0:[lacp]

Bridge interface cloudbr0 on top of LACP bond:

SP_ISCSI_IFACE=eth0,cloudbr0:eth1,cloudbr0:[lacp]

A trivial iSCSI setup can be brought up by the following series of StorPool CLI commands below. See the CLI tutorial for more information about the commands themselves. The setup does the following:

  • has baseName/IQN of iqn.2019-08.com.example:poc-cluster;

  • has floating IP address is 192.168.42.247, which is in VLAN 42;

  • two nodes from the cluster will be able to export in this group:

    • node id 1, with IP address 192.168.42.246

    • node id 3, with IP address 192.168.42.202

  • one client is defined, with IQN iqn.2019-08.com.example:poc-cluster:hv1

  • one volume, called tinyvolume will be exported to the client defined, in the portal group.

Note

You need to obtain the exact IQN of the initiator, available at:

  • Windows Server: iSCSI initiator, it is automatically generated upon installation

  • VMWare vSphere: it is automatically assigned upon creating a software iSCSI adapter

  • Linux-based (XenServer, etc.): /etc/iscsi/initiatorname.iscsi

# storpool iscsi config setBaseName iqn.2019-08.com.example:poc-cluster
OK

# storpool iscsi config portalGroup poc create
OK

# storpool iscsi config portalGroup poc addNet 192.168.42.247/24 vlan 42
OK

# storpool iscsi config portal create portalGroup poc address 192.168.42.246 controller 1
OK

# storpool iscsi config portal create portalGroup poc address 192.168.42.202 controller 3
OK

# storpool iscsi portalGroup list
---------------------------------------
| name | networksCount | portalsCount |
---------------------------------------
| poc  |             1 |            2 |
---------------------------------------

# storpool iscsi portalGroup list portals
--------------------------------------------
| group | address             | controller |
--------------------------------------------
| poc   | 192.168.42.246:3260 |          1 |
| poc   | 192.168.42.202:3260 |          3 |
--------------------------------------------

# storpool iscsi config initiator iqn.2019-08.com.example:poc-cluster:hv1 create
OK

# storpool iscsi config target create tinyvolume
OK

# storpool iscsi config export volume tinyvolume portalGroup poc initiator iqn.2019-08.com.example:poc-cluster:hv1
OK

# storpool iscsi initiator list
----------------------------------------------------------------------------------------------
| name                                    | username | secret | networksCount | exportsCount |
----------------------------------------------------------------------------------------------
| iqn.2019-08.com.example:poc-cluster:hv1 |          |        |             0 |            1 |
----------------------------------------------------------------------------------------------

# storpool iscsi initiator list exports
---------------------------------------------------------------------------------------------------------------------------------------------
| name                                           | volume     | currentControllerId | portalGroup | initiator                               |
---------------------------------------------------------------------------------------------------------------------------------------------
| iqn.2019-08.com.example:poc-cluster:tinyvolume | tinyvolume |                   1 | poc         | iqn.2019-08.com.example:poc-cluster:hv1 |
---------------------------------------------------------------------------------------------------------------------------------------------

Below is a setup with two separate networks that allows for multipath. It uses the 192.168.41.0/24 network on the first interface, 192.168.42.0/24 on the second interface, and the .247 IP for the floating IP in both networks:

# storpool iscsi config setBaseName iqn.2019-08.com.example:poc-cluster
OK

# storpool iscsi config portalGroup poc create
OK

# storpool iscsi config portalGroup poc addNet 192.168.41.247/24
OK

# storpool iscsi config portalGroup poc addNet 192.168.42.247/24
OK

# storpool iscsi config portal create portalGroup poc address 192.168.41.246 controller 1
OK

# storpool iscsi config portal create portalGroup poc address 192.168.42.246 controller 1
OK

# storpool iscsi config portal create portalGroup poc address 192.168.41.202 controller 3
OK

# storpool iscsi config portal create portalGroup poc address 192.168.42.202 controller 3
OK

# storpool iscsi portalGroup list
---------------------------------------
| name | networksCount | portalsCount |
---------------------------------------
| poc  |             2 |            2 |
---------------------------------------

# storpool iscsi portalGroup list portals
--------------------------------------------
| group | address             | controller |
--------------------------------------------
| poc   | 192.168.41.246:3260 |          1 |
| poc   | 192.168.41.202:3260 |          3 |
| poc   | 192.168.42.246:3260 |          1 |
| poc   | 192.168.42.202:3260 |          3 |
--------------------------------------------

Note

Please note that the order of adding the networks relates to the order in the SP_ISCSI_IFACE, the first network will be bound to the first interface appearing in this configuration. More on how to list configured iSCSI interfaces is available here, more on how to list addresses exposed by a particular node, check - 9.15.1.  iscsi_tool.

There is no difference in exporting volumes in multi-path setups.

12.3. Routed iSCSI setup

12.3.1. Overview

Layer-3/routed networks present some challenges to the operation of StorPool iSCSI, unlike flat layer-2 networks:

  • routes need to be resolved for destinations based on the kernel routing table, instead of ARP;

  • floating IP addresses for the portal groups need to be accessible to the whole network;

The first task is accomplished by monitoring the kernel’s routing table, the second with an integrated BGP speaker in storpool_iscsi.

Note

StorPool’s iSCSI does not support Linux’s policy-based routing, and is not affected by iptables, nftables, or any kernel filtering/networking component.

An iSCSI deployment in a layer-3 network has the following general elements:

  • nodes with storpool_iscsi in one or multiple subnets;

  • allocated IP(s) for portal group floating IP addresses;

  • local routing daemon (bird, frr)

  • access to the network’s routing protocol.

The storpool_iscsi daemon connects to a local routing daemon via BGP and announces the floating IPs from the node those are active on. The local routing daemon talks to the network via its own protocol (BGP, OSPF or something else) and passes on the updates.

Note

In a fully routed network, the local routing daemon is also responsible for announcing the IP for cluster management (managed by storpool_mgmt)

12.3.2. Configuration

The following needs to be added to storpool.conf:

SP_ISCSI_ROUTED=1

In routed networks, when adding the portalGroup floating IP address, you need to specify it as /32.

Note

These are example configurations and may not be the exact fit for a particular setup. Handle with care.

Note

In the examples below, the ASN of the network is 65500, StorPool has been assigned 65512, and will need to announce 192.168.42.247.

To enable the BGP speaker in storpool_iscsi, the following snippet for storpool.conf is needed (the parameters are described in the comment above it):

# ISCSI_BGP_IP:BGP_DAEMON_IP:AS_FOR_ISCSI:AS_FOR_THE_DAEMON
SP_ISCSI_BGP_CONFIG=127.0.0.2:127.0.0.1:65512:65512

And here’s a snippet from bird.conf for a BGP speaker that talks to StorPool’s iSCSI:

# variables
myas = 65512;
remoteas = 65500;
neigh = 192.168.42.1

# filter to only export our floating IP

filter spip {
    if (net = 192.168.42.247/32) then accept;
    reject;
}


# external gateway
protocol bgp sw100g1 {
        local as myas;
        neighbor neigh as remoteas;
        import all;
        export filter spip;
        direct;
        gateway direct;
        allow local as;
}

# StorPool iSCSI
protocol bgp spiscsi {
        local as myas;
        neighbor 127.0.0.1 port 2179 as myas;
        import all;
        export all;
        multihop;
        next hop keep;
        allow local as;
}

Note

For protocols different than BGP, please note that the StorPool iSCSI exports the route to the floating IP with a next-hop of the IP address configured for the portal of the node, and this information needs to be preserved when announcing the route.

12.4. Caveats with a Complex iSCSI Architecture

In iSCSI portal definitions, a TCP address/port pair must be unique; only a single portal within the whole cluster may be defined at a single IP address and port. Thus, if the same StorPool iSCSI service should be able to export volumes in more than one portal group, the portals should be placed either on different ports or on different IP addresses (although it is fine that these addresses will be brought up on the same network interface on the host).

Note

Even though StorPool supports reusing IPs, separate TCP ports, etc., the general recommendation on different portal groups is to have a separate VLAN and IP range for each one. There are lots of unknowns with different ports, security issues with multiple customers in the same VLAN, etc..

The redirecting portal on the floating address of a portal group always listens on port 3260. Similarly to the above, different portal groups must have different floating IP addresses, although they are automatically brought up on the same network interfaces as the actual portals within the groups.

Some iSCSI initiator implementations (e.g. VMware vSphere) may only connect to TCP port 3260 for an iSCSI service. In a more complex setup where a StorPool service on a single host may export volumes in more than one portal group, this might mean that the different portals must reside on different IP addresses, since the port number is the same.

For technical reasons, currently a StorPool volume may only be exported by a single StorPool service (host), even though it may be exported in different portal groups. For this reason, some care should be taken in defining the portal groups so that they may have at least some StorPool services (hosts) in common.

13. Multi site and Multicluster

There are two sets of features allowing connections and operations to be performed on different clusters in the same (13.1.  Multicluster) datacenter or different locations (13.2.  Multi site).

General distinction between the two:

  • multicluster covers closely packed clusters (i.e. pods or racks) with a fast and low-latency connection between them

  • multi-site covers clusters in separate locations connected through and insecure and/or high-latency connection

13.1. Multicluster

Main use case for the multicluster is seamless scalability in the same datacenter. A volume could be live-migrated between different sub-clusters in a multicluster setup. This way workloads could be balanced between multiple sub-clusters in a location, which is generally referred to as a multicluster setup.

digraph G {
  rankdir=LR;
  compound=true;
  ranksep=1;
  style=radial;
  bgcolor="white:gray";
  image=svg;
  label="Location A";
  subgraph cluster_a0 {
    style=filled;
    bgcolor="white:lightgrey";
    node [
        style=filled,
        shape=square,
    ];
    bridge0;
    a00 [label="a0.1"];
    a01 [label="a0.2"];
    space0 [label="..."];
    a03 [label="a0.N"];
    label = "Cluster A0";
  }

  subgraph cluster_a1 {
    style=filled;
    bgcolor="white:lightgrey";
    node [
        style=filled,
        shape=square,
    ];
    bridge1;
    a10 [label="a1.1"];
    a11 [label="a1.2"];
    space1 [label="..."];
    a13 [label="a1.N"];
    label = "Cluster A1";
  }

  subgraph cluster_a2 {
    style=filled;
    bgcolor="white:lightgrey";
    node [
        style=filled,
        shape=square,
    ];
    bridge2;
    a20 [label="a2.1"];
    a21 [label="a2.2"];
    space2 [label="..."];
    a23 [label="a2.N"];
    label = "Cluster A2";
  }

  bridge0 -> bridge1 [dir=both, lhead=cluster_a1, ltail=cluster_a0];
  bridge1 -> bridge2 [dir=both, lhead=cluster_a2, ltail=cluster_a1];
  bridge0 -> bridge2 [dir=both, lhead=cluster_a2, ltail=cluster_a0];
// was:
//   bridge0 -> bridge1 [color="red", lhead=cluster_a1, ltail=cluster_a0];
//   bridge1 -> bridge0 [color="blue", lhead=cluster_a0, ltail=cluster_a1];
//   bridge1 -> bridge2 [color="red", lhead=cluster_a2, ltail=cluster_a1];
//   bridge2 -> bridge1 [color="blue", lhead=cluster_a1, ltail=cluster_a2];
//   bridge0 -> bridge2 [color="red", lhead=cluster_a2, ltail=cluster_a0];
//   bridge2 -> bridge0 [color="blue", lhead=cluster_a0, ltail=cluster_a2];

}

13.2. Multi site

Remotely connected clusters in different locations are referred to as multi site. When two remote clusters are connected, they could efficiently transfer snapshots between them. The usuall case is remote backup and DR.

digraph G {
  rankdir=LR;
  compound=true;
  ranksep=2;
  image=svg;
  subgraph cluster_loc_a {
    style=radial;
    bgcolor="white:gray";
    node [
        style=filled,
        color="white:lightgrey",
        shape=square,
    ];
    a0 [label="Cluster A0"];
    a1 [label="Cluster A1"];
    a2 [label="Cluster A2"];
    label = "Location A";
  }

  subgraph cluster_loc_b {
    style=filled;
    color=grey;
    node [
        style=filled,
        color="white:grey",
        shape=square,
    ];
    b0 [label="Cluster B0"];
    b1 [label="Cluster B1"];
    b2 [label="Cluster B2"];
    label = "Location B";
  }

  a1 -> b1 [color="red", lhead=cluster_loc_b, ltail=cluster_loc_a];
  b1 -> a1 [color="blue", lhead=cluster_loc_a, ltail=cluster_loc_b];
}

13.3. Setup

Connecting clusters regardless of their locations requires the storpool_bridge service to be running on at least two nodes in each cluster.

Each node running the storpool_bridge needs the following parameters to be configured in /etc/storpool.conf or /etc/storpool.conf.d/*.conf files:

SP_CLUSTER_NAME=<Human readable name of the cluster>
SP_CLUSTER_ID=<location ID>.<cluster ID>
SP_BRIDGE_HOST=<IP address>
SP_BRIDGE_TEMPLATE=<template>
SP_BRIDGE_IFACE=<interface> # optional with IP failover

The SP_CLUSTER_NAME is mandatory human readable name for this cluster.

The SP_CLUSTER_ID is an unique ID assigned by StorPool for each existing cluster (example nmjc.b). The cluster ID consists of two parts:

nmjc.b
|     `sub-cluster ID
`location ID

The first part before the dot (nmjc) is the location ID, and the part after the dot is the sub-cluster ID (the second part after the . - b)

The SP_BRIDGE_HOST is the IP address to listen for connections from other bridges. Note that 3749 port should be unblocked in the firewalls between the two locations.

The SP_BRIDGE_TEMPLATE is needed to instruct the local bridge which template should be used for incoming snapshots.

The SP_BRIDGE_IFACE is required when two or more bridges are configured with the same public/private key pairs. The SP_BRIDGE_HOST in this case is a floating IP address and will be configured on the SP_BRIDGE_IFACE on the host with the active bridge.

13.4. Connecting two clusters

In this examples there are two clusters named Cluster_A and Cluster_B. To have these two connected through their bridge services we would have to introduce each of them to the other.

digraph G {
  rankdir=LR;
  image=svg;
  subgraph cluster_a {
    style=filled;
    color=lightgrey;
    node [
        style=filled,
        color=white,
        shape=square,
        label="Bridge A\n\nSP_CLUSTER_ID = locationAId.aId\nSP_BRIDGE_HOST = 10.10.10.1\npublic_key: aaaa.bbbb.cccc.dddd\n",
    ];
    bridge0;
    label = "Cluster A";
  }

  subgraph cluster_b {
    style=filled;
    color=grey;
    node [
        style=filled,
        color=white,
        shape=square,
        label="Bridge B\n\nSP_CLUSTER_ID = locationBId.bId\nSP_BRIDGE_HOST = 10.10.20.1\npublic_key: eeee.ffff.gggg.hhhh\n"
    ];
    bridge1;
    label = "Cluster B";
  }
  bridge0 -> bridge1 [dir=none color=none]
}

Note

In case of a multicluster setup the location will be the same for both clusters, the procedure is the same for both cases with the slight difference that in case of multicluster the remote bridges are usually configured with noCrypto.

13.4.1. Cluster A

The following parameters from Cluster_B will be required:

  • The SP_CLUSTER_ID - locationBId.bId

  • The SP_BRIDGE_HOST IP address - 10.10.20.1

  • The public key located in /usr/lib/storpool/bridge/bridge.key.txt in the remote bridge host in Cluster_B - eeeeeeeeeeeee.ffffffffffff.ggggggggggggg.hhhhhhhhhhhhh

By using the CLI we could add Cluster_B’s location with the following commands in Cluster_A:

user@hostA # storpool location add locationBId location_b
user@hostA # storpool cluster add location_b bId
user@hostA # storpool cluster list
--------------------------------------------
| name                 | id   | location   |
--------------------------------------------
| location_b-cl1       | bId  | location_b |
--------------------------------------------

The remote name is location_b-cl1, where the clN number is automatically generated based on the cluster ID. The last step in Cluster_A is to register the Cluster_B’s bridge. The command looks like this:

user@hostA # storpool remoteBridge register location_b-cl1 10.10.20.1 eeeeeeeeeeeee.ffffffffffff.ggggggggggggg.hhhhhhhhhhhhh

Registered bridges in Cluster_A:

user@hostA # storpool remoteBridge list
----------------------------------------------------------------------------------------------------------------------------
| ip             | remote         | minimumDeleteDelay | publicKey                                              | noCrypto |
----------------------------------------------------------------------------------------------------------------------------
| 10.10.20.1     | location_b-cl1 |                    | eeeeeeeeeeeee.ffffffffffff.ggggggggggggg.hhhhhhhhhhhhh | 0        |
----------------------------------------------------------------------------------------------------------------------------

Hint

The public key in /usr/lib/storpool/bridge/bridge.key.txt will be generated on the first run of the storpool_bridge service.

Note

The noCrypto is usually 1 in case of multicluster with a secure datacenter network for higher throughput and lower latency during migrations.

digraph G {
  rankdir=LR;
  image=svg;
  compound=true;
  ranksep=2;
  subgraph cluster_a {
    style=filled;
    color=lightgrey;
    node [
        style=filled,
        color=white,
        shape=square,
        label="Bridge A",
    ];
    bridge0;
    label = "Cluster A";
  }

  subgraph cluster_b {
    style=filled;
    color=grey;
    node [
        style=filled,
        color=white,
        shape=square,
        label="Bridge B"
    ];
    bridge1;
    label = "Cluster B";
  }
  bridge0 -> bridge1 [color="red", lhead=cluster_b, ltail=cluster_a];
//   bridge1 -> bridge0 [color="blue", lhead=cluster_a, ltail=cluster_b];
}

13.4.2. Cluster B

Similarly the parameters from Cluster_A will be required for registering the location, cluster and bridge(s) in Cluster B:

  • The SP_CLUSTER_ID - locationAId.aId

  • The SP_BRIDGE_HOST IP address in Cluster_A - 10.10.10.1

  • The public key in /usr/lib/storpool/bridge/bridge.key.txt in the remote

bridge host in Cluster_A - aaaaaaaaaaaaa.bbbbbbbbbbbb.ccccccccccccc.ddddddddddddd

Similiarly the commands will be:

user@hostB # storpool location add locationAId location_a
user@hostB # storpool cluster add location_a aId
user@hostB # storpool cluster list
--------------------------------------------
| name                 | id   | location   |
--------------------------------------------
| location_a-cl1       | aId  | location_a |
--------------------------------------------
user@hostB # storpool remoteBridge register location_a-cl1 1.2.3.4 aaaaaaaaaaaaa.bbbbbbbbbbbb.ccccccccccccc.ddddddddddddd
user@hostB # storpool remoteBridge list
-------------------------------------------------------------------------------------------------------------------------
| ip          | remote         | minimumDeleteDelay | publicKey                                              | noCrypto |
-------------------------------------------------------------------------------------------------------------------------
| 1.2.3.4     | location_a-cl1 |                    | aaaaaaaaaaaaa.bbbbbbbbbbbb.ccccccccccccc.ddddddddddddd | 0        |
-------------------------------------------------------------------------------------------------------------------------

At this point, provided network connectivity is working, the two bridges will be connected.

digraph G {
  rankdir=LR;
  image=svg;
  compound=true;
  ranksep=2;
  subgraph cluster_a {
    style=filled;
    color=lightgrey;
    node [
        style=filled,
        color=white,
        shape=square,
        label="Bridge A",
    ];
    bridge0;
    label = "Cluster A";
  }

  subgraph cluster_b {
    style=filled;
    color=grey;
    node [
        style=filled,
        color=white,
        shape=square,
        label="Bridge B"
    ];
    bridge1;
    label = "Cluster B";
  }
  bridge0 -> bridge1 [color="red", lhead=cluster_b, ltail=cluster_a];
  bridge1 -> bridge0 [color="blue", lhead=cluster_a, ltail=cluster_b];
}

13.5. Bridge redundancy

There are two ways to add redundancy for the bridge services by configuring and starting the storpool_bridge service on two (or more) nodes in each cluster.

For both cases only one bridge is active at a time and is being failed over in case the node or the active service is restarted.

13.5.1. Separate IP addresses

Configure and start the storpool_bridge with a separate SP_BRIDGE_HOST address and a separate set of public/private key pairs. In this case each of the bridge nodes would have to be registered in the same way as explained in the 13.4.  Connecting two clusters section. The SP_BRIDGE_IFACE parameter is unset and the SP_BRIDGE_HOST address is expected by the storpool_bridge service on each of the node where it is started.

In this case each of the bridge nodes in ClusterA would have to be configured in ClusterB and vice-versa.

13.5.2. Single IP failed over between the nodes

For this, configure and start the storpool_bridge service on the first node. Then distribute the /usr/lib/storpool/bridge/bridge.key and the /usr/lib/storpool/bridge/bridge.key.txt files on the next node where the storpool_bridge service will be running.

The SP_BRIDGE_IFACE is required and represents the interface where the SP_BRIDGE_HOST address will be configured. The SP_BRIDGE_HOST will be up only on the node where the active bridge service is running until either the service or the node itself gets restarted.

With this configuration there will be only one bridge registered in the remote cluster(s), regardless of the number of nodes with running storpool_bridge in the local cluster.

The failover SP_BRIDGE_HOST is better suited for NAT/port-forwarding cases.

13.6. Bridge throughput performance

The throughput performance of a bridge connection depends on a couple of factors (not in this exact sequence) - network throughput, network latency, CPU speed and disk latency. Each could become a bottleneck and could require additional tuning in order to get a higher throughput from the available link between the two sites.

13.6.1. Network

For high-throughput links latency is the most important factor for achieving higher link utilization. For example a low-latency 10 gbps link will be easily saturated (provided crypto is off), but would require some tuning when the latency is higher for optimizing the TCP window size. Same is in effect with lower-bandwidth links with higher latency.

For these cases the send buffer size could be bumped in small increments so that the TCP window is optimized, check the Location section for more info on how to update the send buffer size in each location.

Note

For testing what would be the best send buffer size for throughput performance from primary to backup site, fill a volume with data in the primary (source) site, then create a backup to the backup (remote) site. While observing the bandwidth utilized, increase the send buffers in small increments in the source and the destination cluster until the throughput either stops rising or stays at an acceptable level.

Note that increasing the send buffers above this value can lead to delays when recovering a backup in the opposite direction.

Further sysctl changes might be required, depending on the NIC driver, check the /usr/share/doc/storpool/examples/bridge/90-StorPoolBridgeTcp.conf on the node with the storpool_bridge service, for more info on this.

13.6.2. CPU

The CPU usually becomes a bottleneck only when crypto is configured to on, sometimes it is helpful to move the bridge service on a node with a faster CPU.

If a faster CPU is not available in the same cluster, the SP_BRIDGE_SLEEP_TYPE configured to hsleep or even no might help, note that when this is configured the storpool_cg will attempt to isolate a full-CPU core (i.e. with the second thread free from other processes).

13.6.3. Disks throughput

The default remote recovery setting (SP_REMOTE_RECOVERY_PARALLEL_REQUESTS_PER_DISK) is relatively low especially for dedicated backup clusters, thus when the underlying disks in the receiving cluster are underutilized (this does not happen with flash media) they become the bottleneck. This parameter could be tuned for higher paralelism, an example would be a small cluster of 3 nodes with 8 disks, translating to 48 default queue depth from the bridge, when there are 8 * 3 * 32 available from the underlying disks and (by default with a 10gpbs link), 2048 requests available from the bridge service (256 on an 1gpbs link).

Note

The storpool_server services requires restart after the changes in order for the changes to be applied.

13.7. Exports

A snapshot in one of the clusters could be exported and become visible at all clusters in the location it was exported to, for example a snapshot called snap1 could be exported with:

user@hostA # storpool snapshot snap1 export location_b

It becomes visible in Cluster_B which is part of location_b and could be listed with:

user@hostB # storpool snapshot list remote
-------------------------------------------------------------------------------------------------------
| location   | remoteId             | name     | onVolume | size         | creationTimestamp   | tags |
-------------------------------------------------------------------------------------------------------
| location_b | locationAId.aId.1    | snap1    |          | 107374182400 | 2019-08-11 15:18:02 |      |
-------------------------------------------------------------------------------------------------------

The snapshot may as well be exported to the location of the source cluster where the snapshot resides. This way it will become visible to all sub-clusters in this location.

13.8. Remote clones

Any snapshot export could be cloned locally. For example, to clone a remote snapshot with globalId of locationAId.aId.1 locally we could use:

user@hostB # storpool snapshot snap1-copy template hybrid remote location_a locationAId.aId.1
digraph G {
  rankdir=LR;
  compound=true;
  ranksep=2;
  image=svg;
  subgraph cluster_a {
    style=filled;
    color=lightgrey;
    node [
        style=filled,
        color=white,
        shape=square,
        label="Bridge A",
    ];
    bridge0;
    node [
        style=filled,
        shape=circle,
        color=white,
        label="snap1\nlocationAId.aId.1"
    ]
    snap1;
    label = "Cluster A";
  }

  subgraph cluster_b {
    style=filled;
    color=grey;
    node [
        style=filled,
        color=white,
        shape=square,
        label="Bridge B"
    ];
    bridge1;
    node [
        style=filled,
        shape=circle,
        color=white,
        label="snap1_clone\nlocationAId.aId.1",
    ]
    snap1_clone;
    label = "Cluster B";
  }
  bridge0 -> bridge1 [color="red", lhead=cluster_b, ltail=cluster_a];
  bridge1 -> bridge0 [color="blue", lhead=cluster_a, ltail=cluster_b];
  snap1 -> snap1_clone
}

The name of the clone of the snapshot in Cluster_B will be snap1_clone with all parameters from the hybrid template.

Note

Note that the name of the snapshot in Cluster_B could also be exactly the same in all sub-clusters in a multicluster setup, as well as in clusters in different locations in a multi site setup.

The transfer will start immediately. Only written parts from the snapshot will be transferred between the sites. If snap1 has a size of 100GB, but only 1GB of data was ever written in the volume when it was snapshotted, eventually approximately this amount of data will be transferred between the two (sub-)clusters.

If another snapshot in the remote cluster is already based on snap1 and then exported, the actual transfer will again include only the differences between snap1 and snap2, since snap1 exists in Cluster_B.

digraph G {
  graph [nodesep=0.5, ranksep=1]
  rankdir=LR;
  compound=true;
  image=svg;
  subgraph cluster_a {
    style=filled;
    color=lightgrey;
    node [
        style=filled,
        color=white,
        shape=square,
        label="Bridge A",
    ];
    bridge0;
    node [
        style=filled,
        shape=circle,
        color=white,
        label="snap1\nlocationAId.aId.1"
    ]
    snap1;
    node [
        style=filled,
        shape=circle,
        color=white,
        label="snap2\nlocationAId.aId.2"
    ]
    snap2;
    label = "Cluster A";
    {rank=same; bridge0 snap1 snap2}
  }

  subgraph cluster_b {
    style=filled;
    color=grey;
    node [
        style=filled,
        color=white,
        shape=square,
        label="Bridge B"
    ];
    bridge1;
    node [
        style=filled,
        shape=circle,
        color=white,
        label="snap1_clone\nlocationAId.aId.1",
    ]
    snap1_clone;
    node [
        style=filled,
        shape=circle,
        color=white,
        label="snap2_clone\nlocationAId.aId.2",
    ]
    snap2_clone;
    {rank=same; bridge1 snap1_clone snap2_clone}
    label = "Cluster B";
  }
  bridge0 -> bridge1 [color="red", lhead=cluster_b, ltail=cluster_a];
  bridge1 -> bridge0 [color="blue", lhead=cluster_a, ltail=cluster_b];
  snap2 -> snap1 [dir=back];
  snap2_clone -> snap1_clone [dir=back];
  snap2 -> snap2_clone;
}

The globalId for this snapshot will be the same for all sites it has been transferred to.

13.9. Creating a remote backup on a volume

The volume backup feature is in essence a set of steps that automate the backup procedure for a particular volume.

For example to backup a volume named volume1 in Cluster_A to Cluster_B we will use:

user@hostA # storpool volume volume1 backup Cluster_B

The above command will actually trigger the following set of events:

  1. Creates a local temporary snapshot of volume1 in Cluster_A to be transferred to Cluster_B

  2. Exports the temporary snapshot to Cluster_B

  3. Instructs Cluster_B to initiate the transfer for this snapshot

  4. Exports the transferred snapshot in Cluster_B to be visible from Cluster_A

  5. Deletes the local temporary snapshot

For example if a backup operation has been initiated for a volume called volume1 in Cluster_A, the progress of the operation could be followed with:

user@hostA # storpool snapshot list exports
-------------------------------------------------------------
| location   | snapshot     | globalId          | backingUp |
-------------------------------------------------------------
| location_b | volume1@1433 | locationAId.aId.p | true      |
-------------------------------------------------------------

Once this operation completes the temporary snapshot will no longer be visible as an export and a snapshot with the same globalId will be visible remotely:

user@hostA # snapshot list remote
-----------------------------------------------------------------------------------------------
| location   | remoteId          | name    | onVolume    | size         | creationTimestamp   |
-----------------------------------------------------------------------------------------------
| location_b | locationAId.aId.p | volume1 | volume1     | 107374182400 | 2019-08-13 16:27:03 |
-----------------------------------------------------------------------------------------------

13.10. Creating an atomic remote backup for multiple volumes

Sometimes a set of volumes are used simultaneously in the same virtual machine, an example would be different filesystems for a database and its journal. In order to restore back to the same point in time all volumes a group backup could be initiated:

user@hostA# storpool volume groupBackup Cluster_B volume1 volume2

Note

The same underlying feature is used by the VolumeCare for keeping consistent snapshots for all volumes on a virtual machine.

13.11. Restoring a volume from remote snapshot

Restoring the volume to a previous state from a remote snapshot requires the following steps:

  1. Create a local snapshot from the remotely exported one:

    user@hostA # storpool snapshot volume1-snap template hybrid remote location_b locationAId.aId.p
    OK
    

There are some bits to explain in the above example - from left to right:

  • volume1-snap - name of the local snapshot that will be created.

  • template hybrid - instructs StorPool what will be the replication and placement for the locally created snapshot.

  • remote location_b locationAId.aId.p - instructs StorPool where to look for this snapshot and what is its globalId

If the bridges and the connection between the locations are operational, the transfer will begin immediately.

  1. Next create a volume with the newly created snapshot as a parent:

    user@hostA # storpool volume volume1-tmp parent volume1-snap
    
  2. Finally the volume clone would have to be attached where it is needed.

The last two steps could be changed a bit to rename the old volume to something different and directly create the same volume name from the restored snapshot. This is handled differently in different orchestration systems. The procedure for restoring multiple volumes from a group backup requires the same set of steps.

See VolumeCare 5.4.  revert for an example implementation.

Note

From 19.01 onwards if the snapshot transfer hasn’t completed yet when the volume is created, read operations on an object that is not yet transferred will be forwarded through the bridge and will be processed by the remote cluster.

13.12. Remote deferred deletion

Note

This feature is available for both multicluster and multi-site configurations. Note that the minimumDeleteDelay is per bridge, not per location, thus all bridges to a remote location should be (re)registered with the setting.

The remote bridge could be registered with remote deferred deletion enabled. This feature will enable a user in Cluster A to unexport and set remote snapshots for deferred deletion in Cluster B.

An example for the case without deferred deletion enabled - Cluster_A and Cluster_B are two StorPool clusters in locations A and B connected with a bridge. A volume named volume1 in Cluster_A has two backup snapshots in Cluster_B called volume1@281 and volume1@294.

digraph G {
  graph [nodesep=0.5, ranksep=1]
  rankdir=LR;
  compound=true;
  image=svg;
  subgraph cluster_a {
    style=filled;
    color=lightgrey;
    node [
        style=filled,
        color=white,
        shape=square,
        label="Bridge A",
    ];
    bridge0;
    node [
        shape=rectangle,
    ]
    v1 [label=volume1]
    {rank=same; bridge0 v1}
    label = "Cluster A";
  }

  subgraph cluster_b {
    style=filled;
    color=grey;
    node [
        style=filled,
        color=white,
        shape=square,
        label="Bridge B"
    ];
    bridge1;
    node [
        shape=circle,
    ]
    v1s [label="volume1@281"]
    v2s [label="volume1@294"]
    v1s -> v2s [dir=back]
    {rank=same; bridge1 v1s v2s}
    label = "Cluster B";
  }
  bridge0 -> bridge1 [color="red", lhead=cluster_b, ltail=cluster_a];
  bridge1 -> bridge0 [color="blue", lhead=cluster_a, ltail=cluster_b];
}

The remote snapshots could be unexported from Cluster_A with the deleteAfter flag, but it will be silently ignored in Cluster_B.

To enable this feature the following steps would have to be completed in the remote bridge for Cluster_A:

  1. The bridge in Cluster_A should be registered with minimumDeleteDelay in Cluster_B.

  2. Deferred snapshot deletion should be enabled in Cluster_B (check Management configuration for more on this)

This will enable setting up the deleteAfter parameter on an unexport operation in Cluster_B initiated from Cluster_A.

With the above example volume and remote snapshots, a user in Cluster_A could unexport the volume1@294 snapshot and set its deleteAfter flag to 7 days from the unexport with:

user@hostA # storpool snapshot remote location_b locationAId.aId.q unexport deleteAfter 7d
OK

After the completion of this operation the following events will occur:

  • The volume1@294 snapshot will immediately stop being visible in Cluster_A.

  • The snapshot will get a deleteAfter flag with timestamp a week from the time of the unexport call.

  • A week later the snapshot will be deleted, however only if deferred snapshot deletion is still turned on.

13.13. Volume and snapshot move

13.13.1. Volume move

A volume could be moved both with (live) or without attachment (offline) to a neighbor sub-cluster in a multicluster environment. This is available only for multicluster and not possible for Multi site, where only snapshots could be transferred.

To move a volume use:

# storpool volume <volumeName> moveToRemote <clusterName>

The above command will succeed only in case the volume is not attached on any of the nodes in this sub-cluster. To move the volume live while it is still attached an additional option onAttached should instruct the cluster how to proceed, for example this command:

Lab-D-cl1> volume test moveToRemote Lab-D-cl2 onAttached export

Will move the volume to the Lab-D-cl2 sub-cluster and if the volume is attached in the present cluster will export it back to Lab-D-cl1.

This is an equivalent to:

Lab-D-cl1> multiCluster on
[MC] Lab-D-cl1> cluster cmd Lab-D-cl2 attach volume test client 12
OK

Or directly executing the same CLI command in multicluster mode at a host in Lab-D-cl2 cluster.

Note

Moving a volume will also trigger moving all of its snapshots. In a case where there are parent snapshots with many child volumes they might end up in each sub-cluster their child volumes ended up being moved to as a space-saving measure.

13.13.2. Snapshot move

Moving a snapshot is essentially the same as moving a volume, with the difference that it cannot be moved when attached.

For example:

Lab-D-cl1> snapshot testsnap moveToRemote Lab-D-cl2

Will succeed only if the snapshot is not attached locally.

A snapshot part of a volume snapshot chain will trigger copying also the parent snapshots which will be automatically managed by the cluster.

14. Rebalancing StorPool

14.1. Overview

In some situations the data in the StorPool cluster needs to be rebalanced. This is performed by the balancer and the relocator tools. The relocator is an integral part of the StorPool management service, the balancer is presently an external tool executed on some of the nodes with access to the API.

Note

Be advised that he balancer tool will create some files it needs in the present working directory.

The rebalancing operation is performed in the following steps:

  • The balancer tool is executed, to calculate the new state of the cluster;

  • The results from the balancer are verified by set of automated scripts;

  • The results are also manually reviewed to check whether they contain any inconsistencies and whether they achieve the intended goals. These results are available by running storpool balancer disks and will be printed at the end of balancer.sh.

    • If the result is not satisfactory, the balancer is executed with different parameters, until a satisfactory result is obtained;

  • Once the proposed end result is satisfactory, the calculated state is loaded into the relocator tool, by doing storpool balancer commit;

    • Please note that this step IS NOT REVERSIBLE.

  • The relocator tool performs the actual move of the data.

    • The progress of the relocator tool can be monitored by storpool task list for the currently running tasks, storpool relocator status for an overview of the relocator state and storpool relocator disks (warning: slow command) for the full relocation state.

The balancer tool is executed via the /usr/lib/storpool/balancer.sh wrapper and accepts the following options:

option

Meaning

-g placementGroup

Work only on the specified placement group

-c factor

Factor for how much data to try to move around, from 0 to 10. No default, required parameter.

-f percent

Allow drives to be filled up to this percentage, from 0 to 99. Default 90.

-M maxDataToAdd

Limit the amount of data to copy to a single drive, to be able to rebalance “in pieces”.

-m maxAgCount

Limit the maximum allocation group count on drives to this (effectively their used size).

-b placementGroup

Use disks in the specified placement group to restore replication in critical conditions.

-F

Only move data from fuller to emptier drives. (max -c factor is 3 when -F is used)

-R

Only restore replication for degraded volumes.

-d diskId [-d diskId]

Put data only on the selected disks.

-D diskId [-D diskId]

Don’t move data from those disks.

–only-empty-disk diskId

like -D for all other disks

-V vagId [-V vagId]

Skip balancing vagId

-S

Prefer tail SSD

-o overridesPgName

Specify override placement group name (required only if override template is not created)

–min-disk-full X

Don’t remove data from disk if it is not at least this X% full

–ignore-src-pg-violations

Exactly what it says

–min-replication R

Minimum replication required

–restore-state

Revert to the initial state of the disks (before the balancer commit execution)

-R and -F are mutually exclusive.

The -c value is basically the trade-off between the balanced-ness of the placement groups and the amount of data moved to accomplish that. A lower factor means less data to be moved around, but sometimes more inequality between the data on the disks, a higher one - more data to be moved, but sometimes with a better result in terms of equality of the amount of data for each drive.

On clusters with drives with unsupported size (HDDs > 4TB) the -m option is required. It will limit the data moved onto these drives to up to the set number of allocation groups. This is done as the performance per TB space of larger drives is lower, and it degrades the performance for the whole cluster for high performance use cases.

The -M option helps if you want to limit the amount of data that the balancer would move around, to limit the load on the system created by the rebalancing.

The -f option is required on clusters whose drives are full above 95%. Extreme care should be used when balancing in such cases.

The -b option could be used to move data between placementGroups (in most cases from SSDs to HDDs).

14.2. Restoring volume redundancy on a failed drive

Situation: we have lost drive 1802 in placementGroup ssd. We want to remove it from the cluster and restore the redundancy of the data. We need to do the following:

storpool disk 1802 forget                               # this will also remove the drive from all placement groups it participated in
mkdir -p ~/storpool/balancer && cd ~/storpool/balancer  # it's recommended to run the following commands in a screen/tmux session
/usr/lib/storpool/balancer.sh -R
storpool balancer commit                                # to actually load the data into the relocator and start the re-balancing operation

14.3. Adding new drives and rebalancing data on them

Situation: we have added SSDs 1201, 1202 and HDDs 1510, 1511, that need to go into placement groups ssd and hdd respectively, and we want to re-balance the cluster data so that it is re-dispersed onto the new disks as well. We have no other placement groups in the cluster.

storpool placementGroup ssd addDisk 1201 addDisk 1202
storpool placementGroup hdd addDisk 1510 addDisk 1511
mkdir -p ~/storpool/balancer && cd ~/storpool/balancer  # it's recommended to run the following commands in a screen/tmux session
/usr/lib/storpool/balancer.sh -F -c 0                   # rebalance all placement groups, move data from fuller to emptier drives
storpool balancer commit                                # to actually load the data into the relocator and start the re-balancing operation

14.4. Restoring volume redundancy with rebalancing data on other placementGroup

Situation: we have to restore the redundancy of a hybrid cluster (2 copies on HDDs, one on SSDs) while the ssd placementGroup is out of free space because a few SSDs have recently failed. We can’t replace the failed drives with new ones for the moment.

mkdir -p ~/storpool/balancer && cd ~/storpool/balancer  # it's recommended to run the following commands in a screen/tmux session
/usr/lib/storpool/balancer.sh -R -c 0 -b hdd            # use placementGroup ``hdd`` as a backup and move some data from SSDs
storpool balancer commit                                # to actually load the data into the relocator and start the re-balancing operation

Note

The -f argument could be further used in order to instruct the balancer how full to keep the cluster and thus control how much data will be moved in the backup placement group.

14.5. Decomissioning a live node

Situation: a node in the cluster needs to be decomissioned, so that the data on its drives needs to be moved away. The drive numbers on that node are 101, 102 and 103.

Note

You have to make sure you have enough space to restore the redundancy before proceeding.

storpool disk 101 softEject                             # mark all drives for evacuation
storpool disk 102 softEject
storpool disk 103 softEject
mkdir -p ~/storpool/balancer && cd ~/storpool/balancer  # it's recommended to run the following commands in a screen/tmux session
/usr/lib/storpool/balancer.sh -R -c 0                   # rebalance all placement groups, -F has the same effect in this case
storpool balancer commit                                # to actually load the data into the relocator and start the re-balancing operation

14.6. Decomissioning a dead node

Situation: a node in the cluster needs to be decomissioned, as it has died and cannot be brought back. The drive numbers on that node are 101, 102 and 103.

Note

You have to make sure you have enough space to restore the redundancy before proceeding.

storpool disk 101 forget                                # remove the drives from all placement groups
storpool disk 102 forget
storpool disk 103 forget
mkdir -p ~/storpool/balancer && cd ~/storpool/balancer  # it's recommended to run the following commands in a screen/tmux session
/usr/lib/storpool/balancer.sh -R -c 0                   # rebalance all placement groups
storpool balancer commit                                # to actually load the data into the relocator and start the re-balancing operation

14.7. Resolving imbalances in the drive usage

Situation: we have an imbalance in the drive usage in the whole cluster and we want to improve it.

mkdir -p ~/storpool/balancer && cd ~/storpool/balancer  # it's recommended to run the following commands in a screen/tmux session
/usr/lib/storpool/balancer.sh -F -c 0                   # rebalance all placement groups
/usr/lib/storpool/balancer.sh -F -c 3                   # retry to see if we get a better result with more data movements
storpool balancer commit                                # to actually load the data into the relocator and start the re-balancing operation

14.8. Reading the output of storpool balancer disks

Here is an example output from storpool balancer disks:

-------------------------------------------------------------------------------------------------------------------------------------------------------------------------
|     disk | server |   size   |                  stored                  |                 on-disk                  |                     objects                      |
-------------------------------------------------------------------------------------------------------------------------------------------------------------------------
|        1 |   14.0 |   373 GB |   4.7 MB -> 4.7 MB      (0  B / 0  B)    |   6.1 MB -> 6.1 MB      (0  B / 0  B)    |      26 -> 26           (+0 / +0)      / 405000  |
|     1101 |   11.0 |   447 GB |    16 GB -> 15 GB    (-1.0 GB / 1.4 GB)  |    18 GB -> 17 GB    (-1.1 GB / 1.4 GB)  |   11798 -> 10040     (-1758 / +3932)   / 480000  |
|     1102 |   11.0 |   447 GB |    16 GB -> 15 GB    (-268 MB / 1.3 GB)  |    17 GB -> 17 GB    (-301 MB / 1.4 GB)  |   10843 -> 10045      (-798 / +4486)   / 480000  |
|     1103 |   11.0 |   447 GB |    16 GB -> 15 GB    (-1.0 GB / 1.8 GB)  |    18 GB -> 16 GB    (-1.2 GB / 1.9 GB)  |   12123 -> 10039     (-2084 / +3889)   / 480000  |
|     1104 |   11.0 |   447 GB |    16 GB -> 15 GB    (-757 MB / 1.3 GB)  |    17 GB -> 16 GB    (-899 MB / 1.3 GB)  |   11045 -> 10072      (-973 / +4279)   / 480000  |
|     1111 |   11.0 |   466 GB |   4.7 MB -> 4.7 MB      (0  B / 0  B)    |   5.1 MB -> 5.1 MB      (0  B / 0  B)    |      26 -> 26           (+0 / +0)      / 495000  |
|     1112 |   11.0 |   366 GB |   4.7 MB -> 4.7 MB      (0  B / 0  B)    |   5.1 MB -> 5.1 MB      (0  B / 0  B)    |      26 -> 26           (+0 / +0)      / 390000  |
|     1121 |   11.0 |   931 GB |    22 GB -> 21 GB    (-1009 MB / 830 MB)  |    22 GB -> 21 GB    (-1.0 GB / 872 MB)  |   13713 -> 12698     (-1015 / +3799)   / 975000  |
|     1122 |   11.0 |   931 GB |    21 GB -> 21 GB    (-373 MB / 2.0 GB)  |    22 GB -> 21 GB    (-379 MB / 2.0 GB)  |   13469 -> 12742      (-727 / +3801)   / 975000  |
|     1123 |   11.0 |   931 GB |    22 GB -> 21 GB    (-1.1 GB / 1.9 GB)  |    22 GB -> 21 GB    (-1.1 GB / 2.0 GB)  |   14859 -> 12629     (-2230 / +4102)   / 975000  |
|     1124 |   11.0 |   931 GB |    21 GB -> 21 GB      (36 MB / 1.8 GB)  |    21 GB -> 21 GB      (92 MB / 1.9 GB)  |   13806 -> 12743     (-1063 / +3389)   / 975000  |
|     1201 |   12.0 |   447 GB |    18 GB -> 15 GB    (-2.9 GB / 633 MB)  |    19 GB -> 16 GB    (-3.0 GB / 658 MB)  |   14148 -> 10070     (-4078 / +3050)   / 480000  |
|     1202 |   12.0 |   447 GB |    17 GB -> 15 GB    (-2.1 GB / 787 MB)  |    19 GB -> 16 GB    (-2.3 GB / 815 MB)  |   13243 -> 10067     (-3176 / +2576)   / 480000  |
|     1203 |   12.0 |   447 GB |    17 GB -> 15 GB    (-2.0 GB / 3.3 GB)  |    19 GB -> 16 GB    (-2.4 GB / 3.5 GB)  |   12746 -> 10062     (-2684 / +3375)   / 480000  |
|     1204 |   12.0 |   447 GB |    18 GB -> 15 GB    (-2.7 GB / 1.1 GB)  |    19 GB -> 16 GB    (-2.9 GB / 1.1 GB)  |   12835 -> 10075     (-2760 / +3248)   / 480000  |
|     1212 |   12.0 |   366 GB |   4.7 MB -> 4.7 MB      (0  B / 0  B)    |   6.0 MB -> 6.0 MB      (0  B / 0  B)    |      26 -> 26           (+0 / +0)      / 390000  |
|     1221 |   12.0 |   931 GB |    20 GB -> 21 GB     (569 MB / 1.5 GB)  |    21 GB -> 21 GB     (587 MB / 1.6 GB)  |   13115 -> 12616      (-499 / +3736)   / 975000  |
|     1222 |   12.0 |   931 GB |    22 GB -> 21 GB    (-979 MB / 307 MB)  |    22 GB -> 21 GB    (-1013 MB / 317 MB)  |   12938 -> 12697      (-241 / +3291)   / 975000  |
|     1223 |   12.0 |   931 GB |    22 GB -> 21 GB    (-1.1 GB / 781 MB)  |    22 GB -> 21 GB    (-1.2 GB / 812 MB)  |   13968 -> 12718     (-1250 / +3302)   / 975000  |
|     1224 |   12.0 |   931 GB |    21 GB -> 21 GB    (-784 MB / 332 MB)  |    22 GB -> 21 GB    (-810 MB / 342 MB)  |   13741 -> 12692     (-1049 / +3314)   / 975000  |
|     1225 |   12.0 |   931 GB |    21 GB -> 21 GB    (-681 MB / 849 MB)  |    22 GB -> 21 GB    (-701 MB / 882 MB)  |   13608 -> 12748      (-860 / +3420)   / 975000  |
|     1226 |   12.0 |   931 GB |    22 GB -> 21 GB    (-1.1 GB / 825 MB)  |    22 GB -> 21 GB    (-1.1 GB / 853 MB)  |   13066 -> 12692      (-374 / +3817)   / 975000  |
|     1301 |   13.0 |   447 GB |    13 GB -> 15 GB     (2.6 GB / 4.2 GB)  |    14 GB -> 17 GB     (2.7 GB / 4.4 GB)  |    7244 -> 10038     (+2794 / +6186)   / 480000  |
|     1302 |   13.0 |   447 GB |    12 GB -> 15 GB     (3.0 GB / 3.7 GB)  |    13 GB -> 17 GB     (3.1 GB / 3.9 GB)  |    7507 -> 10063     (+2556 / +5619)   / 480000  |
|     1303 |   13.0 |   447 GB |    14 GB -> 15 GB     (1.3 GB / 3.2 GB)  |    15 GB -> 17 GB     (1.3 GB / 3.4 GB)  |    7888 -> 10038     (+2150 / +5884)   / 480000  |
|     1304 |   13.0 |   447 GB |    13 GB -> 15 GB     (2.7 GB / 3.7 GB)  |    14 GB -> 17 GB     (2.8 GB / 3.9 GB)  |    7660 -> 10045     (+2385 / +5870)   / 480000  |
|     1311 |   13.0 |   466 GB |   4.7 MB -> 4.7 MB      (0  B / 0  B)    |   6.1 MB -> 6.1 MB      (0  B / 0  B)    |      26 -> 26           (+0 / +0)      / 495000  |
|     1312 |   13.0 |   366 GB |   4.7 MB -> 4.7 MB      (0  B / 0  B)    |   6.1 MB -> 6.1 MB      (0  B / 0  B)    |      26 -> 26           (+0 / +0)      / 390000  |
|     1321 |   13.0 |   931 GB |    21 GB -> 21 GB    (-193 MB / 1.1 GB)  |    21 GB -> 21 GB    (-195 MB / 1.2 GB)  |   13365 -> 12765      (-600 / +5122)   / 975000  |
|     1322 |   13.0 |   931 GB |    22 GB -> 21 GB    (-1.4 GB / 1.1 GB)  |    23 GB -> 21 GB    (-1.4 GB / 1.1 GB)  |   12749 -> 12739       (-10 / +4651)   / 975000  |
|     1323 |   13.0 |   931 GB |    21 GB -> 21 GB    (-504 MB / 2.2 GB)  |    22 GB -> 21 GB    (-496 MB / 2.3 GB)  |   13386 -> 12695      (-691 / +4583)   / 975000  |
|     1325 |   13.0 |   931 GB |    21 GB -> 20 GB    (-698 MB / 557 MB)  |    22 GB -> 21 GB    (-717 MB / 584 MB)  |   13113 -> 12768      (-345 / +2668)   / 975000  |
|     1326 |   13.0 |   931 GB |    21 GB -> 21 GB    (-507 MB / 724 MB)  |    22 GB -> 21 GB    (-522 MB / 754 MB)  |   13690 -> 12704      (-986 / +3327)   / 975000  |
|     1401 |   14.0 |   223 GB |   8.3 GB -> 7.6 GB   (-666 MB / 868 MB)  |   9.3 GB -> 8.5 GB   (-781 MB / 901 MB)  |    3470 -> 5043      (+1573 / +2830)   / 240000  |
|     1402 |   14.0 |   447 GB |   9.8 GB -> 15 GB     (5.6 GB / 5.7 GB)  |    11 GB -> 17 GB     (5.8 GB / 6.0 GB)  |    4358 -> 10060     (+5702 / +6667)   / 480000  |
|     1403 |   14.0 |   224 GB |   8.2 GB -> 7.6 GB   (-623 MB / 1.1 GB)  |   9.3 GB -> 8.6 GB   (-710 MB / 1.2 GB)  |    4547 -> 5036       (+489 / +2814)   / 240000  |
|     1404 |   14.0 |   224 GB |   8.4 GB -> 7.6 GB   (-773 MB / 1.5 GB)  |   9.4 GB -> 8.5 GB   (-970 MB / 1.6 GB)  |    4369 -> 5031       (+662 / +2368)   / 240000  |
|     1411 |   14.0 |   466 GB |   4.7 MB -> 4.7 MB      (0  B / 0  B)    |   6.0 MB -> 6.0 MB      (0  B / 0  B)    |      26 -> 26           (+0 / +0)      / 495000  |
|     1412 |   14.0 |   366 GB |   4.7 MB -> 4.7 MB      (0  B / 0  B)    |   6.0 MB -> 6.0 MB      (0  B / 0  B)    |      26 -> 26           (+0 / +0)      / 390000  |
|     1421 |   14.0 |   931 GB |    19 GB -> 21 GB     (1.9 GB / 2.6 GB)  |    19 GB -> 21 GB     (2.0 GB / 2.7 GB)  |   10670 -> 12624     (+1954 / +6196)   / 975000  |
|     1422 |   14.0 |   931 GB |    19 GB -> 21 GB     (1.6 GB / 3.2 GB)  |    20 GB -> 21 GB     (1.6 GB / 3.3 GB)  |   10653 -> 12844     (+2191 / +6919)   / 975000  |
|     1423 |   14.0 |   931 GB |    19 GB -> 21 GB     (1.9 GB / 2.5 GB)  |    19 GB -> 21 GB     (2.0 GB / 2.6 GB)  |   10715 -> 12688     (+1973 / +5846)   / 975000  |
|     1424 |   14.0 |   931 GB |    18 GB -> 20 GB     (2.2 GB / 2.9 GB)  |    19 GB -> 21 GB     (2.3 GB / 3.0 GB)  |   10723 -> 12686     (+1963 / +5505)   / 975000  |
|     1425 |   14.0 |   931 GB |    19 GB -> 21 GB     (1.3 GB / 2.5 GB)  |    20 GB -> 21 GB     (1.4 GB / 2.6 GB)  |   10702 -> 12689     (+1987 / +5486)   / 975000  |
|     1426 |   14.0 |   931 GB |    20 GB -> 21 GB     (1.0 GB / 2.5 GB)  |    20 GB -> 21 GB     (1.0 GB / 2.6 GB)  |   10737 -> 12609     (+1872 / +5771)   / 975000  |
-------------------------------------------------------------------------------------------------------------------------------------------------------------------------
|       45 |    4.0 |    29 TB |   652 GB -> 652 GB    (512 MB / 69 GB)   |   686 GB -> 685 GB   (-240 MB / 72 GB)   |  412818 -> 412818       (+0 / +159118) / 30885000 |
-------------------------------------------------------------------------------------------------------------------------------------------------------------------------

Let’s start with the last line. Here’s the meaning, field by field:

  • There are 45 drives in total.

  • There are 4 server instances.

  • The total disk capacity is 29 TB.

  • The stored data is 652 GB and will change to 652 GB. The total change for all drives afterwards is 512 MB, and the total amount of changes for the drives is 69 GB.

  • The same is repeated for the on-disk size. Here the total amount of changes is roughly the amount of data that would need to be copied.

  • The total current number of objects will not change (i.e. from 412818 to 412818), 0 new objects will be created, the total amount of objects to be moved is 159118, and the total number of possible objects in the cluster is 30885000.

The difference between “stored” and “on-disk” size is that in the latter also includes the size of checksums blocks and other internal data.

For the rest of the lines, the data is basically the same, just per disk.

What needs to be taken into account is:

  • Are there drives that will have too much data on them? Here both data size and objects must be checked, and they should be close to the average percentage for the placement group.

  • Is the data stored on the drives balanced, i.e. are all the drives’ usages close to the average?

  • Are there drives that should have data on them, but nothing is scheduled to be moved?

    • This usually happens because a drive wasn’t added to the right placement group.

  • Will there be too much data to be moved?

To illustrate the difference of amount to be moved, here is the output of storpool balancer disks from a run with -c 10:

-------------------------------------------------------------------------------------------------------------------------------------------------------------------------
|     disk | server |   size   |                  stored                  |                 on-disk                  |                     objects                      |
-------------------------------------------------------------------------------------------------------------------------------------------------------------------------
|        1 |   14.0 |   373 GB |   4.7 MB -> 4.7 MB      (0  B / 0  B)    |   6.1 MB -> 6.1 MB      (0  B / 0  B)    |      26 -> 26           (+0 / +0)      / 405000  |
|     1101 |   11.0 |   447 GB |    16 GB -> 15 GB    (-1.0 GB / 1.7 GB)  |    18 GB -> 17 GB    (-1.1 GB / 1.7 GB)  |   11798 -> 10027     (-1771 / +5434)   / 480000  |
|     1102 |   11.0 |   447 GB |    16 GB -> 15 GB    (-263 MB / 1.7 GB)  |    17 GB -> 17 GB    (-298 MB / 1.7 GB)  |   10843 -> 10000      (-843 / +5420)   / 480000  |
|     1103 |   11.0 |   447 GB |    16 GB -> 15 GB    (-1.0 GB / 3.6 GB)  |    18 GB -> 16 GB    (-1.2 GB / 3.8 GB)  |   12123 -> 10005     (-2118 / +6331)   / 480000  |
|     1104 |   11.0 |   447 GB |    16 GB -> 15 GB    (-752 MB / 2.7 GB)  |    17 GB -> 16 GB    (-907 MB / 2.8 GB)  |   11045 -> 10098      (-947 / +5214)   / 480000  |
|     1111 |   11.0 |   466 GB |   4.7 MB -> 4.7 MB      (0  B / 0  B)    |   5.1 MB -> 5.1 MB      (0  B / 0  B)    |      26 -> 26           (+0 / +0)      / 495000  |
|     1112 |   11.0 |   366 GB |   4.7 MB -> 4.7 MB      (0  B / 0  B)    |   5.1 MB -> 5.1 MB      (0  B / 0  B)    |      26 -> 26           (+0 / +0)      / 390000  |
|     1121 |   11.0 |   931 GB |    22 GB -> 21 GB    (-1003 MB / 6.4 GB)  |    22 GB -> 21 GB    (-1018 MB / 6.7 GB)  |   13713 -> 12742      (-971 / +9712)   / 975000  |
|     1122 |   11.0 |   931 GB |    21 GB -> 21 GB    (-368 MB / 5.8 GB)  |    22 GB -> 21 GB    (-272 MB / 6.1 GB)  |   13469 -> 12718      (-751 / +8929)   / 975000  |
|     1123 |   11.0 |   931 GB |    22 GB -> 21 GB    (-1.1 GB / 5.9 GB)  |    22 GB -> 21 GB    (-1.1 GB / 6.1 GB)  |   14859 -> 12699     (-2160 / +8992)   / 975000  |
|     1124 |   11.0 |   931 GB |    21 GB -> 21 GB      (57 MB / 7.4 GB)  |    21 GB -> 21 GB     (113 MB / 7.7 GB)  |   13806 -> 12697     (-1109 / +9535)   / 975000  |
|     1201 |   12.0 |   447 GB |    18 GB -> 15 GB    (-2.8 GB / 1.2 GB)  |    19 GB -> 17 GB    (-3.0 GB / 1.2 GB)  |   14148 -> 10033     (-4115 / +4853)   / 480000  |
|     1202 |   12.0 |   447 GB |    17 GB -> 15 GB    (-2.0 GB / 1.6 GB)  |    19 GB -> 16 GB    (-2.2 GB / 1.7 GB)  |   13243 -> 10055     (-3188 / +4660)   / 480000  |
|     1203 |   12.0 |   447 GB |    17 GB -> 15 GB    (-2.0 GB / 2.3 GB)  |    19 GB -> 16 GB    (-2.3 GB / 2.4 GB)  |   12746 -> 10070     (-2676 / +4682)   / 480000  |
|     1204 |   12.0 |   447 GB |    18 GB -> 15 GB    (-2.7 GB / 2.1 GB)  |    19 GB -> 16 GB    (-2.8 GB / 2.2 GB)  |   12835 -> 10110     (-2725 / +5511)   / 480000  |
|     1212 |   12.0 |   366 GB |   4.7 MB -> 4.7 MB      (0  B / 0  B)    |   6.0 MB -> 6.0 MB      (0  B / 0  B)    |      26 -> 26           (+0 / +0)      / 390000  |
|     1221 |   12.0 |   931 GB |    20 GB -> 21 GB     (620 MB / 6.3 GB)  |    21 GB -> 21 GB     (805 MB / 6.7 GB)  |   13115 -> 12542      (-573 / +9389)   / 975000  |
|     1222 |   12.0 |   931 GB |    22 GB -> 21 GB    (-981 MB / 2.9 GB)  |    22 GB -> 21 GB    (-1004 MB / 3.0 GB)  |   12938 -> 12793      (-145 / +8795)   / 975000  |
|     1223 |   12.0 |   931 GB |    22 GB -> 21 GB    (-1.1 GB / 5.9 GB)  |    22 GB -> 21 GB    (-1.1 GB / 6.1 GB)  |   13968 -> 12698     (-1270 / +10094)  / 975000  |
|     1224 |   12.0 |   931 GB |    21 GB -> 21 GB    (-791 MB / 4.5 GB)  |    22 GB -> 21 GB    (-758 MB / 4.7 GB)  |   13741 -> 12684     (-1057 / +8616)   / 975000  |
|     1225 |   12.0 |   931 GB |    21 GB -> 21 GB    (-671 MB / 4.8 GB)  |    22 GB -> 21 GB    (-677 MB / 4.9 GB)  |   13608 -> 12690      (-918 / +8559)   / 975000  |
|     1226 |   12.0 |   931 GB |    22 GB -> 21 GB    (-1.1 GB / 6.2 GB)  |    22 GB -> 21 GB    (-1.1 GB / 6.4 GB)  |   13066 -> 12737      (-329 / +9386)   / 975000  |
|     1301 |   13.0 |   447 GB |    13 GB -> 15 GB     (2.6 GB / 4.5 GB)  |    14 GB -> 17 GB     (2.7 GB / 4.6 GB)  |    7244 -> 10077     (+2833 / +6714)   / 480000  |
|     1302 |   13.0 |   447 GB |    12 GB -> 15 GB     (3.0 GB / 4.9 GB)  |    13 GB -> 17 GB     (3.2 GB / 5.2 GB)  |    7507 -> 10056     (+2549 / +7011)   / 480000  |
|     1303 |   13.0 |   447 GB |    14 GB -> 15 GB     (1.3 GB / 3.2 GB)  |    15 GB -> 17 GB     (1.3 GB / 3.3 GB)  |    7888 -> 10020     (+2132 / +6926)   / 480000  |
|     1304 |   13.0 |   447 GB |    13 GB -> 15 GB     (2.7 GB / 4.7 GB)  |    14 GB -> 17 GB     (2.8 GB / 4.9 GB)  |    7660 -> 10075     (+2415 / +7049)   / 480000  |
|     1311 |   13.0 |   466 GB |   4.7 MB -> 4.7 MB      (0  B / 0  B)    |   6.1 MB -> 6.1 MB      (0  B / 0  B)    |      26 -> 26           (+0 / +0)      / 495000  |
|     1312 |   13.0 |   366 GB |   4.7 MB -> 4.7 MB      (0  B / 0  B)    |   6.1 MB -> 6.1 MB      (0  B / 0  B)    |      26 -> 26           (+0 / +0)      / 390000  |
|     1321 |   13.0 |   931 GB |    21 GB -> 21 GB    (-200 MB / 4.1 GB)  |    21 GB -> 21 GB    (-192 MB / 4.3 GB)  |   13365 -> 12690      (-675 / +9527)   / 975000  |
|     1322 |   13.0 |   931 GB |    22 GB -> 21 GB    (-1.3 GB / 6.9 GB)  |    23 GB -> 21 GB    (-1.3 GB / 7.2 GB)  |   12749 -> 12698       (-51 / +10047)  / 975000  |
|     1323 |   13.0 |   931 GB |    21 GB -> 21 GB    (-495 MB / 6.1 GB)  |    22 GB -> 21 GB    (-504 MB / 6.3 GB)  |   13386 -> 12693      (-693 / +9524)   / 975000  |
|     1325 |   13.0 |   931 GB |    21 GB -> 21 GB    (-620 MB / 6.6 GB)  |    22 GB -> 21 GB    (-612 MB / 6.9 GB)  |   13113 -> 12768      (-345 / +9942)   / 975000  |
|     1326 |   13.0 |   931 GB |    21 GB -> 21 GB    (-498 MB / 7.1 GB)  |    22 GB -> 21 GB    (-414 MB / 7.4 GB)  |   13690 -> 12697      (-993 / +9759)   / 975000  |
|     1401 |   14.0 |   223 GB |   8.3 GB -> 7.6 GB   (-670 MB / 950 MB)  |   9.3 GB -> 8.5 GB   (-789 MB / 993 MB)  |    3470 -> 5061      (+1591 / +3262)   / 240000  |
|     1402 |   14.0 |   447 GB |   9.8 GB -> 15 GB     (5.6 GB / 7.1 GB)  |    11 GB -> 17 GB     (5.8 GB / 7.5 GB)  |    4358 -> 10052     (+5694 / +7092)   / 480000  |
|     1403 |   14.0 |   224 GB |   8.2 GB -> 7.6 GB   (-619 MB / 730 MB)  |   9.3 GB -> 8.5 GB   (-758 MB / 759 MB)  |    4547 -> 5023       (+476 / +2567)   / 240000  |
|     1404 |   14.0 |   224 GB |   8.4 GB -> 7.6 GB   (-790 MB / 915 MB)  |   9.4 GB -> 8.5 GB   (-918 MB / 946 MB)  |    4369 -> 5062       (+693 / +2483)   / 240000  |
|     1411 |   14.0 |   466 GB |   4.7 MB -> 4.7 MB      (0  B / 0  B)    |   6.0 MB -> 6.0 MB      (0  B / 0  B)    |      26 -> 26           (+0 / +0)      / 495000  |
|     1412 |   14.0 |   366 GB |   4.7 MB -> 4.7 MB      (0  B / 0  B)    |   6.0 MB -> 6.0 MB      (0  B / 0  B)    |      26 -> 26           (+0 / +0)      / 390000  |
|     1421 |   14.0 |   931 GB |    19 GB -> 21 GB     (2.0 GB / 6.8 GB)  |    19 GB -> 21 GB     (2.1 GB / 7.0 GB)  |   10670 -> 12695     (+2025 / +10814)  / 975000  |
|     1422 |   14.0 |   931 GB |    19 GB -> 21 GB     (1.6 GB / 7.4 GB)  |    20 GB -> 21 GB     (1.7 GB / 7.7 GB)  |   10653 -> 12702     (+2049 / +10414)  / 975000  |
|     1423 |   14.0 |   931 GB |    19 GB -> 21 GB     (2.0 GB / 7.4 GB)  |    19 GB -> 21 GB     (2.1 GB / 7.8 GB)  |   10715 -> 12683     (+1968 / +10418)  / 975000  |
|     1424 |   14.0 |   931 GB |    18 GB -> 21 GB     (2.2 GB / 8.0 GB)  |    19 GB -> 21 GB     (2.3 GB / 8.3 GB)  |   10723 -> 12824     (+2101 / +9573)   / 975000  |
|     1425 |   14.0 |   931 GB |    19 GB -> 21 GB     (1.3 GB / 5.8 GB)  |    20 GB -> 21 GB     (1.4 GB / 6.1 GB)  |   10702 -> 12686     (+1984 / +10231)  / 975000  |
|     1426 |   14.0 |   931 GB |    20 GB -> 21 GB     (1.0 GB / 6.5 GB)  |    20 GB -> 21 GB     (1.2 GB / 6.8 GB)  |   10737 -> 12650     (+1913 / +10974)  / 975000  |
-------------------------------------------------------------------------------------------------------------------------------------------------------------------------
|       45 |    4.0 |    29 TB |   652 GB -> 653 GB    (1.2 GB / 173 GB)  |   686 GB -> 687 GB    (1.2 GB / 180 GB)  |  412818 -> 412818       (+0 / +288439) / 30885000 |
-------------------------------------------------------------------------------------------------------------------------------------------------------------------------

This time the total amount of data to be moved is 180GB. It’s possible to have a difference of an order of magnitude in the total data to be moved between -c 0 and -c 10. Usually best results are achieved by using the -F directly with rare occasions requiring full re-balancing (i.e. no -F and higher -c values)

14.9. Errors from the balancer tool

If the balancer tool doesn’t complete successfully, its output MUST be examined and the root cause fixed.

14.9.1. placementGroup and other violations

Here’s a part of the output of the balancer:

-== POST BALANCE ==-
shards with decreased redundancy 0 (0, 0, 0)
server constraint violations 0
stripe constraint violations 160
placement group violations 0
  • A non-zero number of “server constraint violations” means that there are pieces of data which which have two or more of their copies on the same server. This is an error condition;

  • A non-zero number of “stripe constraint violation” means that for specific pieces of data it’s not optimally striped on the drives of a specific server. This is NOT an error condition;

  • A non-zero number of “placement group violations” means there is an error condition.

14.10. Miscellaneous

If for any reason the currently running rebalancing operation needs to be paused, it can be done via storpool relocator off. In such cases StorPool Support should also be contacted, as this shouldn’t need to happen. Re-enabling it is done via storpool relocator on.

15. Troubleshooting Guide

This part outlines the different states of a StorPool cluster, common knowledge about what should be expected and what are the recommended steps in each of them. This is intended to be used as a guideline for the operations team(s) maintaining the production system provided by StorPool.

Legend:

StorPool CLI and other example commands will be listed outlined (e.g. storpool disk list, top, ping, etc.).

Example output from a command is listed below:

# storpool disk list
...

15.1. Normal state of the system

The normal behaviour of the StorPool storage system is when it is fully configured and in up-and-running state. This is the desired state of the system.

Characteristics of this state:

15.1.1. All nodes in the storage cluster are up and running

This can be checked by using the CLI with storpool service list on any node with access to the API service.

Note

The storpool service list provides status for all services running clusterwide, rather than the services running on the node itself.

15.1.2. All configured StorPool services are up and running

This is again easily checked with storpool service list. Recently restarted services are usually spotted due to their uptime. Recently restarted services are to be taken seriously if the reason for their state is unknown even if they are running at the moment, as in the example with client ID 37 below:

# storpool service list
cluster running, mgmt on node 2
      mgmt   1 running on node  1 ver 19.01.145, started 2019-08-10 19:28:37, uptime 144 days 22:45:51
      mgmt   2 running on node  2 ver 19.01.145, started 2019-08-10 19:27:18, uptime 144 days 22:47:10 active
      server   1 running on node  1 ver 19.01.145, started 2019-08-10 19:28:59, uptime 144 days 22:45:29
    server   2 running on node  2 ver 19.01.145, started 2019-08-10 19:25:53, uptime 144 days 22:48:35
    server   3 running on node  3 ver 19.01.145, started 2019-08-10 19:23:30, uptime 144 days 22:50:58
    client   1 running on node  1 ver 19.01.145, started 2019-08-10 19:28:37, uptime 144 days 22:45:51
    client   2 running on node  2 ver 19.01.145, started 2019-08-10 19:25:32, uptime 144 days 22:48:56
    client   3 running on node  3 ver 19.01.145, started 2019-08-10 19:23:09, uptime 144 days 22:51:19
    client  21 running on node 21 ver 19.01.145, started 2019-08-10 19:20:26, uptime 144 days 22:54:02
    client  22 running on node 22 ver 19.01.145, started 2019-08-10 19:19:26, uptime 144 days 22:55:02
    client  37 running on node 37 ver 19.01.145, started 2019-08-10 13:08:12, uptime 05:06:16

15.1.3. Working cgroup memory and cpuset isolation is properly configured

Use the storpool_cg tool with an argument check to ensure everything is as expected. The tool should not return any warnings. More info on this in Control Groups

When properly configured the sum of all memory limits in the node are less than the available memory in the node. This protects the running kernel from memory shortage as well as all processes in the storpool.slice memory cgroup which ensures the stability of the storage service.

15.1.4. All network interfaces are properly configured

All network interfaces used by StorPool are up and properly configured with hardware acceleration enabled (where applicable); all network switches are configured with jumbo frames and flow control, and none of them experience any packet loss or delays. The output from storpool net list is a good start, all configured network interfaces will be seen as up with proper flags explained at the end. The desired state is uU with a + at the end for each network interface; if hardware acceleration is supported on an interface the A flag should also be present:

storpool net list
------------------------------------------------------------
| nodeId | flags   | net 1             | net 2             |
------------------------------------------------------------
|     23 | uU + AJ | 1A:00:01:00:00:17 | 1E:00:01:00:00:17 |
|     24 | uU + AJ | 1A:00:01:00:00:18 | 1E:00:01:00:00:18 |
|     25 | uU + AJ | 1A:00:01:00:00:19 | 1E:00:01:00:00:19 |
|     26 | uU + AJ | 12:00:01:00:F0:1A | 16:00:01:00:F0:1A |
|     27 | uU + AJ | 26:8A:07:43:EE:BC | 26:8A:07:43:EE:BD |
------------------------------------------------------------
Quorum status: 5 voting beacons up out of 5 expected
Flags:
  u - packets recently received from this node
  d - no packets recently received from this node

  U - this node has enough votes to be in the quorum
  D - this node does not have enough votes to be in the quorum
  + - this node considers itself in the quorum
  B - the connection to this node is through a backup link; check the cabling
  A - this node is using hardware acceleration
  J - the node uses jumbo frames

  N - a non-voting node

15.1.5. All drives are up and running

All drives in use for the storage system are performing at their specified speed, are joined in the cluster and serving requests.

This could be checked with storpool disk list internal, for example in a normally loaded cluster all drives will report low aggregate scores. Below is an example output (trimmed for brevity):

# storpool disk list internal
--------------------------------------------------------------------------------------------------------------------------------------------------------
| disk | server |        aggregate scores        |         wbc pages        |     scrub bw |                          scrub ETA | last scrub completed |
--------------------------------------------------------------------------------------------------------------------------------------------------------
| 2301 |   23.0 |        0 |        0 |        0 |        - + -     / -     |            - |                                  - |  2019-08-11 15:33:44 |
| 2302 |   23.0 |        0 |        0 |        0 |        - + -     / -     |            - |                                  - |  2019-08-11 15:28:48 |
| 2303 |   23.1 |        0 |        0 |        0 |        - + -     / -     |            - |                                  - |  2019-08-11 15:28:49 |
| 2304 |   23.1 |        0 |        0 |        0 |        - + -     / -     |            - |                                  - |  2019-08-11 15:28:50 |
| 2305 |   23.2 |        0 |        0 |        0 |        - + -     / -     |            - |                                  - |  2019-08-11 15:28:51 |
| 2306 |   23.2 |        0 |        0 |        0 |        - + -     / -     |            - |                                  - |  2019-08-11 15:28:51 |
| 2307 |   23.3 |        0 |        0 |        0 |        - + -     / -     |            - |                                  - |  2019-08-11 15:28:52 |
| 2308 |   23.3 |        0 |        0 |        0 |        - + -     / -     |            - |                                  - |  2019-08-11 15:28:53 |
| 2311 |   23.0 |        0 |        0 |        0 |        0 + 0     / 2560  |            - |                                  - |  2019-08-11 15:28:38 |
| 2312 |   23.0 |        0 |        0 |        0 |        0 + 0     / 2560  |            - |                                  - |  2019-08-11 15:28:43 |
| 2313 |   23.1 |        0 |        0 |        0 |        0 + 0     / 2560  |            - |                                  - |  2019-08-11 15:28:44 |
| 2314 |   23.1 |        0 |        0 |        0 |        0 + 0     / 2560  |            - |                                  - |  2019-08-11 15:28:45 |
| 2315 |   23.2 |        0 |        0 |        0 |        0 + 0     / 2560  |            - |                                  - |  2019-08-11 15:28:47 |
| 2316 |   23.2 |        0 |        0 |        0 |        0 + 0     / 2560  |            - |                                  - |  2019-08-11 15:28:39 |
| 2317 |   23.3 |        0 |        0 |        0 |        0 + 0     / 2560  |            - |                                  - |  2019-08-11 15:28:40 |
| 2318 |   23.3 |        0 |        0 |        0 |        0 + 0     / 2560  |            - |                                  - |  2019-08-11 15:28:42 |
[snip]

All drives are regularly scrubbed, so they would have a stable (not increasing) number of errors. The errors corrected for each drive are visible in the storpool disk list output. Last completed scrub is visible in storpool disk list internal as in the example above.

Few notes on the desired state:

  • Some systems may have fewer than two network interfaces or a single backend switch. Even not recommended, this is still possible and sometimes used (usually in PoC or with a backup server) when a cluster is configured with a single-VLAN network redundancy scheme. A single VLAN network redundancy configuration and an inter-switch connection is required for a cluster where only some of the nodes are with a single interface connected to the cluster.

  • If one or more of the points describing the state above is not in effect, the system should not be considered healthy. If there is any suspicion that the system is behaving erratically even though all of the above conditions are satisfied, the recommended steps to check if everything is in order are:

  • Check top and look for the state of each of the configured storpool_* services running in the present node. A properly running service is usually in the S (sleeping) state and rarely seen in the R (running) state. The CPU usage is often reported at 100% usage when hardware sleep is enabled, due to the kernel misreporting. The actual usage is much lower and could be tracked with cpupower monitor for the CPU cores.

  • To ensure all services on this node are running correctly is to use the /usr/lib/storpool/sdump tool, which will be reporting some CPU and network usage statistics for the running services on the node. Use the -l option for the long names of the statistics.

  • On some of the nodes with running workloads (i.e. VM instances, containers, etc.) iostat will show activity for processed requests on the block devices. The following example shows the normal disk activity on a node running VM instances. Note that the usage may vary greatly depending on the workload. The command used in the example is iostat -xm 1 /dev/sp-*| egrep -v " 0[.,]00$" , which will print statistics for the StorPool devices each second excluding drives that have no storage IO activity:

    Device:     rrqm/s   wrqm/s     r/s     w/s    rMB/s    wMB/s avgrq-sz avgqu-sz   await r_await w_await  svctm  %util
    sp-0        0.00     0.00    0.00  279.00     0.00     0.14     1.00     3.87   13.80    0.00   13.80   3.55  98.99
    sp-11       0.00     0.00     165.60  114.10    19.29    14.26   245.66     5.97   20.87    9.81   36.91   0.89  24.78
    sp-12       0.00     0.00     171.60  153.60    19.33    19.20   242.67     9.20   28.17      10.46   47.96   1.08  35.01
    sp-13       0.00     0.00    6.00   40.80     0.04     5.10    225.12     1.75   37.32    0.27   42.77   1.06   4.98
    sp-21       0.00     0.00    0.00   82.20     0.00     1.04    25.90     1.00   12.08    0.00   12.08  12.16  99.99
    

15.1.6. There are no hanging active requests

The output of /usr/lib/storpool/latthreshold.py is empty - shows no hanging requests and no service or disk warnings.

15.2. Degraded state

In this state some system components are not fully operational and need attention. Some examples of a degraded state below.

15.2.1. Degraded state due to service issues

15.2.1.1. A single storpool_server service on one of the storage nodes is not available or not joined in the cluster

Note that this concerns only pools with triple replication, for dual replication this is considered to be a critical state, because there are parts of the system with only one available copy. This is an example output from storpool service list:

# storpool service list
cluster running, mgmt on node 2
      mgmt   1 running on node  1 ver 19.01.145, started 2019-08-10 16:11:59, uptime 19:51:50
      mgmt   2 running on node  2 ver 19.01.145, started 2019-08-10 16:11:58, uptime 19:51:51 active
      mgmt   3 running on node  3 ver 19.01.145, started 2019-08-10 16:11:58, uptime 19:51:51
    server   1 down on node  1 ver 19.01.145
    server   2 running on node  2 ver 19.01.145, started 2019-08-10 16:12:03, uptime 19:51:46
    server   3 running on node  3 ver 19.01.145, started 2019-08-10 16:12:04, uptime 19:51:45
    client   1 running on node  1 ver 19.01.145, started 2019-08-10 16:11:59, uptime 19:51:50
    client   2 running on node  2 ver 19.01.145, started 2019-08-10 16:11:57, uptime 19:51:52
    client   3 running on node  3 ver 19.01.145, started 2019-08-10 16:11:57, uptime 19:51:52

If this is unexpected, i.e. no one has deliberately restarted or stopped the service for planned maintenance or upgrade, it is very important to first bring the service up and then to investigate the root cause for the service outage. When the storpool_server service comes back up it will start recovering outdated data on its drives. The recovery process could be monitored with storpool task list, which will output which disks are recovering, as well as how much data is there left to be recovered. Example output or storpool task list:

# storpool task list
----------------------------------------------------------------------------------------
|     disk |  task id |  total obj |  completed |    started |  remaining | % complete |
----------------------------------------------------------------------------------------
|     2303 | RECOVERY |          1 |          0 |          1 |          1 |         0% |
----------------------------------------------------------------------------------------
|    total |          |          1 |          0 |          1 |          1 |         0% |
----------------------------------------------------------------------------------------

Some of the volumes or snapshots will have the D flag (for degraded) visible in the storpool volume status output, which will disappear once all the data is fully recovered. An example situation would be a reboot of the node for a kernel or a package upgrade that requires reboot and no kernel modules were installed for the new kernel or a service (in this example the storpool_server) was not configured to start on boot and others.

15.2.1.2. Some of the configured StorPool services have failed or is not running

These could be:

  • The storpool_block service on some of the storage only nodes, without any attached volumes or snapshots.

  • A single storpool_server service or multiple instances on the same node, note again that this is critical for systems with dual replication.

  • Single API (storpool_mgmt) service with another active running API

The reason for these could be the same as in the previous examples, usually the system log contains all information needed to check why the service is not (getting) up.

15.2.2. Degraded state due to host OS misconfiguration

Some examples include:

15.2.2.1. Changes in the OS configuration after a system update

This could prevent some of the services from running after a fresh boot. For instance due to changed names of the network interfaces used for the storage system after an upgrade, changed PCIe IDs for NVMe devices, etc.

15.2.2.2. Kdump is no longer collecting kernel dump data properly

If this occurs, it might be difficult to debug what have caused a kernel crash.

Some of the above cases will be difficult to catch prior to booting with the new environment (e.g. kernel or other updates) and sometimes they are only caught after an event that reveals the issue. Thus it is important to regularly test and ensure the system is in properly configured state and collects normally.

15.2.3. Degraded state due to network interface issues

15.2.3.1. Some of the interfaces used by StorPool is not up

This could be checked with storpool net list, e.g.:

# storpool net list
------------------------------------------------------------
| nodeId | flags   | net 1             | net 2             |
------------------------------------------------------------
|     23 | uU + AJ |                   | 1E:00:01:00:00:17 |
|     24 | uU + AJ |                   | 1E:00:01:00:00:18 |
|     25 | uU + AJ | 1A:00:01:00:00:19 | 1E:00:01:00:00:19 |
|     26 | uU + AJ | 12:00:01:00:F0:1A | 16:00:01:00:F0:1A |
|     27 | uU + AJ | 26:8A:07:43:EE:BC | 26:8A:07:43:EE:BD |
------------------------------------------------------------
Quorum status: 5 voting beacons up out of 5 expected
Flags:
  u - packets recently received from this node
  d - no packets recently received from this node

  U - this node has enough votes to be in the quorum
  D - this node does not have enough votes to be in the quorum
  + - this node considers itself in the quorum
  B - the connection to this node is through a backup link; check the cabling
  A - this node is using hardware acceleration
  J - the node uses jumbo frames

  N - a non-voting node

In the above example nodes 23 and 24 are not connected to the first network. This is the SP_IFACE1_CFG interface configuration in /etc/storpool.conf (check with storpool_showconf SP_IFACE1_CFG). Note that the beacons are up and running and the system is processing requests through the second network. The possible reasons could be misconfigured interfaces, StorPool configuration, or backend switch/switches.

15.2.3.2. A HW accceleration qualified interface is running without hardware acceleration

This is once again checked with storpool net list:

# storpool net list
------------------------------------------------------------
| nodeId | flags   | net 1             | net 2             |
------------------------------------------------------------
|     23 | uU +  J | 1A:00:01:00:00:17 | 1E:00:01:00:00:17 |
|     24 | uU +  J | 1A:00:01:00:00:18 | 1E:00:01:00:00:18 |
|     25 | uU + AJ | 1A:00:01:00:00:19 | 1E:00:01:00:00:19 |
|     26 | uU + AJ | 12:00:01:00:F0:1A | 16:00:01:00:F0:1A |
|     27 | uU + AJ | 26:8A:07:43:EE:BC | 26:8A:07:43:EE:BD |
------------------------------------------------------------
Quorum status: 5 voting beacons up out of 5 expected
Flags:
  u - packets recently received from this node
  d - no packets recently received from this node

  U - this node has enough votes to be in the quorum
  D - this node does not have enough votes to be in the quorum
  + - this node considers itself in the quorum
  B - the connection to this node is through a backup link; check the cabling
  A - this node is using hardware acceleration
  J - the node uses jumbo frames

  N - a non-voting node

In the above example nodes 23 and 24 are equipped with NICs qualified for, but are running without hardware acceleration; the possible reason could be either a BIOS or an OS misconfiguration, misconfigured kernel parameters on boot, or network interface misconfiguration. Note that when a system was configured for hardware accelerated operation the cgroups configuration was also sized accordingly, thus running in this state is likely to case performance issues, due to less CPU cores isolated and reserved for the NIC interrupts and storpool_rdma threads.

15.2.3.3. Jumbo frames are expected, but not working on some of the interfaces

Could be seen with storpool net list, if some of the two networks is with MTU lower than 9k the J flag will not be listed:

# storpool net list
-------------------------------------------------------------
| nodeId | flags    | net 1             | net 2             |
-------------------------------------------------------------
|     23 | uU + A   | 12:00:01:00:F0:17 | 16:00:01:00:F0:17 |
|     24 | uU + AJ  | 1A:00:01:00:00:18 | 1E:00:01:00:00:18 |
|     25 | uU + AJ  | 1A:00:01:00:00:19 | 1E:00:01:00:00:19 |
|     26 | uU + AJ  | 1A:00:01:00:00:1A | 1E:00:01:00:00:1A |
-------------------------------------------------------------
Quorum status: 4 voting beacons up out of 4 expected
Flags:
  u - packets recently received from this node
  d - no packets recently received from this node

  U - this node has enough votes to be in the quorum
  D - this node does not have enough votes to be in the quorum
  M - this node is being damped by the rest of the nodes in the cluster
  + - this node considers itself in the quorum
  B - the connection to this node is through a backup link; check the cabling
  A - this node is using hardware acceleration
  J - the node uses jumbo frames

  N - a non-voting node

If the node is not expected to be running without jumbo frames this might be an indication for a misconfigured interface or an issue with applying the interfaces configuration on boot. Note that an OS interface configured for jumbo frames without having the switch port properly configured leads to severe performance issues.

15.2.3.4. Some network interfaces are experiencing network loss or delays on one of the networks

This might affect the latency for some of the storage operations. Depending on the node where the losses occur, it might affect a single client or affect operations in the whole cluster in case of packet loss or delays are happening on a server node. Stats for all interfaces per service are collected in the analytics platform (https://analytics.storpool.com) and could be used to investigate network performance issues. The /usr/lib/storpool/sdump tool will print the same statistics on each of the nodes with services. The usual causes for packet loss are:

  • hardware issues (cables, SFPs, etc.)

  • floods and DDoS attacks “leaking” into the storage network due to misconfiguration

  • saturation of the CPU cores that handle the interrupts for the network cards and others when hardware acceleration is not available

  • network loops leading to saturated switch ports or overloaded NICs

15.2.4. Drive/Controller issues

15.2.4.1. One or more HDD or SSD drives are missing from a single server in the cluster or from servers in the same fault set

Attention

This concerns only pools with triple replication, for dual replication this is considered as critical state.

The missing drives may be seen using storpool disk list or storpool server <serverID> disk list, for example in this output disk 543 is missing from server with ID 54:

# storpool server 54 disk list
disk  |   server  | size    |   used  |  est.free  |   %     | free entries | on-disk size |  allocated objects |  errors |  flags
541   |       54  | 207 GB  |  61 GB  |    136 GB  |   29 %  |      713180  |       75 GB  |   158990 / 225000  |   0 / 0 |
542   |       54  | 207 GB  |  56 GB  |    140 GB  |   27 %  |      719526  |       68 GB  |   161244 / 225000  |   0 / 0 |
543   |       54  |      -  |      -  |         -  |    - %  |           -  |           -  |        - / -       |    -
544   |       54  | 207 GB  |  62 GB  |    134 GB  |   30 %  |      701722  |       76 GB  |   158982 / 225000  |   0 / 0 |
545   |       54  | 207 GB  |  61 GB  |    135 GB  |   30 %  |      719993  |       75 GB  |   161312 / 225000  |   0 / 0 |
546   |       54  | 207 GB  |  54 GB  |    142 GB  |   26 %  |      720023  |       68 GB  |   158481 / 225000  |   0 / 0 |
547   |       54  | 207 GB  |  62 GB  |    134 GB  |   30 %  |      719996  |       77 GB  |   179486 / 225000  |   0 / 0 |
548   |       54  | 207 GB  |  53 GB  |    143 GB  |   26 %  |      718406  |       70 GB  |   179038 / 225000  |   0 / 0 |

Usual reasons - the drive was ejected from the cluster due to a write error either by the kernel or by the running storpool_server instance. More information may be found using dmesg | tail and in the system log. More information about the model and the serial number of the failed drive is shown by storpool disk list info.

In normal conditions the server will flag the disk to be re-tested and will eject it for a quick test. Provided the disk is still working correctly and test results are not breaching any thresholds the disk will be returned into the cluster to recover. Such a case for example might happen if the stalled request was caused by an intermediate issue, like a reallocated sector.

In case the disk is breaching any sane latency and bandwdith thresholds it will not be automatically returned and will have to be re-balanced out of the cluster. Such disks are marked as “bad” (more available at storpool_initdisk options)

When one or more drives are ejected (marked as bad already) and missing, multiple volumes and/or snapshots will be listed with the D flag in the output of storpool volume status (D as Degraded), due to the missing replicas for some of the data. This is normal and expected and there are the following options in this situation:

  • The drive could still be working properly (e.g. a set of bad sectors were reallocated) even after it was tested, in order to re-test you could mark the drive as –good (more info on how at storpool_initdisk options) and attempt to get back into the cluster.

  • In some occasions a disk might have lost its signatures and would have to be returned in the cluster to recover from scratch - it will be automatically re-tested upon attempt to a full (read-write) stress-test is recommended to ensure it is working correctly (fio is a good tool for this kind of tests, check --verify option). In case the stress test is successful (e.g. the drive has been written to and verified successfully), it may be reinitialized with storpool_initdisk with the same disk ID it was before. This will automatically return it to the cluster and it will fully recover all data from scratch as if it was a brand new.

  • The drive has failed irrecoverably and a replacement is available. The replacement drive is initialized with the diskID of the failed drive with storpool_initdisk. After returning it to the cluster it will fully recover all the data from the live replicas (please check Rebalancing StorPool for more).

  • A replacement is not available. The only option is to re-balance the cluster without this drive (more details in Rebalancing StorPool).

Attention

Beware that in some cases with very filled clusters it might be impossible to get the cluster back to full redundancy without overfilling some of the remaining drives. See next section.

15.2.4.2. Some of the drives in the cluster are beyond 90% (up to 96% full)

With proper planning this should be rarely an issue. A way to evade it is to add more drives or an additional server node with a full set of drives into the cluster. Another option is to remove unused volumes or snapshots.

The storpool snapshot space command will return relevant information for the referred space for each snapshot on the underlying drives. Note that snapshots with a negative value in their “used” column will not free up any space if they are removed and will remain in the deleting state, because they are parents of multiple cloned child volumes.

Note that depending on the speed with which the cluster is being populated with data by the end users this might also be considered a critical state.

15.2.4.3. Some of the drives have fewer than 140k free entries (alert for an overloaded system)

This may be observed in the output of storpool disk list or storpool server <serverID> disk list, an example from the latter below:

# storpool server 23 disk list
  disk  |  server  |    size    |    used    |  est.free  |      %  |  free entries  |  on-disk size  |  allocated objects |  errors |  flags
  2301  |    23.0  |   893 GiB  |   2.6 GiB  |   857 GiB  |    0 %  |       3719930  |       660 KiB  |       17 / 930000  |   0 / 0 |
  2302  |    23.0  |   893 GiB  |   2.6 GiB  |   857 GiB  |    0 %  |       3719929  |       668 KiB  |       17 / 930000  |   0 / 0 |
  2303  |    23.0  |   893 GiB  |   2.6 GiB  |   857 GiB  |    0 %  |       3719929  |       668 KiB  |       17 / 930000  |   0 / 0 |
  2304  |    23.0  |   893 GiB  |   2.6 GiB  |   857 GiB  |    0 %  |       3719931  |       668 KiB  |       17 / 930000  |   0 / 0 |
  2306  |    23.0  |   893 GiB  |   2.6 GiB  |   857 GiB  |    0 %  |       3719932  |       664 KiB  |       17 / 930000  |   0 / 0 |
  2305  |    23.0  |   893 GiB  |   2.6 GiB  |   857 GiB  |    0 %  |       3719930  |       660 KiB  |       17 / 930000  |   0 / 0 |
  2307  |    23.0  |   893 GiB  |   2.6 GiB  |   857 GiB  |    0 %  |         19934  |       664 KiB  |       17 / 930000  |   0 / 0 |
--------------------------------------------------------------------------------------------------------------------------------------
     7  |     1.0  |   6.1 TiB  |    18 GiB  |   5.9 TiB  |    0 %  |      26039515  |       4.5 MiB  |      119 / 6510000 |   0 / 0 |

This usually happens after the system has been loaded for longer periods of time with a sustained write workload on one or multiple volumes. If this is unexpected and the reason is an erratic workload, the recommended way to handle this is to set a limit (bandwidth, iops or both) on the loaded volumes for example with storpool volume <volumename> bw 100M iops 1000. The same could be set for multiple volumes/snapshots in a template with storpool template <templatename> bw 100M iops 1000 propagate. Please note that propagating changes for templates with a very large number of volumes and snapshots might not work. If the overloaded state is due to normally occurring worload it is best to expand the system with more drives and or reformat the drives with larger number of entries (relates mainly to HDD drives). The latter case might be cause due to lower number of hard drives in a HDD only or a hybrid pool and rarely due to overloaded SSDs.

Another case related to overloaded drives is when many volumes are created out of the same template, which requires overrides in order to shuffle the objects where the journals are residing in order to avoid overload of the same triplet of disks when all virtual machines spike for some reason (i.e. unattended upgrades, a syslog intensive cron job, etc.)

A couple of notes on the degraded states - apart from the notes for the replication above none of these should affect the stability of the system at this point. For the example with the missing disk in hybrid systems with a single failed SSD, all read requests on volumes with triple replication that have data on the failed drive will be served by some of the redundant copies on HDDs. This could bring up a bit the read latencies for the operations on the parts of the volume which were on this exact SSD. This is usually negligible in medium to large systems, i.e. in a cluster with 20 SSDs or NVMe drives, these are 1/20th of all the read operations in the cluster. In case of dual replicas on SSDs and a third replica on HDDs there is no read latency penalty whatsoever, which is also the case for missing hard drives - they will not affect the system at all and in fact some write operations are even faster, because they are not waiting for the missing drive.

15.3. Critical state

This is an emergency state that requires immediate attention and intervention from the operations team and/or StorPool Support. Some of the conditions that could lead to this state are:

  • partial or complete network outage

  • power loss for some nodes in the cluster

  • memory shortage leading to a service failure due to missing or incomplete cgroups configuration

The following states are an indication for critical conditions:

15.3.1. API service failure

15.3.1.1. API not reachable on any of the configured nodes (the ones running the storpool_mgmt service)

Requests to the API from any of the nodes configured to access it either stall or cannot reach a working service. This is a critical state due to the fact that the status of the cluster is unknown (might be down for that matter).

This might be caused by:

  • Misconfigured network for accessing the floating IP address - the address may be obtained by storpool_showconf http on any of the nodes with a configured storpool_mgmt service in the cluster:

    # storpool_showconf http
    SP_API_HTTP_HOST=10.3.10.78
    SP_API_HTTP_PORT=81
    

Failed interfaces on the hosts that have the storpool_mgmt service running. To find the interface where the StorPool API should be running use storpool_showconf api_iface

# storpool_showconf api_iface
SP_API_IFACE=bond0.410

It is recommended to have the API on a redundant interface (e.g. an active-backup bond interface). Note that even without an API, provided the cluster is in quorum, there should be no impact on any running operations, but changes in the cluster like creating/attaching/detaching/deleting volumes or snapshots) will be impossible. Running with no API in the cluster triggers highest severity alert to StorPool Support (essentially wake up alert) due to the unknown state of the system.

  • The cluster is not in quorum - The cluster is in this state if the number of running voting storpool_beacon services is less than the half of the expected nodes plus one ((expected / 2) + 1). The configured number of expected nodes in the cluster may be checked with storpool_showconf expected, but is generally considered to be the number of server nodes (except when client nodes are configured as voting for some reason). In a system with 6 servers at least 4 voting beacons should be available to get back the cluster in running state:

    # storpool_showconf expected
    SP_EXPECTED_NODES=6
    

The current number of expected votes and the number of voting beacons are displayed in the output of storpool net list, check the example above (the Quorum status: line).

15.3.1.2. API requests are not returning for more than 30-60 seconds (e.g. storpool volume status, storpool snapshot space, storpool disk list, etc.)

These API requests collect data from the running storpool_server services on each server node. Possible reasons are:

  • network loss or delays;

  • failing storpool_server services;

  • failing drives or hardware (CPU, memory, controllers, etc.);

  • overload

15.3.2. Server service failure

15.3.2.1. Two storpool_server services or whole servers are down

Two storpool_server services or whole servers are down or not joined in the cluster in different failure sets. Very risky state, because there are parts of the volumes with only one live replica, if the latest writes land on a drive returning an IO error or broken data (detected by StorPool) this will lead to data loss.

As in degraded state some of the read operations for parts of the volumes will be served from HDDs in a hybrid system and might raise read latencies. In this state it is very important to bring back the missing services/nodes as soon as possible, because a failure in any of the remaining drives in other nodes or another fault set will bring some of the volumes in down state and might lead to data loss in case of an error returned by a drive with the latest writes.

15.3.2.2. More than two storpool_server services or whole servers are down

This state results in some volumes being in down state (storpool volume status) due to some parts only on the missing drives. Recommended action in this case - check for the reasons for the degraded services or missing (unresponsive) nodes and get them back up.

Possible reasons are:

  • lost network connectivity

  • severe packet loss/delays/loops

  • partial or complete power loss

  • hardware instabilities, overheating

  • kernel or other software instabilities, crashes

15.3.3. Client service failure

If the client service (storpool_block) is down on some of the nodes depending on it, these could be either client-only or converged hypervisors, this will stall all requests on that particular node until the service is back up.

Possible reasons are again:

  • lost network connectivity

  • severe packet loss/delays/loops

  • bugs in the storpool_block service or the storpool_bd kernel module

In case of power loss or kernel crashes any virtual machine instances that were running on this node could be started on other available nodes.

15.3.4. Network interface or Switch failure

This means that the networks used for StorPool are down or are experiencing heavy packet loss or delays. In this case the quorum service will prevent a split-brain situation and will restart all services to ensure the cluster is fully connected on at least one network before it transitions again to running state. Such issues might be alleviated by a single-vlan setup when different nodes have partial network connectivity, but will be again with severe delays in case of severe packet loss.

15.3.5. Hard Drive/SSD failures

15.3.5.1. Drives from two or more different nodes (fault sets) in the cluster are missing (or from a single node/fault set for systems with dual replication pools).

In this case multiple volumes may either experience degraded performance (hybrid placement) or will be in down state when more than two replicas are missing. All operations on volumes in down state are stalled, until the redundancy is restored (i.e. at least one replica is available). The recommended steps are to immediately check for the reasons for the missing drives/services/nodes and return them into the cluster as soon as possible.

15.3.5.2. Some of the drives are more than 97% full.

At some point all cluster operations will stall until either some of the data in the cluster is deleted or new drives/nodes are added. Adding drives requires the new drives to be stress tested and a re-balancing of the system to include them, which should be carefully planned (details in Rebalancing StorPool).

Note

Cleaning up snapshots that have multiple cloned volumes and a negative value for used space in the output of storpool snapshot space will not free up any space.

15.3.5.3. Some of the drives have fewer than 100k free entries.

This is usually caused by a heavily overloaded system. In this state the latencies for some operations might become very high (measured in seconds). Possible reasons are severely overloaded volumes for long periods of time without any configured bandwidth or iops limits. This could be checked by using iostat to look for volumes that are being constantly 100% loaded with a large number of requests to the storage system. Another way to check for such volumes is to use the “Top volumes” in the analytics in order to get info for the most loaded volumes and apply IOPS and or bandwidth limits accordingly. Other causes are misbehaving (underperforming) drives or misbehaving HBA/SAS controllers, the recommended way to deal with these cases is to investigate for such drives, a good idea is to check the output from storpool disk list internal for higher aggregation scores on some drives or set of drives (e.g. on the same server) or by the use of the analytics to check for abnormal latency on some of the backend nodes (i.e. drives with significantly higher operations latency compared to other drives of the same type). An example would be a failing controller causing the SATA speed to degrade to SATA1.0 (1.5Gb/s) instead of SATA 3.0 (6Gb/s), weared out batteris on a RAID controller when its cache is used to accelerate the writes on the HDDs, and others.

The circumstances leading a system to the critical state are rare and are usually preventable with by taking measures to handle all the issues at the first signs of a change from the normal to degraded state.

In any of the above cases if you feel that something is not as expected, a consultation with StorPool Support is the best course of action. StorPool Support receives notifications for all detailed cases and pro-actively takes actions to alleviate a system going into degraded or critical state as soon as practically possible.

15.3.6. Hanging requests in the cluster

The output of /usr/lib/storpool/latthreshold.py shows hanging requests and/or missing services as in the example below:

disk | reported by | peers                      |  s |   op  |      volume |                              requestId
-------------------------------------------------------------------------------------------------------------------
-    | client 2    | client 2    -> server 2.1  | 15 | read  | volume-name | 9223936579889289248:270215977642998472
-    | client 2    | client 2    -> server 2.1  | 15 | read  | volume-name | 9223936579889289248:270497452619709333
-    | client 2    | client 2    -> server 2.1  | 15 | read  | volume-name | 9223936579889289248:270778927596419790
-    | client 2    | client 2    -> server 2.1  | 15 | write | volume-name | 9223936579889289248:271060402573130531
-    | client 2    | client 2    -> server 2.1  | 15 | read  | volume-name | 9223936579889289248:271341877549841211
-    | client 2    | client 2    -> server 2.1  | 15 | read  | volume-name | 9223936579889289248:271623352526551744
-    | client 2    | client 2    -> server 2.1  | 15 | write | volume-name | 9223936579889289248:271904827503262450
server 2.1  connection status: established no_data timeout
disk 202 EXPECTED_UNKNOWN server 2.1

This could be caused by starving CPU, hardware resets, misbehaving disks or network or stalled services. The disk field in the output and the service warnings after the requests table could be used as an indicator for the misbehaving component.

Note that the active requests api call has a timeout for each service to respond. The default timeout that the latthreshold tool uses is 10 seconds. This value can be altered by using the latthreshold’s --api-requests-timeout/-A and passing it a numeric value with a time unit (m, s, ms or us) e.g. 100ms.

Service connection will have one of the following statuses:

  1. established done - service reported its active requests as expected; this is not displayed in the regular output, only with --json

  2. not_established - did not make a connection with the service - this could indicate the server is down, but may also indicate the service version is too old or its stream was overfilled or not connected

  3. established no_data timeout - service did not respond and the connection was closed because the timeout was reached

  4. established data timeout - service responded but the connection was closed because the timeout was reached before it could send all the data

  5. established invalid_data - a message the service sent had invalid data in it

The latthreshold tool also reports disk statuses. Reported disk statuses will be one of the following:

  1. EXPECTED_MISSING - the service response was good, but did not provide information about the disk

  2. EXPECTED_NO_CONNECTION_TO_PEER - the connection to the service was not established

  3. EXPECTED_NO_PEER - the service is not present

  4. EXPECTED_UNKNOWN - the service response was invalid or a timeout occured