StorPool User Guide - version 21

1. StorPool Overview

StorPool is distributed block storage software. It pools the attached storage (HDDs, SSDs or NVMe drives) of standard servers to create a single pool of shared storage. The StorPool software is installed on each server in the cluster. It combines the performance and capacity of all drives attached to the servers into one global namespace.

StorPool provides standard block devices. You can create one or more volumes through its sophisticated volume manager. StorPool is compatible with ext3 and XFS file systems and with any system designed to work with a block device, e.g. databases and cluster file systems (like OCFS and GFS). StorPool can also be used with no file system, for example when using volumes to store VM images directly or as LVM physical volumes.

Redundancy is provided by multiple copies (replicas) of the data written synchronously across the cluster. Users may set the number of replication copies or an erasure coding scheme. The replication level directly correlates with the number of servers that may be down without interruption in the service. For replication 3 and all N+2 erasure coding schemes the number of the servers (see 12.6.  Fault sets) that may be down simultaneously without losing access to the data is 2.

StorPool protects data and guarantees its integrity by a 64-bit checksum and version for each sector on a StorPool volume or snapshot. StorPool provides a very high degree of flexibility in volume management. Unlike other storage technologies, such as RAID or ZFS, StorPool does not rely on device mirroring (pairing drives for redundancy). So every disk that is added to a StorPool cluster adds capacity and improves the performance of the cluster, not just for new data but also for existing data. Provided that there are sufficient copies of the data, drives can be added or taken away with no impact to the storage service. Unlike rigid systems like RAID, StorPool does not impose any strict hierarchical storage structure dictated by the underlying disks. StorPool simply creates a single pool of storage that utilizes the full capacity and performance of a set of commodity drives.

2. Architecture

StorPool works on a cluster of servers in a distributed shared-nothing architecture. All functions are performed by all servers on an equal peer basis. It works on standard off-the-shelf servers running GNU/Linux.

Each storage node is responsible for data stored on its local drives. Storage nodes collaborate to provide the storage service. StorPool provides a shared storage pool combining all the available storage capacity. It uses synchronous replication across servers. The StorPool client communicates in parallel with all StorPool servers. The StorPool iSCSI target provides access to volumes exported through it to other initiators.

The software consists of two different types of services - storage server services and a storage client services - that are installed on each physical server (host, node). The storage client services are the native block on Linux based systems, the iSCSI target or the NVMeTCP target for other systems. Each host can be a storage server, a storage client, iSCSI target, NVMeOF controller or any combination. The StorPool volumes appear to storage clients as block devices under the /dev/storpool/ directory and behave as normal disk devices. The data on the volumes can be read and written by all clients simultaneously; its consistency is guaranteed through a synchronous replication protocol. Volumes may be used by clients as they would use a local hard drive or disk array.

3. Feature Highlights

3.1. Scale-out, not Scale-Up

The StorPool solution is fundamentally about scaling out (by adding more drives or nodes) rather than scaling up (adding capacity by replacing a storage box with larger storage box). This means StorPool can scale independently by IOPS, storage space and bandwidth. There is no bottleneck or single point of failure. StorPool can grow without interruption and in small steps - a drive, a server and/or a network interface at a time.

3.2. High Performance

StorPool combines the IOPS performance of all drives in the cluster and optimizes drive access patterns to provide low latency and handling of storage traffic bursts. The load is distributed equally between all servers through striping and sharding.

3.3. High Availability and Reliability

StorPool uses a replication mechanism that slices and stores copies of the data on different servers. For primary, high performance storage this solution has many advantages compared to RAID systems and provides considerably higher levels of reliability and availability. In case of a drive, server, or other component failure, StorPool uses some of the available copies of the data located on other nodes in the same or other racks significantly decreasing the risk of losing access to or losing data.

3.4. Commodity Hardware

StorPool supports drives and servers in a vendor-agnostic manner, allowing you to avoid vendor lock-in. This allows the use of commodity hardware, while preserving reliability and performance requirements. Moreover, unlike RAID, StorPool is drive agnostic - you can mix drives of various types, make, speed or size in a StorPool cluster.

3.5. Shared Block Device

StorPool provides shared block devices with semantics identical to a shared iSCSI or FC disk array.

3.6. Co-existence with hypervisor software

StorPool can utilize repurposed existing servers and can co-exist with hypervisor software on the same server. This means that there is no dedicated hardware for storage, and growing an IaaS cloud solution is achieved by simply adding more servers to the cluster.

3.7. Compatibility

StorPool is compatible with 64-bit Intel and AMD based servers. We support all Linux-based hypervisors and hypervisor management software. Any Linux software designed to work with a shared storage solution such as an iSCSI or FC disk array will work with StorPool. StorPool guarantees the functionality and availability of the storage solution at the Linux block device interface.

3.8. CLI interface and API

StorPool provides an easy to use yet powerful command-line interface (CLI) tool for administration of the data storage solution. It is simple and user-friendly - making configuration changes, provisioning and monitoring fast and efficient.

StorPool also provides a RESTful JSON API, and python bindings exposing all the available functionality, so you can integrate it with any existing management system.

3.9. Reliable Support

StorPool comes with reliable dedicated support:

  • Remote installation and initial configuration by StorPool’s specialists

  • 24x7 support

  • Live software updates without interruption in the service

4. Hardware Requirements

All distributed storage systems are highly dependent on the underlying hardware. There are some aspects that will help achieve maximum performance with StorPool and are best considered in advance. Each node in the cluster can be used as server, client, iSCSI target or any combination; depending on the role, hardware requirements vary.

Tip

For detailed information about the supported hardware and software, see the StorPool System Requirements document.

4.1. Minimum StorPool cluster

  • 3 industry-standard x86 servers;

  • any x86-64 CPU with 4 threads or more;

  • 32 GB ECC RAM per node (8+ GB used by StorPool);

  • any hard drive controller in JBOD mode;

  • 3x SATA3 hard drives or SSDs;

  • dedicated 2x10GE LAN;

4.3. How StorPool relies on hardware

4.3.1. CPU

When the system load is increased, CPUs are saturated with system interrupts. To avoid the negative effects of this, StorPool’s server and client processes are given one or more dedicated CPU cores. This significantly improves overall the performance and the performance consistency.

4.3.2. RAM

ECC memory can detect and correct the most common kinds of in-memory data corruption thus maintains a memory system immune to single-bit errors. Using ECC memory is an essential requirement for improving the reliability of the node. In fact, StorPool is not designed to work with non-ECC memory.

4.3.3. Storage (HDDs / SSDs)

StorPool ensures the best drive utilization. Replication and data integrity are core functionality, so RAID controllers are not required and all storage devices might be connected as JBOD. All hard drives are journaled either on an NVMe drive similar to Intel Optane series. When write-back cache is available on a RAID controller it could be used in a StorPool specific way in order to provide power-loss protection for the data written on the hard disks. This is not necessary for SATA SSD pools.

4.3.4. Network

StorPool is a distributed system which means that the network is an essential part of it. Designed for efficiency, StorPool combines data transfer from other nodes in the cluster. This greatly improves the data throughput, compared with access to local devices, even if they are SSD or NVMe.

4.4. Software Compatibility

4.4.1. Operating Systems

  • Linux (various distributions)

  • Windows and VMware, Citrix Xen through standard protocols (iSCSI)

4.4.2. File Systems

Developed and optimized for Linux, StorPool is very well tested on CentOS, Ubuntu and Debian. Compatible and well tested with ext4 and XFS file systems and with any system designed to work with a block device, e.g. databases and cluster file systems (like GFS2 or OCFS2). StorPool can also be used with no file system, for example when using volumes to store VM images directly. StorPool is compatible with other technologies from the Linux storage stack, such as LVM, dm-cache/bcache, and LIO.

4.4.3. Hypervisors & Cloud Management/Orchestration

  • KVM

  • LXC/Containers

  • OpenStack

  • OpenNebula

  • OnApp

  • CloudStack

  • any other technology compatible with the Linux storage stack.

5. Installation and Upgrade

Currently the installation and upgrade procedures are performed by the StorPool support team.

6. Node configuration options

You can configure a StorPool node by setting options in the /etc/storpool.conf configuration file. You can also define options in configuration files in the /etc/storpool.conf.d/ directory; these files must meet all of the following requirements:

  • The name ends with the .conf extension.

  • The name does not start with a dot (.).

When the system locates files in the /etc/storpool.conf.d/ directly with correct names (like local.conf or hugepages.conf), it would process the options set in them. It would ignore files with incorrect names, like .hidden.conf, local.confx, server.conf.bak, or storpool.conf~.

6.1. Introduction

Here you can find a small example for the /etc/storpool.conf file, and also information about putting host-specific options in separate sections.

6.1.1. Minimal node configuration

The minimum working configuration must specify the network interface, number of expected nodes, authentication tokens, and unique ID of the node, like in the example below:

#-
# Copyright (c) 2013 - 2017  StorPool.
# All rights reserved.
#

# Human readable name of the cluster, usually in the "Company Name"-"Location"-"number" form
# For example: StorPool-Sofia-1
# Mandatory for monitoring
SP_CLUSTER_NAME=  #<Company-Name-PoC>-<City-or-nearest-airport>-<number>

# Remote authentication token provided by StorPool support for data related to crashed services, collected
# vmcore-dmesg files after kernel panic, per-host monitoring alerts, storpool_iolatmon alerts, and so on.
SP_RAUTH_TOKEN= <rauth-token>

# Computed by the StorPool Support team; consists of location and cluster separated by a dot
# For example: nzkr.b
# Mandatory (since version 16.02)
SP_CLUSTER_ID=  #Ask StorPool Support

# Interface for storpool communication
#
# Default: empty
SP_IFACE1_CFG=1:eth1.100:eth1:100:10.1.100.1:b:x:r
SP_IFACE2_CFG=1:eth2.100:eth2:100:10.2.100.1:b:x:r

# expected nodes for beacon operation
#
# !!! Must be specified !!!
#
SP_EXPECTED_NODES=3

# API authentication token
#
# 64bit random value
# For example, generate it with: od -vAn -N8 -tu8 /dev/random
SP_AUTH_TOKEN=4306865639163977196

##########################

[spnode1.example.com]
SP_OURID = 1

This section of the documentation describes all options. If you need more examples, check the /usr/share/doc/storpool/examples/storpool.conf.example file in your StorPool installation.

6.1.2. Per host configuration

Specific options per host. The value in the square brackets should be the name of the host, as returned by the hostname command:

[spnode1.example.com]
SP_OURID=1

Specific configuration details might be added for each host individually, like shown in the example below:

[spnode1.example.com]
SP_OURID=1
SP_IFACE1_CFG=1:eth1.100:eth1:100:10.1.100.1:b:x:r
SP_IFACE2_CFG=1:eth2.100:eth2:100:10.2.100.1:b:x:r
SP_NODE_NON_VOTING=1

Note

It is important that each node has valid configuration sections for all nodes in the cluster in its local /etc/storpool.conf file. Keep consistent /etc/storpool.conf files across all nodes in the cluster.

6.2. Identification and voting

You can use the /etc/storpool.conf configuration file to set the ID of the node in the cluster, whether it is a voting node, and the number of expected nodes.

6.2.1. Node ID

Use the SP_OURID option to set the ID of the node in the cluster. The value must be unique throughout the cluster.

SP_OURID=1

6.2.2. Non-voting beacon node

On client-only nodes (see 2.  Architecture) the storpool_server service should not be started. To achieve this, on such nodes you should set the SP_NODE_NON_VOTING option to 1, as shown below (default value is 0):

SP_NODE_NON_VOTING=1

Attention

It is strongly recommended to configure SP_NODE_NON_VOTING at the per-host configuration sections in storpool.conf; for details, see Per host configuration.

6.2.3. Expected nodes

Minimum number of expected nodes for beacon operation. Usually equal to the number of nodes with storpool_server instances running:

SP_EXPECTED_NODES=3

6.3. Network communication

You can use the /etc/storpool.conf configuration file to set how the network interfaces should be used, preferred ports, and other network-related options.

6.3.1. Interfaces for StorPool cluster communication

The network interface options should be defined in the following way:

SP_IFACE1_CFG='config-version':'resolve-interface':'raw-interface':'VLAN':'IP':'resolve':'shared':'mac'

The option names are in the SP_IFACE1_CFG format, where the number after SP_IFACE is the number of the interface. The option values are as follows:

config-version

version of the config format (currently 1)

resolve-interface

Name of the kernel interface that has the IP address.

raw-interface

Name of the raw interfaces to transmit and receive on.

VLAN

VLAN tag to use or 0 if the VLAN is untagged.

IP

IP address to use.

resolve

‘b’ - use broadcast for resolving (for Ethernet based networks); ‘k’ - use kernel address resolving (for IP-only networks).

shared

‘s’ for shared in case of a bond interface on top; ‘x’ for exclusive - if nothing else is supposed to use this interface

mac

‘r’ for using the unmodified MAC of the raw interface; ‘P’ (Privatized MAC); ‘v’ the resolve interfaces.

It is recommended to have two dedicated network interfaces for communication between the nodes. Here are a few examples:

  • Single VLAN active-backup bond interface, storage and API on the same VLAN:

    SP_IFACE1_CFG=1:bond0.900:eth2:900:10.9.9.1:b:s:P
    SP_IFACE2_CFG=1:bond0.900:eth3:900:10.9.9.1:b:s:P
    
  • Single VLAN, LACP bond interface, storage and API on the same VLAN:

    SP_IFACE1_CFG=1:bond0.900:eth2:900:10.9.9.1:b:s:v
    SP_IFACE2_CFG=1:bond0.900:eth3:900:10.9.9.1:b:s:v
    
  • Two VLANs: one VLAN for storage, one VLAN for API, both over bond:

    SP_IFACE1_CFG=1:bond0.101:eth2:101:10.2.1.1:b:s:P
    SP_IFACE2_CFG=1:bond0.101:eth2:101:10.2.1.1:b:s:P
    
  • Three VLANs: two VLANs for storage, API on separate physical interface

    SP_IFACE1_CFG=1:eth2.101:eth2:101:10.2.1.1:b:x:v
    SP_IFACE2_CFG=1:eth3.201:eth2:201:10.2.2.1:b:x:v
    

6.3.2. Address for API management (storpool_mgmt)

Address for receiving requests from user-space tools. Multiple clients can simultaneously send requests to the API; for details about the management service, see 9.  Background services. By default, the address is bound on localhost:

SP_API_HTTP_HOST=127.0.0.1

For cluster-wide access and automatic failover between the nodes, multiple nodes might have the API service started. The specified IP address is brought up only on one of the nodes in the cluster at a time - the so called active API service. You may specify an available IP address (SP_API_HTTP_HOST), which will be brought up or down on the corresponding interface (SP_API_IFACE) when migrating the API service between the nodes.

To configure an interface (SP_API_IFACE) and address (SP_API_HTTP_HOST):

SP_API_HTTP_HOST=10.10.10.240
SP_API_IFACE=eth1

Note

The script that adds or deletes the SP_API_HTTP_HOST address is located at /usr/lib/storpool/api-ip. It could be easily modified for other use cases, for example configure routing, firewalls, and so on.

6.3.3. Port for API management (storpool_mgmt)

Port for the API management service. The default value is 81:

SP_API_HTTP_PORT=81

6.3.4. API authentication token

This value must be an unique integer for each cluster:

SP_AUTH_TOKEN=0123456789

Hint

The token can be generated with: od -vAn -N8 -tu8 /dev/random

6.3.5. Ignore RX port option

Used to instruct the services that the network can preserve the selected port even when altering ports. Default value is 0:

SP_IGNORE_RX_PORT=0

6.3.6. Preferred port

Used for setting which port is preferred when two networks are specified, but only one of them could actually be used for any reason (in an active-backup bond style). The default value is 0:

SP_PREFERRED_PORT=0 # load-balancing

Supported values are:

SP_PREFERRED_PORT=1 # use SP_IFACE1_CFG by default
SP_PREFERRED_PORT=2 # use SP_IFACE2_CFG by default

6.3.7. Address for the bridge service (storpool_bridge)

Required for the local bridge service (see 9.  Background services), this is the address where the bridge binds to:

SP_BRIDGE_HOST=180.220.200.8

6.3.8. Interface for the bridge address (storpool_bridge)

Expected when the SP_BRIDGE_HOST value is a floating IP address for the storpool_bridge service:

SP_BRIDGE_IFACE=bond0.900

6.4. Drives

You can use the /etc/storpool.conf configuration file to define a specific driver to be used for NVMe SSD drives, a group owner for the StorPool devices, and other drive-related options.

6.4.1. NVMe SSD drives

The storpool_nvmed service (see 9.  Background services) automatically detects all initialized StorPool devices, and attaches them to the configured SP_NVME_PCI_DRIVER.

To configure a driver for storpool_nvmed that is different than the default storpool_pci:

SP_NVME_PCI_DRIVER=vfio-pci

The vfio-pci driver requires the iommu=pt option for both Intel/AMD CPUs and the intel_iommu=on option in addition for Intel CPUs on the kernel command line.

6.4.2. Exclude disks globally or per server instance

A list of paths to drives to be excluded at instance boot time:

SP_EXCLUDE_DISKS=/dev/sda1:/dev/sdb1

Can also be specified for each server instance individually:

SP_EXCLUDE_DISKS=/dev/sdc1
SP_EXCLUDE_DISKS_1=/dev/sda1

6.4.3. Parallel requests per disk when recovering from remote (storpool_bridge)

Number of parallel requests to issue while performing remote recovery. Should be between 1 and 64, the default value is 2:

SP_REMOTE_RECOVERY_PARALLEL_REQUESTS_PER_DISK=2

6.4.4. Group owner for the StorPool devices

The system group to use for the /dev/storpool and /dev/storpool-byid/ directories and the /dev/sp-* raw disk devices:

SP_DISK_GROUP=disk

6.4.5. Permissions for the StorPool devices

The access mode to set on the /dev/sp-* raw disk devices:

SP_DISK_MODE=0660

6.5. Monitoring and debugging

You can use the /etc/storpool.conf configuration file to set the cluster name, cluster ID, and other options related to monitoring and debugging.

6.5.1. Cluster name

Required for the pro-active monitoring performed by StorPool Support team. Usually in the <Company-Name>-<City-or-nearest-airport>-<number> form, with numbering starting from 1.

SP_CLUSTER_NAME=StorPool-Sofia-1

6.5.2. Cluster ID

The Cluster ID is computed from the StorPool Support ID and consists of two parts - location and cluster - separated by a dot ("."). Each location consists of one or more clusters:

SP_CLUSTER_ID=nzkr.b

6.5.3. Local user for debug data collection

User to change the ownership of the storpool_abrtsync service runtime (see 9.  Background services). Unset by default:

SP_ABRTSYNC_USER=

Note

In case of no configuration during installation, this user will be set by default to storpool.

6.5.4. Remote addresses for sending debug data

The defaults are shown below. They could be altered (unlikely) in case a jump host or a custom collection nodes are used.

SP_ABRTSYNC_REMOTE_ADDRESSES=reports.storpool.com,reports1.storpool.com,reports2.storpool.com

6.5.5. Remote ports for sending debug data

The default port is shown below. Might be altered in case a jump host or a custom collection nodes are used.

SP_CRASH_REMOTE_PORT=2266

6.5.6. Report directory

Location for collecting automated bug reports and shared memory dumps:

SP_REPORTDIR=/var/spool/storpool

6.6. Cgroup options

You can use the /etc/storpool.conf configuration file to define several options related to cgroups. For more information on StorPool and cgroups, see Control groups.

6.6.1. Enabling cgroups

The following option enables the usage of cgroups. The default value is 1:

SP_USE_CGROUPS=1

Each StorPool process requires a specification of the cgroups it should be started into; there is a default configuration for each service. One or more processes may be placed in the same cgroup, or each one may be in a cgroup of its own, as appropriate. StorPool provides a tool for setting up cgroups called storpool_cg. It is able to automatically configure a system depending on the installed services on all supported operating systems.

6.6.2. storpool_rdma module

The SP_RDMA_CGROUPS option is required for setting the kernel threads started by the storpool_rdma module:

SP_RDMA_CGROUPS=-g cpuset:storpool.slice/rdma -g memory:storpool.slice/common

6.6.3. Options for StorPool services

Use the following options to configure the StorPool services:

  • storpool_block

    Use the SP_BLOCK_CGROUPS option:

    SP_BLOCK_CGROUPS=-g cpuset:storpool.slice/block -g memory:storpool.slice/common
    
  • storpool_bridge

    Use the SP_BRIDGE_CGROUPS option:

    SP_BRIDGE_CGROUPS=-g cpuset:storpool.slice/bridge -g memory:storpool.slice/alloc
    
  • storpool_server

    Use the SP_SERVER_CGROUPS option:

    SP_SERVER_CGROUPS=-g cpuset:storpool.slice/server -g memory:storpool.slice/common
    SP_SERVER1_CGROUPS=-g cpuset:storpool.slice/server_1 -g memory:storpool.slice/common
    SP_SERVER2_CGROUPS=-g cpuset:storpool.slice/server_2 -g memory:storpool.slice/common
    SP_SERVER3_CGROUPS=-g cpuset:storpool.slice/server_3 -g memory:storpool.slice/common
    SP_SERVER4_CGROUPS=-g cpuset:storpool.slice/server_4 -g memory:storpool.slice/common
    SP_SERVER5_CGROUPS=-g cpuset:storpool.slice/server_5 -g memory:storpool.slice/common
    SP_SERVER6_CGROUPS=-g cpuset:storpool.slice/server_6 -g memory:storpool.slice/common
    
  • storpool_beacon

    Use the SP_BEACON_CGROUPS option:

    SP_BEACON_CGROUPS=-g cpuset:storpool.slice/beacon -g memory:storpool.slice/common
    
  • storpool_mgmt

    Use the SP_MGMT_CGROUPS option:

    SP_MGMT_CGROUPS=-g cpuset:storpool.slice/mgmt -g memory:storpool.slice/alloc
    
  • storpool_controller

    Use the SP_CONTROLLER_CGROUPS option:

    SP_CONTROLLER_CGROUPS=-g cpuset:system.slice -g memory:system.slice
    
  • storpool_iscsi

    Use the SP_ISCSI_CGROUPS option:

    SP_ISCSI_CGROUPS=-g cpuset:storpool.slice/iscsi -g memory:storpool.slice/alloc
    
  • storpool_nvmed

    Use the SP_NVMED_CGROUPS option:

    SP_NVMED_CGROUPS=-g cpuset:storpool.slice/beacon -g memory:storpool.slice/common
    

6.7. Miscellaneous options

6.7.1. Working directory

Used for reports, shared memory dumps, sockets, core files, and so on. The default value is /var/run/storpool:

SP_WORKDIR=/var/run/storpool

Hint

On nodes with /var/run in RAM and limited amount of memory, /var/spool/storpool/run is recommended.

6.7.2. Restart automatically in case of crash

The main StorPool services (see 9.  Background services) are governed by a special storpool_daemon service. There is an option for this service, which specifies a period of time in seconds (default is 1800). In case this service crashes, and the number of crashes within the specified period is less than 3, then the service would be restarted automatically.

SP_RESTART_ON_CRASH=1800

6.7.3. Logging of non-read-only open/close for StorPool devices

If set to 0, the storpool_bd kernel module will not log anything about opening or closing StorPool devices:

SP_BD_LOG_OPEN_CLOSE=1

6.7.4. Configuring the storpool_logd service

For details about the storpool_logd service, see StorPool Log Daemon (storpool_logd).

To configure an HTTP/S proxy for the service:

SP_HTTPS_PROXY=<proxy URL>

To override the URL for the service:

SP_LOGD_URL=<logd-URL>

Note

Custom instances require HTTPS with properly installed certificates, locally if necessary.

6.7.5. Cache size

Each storpool_server process allocates the amount of RAM (in MB) set using SP_CACHE_SIZE for caching. The size of the cache depends on the number of storage devices on each storpool_server instance, and is taken care by the storpool_cg tool during cgroups configuration. Here is an example configuration for all storpool_server instances:

SP_CACHE_SIZE=4096

Note

A node with three storpool_server processes running will use 4096*3 = 12GB cache total.

You can override the size of the cache for each of the storpool_server instances, as shown below. This is useful when different instances control different number of drives:

SP_CACHE_SIZE=1024
SP_CACHE_SIZE_1=1024
SP_CACHE_SIZE_2=4096
SP_CACHE_SIZE_3=8192

Setting on the internal write-back caching:

SP_WRITE_BACK_CACHE_ENABLED=1

Attention

UPS is mandatory with write-back caching. A clean server shutdown is required before the UPS batteries are depleted.

7. Storage devices

All storage devices that will be used by StorPool (HDD, SSD, or NVMe) must have one or more properly aligned partitions, and must have an assigned ID. Larger NVMe drives should be split into two or more partitions, which allows assigning them to different instances of the storpool_server service.

You can initialize the devices quickly using the disk_init_helper tool provided by StorPool. Alternatively, you do this manually using the common parted tool.

7.1. Journals

All hard disk drives should have a journal provided in one of the following ways:

  • On a persistent memory device (/dev/pmemN)

  • On a small, high-endurance NVMe device (An Intel Optane or similar)

  • On a regular NVMe on a small separate partition from its main data one

  • On battery/cachevault power-loss protected virtual device (RAID controllers).

No journals on the HDD is acceptable in case of snapshot-only data (for example, a backup-only cluster).

7.2. Using disk_init_helper

The disk_init_helper tool is used in two steps:

  1. Discovery and setup

    The tool discovers all drives that do not have partitions and are not used anywhere (no LVM PV, device mapper RAID, StorPool data disks, and so on). It uses this information to generate a suggested configuration, which is stored as a configuration file. You can try different options until you get a configuration that suits you needs.

  2. Initialization

    You provide the configuration file from the first step to the tool, and it initializes the drives.

disk_init_helper is also used in the storpool-ansible playbook (see github.com/storpool/ansible), where it helps providing consistent defaults for known configurations and idempotency.

7.2.1. Example node

This is an example node with 7 x 960GB SSDs, 8 x 2TB HDDs, 1 x 100GB Optane NVMe, and 3 x 1TB NVMe disks:

[root@s25 ~]# lsblk
NAME     MAJ:MIN RM   SIZE RO TYPE MOUNTPOINT
sda        8:0    0 894.3G  0 disk
sdb        8:16   0 894.3G  0 disk
sdc        8:32   0 894.3G  0 disk
sdd        8:48   0 894.3G  0 disk
sde        8:64   0 894.3G  0 disk
sdf        8:80   0 894.3G  0 disk
sdg        8:96   0 894.3G  0 disk
sdh        8:112  0   1.8T  0 disk
sdi        8:128  0   1.8T  0 disk
sdj        8:144  0   1.8T  0 disk
sdk        8:160  0   1.8T  0 disk
sdl        8:176  0 111.8G  0 disk
|-sdl1     8:177  0  11.8G  0 part
|-sdl2     8:178  0   100G  0 part /
`-sdl128 259:15   0   1.5M  0 part
sdm        8:192  0   1.8T  0 disk
sdn        8:208  0   1.8T  0 disk
sdo        8:224  0   1.8T  0 disk
sdp        8:240  0   1.8T  0 disk
nvme0n1  259:6    0  93.2G  0 disk
nvme1n1  259:0    0 931.5G  0 disk
nvme2n1  259:1    0 931.5G  0 disk
nvme3n1  259:4    0 931.5G  0 disk

This node is used in the examples below.

7.2.2. Discovering drives

7.2.2.1. Basic usage

To assign IDs for all disks on this node run the tool with the --start argument:

[root@s25 ~]# disk_init_helper discover --start 2501 -d disks.json
sdl partitions: sdl1, sdl2, sdl128
Success generating disks.json, proceed with 'init'

Note

Note that the automatically generated IDs must be unique within the StorPool cluster. Allowed IDs are between 1 and 4000.

StorPool disk IDs will be assigned in an offset by 10 for SSD, NVMe, and HDD drives accordingly, which could be further tweaked by parameters.

By default, the tool discovers all disks without partitions; the one where the OS is installed (/dev/sdl) is skipped. The tools does the following:

  • Prepares all SSD, NVMe, and HDD devices with a single large partition on each one.

  • Uses the Optane device as a journal-only device for the hard drive journals.

7.2.2.2. Viewing configuration

You can use the --show option to see what will be done:

[root@s25 ~]# disk_init_helper discover --start 2501 --show
sdl partitions: sdl1, sdl2, sdl128
/dev/sdb (type: SSD):
    data partitions
        /dev/disk/by-id/ata-SAMSUNG_MZ7LM960HMJP-00005_S2TZNX0J302126-part1 (2501): 894.25 GiB (mv: None)
/dev/sda (type: SSD):
    data partitions
        /dev/disk/by-id/ata-SAMSUNG_MZ7LM960HMJP-00005_S2TZNX0J302127-part1 (2502): 894.25 GiB (mv: None)
/dev/sdc (type: SSD):
    data partitions
        /dev/disk/by-id/ata-SAMSUNG_MZ7LM960HMJP-00005_S2TZNX0J302128-part1 (2503): 894.25 GiB (mv: None)
/dev/sdd (type: SSD):
    data partitions
        /dev/disk/by-id/ata-SAMSUNG_MZ7LM960HMJP-00005_S2TZNX0J302129-part1 (2504): 894.25 GiB (mv: None)
/dev/sde (type: SSD):
    data partitions
        /dev/disk/by-id/ata-SAMSUNG_MZ7LM960HMJP-00005_S2TZNX0J302137-part1 (2505): 894.25 GiB (mv: None)
/dev/sdf (type: SSD):
    data partitions
        /dev/disk/by-id/ata-SAMSUNG_MZ7LM960HMJP-00005_S2TZNX0J302138-part1 (2506): 894.25 GiB (mv: None)
/dev/sdg (type: SSD):
    data partitions
        /dev/disk/by-id/ata-SAMSUNG_MZ7LM960HMJP-00005_S2TZNX0J302139-part1 (2507): 894.25 GiB (mv: None)
/dev/sdh (type: HDD):
    data partitions
        /dev/disk/by-id/ata-ST2000NM0008-2F3100_ZDS00Y25-part1 (2521): 1863.01 GiB (mv: /dev/disk/by-id/nvme-INTEL_SSDPE21K100GA_PHKE831600DM100EGN-part1)
/dev/sdj (type: HDD):
    data partitions
        /dev/disk/by-id/ata-ST2000NM0008-2F3100_ZDS03YRJ-part1 (2522): 1863.01 GiB (mv: /dev/disk/by-id/nvme-INTEL_SSDPE21K100GA_PHKE831600DM100EGN-part2)
/dev/sdi (type: HDD):
    data partitions
        /dev/disk/by-id/ata-ST2000NM0008-2F3100_ZDS041FK-part1 (2523): 1863.01 GiB (mv: /dev/disk/by-id/nvme-INTEL_SSDPE21K100GA_PHKE831600DM100EGN-part3)
/dev/sdk (type: HDD):
    data partitions
        /dev/disk/by-id/ata-ST2000NM0008-2F3100_ZDS04280-part1 (2524): 1863.01 GiB (mv: /dev/disk/by-id/nvme-INTEL_SSDPE21K100GA_PHKE831600DM100EGN-part4)
/dev/sdp (type: HDD):
    data partitions
        /dev/disk/by-id/ata-ST2000NM0008-2F3100_ZDS05PTA-part1 (2525): 1863.01 GiB (mv: /dev/disk/by-id/nvme-INTEL_SSDPE21K100GA_PHKE831600DM100EGN-part5)
/dev/sdo (type: HDD):
    data partitions
        /dev/disk/by-id/ata-ST2000NM0008-2F3100_ZDS05PTB-part1 (2526): 1863.01 GiB (mv: /dev/disk/by-id/nvme-INTEL_SSDPE21K100GA_PHKE831600DM100EGN-part6)
/dev/sdm (type: HDD):
    data partitions
        /dev/disk/by-id/ata-ST2000NM0008-2F3100_ZDS05PTD-part1 (2527): 1863.01 GiB (mv: /dev/disk/by-id/nvme-INTEL_SSDPE21K100GA_PHKE831600DM100EGN-part7)
/dev/sdn (type: HDD):
    data partitions
        /dev/disk/by-id/ata-ST2000NM0008-2F3100_ZDS05PTJ-part1 (2528): 1863.01 GiB (mv: /dev/disk/by-id/nvme-INTEL_SSDPE21K100GA_PHKE831600DM100EGN-part8)
/dev/nvme0n1 (type: journal-only NVMe):
    journal partitions
        /dev/disk/by-id/nvme-INTEL_SSDPE21K100GA_PHKE831600DM100EGN-part1 (None): 0.10 GiB (mv: None)
        /dev/disk/by-id/nvme-INTEL_SSDPE21K100GA_PHKE831600DM100EGN-part2 (None): 0.10 GiB (mv: None)
        /dev/disk/by-id/nvme-INTEL_SSDPE21K100GA_PHKE831600DM100EGN-part3 (None): 0.10 GiB (mv: None)
        /dev/disk/by-id/nvme-INTEL_SSDPE21K100GA_PHKE831600DM100EGN-part4 (None): 0.10 GiB (mv: None)
        /dev/disk/by-id/nvme-INTEL_SSDPE21K100GA_PHKE831600DM100EGN-part5 (None): 0.10 GiB (mv: None)
        /dev/disk/by-id/nvme-INTEL_SSDPE21K100GA_PHKE831600DM100EGN-part6 (None): 0.10 GiB (mv: None)
        /dev/disk/by-id/nvme-INTEL_SSDPE21K100GA_PHKE831600DM100EGN-part7 (None): 0.10 GiB (mv: None)
        /dev/disk/by-id/nvme-INTEL_SSDPE21K100GA_PHKE831600DM100EGN-part8 (None): 0.10 GiB (mv: None)
/dev/nvme3n1 (type: NVMe w/ journals):
    data partitions
        /dev/disk/by-id/nvme-INTEL_SSDPE2KX010T8_BTLJ849207E61P0FGN-part1 (2511): 931.51 GiB (mv: None)
/dev/nvme1n1 (type: NVMe w/ journals):
    data partitions
        /dev/disk/by-id/nvme-INTEL_SSDPE2KX010T8_BTLJ849207F91P0FGN-part1 (2512): 931.51 GiB (mv: None)
/dev/nvme2n1 (type: NVMe w/ journals):
    data partitions
        /dev/disk/by-id/nvme-INTEL_SSDPE2KX010T8_BTLJ84920JAJ1P0FGN-part1 (2513): 931.51 GiB (mv: None)
7.2.2.3. Recognizing SSDs

The SSDs and HDDs are auto-discovered by their rotational flag in the /sys/block hierarchy. There are however occasions when this flag might be misleading, and an SSD is visible as a rotational device.

For such cases there are overrides that can further help with proper configuration, as shown in the example below:

# disk_init_helper discover --start 101 '*Micron*M510*:s'

All devices matching the /dev/disk/by-id*Micron*M510*:s pattern will be forced as SSD drives, regardless of how they were discovered by the tool.

7.2.2.4. Specifying a journal

Similarly, a journal may be specified for a device, for example:

# disk_init_helper discover --start 101 '*Hitachi*HUA7200*:h:njo'

Instructs the tool to use an NVMe journal-only device for keeping the journals for all Hitachi HUA7200 drives.

The overrides look like this: <disk-serial-pattern>:<disk-type>[:<journal-type>]

The disk type may be one of:

  • s - SSD drive

  • sj - SSD drive with HDD journals (used for testing only)

  • n - NVMe drive

  • nj - NVMe drive with HDD journals

  • njo - NVMe drive with journals only (no StorPool data disk)

  • h - HDD drive

  • x - Exclude this drive match, even if it is with the right size.

The journal-type override is optional, and makes sense mostly when the device is an HDD:

  • nj - journal on an NVMe drive - requires at least one nj device

  • njo - journal on an NVMe drive - requires at least one njo device

  • sj - journal on SSD drive (unusual, but useful for testing); requires at least one sj device.

7.2.3. Initializing drives

To initialize the drives using an existing configuration file:

# disk_init_helper init disks.json

The above will apply the settings pre-selected during the discovery phase.

More options may be specified to either provide some visibility on what will be done (like --verbose and --noop), or provide additional options to storpool_initdisk for the different disk types (like --ssd-args and --hdd-args).

7.3. Manual partitioning

A disk drive can be initialized manually as a StorPool data disk.

7.3.1. Creating partitions

First, an aligned partition should be created spanning the full volume of the disk drive. Here is an example command for creating a partition on the whole drive with the proper alignment:

# parted -s --align optimal /dev/sdX mklabel gpt -- mkpart primary 2MiB 100%    # Here, X is the drive letter

For dual partitions on a NVMe drive that is larger than 4TB use:

# parted -s --align optimal /dev/nvmeXnY mklabel gpt -- mkpart primary 2MiB 50%   # Here, X is the nvme device controller, and Y is the device's first namespace
# parted -s --align optimal /dev/nvmeXnY mkpart primary 50% 100%

Similarly, to split an even larger (for example, 8TB or larger) NVMe drive to four partitions use:

# parted -s --align optimal /dev/nvmeXnY mklabel gpt -- mkpart primary 2MiB 25%   # Here, X is the nvme device controller, and Y is the device's first namespace
# parted -s --align optimal /dev/nvmeXnY mkpart primary 25% 50%
# parted -s --align optimal /dev/nvmeXnY mkpart primary 50% 75%
# parted -s --align optimal /dev/nvmeXnY mkpart primary 75% 100%

Hint

NVMe devices larger than 4TB should always be split as up to 4TiB chunks.

7.3.2. Initializing a drive

On a brand new cluster installation it is necessary to have one drive formatted with the “init” (-I) flag of the storpool_initdisk tool. This device is necessary only for the first start, and therefore it is best to pick the first drive in the cluster.

Initializing the first drive on the first server node with the init flag:

# storpool_initdisk -I {diskId} /dev/sdX   # Here, X is the drive letter

Initializing an SSD or NVME SSD device with the SSD flag set:

# storpool_initdisk -s {diskId} /dev/sdX   # Here, X is the drive letter

Initializing an HDD drive with a journal device:

# storpool_initdisk {diskId} /dev/sdX --journal /dev/sdY   # Here, X and Y are the drive letters

To list all initialized devices:

# storpool_initdisk --list

Example output:

0000:01:00.0-p1, diskId 2305, version 10007, server instance 0, cluster e.b, SSD, opened 7745
0000:02:00.0-p1, diskId 2306, version 10007, server instance 0, cluster e.b, SSD, opened 7745
/dev/sdr1, diskId 2301, version 10007, server instance 1, cluster e.b, SSD, opened 8080
/dev/sdq1, diskId 2302, version 10007, server instance 1, cluster e.b, SSD, opened 8080
/dev/sds1, diskId 2303, version 10007, server instance 1, cluster e.b, SSD, opened 8080
/dev/sdt1, diskId 2304, version 10007, server instance 1, cluster e.b, SSD, opened 8080
/dev/sda1, diskId 2311, version 10007, server instance 2, cluster e.b, WBC, jmv 160036C1B49, opened 8185
/dev/sdb1, diskId 2311, version 10007, server instance 2, cluster -, journal mv 160036C1B49, opened 8185
/dev/sdc1, diskId 2312, version 10007, server instance 2, cluster e.b, WBC, jmv 160036CF95B, opened 8185
/dev/sdd1, diskId 2312, version 10007, server instance 2, cluster -, journal mv 160036CF95B, opened 8185
/dev/sde1, diskId 2313, version 10007, server instance 3, cluster e.b, WBC, jmv 160036DF8DA, opened 8971
/dev/sdf1, diskId 2313, version 10007, server instance 3, cluster -, journal mv 160036DF8DA, opened 8971
/dev/sdg1, diskId 2314, version 10007, server instance 3, cluster e.b, WBC, jmv 160036ECC80, opened 8971
/dev/sdh1, diskId 2314, version 10007, server instance 3, cluster -, journal mv 160036ECC80, opened 8971

7.3.3. Drive initialization options

Other available options of the storpool_initdisk tool:

--list

List all StorPool disks on this node.

-i

Specify server instance, used when more than one storpool_server instances are running on the same node.

-r

Used to return an ejected disk back to the cluster or change some of the flags.

-F

Forget this disk and mark it as ejected; succeeds only without a running storpool_server instance that has the drive opened.

-s|–ssd y/n

Set SSD flag - on new initialize only, not reversible with -r. Providing the y or n value forces a disk to be considered as flash-based or not.

-j|–journal (<device>|none)

Used for HDDs when a RAID controller with a working cachevault or battery is present or an NVMe device is used as a power loss protected write-back journal cache.

--bad

Marks disk as bad. Will be treated as ejected by the servers.

--good

Resets disk to ejected if it was bad. Use with caution.

--list-empty

List empty NVMe devices.

--json

Output the list of devices as a JSON object.

--nvme-smart nvme-pci-addr

Dump the NVMe S.M.A.R.T. counters; only for devices controlled by the storpool_nvmed service.

Advanced options (use with care):

-e (entries_count)

Initialize the disk by overriding the default number of entries count (default is based on the disk size).

-o (objects_count)

Initialize the disk by overriding the default number of objects count (default is based on the disk size).

--wipe-all-data

Used when re-initializing an already initialized StorPool drive. Use with caution.

--no-test

Disable forced one-time test flag.

--no-notify

Does not notify servers of the changes. They won’t immediately open the disk. Useful for changing a flag with -r without returning the disk back to the server.

–no-fua (y|n)

Used to forcefully disable FUA support for an SSD device. Use with caution because it might lead to data loss if the device is powered off before issuing a FLUSH CACHE command.

–no-flush (y|n)

Used to forcefully disable FLUSH support for an SSD device.

–no-trim (y|n)

Used to forcefully disable TRIM support for an SSD device. Useful when the drive is misbehaving when TRIM is enabled.

–test-override no/test/pass

Modify the “test override” flag (default during disk init is “test”).

–wbc (y|n)

Used for HDDs when the internal write-back caching is enabled, implies SP_WRITE_BACK_CACHE_ENABLED to have an effect. Turned off by default.

--nvmed-rescan

Instruct the storpool_nvmed service to rescan after device changes.

8. Network interfaces

The recommended mode of operation is with hardware acceleration enabled for supported network interfaces. Most NICs controlled by the i40e/ixgbe/ice (Intel), mlx4_core/mlx5_core (Nvidia/Mellanox), and bnxt (Broadcom) drivers do support hardware acceleration.

When enabled, the StorPool services are directly using the NIC, bypassing the Linux kernel. This reduces CPU usage and processing latency, and StorPool traffic is not affected by issues in Linux kernel (floods). Due to the Linux kernel being now bypassed, the entire network stack is implemented in the StorPool services.

8.1. Preparing Interfaces

There are two ways to configure the network interfaces for StorPool. One is automatic: providing the VLAN ID, the IP network(s), and the mode of operation, and leaving the IP address selection to the helper tooling. The other is semi-manual: providing explicit IP address configuration for each parameter on each of the nodes in the cluster.

8.2. Automatic configuration

The automatic configuration is performed with the net_helper tool provided by StorPool. On running the tools it selects addresses based on the SP_OURID of each node in the cluster. It requires VLAN (default 0, that is, untagged), an IP network for the storage, and pre-defined mode of operation for the interfaces in the OS.

The supported modes are exclusive-ifaces, active-backup-bond, bridge-active-backup-bond, mlag-bond, and bridge-mlag-bond. They are described in the sections below.

Note

All modes relate only to the way the kernel networks are configured. The storage services are always in active-active mode (unless configured differently) using directly the underlying interfaces. Any configuration of the kernel interfaces is solely for the purposes of other traffic, for example for access to the API.

8.2.1. Exclusive interfaces mode

The exclusive-ifaces mode offers the simplest possible configuration, with just the two main raw interfaces, each configured with a different address.

This mode is used mainly when there is no need for redundancy on the kernel network, usually for multipath iSCSI.

Note

Not recommended for the storage network if the API is configured on top. In such a situation it is recommended to use some of the bond modes.

8.2.2. Active backup bond modes

In the active-backup-bond and bridge-active-backup-bond modes, the underlying storage interfaces are added in an active-backup bond interface (named spbond0 by default), which uses an ARP monitor to select the active interface in the bond.

If the VLAN for the storage interfaces is tagged, an additional VLAN interface is created on top of the bond.

Here is a very simple example with untagged VLAN (that is, 0):

digraph G {
  rankdir=LR;
  image=svg;
  compound=true;
  sp0;
  sp1;
  spbond0;
  spbond0 -> sp1;
  spbond0 -> sp0;
}

In case the network is with a tagged VLAN, it will be created on top of the spbond0 interface.

Here is an example with tagged VLAN 100:

digraph G {
  rankdir=LR;
  image=svg;
  compound=true;
  sp0;
  sp1;
  spbond0;
  "spbond0.100";
  spbond0 -> sp1;
  spbond0 -> sp0;
  "spbond0.100" -> spbond0;
}

In the bridge-active-backup-bond mode, the final resolve interface is a slave of a bridge interface (named br-storage by default).

This is a tagged VLAN 100 on the bond:

digraph G {
  rankdir=LR;
  image=svg;
  compound=true;
  sp0;
  sp1;
  spbond0;
  "br-storage";
  "spbond0.100";
  spbond0 -> sp1;
  spbond0 -> sp0;
  "br-storage" -> "spbond0.100" -> spbond0;
}

Lastly, here is a more complex example with four interfaces (sp0, sp1, sp2, sp3). The first two for the storage network are in bridge-active-backup-bond mode. The other two for the iSCSI network are in exclusive-ifaces mode. There are two additional networks on top of the storage resolve interface (in this example, 1100 and 1200). There is also an additional multipath network on the iSCSI interfaces with VLANs: 1301 on the first, and 1302 on the second iSCSI network interface.

Creating such a configuration with the net_helper tool should be done in the following way:

# net_helper genconfig sp0 sp1 sp2 sp3 \
    --vlan 100 \
    --sp-network 10.0.0.1/24 \
    --sp-mode bridge-active-backup-bond \
    --add-iface-net 1100,10.1.100.0/24 \
    --add-iface-net 1200,10.1.200.0/24 \
    --iscsi-mode exclusive-ifaces \
    --iscsicfg 1301,10.130.1.0/24:1302,10.130.2.0/24
digraph G {
  rankdir=LR;
  image=svg;
  compound=true;
  sp0;
  sp1;
  spbond0;
  "spbond0.1100";
  "spbond0.1200";
  "br-storage";
  sp2;
  sp3;
  "sp2.1301";
  "sp3.1302";
  spbond0 -> sp1;
  spbond0 -> sp0;
  "br-storage" -> "spbond0.100";
  "spbond0.100" -> spbond0;
  "spbond0.1100" -> spbond0;
  "spbond0.1200" -> spbond0;
  "sp2.1301" -> sp2;
  "sp3.1302" -> sp3;
}

The tooling helps selecting automatically the addresses for the ARP-monitoring targets, if they are not overridden for better active network selection. These addresses are usually the other storage nodes in the cluster. For the iSCSI in this mode, it is best to provide explicit ARP monitoring addresses.

8.2.3. LACP modes

The mlag-bond and bridge-mlag-bond modes are very close to the active-backup-bond and bridge-active-backup-bond modes described above, with the notable difference that they are LACP both when specified for the main storage or the iSCSI network interfaces.

With these bond types, no additional ARP-monitoring addresses are required or being auto-generated by the tooling.

A quirk with these modes is that multipath networks for the iSCSI are being created on top of the bond interface, because there is no way to send traffic through a specific interface under the bond. Use the exclusive-iface mode for such cases.

8.2.4. Creating the configuration

You can create a configuration and save it as a file using the net_helper tool with the genconfig option. For mode information, see the examples provided below.

8.2.5. Applying the configuration

To actually apply the configuration stored in a file, use the applyifcfg option:

[root@s11 ~]# net_helper applyifcfg --from-config /etc/storpool/autonets.conf
...

Additional sub-commands available are:

up

Execute ifup/nmcli connection up on all created interfaces.

down

Execute ifdown/nmcli connection down on all created interfaces.

check

Check whether there is a configuration, or if there is a difference between the present one and a newly created one.

cleanup

Delete all network interfaces created by the net_helper tool. Useful when re-creating the same raw interfaces with a different mode.

8.2.6. Simple example

Here is a minimal example with the following parameters:

  • Interface names: sp0 and sp1 (the order is important)

  • VLAN ID: 42

  • IP Network: 10.4.2.0/24

  • Predefined mode mode of operation: active-backup bond on top of the storage interfaces

The example below is for a node with SP_OURID=11. Running net_helper genconfig this way will just print an example configuration:

[root@s11 ~]# storpool_showconf SP_OURID
SP_OURID=11
[root@s11 ~]# net_helper genconfig sp0 sp1 --vlan 42 --sp-network 10.4.2.0/24 --sp-mode active-backup-bond
interfaces=sp0 sp1
addresses=10.4.2.11
sp_mode=active-backup-bond
vlan=42
add_iface=
sp_mtu=9000
iscsi_mtu=9000
iscsi_add_iface=
arp_ip_targets=10.4.2.12,10.4.2.13,10.4.2.14,10.4.2.15
config_path=/etc/storpool.conf.d/net_helper.conf

To store the configuration on the file system of the node:

[root@s11 ~]# net_helper genconfig sp0 sp1 --vlan 42 --sp-network 10.4.2.0/24 --sp-mode active-backup-bond > /etc/storpool/autonets.conf

With this configuration, the net_helper applyifcfg command can be used to produce configuration for the network based on the operating system. This example is for CentOS 7 (--noop just prints what will be done):

[root@s11 ~]# net_helper applyifcfg --from-config /etc/storpool/autonets.conf --noop
Same resolve interface spbond0.42 for both nets, assuming bond
An active-backup bond interface detected
Will patch /etc/storpool.conf.d/net_helper.conf with:
________________
SP_IFACE1_CFG=1:spbond0.42:sp0:42:10.4.2.11:b:s:P
SP_IFACE2_CFG=1:spbond0.42:sp1:42:10.4.2.11:b:s:P
SP_ALL_IFACES=dummy0 sp0 sp1 spbond0 spbond0.42
________________

Executing command: iface-genconf --auto --overwrite --sp-mtu 9000 --iscsi-mtu 9000 --arp-ip-targets 10.4.2.12,10.4.2.13,10.4.2.14,10.4.2.15 --noop
Using /usr/lib/storpool, instead of the default /usr/lib/storpool
Same resolve interface spbond0.42 for both nets, assuming bond
An active-backup bond interface detected
# Noop selected, use the following commands to create the configuration file manually
cat <<_EOF_ > /etc/sysconfig/network-scripts/ifcfg-spbond0.42
# Autogenerated by /usr/sbin/iface-genconf on 2022-07-12 07:02:27

DEVICE=spbond0.42
ONBOOT=yes
TYPE=Vlan
BOOTPROTO=none
MTU=9000
VLAN=yes
PHYSDEV=spbond0
IPADDR=10.4.2.11
PREFIX=24
NM_CONTROLLED=no
_EOF_
# Noop selected, use the following commands to create the configuration file manually
cat <<_EOF_ > /etc/sysconfig/network-scripts/ifcfg-dummy0
# Autogenerated by /usr/sbin/iface-genconf on 2022-07-12 07:02:27

DEVICE=dummy0
ONBOOT=yes
TYPE=dummy
MASTER=spbond0
SLAVE='yes'
BOOTPROTO=none
MTU=9000
NM_CONTROLLED=no
_EOF_
# Noop selected, use the following commands to create the configuration file manually
cat <<_EOF_ > /etc/sysconfig/network-scripts/ifcfg-spbond0
# Autogenerated by /usr/sbin/iface-genconf on 2022-07-12 07:02:27

DEVICE=spbond0
ONBOOT=yes
TYPE=Bond
BONDING_MASTER=yes
BONDING_OPTS="mode=active-backup arp_interval=500 arp_validate=active arp_all_targets=any arp_ip_target=10.4.2.12,10.4.2.13,10.4.2.14,10.4.2.15"
BOOTPROTO=none
MTU=9000
NM_CONTROLLED=no
_EOF_
# Noop selected, use the following commands to create the configuration file manually
cat <<_EOF_ > /etc/sysconfig/network-scripts/ifcfg-sp0
# Autogenerated by /usr/sbin/iface-genconf on 2022-07-12 07:02:27

DEVICE=sp0
ONBOOT=yes
TYPE=Ethernet
MASTER=spbond0
SLAVE='yes'
BOOTPROTO=none
MTU=9000
NM_CONTROLLED=no
_EOF_
# Noop selected, use the following commands to create the configuration file manually
cat <<_EOF_ > /etc/sysconfig/network-scripts/ifcfg-sp1
# Autogenerated by /usr/sbin/iface-genconf on 2022-07-12 07:02:27

DEVICE=sp1
ONBOOT=yes
TYPE=Ethernet
MASTER=spbond0
SLAVE='yes'
BOOTPROTO=none
MTU=9000
NM_CONTROLLED=no
_EOF_

There are many additional options available, for example the name of the bond could be customized, an additional set of VLAN interfaces could be created on top of the bond interface, and so on.

8.2.7. Advanced example

Here is a more advanced example:

  • Interface names: sp0 and sp1 for StorPool, and lacp0 and lacp1 for the iSCSI service

  • VLAN ID: 42 for the storage interfaces

  • IP Network: 10.4.2.0/24

  • Additional VLAN ID: 43 on the bond over the storage interfaces

  • Storage interfaces kernel mode of operation: with a bridge with MLAG bond on top of the storage interfaces

  • iSCSI dedicated interfaces kernel mode of operation: with an MLAG bond

  • VLAN 100, and IP network 172.16.100.0/24 for a portal group in iSCSI

To prepare the configuration:

[root@s11 ~]e net_helper genconfig \
        lacp0 lacp1 sp0 sp1
        --vlan 42 \
        --sp-network 10.4.2.0/24 \
        --sp-mode bridge-mlag-bond \
        --iscsi-mode mlag-bond \
        --add-iface 43,10.4.3.0/24 \
        --iscsicfg-net 100,172.16.100.0/24 | tee /etc/storpool/autonets.conf
interfaces=lacp0 lacp1 sp0 sp1
addresses=10.4.2.11
sp_mode=bridge-mlag-bond
vlan=42
iscsi_mode=mlag-bond
add_iface=43,10.4.3.11/24
sp_mtu=9000
iscsi_mtu=9000
iscsi_add_iface=100,172.16.100.11/24
iscsi_arp_ip_targets=
config_path=/etc/storpool.conf.d/net_helper.conf

Example output:

[root@s11 ~]# net_helper applyifcfg --from-config /etc/storpool/autonets.conf --noop
Same resolve interface br-storage for both nets, assuming bond
An 802.3ad bond interface detected
Will patch /etc/storpool.conf.d/net_helper.conf with:
________________
SP_RESOLVE_IFACE_IS_BRIDGE=1
SP_BOND_IFACE_NAME=spbond0.42
SP_IFACE1_CFG=1:br-storage:lacp0:42:10.4.2.11:b:s:v
SP_IFACE2_CFG=1:br-storage:lacp1:42:10.4.2.11:b:s:v
SP_ISCSI_IFACE=sp0,spbond1:sp1,spbond1:[lacp]
SP_ALL_IFACES=br-storage dummy0 dummy1 lacp0 lacp1 sp0 sp1 spbond0 spbond0.42 spbond0.43 spbond1 spbond1.100
________________

Executing command: iface-genconf --auto --overwrite --sp-mtu 9000 --iscsi-mtu 9000 --add-iface 43,10.4.3.11/24 --iscsicfg-explicit 100,172.16.100.11/24 --noop
Using /usr/lib/storpool, instead of the default /usr/lib/storpool
Same resolve interface br-storage for both nets, assuming bond
An 802.3ad bond interface detected
# Noop selected, use the following commands to create the configuration file manually
cat <<_EOF_ > /etc/sysconfig/network-scripts/ifcfg-br-storage
# Autogenerated by /usr/sbin/iface-genconf on 2022-07-12 07:33:39

DEVICE=br-storage
ONBOOT=yes
TYPE=Bridge
BOOTPROTO=none
MTU=9000
IPADDR=10.4.2.11
PREFIX=24
NM_CONTROLLED=no
_EOF_
# Noop selected, use the following commands to create the configuration file manually
cat <<_EOF_ > /etc/sysconfig/network-scripts/ifcfg-spbond0.42
# Autogenerated by /usr/sbin/iface-genconf on 2022-07-12 07:33:39

DEVICE=spbond0.42
ONBOOT=yes
TYPE=Vlan
BRIDGE=br-storage
BOOTPROTO=none
MTU=9000
VLAN=yes
PHYSDEV=spbond0
NM_CONTROLLED=no
_EOF_
# Noop selected, use the following commands to create the configuration file manually
cat <<_EOF_ > /etc/sysconfig/network-scripts/ifcfg-dummy0
# Autogenerated by /usr/sbin/iface-genconf on 2022-07-12 07:33:39

DEVICE=dummy0
ONBOOT=yes
TYPE=dummy
MASTER=spbond0
SLAVE='yes'
BOOTPROTO=none
MTU=9000
NM_CONTROLLED=no
_EOF_
# Noop selected, use the following commands to create the configuration file manually
cat <<_EOF_ > /etc/sysconfig/network-scripts/ifcfg-spbond0
# Autogenerated by /usr/sbin/iface-genconf on 2022-07-12 07:33:39

DEVICE=spbond0
ONBOOT=yes
TYPE=Bond
BRIDGE=br-storage
BONDING_MASTER=yes
BONDING_OPTS="mode=802.3ad miimon=100 lacp_rate=1"
BOOTPROTO=none
MTU=9000
NM_CONTROLLED=no
_EOF_
# Noop selected, use the following commands to create the configuration file manually
cat <<_EOF_ > /etc/sysconfig/network-scripts/ifcfg-lacp0
# Autogenerated by /usr/sbin/iface-genconf on 2022-07-12 07:33:39

DEVICE=lacp0
ONBOOT=yes
TYPE=Ethernet
MASTER=spbond0
SLAVE='yes'
BOOTPROTO=none
MTU=9000
NM_CONTROLLED=no
_EOF_
# Noop selected, use the following commands to create the configuration file manually
cat <<_EOF_ > /etc/sysconfig/network-scripts/ifcfg-lacp1
# Autogenerated by /usr/sbin/iface-genconf on 2022-07-12 07:33:39

DEVICE=lacp1
ONBOOT=yes
TYPE=Ethernet
MASTER=spbond0
SLAVE='yes'
BOOTPROTO=none
MTU=9000
NM_CONTROLLED=no
_EOF_
# Noop selected, use the following commands to create the configuration file manually
cat <<_EOF_ > /etc/sysconfig/network-scripts/ifcfg-spbond0.43
# Autogenerated by /usr/sbin/iface-genconf on 2022-07-12 07:33:39

DEVICE=spbond0.43
ONBOOT=yes
TYPE=Vlan
BOOTPROTO=none
MTU=9000
VLAN=yes
PHYSDEV=spbond0
IPADDR=10.4.3.11
PREFIX=24
NM_CONTROLLED=no
_EOF_
# Noop selected, use the following commands to create the configuration file manually
cat <<_EOF_ > /etc/sysconfig/network-scripts/ifcfg-sp0
# Autogenerated by /usr/sbin/iface-genconf on 2022-07-12 07:33:39

DEVICE=sp0
ONBOOT=yes
TYPE=Ethernet
MASTER=spbond1
SLAVE='yes'
BOOTPROTO=none
MTU=9000
NM_CONTROLLED=no
_EOF_
# Noop selected, use the following commands to create the configuration file manually
cat <<_EOF_ > /etc/sysconfig/network-scripts/ifcfg-sp1
# Autogenerated by /usr/sbin/iface-genconf on 2022-07-12 07:33:39

DEVICE=sp1
ONBOOT=yes
TYPE=Ethernet
MASTER=spbond1
SLAVE='yes'
BOOTPROTO=none
MTU=9000
NM_CONTROLLED=no
_EOF_
# Noop selected, use the following commands to create the configuration file manually
cat <<_EOF_ > /etc/sysconfig/network-scripts/ifcfg-dummy1
# Autogenerated by /usr/sbin/iface-genconf on 2022-07-12 07:33:39

DEVICE=dummy1
ONBOOT=yes
TYPE=dummy
MASTER=spbond1
SLAVE='yes'
BOOTPROTO=none
MTU=9000
NM_CONTROLLED=no
_EOF_
# Noop selected, use the following commands to create the configuration file manually
cat <<_EOF_ > /etc/sysconfig/network-scripts/ifcfg-spbond1.100
# Autogenerated by /usr/sbin/iface-genconf on 2022-07-12 07:33:39

DEVICE=spbond1.100
ONBOOT=yes
TYPE=Vlan
BOOTPROTO=none
MTU=9000
VLAN=yes
PHYSDEV=spbond1
IPADDR=172.16.100.11
PREFIX=24
NM_CONTROLLED=no
_EOF_
# Noop selected, use the following commands to create the configuration file manually
cat <<_EOF_ > /etc/sysconfig/network-scripts/ifcfg-spbond1
# Autogenerated by /usr/sbin/iface-genconf on 2022-07-12 07:33:39

DEVICE=spbond1
ONBOOT=yes
TYPE=Bond
BONDING_MASTER=yes
BONDING_OPTS="mode=802.3ad miimon=100 lacp_rate=1"
BOOTPROTO=none
MTU=9000
NM_CONTROLLED=no
_EOF_

8.3. Manual configuration

The net_helper tool is merely a glue-like tool covering the following manual steps:

  • Construct the SP_IFACE1_CFG/SP_IFACE2_CFG/SP_ISCSI_IFACE and other configuration statements based on the provided parameters (for the first and second network interfaces for the storage/iSCSI)

  • Execute iface_genconf that recognizes these configurations, and dumps configuration in /etc/sysconfig (CentOS 7) or /etc/network/interfaces (Debian) or using nmcli to configure with Network Manager (Alma8/Rocky8/RHEL8)

  • Execute /usr/lib/storpool/vf-genconf to prepare or re-create the configuration for virtual function interfaces.

8.4. Network and storage controllers interrupts affinity

The setirqaff utility is started by cron every minute. It checks the CPU affinity settings of several classes of IRQs (network interfaces, HBA, RAID), and updates them if needed. The policy is built in the script and does not require any external configuration files, apart from properly configured storpool.conf (see 6.  Node configuration options) for the present node.

9. Background services

A StorPool installation provides background services that take care of different functionality on each node participating in the cluster.

For details about how to control the services, see 10.  Managing services with storpool_ctl.

9.1. storpool_beacon

The beacon must be the first process started on all nodes in the cluster. It informs all members about the availability of the node on which it is installed. If the number of the visible nodes changes, every storpool_beacon service checks that its node still participates is the quorum - which means it can communicate with more than half of the expected nodes, including itself (see SP_EXPECTED_NODES in 6.  Node configuration options).

If the storpool_beacon service starts successfully, it will send to the system log (/var/log/messages, /var/log/syslog, or similar) messages as those shown below for every node that comes up in the StorPool cluster:

[snip]
Jan 21 16:22:18 s01 storpool_beacon[18839]: [info] incVotes(1) from 0 to 1, voteOwner 1
Jan 21 16:23:10 s01 storpool_beacon[18839]: [info] peer 2, beaconStatus UP bootupTime 1390314187662389
Jan 21 16:23:10 s01 storpool_beacon[18839]: [info] incVotes(1) from 1 to 2, voteOwner 2
Jan 21 16:23:10 s01 storpool_beacon[18839]: [info] peer up 1
[snip]

9.2. storpool_server

The storpool_server service must be started on each node that provides its storage devices (HDD, SSD, or NVMe drives) to the cluster. If the service starts successfully, all the drives intended to be used as StorPool disks should be listed in the system log, as shown in the example below:

Dec 14 09:54:19 s11 storpool_server[13658]: [info] /dev/sdl1: adding as data disk 1101 (ssd)
Dec 14 09:54:19 s11 storpool_server[13658]: [info] /dev/sdb1: adding as data disk 1111
Dec 14 09:54:20 s11 storpool_server[13658]: [info] /dev/sda1: adding as data disk 1114
Dec 14 09:54:20 s11 storpool_server[13658]: [info] /dev/sdk1: adding as data disk 1102 (ssd)
Dec 14 09:54:20 s11 storpool_server[13658]: [info] /dev/sdj1: adding as data disk 1113
Dec 14 09:54:22 s11 storpool_server[13658]: [info] /dev/sdi1: adding as data disk 1112

On a dedicated node (for example, one with a larger amount of spare resources) you can start more than one instances of storpool_server service (up to seven); for details, see 13.  Multi-server.

9.3. storpool_block

The storpool_block service provides the client (initiator) functionality. StorPool volumes can be attached only to the nodes where this service is running. When attached to a node, a volume can be used and manipulated as a regular block device via the /dev/stopool/{volume_name} symlink:

# lsblk /dev/storpool/test
NAME MAJ:MIN RM  SIZE RO TYPE MOUNTPOINT
sp-2 251:2    0  100G  0 disk

9.4. storpool_mgmt

The storpool_mgmt service should be started on at least two management nodes in the cluster. It receives requests from user space tools (CLI or API), executes them in the StorPool cluster, and returns the results back to the sender. An automatic failover mechanism is available: when the node with the active storpool_mgmt service fails, the SP_API_HTTP_HOST IP address is automatically configured on the next node with the lowest SP_OURID with a running storpool_mgmt service.

9.5. storpool_bridge

The storpool_bridge service is started on two or more nodes in the cluster, with one being active (similarly to the storpool_mgmt service). This service synchronizes snapshots for the backup and disaster recovery use cases between the current cluster and one or more StorPool clusters in different locations.

9.6. storpool_controller

The storpool_controller service is started on all nodes running the storpool_server service. It collects information from all storpool_server instances in order to provide statistics data to the API.

Note

The storpool_controller service requires port 47567 to be open on the nodes where the API (storpool_mgmt) service is running.

9.7. storpool_nvmed

The storpool_nvmed service is started on all nodes that have the storpool_server service and have NVMe devices. It handles the management of the NVMe devices, their unplugging from kernel’s nvme driver, and passing to the storpool_pci or vfio_pci drivers (configured through SP_NVME_PCI_DRIVER, see 6.4.1.  NVMe SSD drives).

9.8. storpool_stat

The storpool_stat service is started on all nodes. It collects the following system metrics from all nodes:

  • CPU stats - queue run/wait, user, system, and so on, per CPU

  • Memory usage stats per cgroup

  • Network stats for the StorPool services

  • The I/O stats of the system drives

  • Per-host validating service checks (for example, if there are processes in the root cgroup, the API is reachable if configured, and so on)

On some nodes it collects additional information:

  • On all nodes with the storpool_block service: the I/O stats of all attached StorPool volumes;

  • On server nodes: stats for the communication of storpool_server with the drives.

For more information, see Monitoring data collected.

The collected data can be viewed at https://analytics.storpool.com. It can also be submitted to an InfluxDB instance run by your organization (configurable in storpool.conf).

9.9. storpool_qos

The storpool_qos service tracks changes for volumes that match certain criteria, and takes care for updating the I/O performance settings of the matching volumes. For details, see Quality of service.

9.10. storpool_abrtsync

The storpool_abrtsync service automatically sends reports about aborted services to StorPool’s monitoring system.

10. Managing services with storpool_ctl

storpool_ctl is a helper tool providing an easy way to perform an action for all installed services on a StorPool node. You can use it to start, stop, or restart services, or enable/disable starting them on boot.

For more information about the services themselves, see 9.  Background services.

10.1. Supported actions

To list all supported actions use:

# storpool_ctl --help
usage: storpool_ctl [-h] {disable,start,status,stop,restart,enable} ...

       Tool that controls all StorPool services on a node, taking care of
       service dependencies, required checks before executing an action and
       others.

positional arguments:
{disable,start,status,stop,restart,enable}
                        action

optional arguments:
-h, --help            show this help message and exit

10.2. Getting status

List the status of all services installed on this node:

# storpool_ctl status
storpool_nvmed          not_running
storpool_mgmt           not_running
storpool_reaffirm       not_running
storpool_flushwbc       not_running
storpool_stat           not_running
storpool_abrtsync       not_running
storpool_block          not_running
storpool_kdump          not_running
storpool_hugepages      not_running
storpool_bridge         not_running
storpool_beacon         not_running
storpool_cgmove         not_running
storpool_iscsi          not_running
storpool_server         not_running
storpool_controller     not_running

The status is always one of:

  • not_running when the service is disabled or stopped

  • not_enabled when the service is running but is not yet enabled

  • running when the service is running and enabled

Note

The tool always prints the status after any selected action was applied.

You can also use storpool_ctl status --problems, which shows only services that are not in running state. Note that in this mode the utility exits with non-zero exit status in case there are any installed services in not_running or not_enabled state.

# storpool_ctl status --problems
storpool_mgmt           not_running
storpool_server         not_enabled
# echo $?
1

10.3. Starting services

To start all services:

# storpool_ctl start
cgconfig                running
storpool_abrtsync       not_enabled
storpool_cgmove         not_enabled
storpool_block          not_enabled
storpool_mgmt           not_enabled
storpool_flushwbc       not_enabled
storpool_server         not_enabled
storpool_hugepages      not_enabled
storpool_stat           not_enabled
storpool_controller     not_enabled
storpool_beacon         not_enabled
storpool_kdump          not_enabled
storpool_reaffirm       not_enabled
storpool_bridge         not_enabled
storpool_iscsi          not_enabled

10.4. Enabling services

Note

The services should be enabled after the configuration of cgroups is completed. For more information, see Control groups.

To enable all services:

# storpool_ctl enable
Created symlink from /etc/systemd/system/multi-user.target.wants/storpool_cgmove.service to /usr/lib/systemd/system/storpool_cgmove.service.
Created symlink from /etc/systemd/system/multi-user.target.wants/storpool_bridge.service to /usr/lib/systemd/system/storpool_bridge.service.
Created symlink from /etc/systemd/system/multi-user.target.wants/storpool_block.service to /usr/lib/systemd/system/storpool_block.service.
Created symlink from /etc/systemd/system/multi-user.target.wants/storpool_reaffirm.service to /usr/lib/systemd/system/storpool_reaffirm.service.
Created symlink from /etc/systemd/system/storpool_beacon.service.wants/storpool_reaffirm.service to /usr/lib/systemd/system/storpool_reaffirm.service.
Created symlink from /etc/systemd/system/storpool_block.service.wants/storpool_reaffirm.service to /usr/lib/systemd/system/storpool_reaffirm.service.
Created symlink from /etc/systemd/system/storpool_mgmt.service.wants/storpool_reaffirm.service to /usr/lib/systemd/system/storpool_reaffirm.service.
Created symlink from /etc/systemd/system/storpool_server.service.wants/storpool_reaffirm.service to /usr/lib/systemd/system/storpool_reaffirm.service.
Created symlink from /etc/systemd/system/multi-user.target.wants/storpool_controller.service to /usr/lib/systemd/system/storpool_controller.service.
Created symlink from /etc/systemd/system/multi-user.target.wants/storpool_beacon.service to /usr/lib/systemd/system/storpool_beacon.service.
Created symlink from /etc/systemd/system/storpool_server.service.wants/storpool_beacon.service to /usr/lib/systemd/system/storpool_beacon.service.
Created symlink from /etc/systemd/system/storpool_block.service.wants/storpool_beacon.service to /usr/lib/systemd/system/storpool_beacon.service.
Created symlink from /etc/systemd/system/storpool_mgmt.service.wants/storpool_beacon.service to /usr/lib/systemd/system/storpool_beacon.service.
Created symlink from /etc/systemd/system/multi-user.target.wants/storpool_kdump.service to /usr/lib/systemd/system/storpool_kdump.service.
Created symlink from /etc/systemd/system/multi-user.target.wants/storpool_abrtsync.service to /usr/lib/systemd/system/storpool_abrtsync.service.
Created symlink from /etc/systemd/system/multi-user.target.wants/storpool_mgmt.service to /usr/lib/systemd/system/storpool_mgmt.service.
Created symlink from /etc/systemd/system/multi-user.target.wants/storpool_server.service to /usr/lib/systemd/system/storpool_server.service.
Created symlink from /etc/systemd/system/multi-user.target.wants/storpool_flushwbc.service to /usr/lib/systemd/system/storpool_flushwbc.service.
Created symlink from /etc/systemd/system/multi-user.target.wants/storpool_stat.service to /usr/lib/systemd/system/storpool_stat.service.
Created symlink from /etc/systemd/system/sysinit.target.wants/storpool_hugepages.service to /usr/lib/systemd/system/storpool_hugepages.service.
Created symlink from /etc/systemd/system/multi-user.target.wants/storpool_iscsi.service to /usr/lib/systemd/system/storpool_iscsi.service.
storpool_cgmove         running
storpool_bridge         running
storpool_block          running
storpool_reaffirm       running
storpool_controller     running
storpool_beacon         running
storpool_kdump          running
storpool_abrtsync       running
storpool_mgmt           running
storpool_server         running
storpool_flushwbc       running
storpool_stat           running
storpool_hugepages      running
storpool_iscsi          running

10.5. Disabling services

To disable all services (without stopping them):

# storpool_ctl disable
Removed symlink /etc/systemd/system/multi-user.target.wants/storpool_cgmove.service.
Removed symlink /etc/systemd/system/multi-user.target.wants/storpool_bridge.service.
Removed symlink /etc/systemd/system/multi-user.target.wants/storpool_block.service.
Removed symlink /etc/systemd/system/multi-user.target.wants/storpool_reaffirm.service.
Removed symlink /etc/systemd/system/multi-user.target.wants/storpool_controller.service.
Removed symlink /etc/systemd/system/multi-user.target.wants/storpool_beacon.service.
Removed symlink /etc/systemd/system/multi-user.target.wants/storpool_kdump.service.
Removed symlink /etc/systemd/system/multi-user.target.wants/storpool_abrtsync.service.
Removed symlink /etc/systemd/system/multi-user.target.wants/storpool_mgmt.service.
Removed symlink /etc/systemd/system/multi-user.target.wants/storpool_server.service.
Removed symlink /etc/systemd/system/multi-user.target.wants/storpool_flushwbc.service.
Removed symlink /etc/systemd/system/multi-user.target.wants/storpool_stat.service.
Removed symlink /etc/systemd/system/multi-user.target.wants/storpool_iscsi.service.
Removed symlink /etc/systemd/system/sysinit.target.wants/storpool_hugepages.service.
Removed symlink /etc/systemd/system/storpool_beacon.service.wants/storpool_reaffirm.service.
Removed symlink /etc/systemd/system/storpool_block.service.wants/storpool_reaffirm.service.
Removed symlink /etc/systemd/system/storpool_block.service.wants/storpool_beacon.service.
Removed symlink /etc/systemd/system/storpool_mgmt.service.wants/storpool_reaffirm.service.
Removed symlink /etc/systemd/system/storpool_mgmt.service.wants/storpool_beacon.service.
Removed symlink /etc/systemd/system/storpool_server.service.wants/storpool_reaffirm.service.
Removed symlink /etc/systemd/system/storpool_server.service.wants/storpool_beacon.service.
storpool_kdump          not_enabled
storpool_hugepages      not_enabled
storpool_bridge         not_enabled
storpool_controller     not_enabled
storpool_beacon         not_enabled
storpool_cgmove         not_enabled
storpool_block          not_enabled
storpool_mgmt           not_enabled
storpool_stat           not_enabled
storpool_reaffirm       not_enabled
storpool_server         not_enabled
storpool_flushwbc       not_enabled
storpool_abrtsync       not_enabled
storpool_iscsi          not_enabled

10.6. Stopping services

To stop all services:

# storpool_ctl stop
storpool_server         not_running
storpool_iscsi          not_running
storpool_controller     not_running
storpool_mgmt           not_running
storpool_cgmove         not_running
storpool_kdump          not_running
storpool_reaffirm       not_running
storpool_stat           not_running
storpool_beacon         not_running
storpool_hugepages      not_running
storpool_flushwbc       not_running
storpool_block          not_running
storpool_bridge         not_running
storpool_abrtsync       not_running
Module storpool_pci    version 6D0D7D6E357D24CBDF2D1BA
Module storpool_disk   version D92BDA6C929615392EEAA7E
Module storpool_bd     version C6EB4EEF1E0ABF1A4774788
Module storpool_rdma   version 4F1FB67DF4617ECD6C472C4

The stop action features the options:

--servers

Stop just the server instances.

--expose-nvme

Expose any configured NVMe devices attached to the selected SP_NVME_PCI_DRIVER back to the nvme driver.

11. CLI tutorial

StorPool provides an easy yet powerful Command Line Interface (CLI) for administrating the data storage cluster or multiple clusters in the same location (multi-cluster). It has an integrated help system that provides useful information on every step. There are various ways to execute commands in the CLI, depending on the style and needs of the administrator. The StorPool CLI gets its configuration from the /etc/storpool.conf file and command line options.

The current document provides an introduction to the CLI and a few useful examples. For more information, see the 12.  CLI reference document.

11.1. Using the standard shell

Type regular shell command with parameters:

# storpool service list

Pipe command output to StorPool CLI:

# echo "service list" | storpool

Redirect the standard input from a predefined file with commands:

# storpool < input_file

Display the available command line options:

# storpool --help

11.2. Using the interactive shell

To start the interactive StorPool shell:

# storpool
StorPool> service list

Interactive shell help can be invoked by pressing the question mark key (?):

# storpool
StorPool> attach <?>
  client - specify a client to attach the volume to {M}
  here - attach here {M}
  list - list the current attachments
  mode - specify the read/write mode {M}
  noWait - do not wait for the client {M}
  snapshot - specify a snapshot to attach {M}
  timeout - seconds to wait for the client to appear {M}
  volume - specify a volume to attach {M}

Shell autocomplete (invoked by double-pressing the Tab key) will show available options for current step:

StorPool> attach <tab> <tab>
client    here      list      mode      noWait    snapshot  timeout   volume

The StorPool shell can detect incomplete lines and suggest options:

# storpool
StorPool> attach <enter>
.................^
Error: incomplete command! Expected:
    volume - specify a volume to attach
    client - specify a client to attach the volume to
    list - list the current attachments
    here - attach here
    mode - specify the read/write mode
    snapshot - specify a snapshot to attach
    timeout - seconds to wait for the client to appear
    noWait - do not wait for the client

To exit the interactive shell use the quit or exit command, or directly use the Ctrl+C or Ctrl+D keyboard shortcuts of your terminal.

11.3. Error messages

If the shell command is incomplete or wrong the system would display an error message, which would include the possible options:

# storpool attach
Error: incomplete command! Expected:
    list - list the current attachments
    timeout - seconds to wait for the client to appear
    volume - specify a volume to attach
    here - attach here
    noWait - do not wait for the client
    snapshot - specify a snapshot to attach
    mode - specify the read/write mode
    client - specify a client to attach the volume to

# storpool attach volume
Error: incomplete command! Expected:
  volume - the volume to attach

11.4. Multi-cluster mode

To enter multi-cluster mode (see 17.  Multi-site and multi-cluster) while in interactive mode:

StorPool> multiCluster on
[MC] StorPool>

For non-interactive mode, use:

# storpool -M <command>

Note

All commands not relevant to multi-cluster will silently fall-back to non-multi-cluster mode. For example, storpool -M service list will list only local services. Same for storpool -M disk list and storpool -M net list.

12. CLI reference

For introduction and examples, see 11.  CLI tutorial.

12.1. Location

The location submenu is used for configuring other StorPool sub-clusters in the same or different location (17.  Multi-site and multi-cluster). The location ID is the first part (left of the .) in the SP_CLUSTER_ID configured in the remote cluster.

For example to add a location with SP_CLUSTER_ID=nzkr.b use:

# storpool location add nzkr StorPoolLab-Sofia
OK

To list the configured locations use:

# storpool location list
-----------------------------------------------
| id   | name              | rxBuf  | txBuf   |
-----------------------------------------------
| nzkr | StorPoolLab-Sofia | 85 KiB | 128 KiB |
-----------------------------------------------

To rename a location:

# storpool location rename StorPoolLab-Sofia name StorPoolLab-Amsterdam
OK

To remove a location:

# storpool location remove StorPoolLab-Sofia
OK

Note

This command will fail if there is an existing cluster or a remote bridge configured for this location

To update the send or receive buffer sizes to values different from the defaults, use:

# storpool location update StorPoolLab-Sofia recvBufferSize 16M
OK
# storpool location update StorPoolLab-Sofia sendBufferSize 1M
OK
# storpool location list
-----------------------------------------------
| id   | name              | rxBuf  | txBuf   |
-----------------------------------------------
| nzkr | StorPoolLab-Sofia | 16 MiB | 1.0 MiB |
-----------------------------------------------

12.2. Cluster

The cluster submenu is used for configuring a new cluster for an already configured location. The cluster ID is the second part (right from the .) in the SP_CLUSTER_ID configured in the remote, for example to add the cluster b for the remote location nzkr use:

# storpool cluster add StorPoolLab-Sofia b
OK

To list the configured clusters use:

# storpool cluster list
--------------------------------------------------
| name                  | id | location          |
--------------------------------------------------
| StorPoolLab-Sofia-cl1 | b  | StorPoolLab-Sofia |
--------------------------------------------------

To remove a cluster use:

# storpool cluster remove StorPoolLab-Sofia b

12.3. Remote bridge

The remoteBridge submenu is used to register or deregister a remote bridge for a configured location.

12.3.1. Registering and de-registering

To register a remote bridge use storpool remoteBridge register <location-name> <IP address> <public-key>, example:

# storpool remoteBridge register StorPoolLab-Sofia 10.1.100.10 ju9jtefeb8idz.ngmrsntnzhsei.grefq7kzmj7zo.nno515u6ftna6
OK

Will register the StorPoolLab-Sofia location with an IP address of 10.1.100.10 and the above public key.

In case of a change in the IP address or the public key of a remote location the remote bridge could be de-registered and then registered again with the required parameters, for example:

# storpool remoteBridge deregister 10.1.100.10
OK
# storpool remoteBridge register StorPoolLab-Sofia 78.90.13.150 8nbr9q162tjh.ahb6ueg16kk2.mb7y2zj2hn1ru.5km8qut54x7z
OK

A remote bridge might be registered with noCrypto in case of a secure interconnect between the clusters, typical use-case is a 17.1.  Multicluster setup, with other sub-clusters in the same datacenter.

12.3.2. Minimum deletion delay

To enable deferred deletion on unexport from the remote site the minimumDeleteDelay flag should also be set, the format of the command is storpool remoteBridge register <location-name> <IP address> <public-key> minimumDeleteDelay <minimumDeleteDelay>, where the last parameter is a time period provided as X[smhd] - X is an integer and s, m, h, and d are seconds, minutes, hours and days accordingly.

For example, if we register the remote bridge for StorPoolLab-Sofia location with a minimumDeleteDelay of one day the register would look like this:

# storpool remoteBridge register StorPoolLab-Sofia 78.90.13.150 8nbr9q162tjh.ahb6ueg16kk2.mb7y2zj2hn1ru.5km8qut54x7z minimumDeleteDelay 1d
OK

After this operation all snapshots sent from the remote cluster could be unexported later with the deleteAfter parameter set (check the 12.11.1.  Remote snapshots section). Any deleteAfter parameters lower than the minimumDeleteDelay will be overridden by the bridge in the remote cluster. All such events will be logged on the node with the active bridge in the remote cluster.

For more information about deferred deletion, see 17.2.  Multi site.

12.3.3. Listing registered remote bridges

To list all registered remote bridges use:

# storpool remoteBridge list
StorPool> remoteBridge list
------------------------------------------------------------------------------------------------------------------------------
| ip           | remote            | minimumDeleteDelay | publicKey                                               | noCrypto |
------------------------------------------------------------------------------------------------------------------------------
| 10.1.200.10  | StorPoolLab-Sofia |                    | nonwtmwsgdr2p.fos2qus4h1qdk.pnt9ozj8gcktj.d7b2aa24gsegn | 0        |
| 10.1.200.11  | StorPoolLab-Sofia |                    | jtgeaqhsmqzqd.x277oefofxbpm.bynb2krkiwg54.ja4gzwqdg925j | 0        |
------------------------------------------------------------------------------------------------------------------------------

12.3.4. Status of remote bridges

To show the status of remote bridges use:

# storpool remoteBridge status
------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
| ip           | clusterId | connectionState | connectionTime      | reconnectCount | receivedExports | sentExports | lastError        | lastErrno               | errorTime           | bytesSentSinceStart | bytesRecvSinceStart | bytesSentSinceConnect | bytesRecvSinceConnect |
------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
| 10.1.200.11  | d.b       | connected       | 2021-02-07 18:08:25 |              2 |               5 |           2 | socket error     | Operation not permitted | 2021-02-07 17:58:58 |           210370560 |           242443328 |              41300272 |              75088624 |
| 10.1.200.10  | d.d       | connected       | 2021-02-07 17:51:42 |              1 |               7 |           2 | no error         | No error information    | -                   |           186118480 |            39063648 |             186118480 |              39063648 |
| 10.1.200.4   | e.n       | connected       | 2021-02-07 17:51:42 |              1 |               5 |           0 | no error         | No error information    | -                   |           117373472 |              316784 |             117373472 |                316784 |
------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------

12.4. Network

To list basic details about the cluster network use:

# storpool net list
------------------------------------------------------------
| nodeId | flags   | net 1             | net 2             |
------------------------------------------------------------
|     11 | uU + AJ | F4:52:14:76:9C:B0 | F4:52:14:76:9C:B0 |
|     12 | uU + AJ | 02:02:C9:3C:E3:80 | 02:02:C9:3C:E3:81 |
|     13 | uU + AJ | F6:52:14:76:9B:B0 | F6:52:14:76:9B:B1 |
|     14 | uU + AJ | 26:8A:07:43:EE:BC | 26:8A:07:43:EE:BD |
|     15 | uU + AJ | 1A:60:00:00:00:0F | 1E:60:00:00:00:0F |
------------------------------------------------------------
Quorum status: 5 voting beacons up out of 5 expected
Flags:
  u - packets recently received from this node
  d - no packets recently received from this node

  U - this node has enough votes to be in the quorum
  D - this node does not have enough votes to be in the quorum
  M - this node is being damped by the rest of the nodes in the cluster
  + - this node considers itself in the quorum
  B - the connection to this node is through a backup link; check the cabling
  A - this node is using hardware acceleration
  J - the node uses jumbo frames

  N - a non-voting node

12.5. Server

To list the nodes that are configured as StorPool servers and their storpool_server instances use:

# storpool server list
cluster running, mgmt on node 11
    server  11.0 running on node 11
    server  12.0 running on node 12
    server  13.0 running on node 13
    server  14.0 running on node 14
    server  11.1 running on node 11
    server  12.1 running on node 12
    server  13.1 running on node 13
    server  14.1 running on node 14

To get more information about which storage devices are provided by a particular server use the storpool server <ID>:

# storpool server 11 disk list
disk  |  server  |    size     |    used     |   est.free  |      %  |  free entries  |   on-disk size  |  allocated objects |  errors |   flags
1103  |    11.0  |    447 GiB  |    3.1 GiB  |    424 GiB  |    1 %  |       1919912  |         20 MiB  |    40100 / 480000  |   0 / 0 |
1104  |    11.0  |    447 GiB  |    3.1 GiB  |    424 GiB  |    1 %  |       1919907  |         20 MiB  |    40100 / 480000  |   0 / 0 |
1111  |    11.0  |    465 GiB  |    2.6 GiB  |    442 GiB  |    1 %  |        494977  |         20 MiB  |    40100 / 495000  |   0 / 0 |
1112  |    11.0  |    365 GiB  |    2.6 GiB  |    346 GiB  |    1 %  |        389977  |         20 MiB  |    40100 / 390000  |   0 / 0 |
1125  |    11.0  |    931 GiB  |    2.6 GiB  |    894 GiB  |    0 %  |        974979  |         20 MiB  |    40100 / 975000  |   0 / 0 |
1126  |    11.0  |    931 GiB  |    2.6 GiB  |    894 GiB  |    0 %  |        974979  |         20 MiB  |    40100 / 975000  |   0 / 0 |
----------------------------------------------------------------------------------------------------------------------------------------
   6  |     1.0  |    3.5 TiB  |     16 GiB  |    3.4 TiB  |    0 %  |       6674731  |        122 MiB  |   240600 / 3795000 |   0 / 0 |

Note

Without specifying instance the first instance is assumed - 11.0 as in the above example. The second, third and fourth storpool_server instance would be 11.1, 11.2 and 11.3 accordingly.

To list the servers that are blocked and could not join the cluster for some reason:

# storpool server blocked
cluster waiting, mgmt on node 12
  server  11.0 waiting on node 11 missing:1201,1202,1203,1204,1212,1221,1222,1223,1224,1225,1226,1301,1302,1303,1304,1311,1312,1321,1322,1323,1324,1325,1326,4043 pending:1103,1104,1111,1112,1125,1126
  server  12.0    down on node 12
  server  13.0    down on node 13
  server  14.0 waiting on node 14 missing:1201,1202,1203,1204,1212,1221,1222,1223,1224,1225,1226,1301,1302,1303,1304,1311,1312,1321,1322,1323,1324,1325,1326,4043 pending:1403,1404,1411,1412,1421,1423
  server  11.1 waiting on node 11 missing:1201,1202,1203,1204,1212,1221,1222,1223,1224,1225,1226,1301,1302,1303,1304,1311,1312,1321,1322,1323,1324,1325,1326,4043 pending:1101,1102,1121,1122,1123,1124
  server  12.1    down on node 12
  server  13.1    down on node 13
  server  14.1 waiting on node 14 missing:1201,1202,1203,1204,1212,1221,1222,1223,1224,1225,1226,1301,1302,1303,1304,1311,1312,1321,1322,1323,1324,1325,1326,4043 pending:1401,1402,1424,1425,1426

12.6. Fault sets

The fault sets are a way to instruct StorPool to use the drives in a group of nodes for only one replica of the data if they are expected to fail simultaneously. Some examples would be:

  • Multinode chassis

  • Multiple nodes in the same rack backed by the same power supply

  • Nodes connected to the same set of switches and so on.

To define a fault set only a name and some set of server nodes are needed:

# storpool faultSet chassis_1 addServer 11 addServer 12
OK

To list defined fault sets:

# storpool faultSet list
-------------------------------------------------------------------
| name                 |                                  servers |
-------------------------------------------------------------------
| chassis_1            |                                    11 12 |
-------------------------------------------------------------------

To remove a fault set:

# storpool faultSet chassis_1 delete chassis_1

Attention

A new fault set definition has effect only on newly created volumes. To change the configuration on already created volumes a re-balance operation would be required. See 12.18.  Balancer for more details on re-balancing a cluster after defining fault sets.

12.7. Services

Check the state of all services presently running in the cluster and their uptime:

# storpool service list
cluster running, mgmt on node 12
  mgmt    11 running on node 11 ver 20.00.18, started 2022-09-08 18:23:36, uptime 1 day 00:53:43
  mgmt    12 running on node 12 ver 20.00.18, started 2022-09-08 18:23:35, uptime 1 day 00:53:44 active
server  11.0 running on node 11 ver 20.00.18, started 2022-09-08 18:23:45, uptime 1 day 00:53:34
server  12.0 running on node 12 ver 20.00.18, started 2022-09-08 18:23:41, uptime 1 day 00:53:38
server  13.0 running on node 13 ver 20.00.18, started 2022-09-08 18:23:35, uptime 1 day 00:53:44
server  14.0 running on node 14 ver 20.00.18, started 2022-09-08 18:23:39, uptime 1 day 00:53:40
server  11.1 running on node 11 ver 20.00.18, started 2022-09-08 18:23:45, uptime 1 day 00:53:34
server  12.1 running on node 12 ver 20.00.18, started 2022-09-08 18:23:44, uptime 1 day 00:53:35
server  13.1 running on node 13 ver 20.00.18, started 2022-09-08 18:23:37, uptime 1 day 00:53:42
server  14.1 running on node 14 ver 20.00.18, started 2022-09-08 18:23:39, uptime 1 day 00:53:40
client    11 running on node 11 ver 20.00.18, started 2022-09-08 18:23:33, uptime 1 day 00:53:46
client    12 running on node 12 ver 20.00.18, started 2022-09-08 18:23:34, uptime 1 day 00:53:45
client    13 running on node 13 ver 20.00.18, started 2022-09-08 18:23:32, uptime 1 day 00:53:47
client    14 running on node 14 ver 20.00.18, started 2022-09-08 18:23:32, uptime 1 day 00:53:47
client    15 running on node 15 ver 20.00.18, started 2020-01-09 10:46:17, uptime 08:31:02
bridge    11 running on node 11 ver 20.00.18, started 2022-09-08 18:23:34, uptime 1 day 00:53:45 active
bridge    12 running on node 12 ver 20.00.18, started 2022-09-08 18:23:34, uptime 1 day 00:53:45
 cntrl    11 running on node 11 ver 20.00.18, started 2022-09-08 18:23:35, uptime 1 day 00:53:44
 cntrl    12 running on node 12 ver 20.00.18, started 2022-09-08 18:23:34, uptime 1 day 00:53:45
 cntrl    13 running on node 13 ver 20.00.18, started 2022-09-08 18:23:31, uptime 1 day 00:53:48
 cntrl    14 running on node 14 ver 20.00.18, started 2022-09-08 18:23:31, uptime 1 day 00:53:48
 iSCSI    12 running on node 13 ver 20.00.18, started 2022-09-08 18:23:32, uptime 1 day 00:53:47
 iSCSI    13 running on node 13 ver 20.00.18, started 2022-09-08 18:23:32, uptime 1 day 00:53:47

12.8. Disk

12.8.1. Disk list main info

The disk sub-menu is for querying or managing the available disks in the cluster.

To display all available disks in all server instances in the cluster:

# storpool disk list
disk  |  server  |    size     |     used    |   est.free  |      %  |  free entries  |   on-disk size  |   allocated objects |  errors |  flags
1101  |    11.1  |    893 GiB  |    2.6 GiB  |    857 GiB  |    0 %  |       3719946  |        664 KiB  |     41000 / 930000  |   0 / 0 |
1102  |    11.1  |    446 GiB  |    2.6 GiB  |    424 GiB  |    1 %  |       1919946  |        664 KiB  |     41000 / 480000  |   0 / 0 |
1103  |    11.0  |    893 GiB  |    2.6 GiB  |    857 GiB  |    0 %  |       3719948  |        660 KiB  |     41000 / 930000  |   0 / 0 |
1104  |    11.0  |    446 GiB  |    2.6 GiB  |    424 GiB  |    1 %  |       1919946  |        664 KiB  |     41000 / 480000  |   0 / 0 |
1105  |    11.0  |    446 GiB  |    2.6 GiB  |    424 GiB  |    1 %  |       1919947  |        664 KiB  |     41000 / 480000  |   0 / 0 |
1111  |    11.1  |    930 GiB  |    2.6 GiB  |    893 GiB  |    0 %  |        974950  |        716 KiB  |     41000 / 975000  |   0 / 0 |
1112  |    11.1  |    930 GiB  |    2.6 GiB  |    893 GiB  |    0 %  |        974949  |        736 KiB  |     41000 / 975000  |   0 / 0 |
1113  |    11.1  |    930 GiB  |    2.6 GiB  |    893 GiB  |    0 %  |        974943  |        760 KiB  |     41000 / 975000  |   0 / 0 |
1114  |    11.1  |    930 GiB  |    2.6 GiB  |    893 GiB  |    0 %  |        974937  |        844 KiB  |     41000 / 975000  |   0 / 0 |
[snip]
1425  |    14.1  |    931 GiB  |    2.6 GiB  |    894 GiB  |    0 %  |        974980  |         20 MiB  |     40100 / 975000  |   0 / 0 |
1426  |    14.1  |    931 GiB  |    2.6 GiB  |    894 GiB  |    0 %  |        974979  |         20 MiB  |     40100 / 975000  |   0 / 0 |
----------------------------------------------------------------------------------------------------------------------------------------
  47  |     8.0  |     30 TiB  |    149 GiB  |     29 TiB  |    0 %  |      53308967  |        932 MiB  |  1844600 / 32430000 |   0 / 0 |

To mark a device as temporarily unavailable:

# storpool disk 1111 eject
OK

Ejecting a disk from the cluster will stop the data replication for this disk, but will keep the metadata about the placement groups in which it participated and which volume objects it contained.

Note

The command above will refuse to eject the disk if this operation would lead to volumes or snapshots going into the down state (usually when the last up-to-date copy for some parts of a volume/snapshot is on this disk).

This drive will be shown as missing in the storpool disk list output, for example:

# storpool disk list
    disk  |  server  |    size    |    used    |  est.free  |      %  |  free entries  |  on-disk size  |  allocated objects |  errors  |  flags
    [snip]
    1422  |    14.1  |         -  |         -  |         -  |    - %  |             -  |             -  |        - / -       |    - / - |
    [snip]

Attention

This operation leads to degraded redundancy for all volumes and snapshots that have data on the ejected disk.

Such a disk will not return by itself back into the cluster, and would have to be manually reinserted by removing its EJECTED flag with storpool_initdisk -r /dev/$path.

12.8.2. Disk list additional info

To display additional info regarding disks:

# storpool disk list info
disk   |  server  |    device    |       model        |           serial            |             description          |         flags          |
 1101  |    11.1  |  0000:04:00.0-p1  |  SAMSUNG MZQLB960HAJR-00007  |  S437NF0M500149             |                                  |  S                     |
 1102  |    11.1  |  /dev/sdj1   |  Micron_M500DC_MTFDDAK480MBB  |  14250C6368E5               |                                  |  S                     |
 1103  |    11.0  |  /dev/sdi1   |  SAMSUNG_MZ7LH960HAJR-00005  |  S45NNE0M229767             |                                  |  S                     |
 1104  |    11.0  |  /dev/sdd1   |  Micron_M500DC_MTFDDAK480MBB  |  14250C63689B               |                                  |  S                     |
 1105  |    11.0  |  /dev/sdc1   |  Micron_M500DC_MTFDDAK480MBB  |  14250C6368EC               |                                  |  S                     |
 1111  |    11.1  |  /dev/sdl1   |  Hitachi_HUA722010CLA330  |  JPW9K0N13243ZL             |                                  |  W                     |
 1112  |    11.1  |  /dev/sda1   |  Hitachi_HUA722010CLA330  |  JPW9J0N13LJEEV             |                                  |  W                     |
 1113  |    11.1  |  /dev/sdb1   |  Hitachi_HUA722010CLA330  |  JPW9J0N13N694V             |                                  |  W                     |
 1114  |    11.1  |  /dev/sdm1   |  Hitachi_HUA722010CLA330  |  JPW9K0N132R7HL             |                                  |  W                     |
 [snip]
 1425  |    14.1  |  /dev/sdm1   |  Hitachi_HDS721050CLA360  |  JP1532FR1BY75C             |                                  |  W                     |
 1426  |    14.1  |  /dev/sdh1   |  Hitachi_HUA722010CLA330  |  JPW9K0N13RS95L             |                                  |  W, J                  |

To set additional information for some of the disks, shown in the output of storpool disk list info:

# storpool disk 1111 description HBA2_port7
OK
# storpool disk 1104 description FAILING_SMART
OK

12.8.3. Disk list server internal info

To display internal statistics about each disk:

# storpool disk list internal

--------------------------------------------------------------------------------------------------------------------------------------------------------
| disk | server |        aggregate scores        |         wbc pages        |     scrub bw |                          scrub ETA | last scrub completed |
--------------------------------------------------------------------------------------------------------------------------------------------------------
| 1101 |   11.1 |        0 |        0 |        0 |        - + -     / -     |            - |                                  - |  2022-09-08 18:23:07 |
| 1102 |   11.1 |        0 |        0 |        0 |        - + -     / -     |            - |                                  - |  2022-09-08 18:23:07 |
| 1103 |   11.0 |        0 |        0 |        0 |        - + -     / -     |            - |                                  - |  2022-09-08 18:23:08 |
| 1104 |   11.0 |        0 |        0 |        0 |        - + -     / -     |            - |                                  - |  2022-09-08 18:23:09 |
| 1105 |   11.1 |        0 |        0 |        0 |        - + -     / -     |            - |                                  - |  2022-09-08 18:23:10 |
| 1111 |   11.0 |        0 |        0 |        0 |        0 + 0     / 2560  |            - |                                  - |  2022-09-08 18:23:12 |
| 1112 |   11.1 |        0 |        0 |        0 |        0 + 0     / 2560  |            - |                                  - |  2022-09-08 18:23:15 |
| 1113 |   11.1 |        0 |        0 |        0 |        0 + 0     / 2560  |            - |                                  - |  2022-09-08 18:23:17 |
| 1114 |   11.1 |        0 |        0 |        0 |        0 + 0     / 2560  |            - |                                  - |  2022-09-08 18:23:13 |
[snip]
| 1425 |   14.1 |        0 |        0 |        0 |        0 + 0     / 2560  |            - |                                  - |  2022-09-08 18:23:15 |
| 1426 |   14.1 |        0 |        0 |        0 |        0 + 0     / 2560  |            - |                                  - |  2022-09-08 18:23:19 |
--------------------------------------------------------------------------------------------------------------------------------------------------------

The sections in this output are as follows:

aggregate scores

Internal values representing how much data is about to be defragmented on the particular drive. Usually between 0 and 1, on heavily loaded clusters the rightmost column might get into the hundreds or even thousands if some drives are severely loaded.

wbc pages

Internal statistics for each drive that have the write back cache or journaling in StorPool enabled.

scrub bw

The scrubbing speed in MB/s.

scrub ETA

Approximate time/date when the scrubbing operation will complete for this drive.

last scrub completed

The last time/date when the drive was scrubbed.

Note

The default installation includes a cron job on the management nodes that starts a scrubbing job for one drive per node. You can increase the number of disks that are scrubbing in parallel per node (the example is for four drives) by running the following:

# . /usr/lib/storpool/storpool_confget.sh
# storpool_q -d '{"set":{"scrubbingDiskPerNode":"4"}}' KV/Set/conf

And you can see the number of drives that are scrubbing per node with:

# . /usr/lib/storpool/storpool_confget.sh
# storpool_q KV/Get/conf/scrubbingDiskPerNode | jq -re '.data.pairs.scrubbingDiskPerNode'

To configure a different local or remote recovery override for this particular disk, different than the ones configured with mgmtConfig:

# storpool disk 1111 maxRecoveryRequestsOverride local 2
OK
# storpool disk 1111 maxRecoveryRequestsOverride remote 4
OK

To remove a configured override:

# storpool disk 1111 maxRecoveryRequestsOverride remote clear
OK

This will remove the override and the default configured for the whole cluster at 12.22.  Management configuration will take precedence.

12.8.4. Disk list performance information

To display performance related in-server statistics for each disk use:

# storpool disk list perf
       |                                         latencies |              thresholds |    times exceeded | flags
  disk | disk (avg) | disk (max) |  jrn (avg) |  jrn (max) |       disk |    journal |   disk |  journal |
  2301 |    0.299ms |    0.400ms |          - |          - |    0.000ms |          - |      0 |        - |
  2302 |    0.304ms |    0.399ms |          - |          - |    0.000ms |          - |      0 |        - |
  2303 |    0.316ms |    0.426ms |          - |          - |    0.000ms |          - |      0 |        - |
  [snip]
  2621 |    4.376ms |    4.376ms |    0.029ms |    0.029ms |    0.000ms |    0.000ms |      0 |        0 |
  2622 |    4.333ms |    4.333ms |    0.025ms |    0.025ms |    0.000ms |    0.000ms |      0 |        0 |

Note

Global latency thresholds are configured through the mgmtConfig section.

To configure a single disk latency threshold override use:

# storpool disk 2301 latencyLimitOverride disk 500
OK
# storpool disk list perf
       |                                         latencies |              thresholds |    times exceeded | flags
  disk | disk (avg) | disk (max) |  jrn (avg) |  jrn (max) |       disk |    journal |   disk |  journal |
  2301 |    0.119ms |    0.650ms |          - |          - |  500.000ms |          - |      0 |        - |     D
  [snip]

The D flag means there is a disk latency override, visible in the thresholds section.

Similarly, to configure a single disk journal latency threshold override:

# storpool disk 2621 latencyLimitOverride journal 100
OK
storpool disk list perf
       |                                         latencies |              thresholds |    times exceeded | flags
  disk | disk (avg) | disk (max) |  jrn (avg) |  jrn (max) |       disk |    journal |   disk |  journal |
  [snip]
  2621 |    8.489ms |   13.704ms |    0.052ms |    0.669ms |    0.000ms |  100.000ms |      0 |        0 |     J

The J flag means there is a disk journal latency override, visible in the thresholds section.

To override a single disk to no longer oblige the global limit:

# storpool disk 2301 latencyLimitOverride disk unlimited
OK

This will show the disk threshold as unlimited:

# storpool disk list perf
        |                                         latencies |              thresholds |    times exceeded | flags
    disk | disk (avg) | disk (max) |  jrn (avg) |  jrn (max) |       disk |    journal |   disk |  journal |
    2301 |    0.166ms |    0.656ms |          - |          - |  unlimited |          - |      0 |        - |     D

To clear an override and leave the global limit:

# storpool disk 2601 latencyLimitOverride disk off
OK
# storpool disk 2621 latencyLimitOverride journal off
OK

If a disk was ejected due to an excessive latency the server will keep a log from the last 128 requests sent to the disk, to list them use:

# storpool disk 2601 ejectLog
   log creation time |  time of first event
 2022-03-31 17:50:16 |  2022-03-31 17:31:08 +761,692usec
 req# |         start |           end |      latency |               addr |       size |                   op
    1 |          +0us |        +424us |        257us | 0x0000000253199000 |    128 KiB |         DISK_OP_READ
  [snip]
  126 | +1147653582us | +1147653679us |         97us | 0x0000000268EBE000 |     12 KiB |    DISK_OP_WRITE_FUA
  127 | +1147654920us | +1147655192us |        272us | 0x0000000012961000 |    128 KiB |    DISK_OP_WRITE_FUA
          total |        maxTotal |           limit |      generation | times exceeded (for this eject)
          23335us |        100523us |          1280us |           15338 |               1

The same data is available if a disk journal was ejected after breaching the threshold with:

# storpool disk 2621 ejectLog journal
[snip]

(the output is similar to that of the disk ejectLog above)

12.8.5. Ejecting disks and internal server tests

When a server controlling a disk notices some issues with it (write error, stalled request above a predefined threshold) it is also marked as “test pending”, due to many transient errors when a disk drive (or its controller) stalls a request for more than a predefined threshold.

An eject option is available for manually initiating such a test, which will flag the disk that it requires test and will eject it. The server instance will then perform a quick set of non-intrusive read-write tests on this disk and will return it back into the cluster if all tests did well; for example:

# storpool disk 2331 eject test
OK

The tests are done usually couple of seconds up to a minute, to check the results from the last test:

# storpool disk 2331 testInfo
 times tested  |   test pending  |  read speed   |  write speed  |  read max latency   |  write max latency  | failed
            1  |             no  |  1.0 GiB/sec  |  971 MiB/sec  |  8 msec             |  4 msec             |     no

If the disk was already marked for testing the option “now” will skip the test on the next attempt to re-open the disk:

# storpool disk 2301 eject now
OK

Attention

Note that this is exactly the same as “eject”, the disk would have to be manually returned into the cluster.

To mark a disk as unavailable by first re-balancing all data out to the other disks in the cluster and only then eject it:

# storpool disk 1422 softEject
OK
Balancer auto mode currently OFF. Must be ON for soft-eject to complete.

Note

This option requires StorPool balancer to be started after the above was issued, see more in the 12.18.  Balancer section.

To remove a disk from the list of reported disks and all placement groups it participates in:

# storpool disk 1422 forget
OK

To get detailed information about given disk:

# storpool disk 1101 info
agAllocated | agCount | agFree | agFreeing | agFull | agMaxSizeFull | agMaxSizePartial | agPartial
      7 |     462 |    455 |         1 |      0 |             0 |                1 |         1

entriesAllocated | entriesCount | entriesFree | sectorsCount
              50 |      1080000 |     1079950 |    501215232

objectsAllocated | objectsCount | objectsFree | objectStates
              18 |       270000 |      269982 | ok:18

serverId | 1

id                   | objectsCount | onDiskSize | storedSize | objectStates
#bad_id              |            1 |       0  B |       0  B | ok:1
#clusters            |            1 |    8.0 KiB |     768  B | ok:1
#drive_state         |            1 |    8.0 KiB |     4.0  B | ok:1
#drives              |            1 |    100 KiB |     96 KiB | ok:1
#iscsi_config        |            1 |     12 KiB |    8.0 KiB | ok:1
[snip]

To get detailed information about the objects on a particular disk:

# storpool disk 1101 list
object name         |  stored size | on-disk size | data version | object state | parent volume
#bad_id:0            |         0  B |         0  B |    1480:2485 |       ok (1) |
#clusters:0          |       768  B |      8.0 KiB |      711:992 |       ok (1) |
#drive_state:0       |       4.0  B |      8.0 KiB |    1475:2478 |       ok (1) |
#drives:0            |       96 KiB |      100 KiB |    1480:2484 |       ok (1) |
[snip]
test:4094            |         0  B |         0  B |          0:0 |       ok (1) |
test:4095            |         0  B |         0  B |          0:0 |       ok (1) |
----------------------------------------------------------------------------------------------------
4115 objects         |      394 KiB |      636 KiB |              |              |

To get detailed information about the active requests that the disk is performing at the moment:

# storpool disk 1101 activeRequests
-----------------------------------------------------------------------------------------------------------------------------------
| request ID                     |  request IDX |               volume |         address |       size |       op |    time active |
-----------------------------------------------------------------------------------------------------------------------------------
| 9226469746279625682:285697101441249070 |            9 |           testvolume |     85276782592 |     4.0 KiB |     read |         0 msec |
| 9226469746279625682:282600876697431861 |           13 |           testvolume |     96372936704 |     4.0 KiB |     read |         0 msec |
| 9226469746279625682:278097277070061367 |           19 |           testvolume |     46629707776 |     4.0 KiB |     read |         0 msec |
| 9226469746279625682:278660227023482671 |          265 |           testvolume |     56680042496 |     4.0 KiB |    write |         0 msec |
-----------------------------------------------------------------------------------------------------------------------------------

To issue retrim operation on a disk (available for SSD disks only):

# storpool disk 1101 retrim
OK

To start, pause or continue a scrubbing operation for a disk:

# storpool disk 1101 scrubbing start
OK
# storpool disk 1101 scrubbing pause
OK
# storpool disk 1101 scrubbing continue
OK

Note

Use storpool disk list internal to check the status of a running scrub operation or when was the last completed scrubbing operation for this disk.

12.9. Placement groups

The placement groups are predefined sets of disks, over which volume objects will be replicated. It is possible to specify which individual disks to add to the group.

To display the defined placement groups in the cluster:

# storpool placementGroup list
name
default
hdd
ssd

To display details about a placement group:

# storpool placementGroup ssd list
type   | id
disk   | 1101 1201 1301 1401

Creating a new placement group or extend an existing one requires specifying its name and providing one or more disks to be added:

# storpool placementGroup ssd addDisk 1102
OK
# storpool placementGroup ssd addDisk 1202
OK
# storpool placementGroup ssd addDisk 1302 addDisk 1402
OK
# storpool placementGroup ssd list
type | id
disk   | 1101 1102 1201 1202 1301 1302 1401 1402

To remove one or more disks from a placement group:

# storpool placementGroup ssd rmDisk 1402
OK
# storpool placementGroup ssd list
type | id
disk   | 1101 1102 1201 1202 1301 1302 1401

To rename a placement group:

# storpool placementGroup ssd rename M500DC
OK

The unused placement groups can be removed. To avoid accidents, the name of the group must be entered twice:

# storpool placementGroup ssd delete ssd
OK

12.10. Volumes

The volumes are the basic service of the StorPool storage system. A volume always have a name and a certain size. It can be read from and written to. It could be attached to hosts as read-only or read-write block device under the /dev/storpool directory. A volume may have one or more tags created or changed in the form name=value. The volume name is a string consisting of one or more of the allowed characters - upper and lower latin letters (a-z,A-Z), numbers (0-9) and the delimiter dot (.), colon (:), dash (-) and underscore (_). The same rules apply for the keys and values used for the volume tags. The volume name including tags cannot exceed 200 bytes.

When a volume is created, at minimum the <volumeName>, the <template> or placement/replication details and its size must be specified:

# storpool volume testvolume size 100G template hybrid

Additional parameters that can be used or overridden:

placeAll

Place all objects in placementGroup (Default value: default).

placeTail

Name of placementGroup for reader (Default value: same as placeAll value).

placeHead

Place the third replica in a different placementGroup (Default value: same as placeAll value)

template

Use template with preconfigured placement, replication, and/or limits; for details, see 12.14.  Templates. Usage of templates is seriously encouraged due to easier tracking and capacity management.

parent

Use a snapshot as a parent for this volume.

reuseServer

Place multiple copies on the same server.

baseOn

Use parent volume, this will create a transient snapshot used as a parent. For details, see 12.11.  Snapshots).

iops

Set the maximum IOPS limit for this volume (in IOPS).

bw

Set maximum bandwidth limit (in MB/s).

tag

Set a tag for this volume in the form name=value.

create

Create the volume, fail if it exists (optional for now).

update

Update the volume, fail if it does not exist (optional for now).

limitType

Specify whether iops and bw limits ought to be for the total size of the block device or per each GiB (one of “total” or “perGiB”)

The create option is useful in scripts when you have to prevent an involuntary update of a volume:

# storpool volume test create template hybrid
OK
# storpool volume test create size 200G template hybrid
Error: Volume 'test' already exists

A statement with update parameter will fail with an error if the volume does not exist:

# storpool volume test update template hybrid size +100G
OK
# storpool volume test1 update template hybrid
Error: volume 'test1' does not exist

12.10.1. Listing all volumes

To list all available volumes:

# storpool volume list
-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------
| volume               |    size  | rdnd. | placeHead  | placeAll   | placeTail  |   iops  |    bw   | parent               | template             | flags     | tags       |
-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------
| testvolume           |  100 GiB |     3 | ultrastar  | ultrastar  | ssd        |       - |       - | testvolume@35691     | hybrid               |           | name=value |
| testvolume_8_2       |  100 GiB |   8+2 |       nvme |      nvme  | nvme       |       - |       - | testvolume_8_2@35693 | nvme                 |           | name=value |
-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------

Flags:
R - allow placing two disks within a replication chain onto the same server
t - volume move target. Waiting for the move to finish
G - IOPS and bandwidth limits are per GiB and depends on volume/snapshot size

12.10.2. Listing exported volumes

To list volumes exported to other sub-clusters in the multi-cluster:

# storpool volume list exports
---------------------------------
| remote    | volume | globalId |
---------------------------------
| Lab-D-cl2 | test   | d.n.buy  |
---------------------------------

To list volumes exported in other sub-clusters to this one in a multi-cluster setup:

# volume list remote
--------------------------------------------------------------------------
| location | remoteId | name | size         | creationTimestamp   | tags |
--------------------------------------------------------------------------
| Lab-D    | d.n.buy  | test | 137438953472 | 2020-05-27 11:57:38 |      |
--------------------------------------------------------------------------

Note

Once attached a remotely exported volume will no longer be visible with volume list remote, even if the export is still visible in the remote cluster with volume list exports. Every export invocation in the local cluster will be used up for every attach in the remote cluster.

12.10.3. Volume status

To get an overview of all volumes and snapshots and their state in the system:

# storpool volume status
----------------------------------------------------------------------------------------------------------------------------------------------------
| volume               |     size | rdnd. | tags       |  alloc % |   stored |  on disk | syncing | missing | status    | flags | drives down      |
----------------------------------------------------------------------------------------------------------------------------------------------------
| testvolume           |  100 GiB |     3 | name=value |    0.0 % |     0  B |     0  B |    0  B |    0  B | up        |       |                  |
| testvolume@35691     |  100 GiB |     3 |            |  100.0 % |  100 GiB |  317 GiB |    0  B |    0  B | up        | S     |                  |
----------------------------------------------------------------------------------------------------------------------------------------------------
| 2 volumes            |  200 GiB |       |            |   50.0 % |  100 GiB |  317 GiB |    0  B |    0  B |           |       |                  |
----------------------------------------------------------------------------------------------------------------------------------------------------

Flags:
  S - snapshot
  B - balancer blocked on this volume
  D - decreased redundancy (degraded)
  M - migrating data to a new disk
  R - allow placing two disks within a replication chain onto the same server
  t - volume move target. Waiting for the move to finish
  C - disk placement constraints violated, rebalance needed

The columns in this output are:

  • volume - name of the volume or snapshot (see flags below)

  • size - provisioned volume size, the visible size inside a VM for example

  • rdnd. - number of copies for this volume or its erasure coding scheme

  • tags - all custom key=value tags configured for this volume or snapshot

  • alloc % - how much space was used on this volume in percent

  • stored - space allocated on this volume

  • on disk - the size allocated on all drives in the cluster after replication and the overhead from data protection

  • syncing - how much data is not in sync after a drive or server was missing, the data is recovered automatically once the missing drive or server is back in the cluster

  • missing - shows how much data is not available for this volume when the volume is with status down, see status below

  • status - shows the status of the volume, which could be one of:

    • up - all copies are available

    • down - none of the copies are available for some parts of the volume

    • up soon - all copies are available and the volume will soon get up

  • flags - flags denoting features of this volume:

    • S - stands for snapshot, which is essentially a read-only (frozen) volume

    • B - used to denote that the balancer is blocked for this volume (usually when some of the drives are missing)

    • D - this flag is displayed when some of the copies is either not available or outdated and the volume is with decreased redundancy

    • M - displayed when changing the replication or a cluster re-balance is in progress

    • R - displayed when the policy for keeping copies on different servers is overridden

    • C - displayed when the volume or snapshot placement constraints are violated

  • drives down - displayed when the volume is in down state, displaying the drives required to get the volume back up.

Size is in B/KiB/MiB/GiB, TiB or PiB.

To get just the status data from the storpool_controller services in the cluster, without any info for stored, on disk size, etc.:

# storpool volume quickStatus
----------------------------------------------------------------------------------------------------------------------------------------------------
| volume               |     size | rdnd. | tags       |  alloc % |   stored |  on disk | syncing | missing | status    | flags | drives down      |
----------------------------------------------------------------------------------------------------------------------------------------------------
| testvolume           |  100 GiB |     3 | name=value |    0.0 % |     0  B |     0  B |    0  B |    0  B | up        |       |                  |
| testvolume@35691     |  100 GiB |     3 |            |    0.0 % |     0  B |     0  B |    0  B |    0  B | up        | S     |                  |
----------------------------------------------------------------------------------------------------------------------------------------------------
| 2 volumes            |  200 GiB |       |            |    0.0 % |     0  B |     0  B |    0  B |    0  B |           |       |                  |
----------------------------------------------------------------------------------------------------------------------------------------------------

Note

The quickStatus has less of an impact on the storpool_server services and thus on end-user operations because the gathered data does not include the per-volume detailed storage stats provided with status.

12.10.4. Used space estimation

To check the estimated used space by the volumes in the system:

# storpool volume usedSpace
-----------------------------------------------------------------------------------------
| volume               |        size | rdnd. |      stored |        used | missing info |
-----------------------------------------------------------------------------------------
| testvolume           |     100 GiB |     3 |     1.9 GiB |     100 GiB |         0  B |
-----------------------------------------------------------------------------------------

The columns are as follows:

  • volume - name of the volume

  • size - the provisioned size of this volume

  • rdnd. - number of copies for this volume or its erasure coding scheme

  • stored - how much data is stored for this volume (without referring all parent snapshots)

  • used - how much data has been written (including the data written in parent snapshots)

  • missing info - if this value is anything other than 0  B probably some of the storpool_controller services in the cluster is not running correctly.

Note

The used column shows how much data is accessible and reserved for this volume.

12.10.5. Listing disk sets and objects

To list the target disk sets and objects of a volume:

# storpool volume testvolume list
volume testvolume
size 100 GiB
replication 3
placeHead hdd
placeAll hdd
placeTail ssd
target disk sets:
       0: 1122 1323 1203
       1: 1424 1222 1301
       2: 1121 1324 1201
[snip]
  object: disks
       0: 1122 1323 1203
       1: 1424 1222 1301
       2: 1121 1324 1201
[snip]

Hint

In this example, the volume is with hybrid placement with two copies on HDDs and one copy on SSDs (the rightmost disk sets column). The target disk sets are lists of triplets of drives in the cluster used as a template for the actual objects of the volume.

To get detailed info about the disks used for this volume and the number of objects on each of them:

# storpool volume testvolume info
    diskId | count
  1101 |   200
  1102 |   200
  1103 |   200
  [snip]

chain                | count
1121-1222-1404       |  25
1121-1226-1303       |  25
1121-1226-1403       |  25
[snip]

diskSet              | count
218-313-402          |   3
218-317-406          |   3
219-315-402          |   3

Note

The order of the diskSet is not by placeHead, placeAll, placeTail, check the actual order in the storpool volume <volumename> list output. The reason is to count similar diskSet with a different order in the same slot, i.e. [101, 201, 301] is accounted as the same diskSet as [201, 101, 301].

12.10.6. Managing volumes

To rename a volume:

# storpool volume testvolume rename newvolume
OK

Attention

Changing the name of a volume will not wait for clients that have this volume attached to update the name of the symlink. Always use client sync for all clients with the volume attached.

To add a tag for a volume:

# storpool volume testvolume tag name=value

To change a tag for a volume:

# storpool volume testvolume tag name=newvalue

To remove a tag just set it to an empty value:

# storpool volume testvolume tag name=

To resize a volume up:

# storpool volume testvolume size +1G
OK

To shrink a volume (resize down):

# storpool volume testvolume size 50G shrinkOk

Attention

Shrinking of a storpool volume changes the size of the block device, but does not adjust the size of LVM or filesystem contained in the volume. Failing to adjust the size of the filesystem or LVM prior to shrinking the StorPool volume would result in data loss.

To delete a volume:

# storpool volume vol1 delete vol1

Note

To avoid accidents, the volume name must be entered twice. Attached volumes cannot be deleted even when not used as a safety precaution. Fort details, see 12.12.  Attachments.

A volume could be converted from based on a snapshot to a stand-alone volume. For example the testvolume below is based on an anonymous snapshot:

# storpool_tree
StorPool
  `-testvolume@37126
     `-testvolume

To rebase it against root (known also as “promote”):

# storpool volume testvolume rebase
OK
# storpool_tree
StorPool
  `- testvolume@255 [snapshot]
     `- testvolume [volume]

The rebase operation could also be to a particular snapshot from a chain of parent snapshots on this child volume:

# storpool_tree
StorPool
  `- testvolume-snap1 [snapshot]
     `- testvolume-snap2 [snapshot]
        `- testvolume-snap3 [snapshot]
           `- testvolume [volume]
# storpool volume testvolume rebase testvolume-snap2
OK

After the operation the volume is directly based on testvolume-snap2 and includes all changes from testvolume-snap3:

# storpool_tree
StorPool
  `- testvolume-snap1 [snapshot]
     `- testvolume-snap2 [snapshot]
        |- testvolume [volume]
        `- testvolume-snap3 [snapshot]

To backup a volume named testvolume in a configured remote location StorPoolLab-Sofia:

# storpool volume testvolume backup StorPoolLab-Sofia
OK

After this operation a temporary snapshot will be created and will be transferred in StorPoolLab-Sofia location. After the transfer completes, the local temporary snapshot will be deleted and the remote snapshot will be visible as exported from StorPoolLab-Sofia. For more information on working with snapshot exports, see 12.11.1.  Remote snapshots .

When backing up a volume, the remote snapshot may have one or more tags applied, example below:

# storpool volume testvolume backup StorPoolLab-Sofia tag key=value # [tag key2=value2]
OK

To move a volume to a different cluster in a multicluster environment (more on clusters here):

# storpool volume testvolume moveToRemote Lab-D-cl2 # onAttached export

Note

Moving a volume to a remote cluster will fail if the volume is attached on a local host. It could be further specified what to do in such case with the onAttached parameter, as in the comment in the example above. More info on volume move is available in 17.13.  Volume and snapshot move.

12.11. Snapshots

Snapshots are read-only point-in-time images of volumes. They are created once and cannot be changed. They can be attached to hosts as read-only block devices under /dev/storpool. Volumes and snapshots share the same name-space, thus their names are unique within a StorPool cluster. Volumes can be based on snapshots. Such volumes contain only the changes since the snapshot was taken. After a volume is created from a snapshot, writes will be recorded within the volume. Reads from volume may be served by volume or by its parent snapshot depending on whether volume contains changed data for the read request or not.

To create an unnamed (known also as anonymous) snapshot of a volume:

# storpool volume testvolume snapshot
OK

This will create a snapshot named testvolume@<ID>, where ID is an unique serial number. Note that any tags on the volume will not be propagated to the snapshot; to set tags on the snapshot at creation time:

# storpool volume testvolume tag key=value snapshot

To create a named snapshot of a volume:

# storpool volume testvolume snapshot testsnap
OK

Again to directly set tags:

# storpool volume testvolume snapshot testsnapplustags tag key=value

To remove a tag on a snapshot:

# storpool snapshot testsnapplustags tag key=

To list the snapshots:

# storpool snapshot list
-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
| snapshot          |    size  | rdnd. | placeHead | placeAll   | placeTail | created on          | volume      | iops  | bw   | parent          | template  | flags | targetDeleteDate | tags      |
-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
| testsnap          |  100 GiB |     3 | hdd       | hdd        | ssd       | 2019-08-30 04:11:23 | testvolume  |     - |    - | testvolume@1430 | hybrid-r3 |       | -                | key=value |
| testvolume@1430   |  100 GiB |     3 | hdd       | hdd        | ssd       | 2019-08-30 03:56:58 | testvolume  |     - |    - |                 | hybrid-r3 | A     | -                |           |
-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------

Flags:
  A - anonymous snapshot with auto-generated name
  B - bound snapshot
  D - snapshot currently in the process of deletion
  T - transient snapshot (created during volume cloning)
  R - allow placing two disks within a replication chain onto the same server
  P - snapshot delete blocked due to multiple children

To list the snapshots only for a particular volume:

# storpool volume testvolume list snapshots
[snip]

To create a bound snapshot on a volume:

# storpool volume testvolume bound snapshot
OK

This snapshot will be automatically deleted when the last child volume created from it is deleted. Useful for non-persistent images.

To list the target disk sets and objects of a snapshot:

# storpool snapshot testsnap list
[snip]

The output is similar as with storpool volume <volumename> list.

To get detailed info about the disks used for this snapshot and the number of objects on each of them:

# storpool snapshot testsnap info
[snip]

The output is similar to the storpool volume <volumename> info.

To create a volume based on an existing snapshot (cloning):

# storpool volume testvolume parent centos73-base-snap
OK

To revert a volume to an existing snapshot:

# storpool volume testvolume revertToSnapshot centos73-working
OK

This is also possible through the use of templates with a parent snapshot (see 12.14.  Templates):

# storpool volume spd template centos73-base
OK

Create a volume based on another existing volume (cloning):

# storpool volume testvolume1 baseOn testvolume
OK

Note

This operation will first create an anonymous bound snapshot on testvolume and will then create testvolume1 with the bound snapshot as parent. The snapshot will exist until both volumes are deleted and will be automatically deleted afterwards.

To delete a snapshot:

# storpool snapshot spdb_snap1 delete spdb_snap1
OK

Note

To avoid accidents, the name of the snapshot must be entered twice.

A snapshot could also be bound to its child volumes, it will exist until all child volumes are deleted:

# storpool snapshot testsnap bind
OK

The opposite operation is also possible, to unbind such snapshot:

# storpool snapshot testsnap unbind
OK

To get the space that will be freed if a snapshot is deleted:

# storpool snapshot space
----------------------------------------------------------------------------------------------------------------
| snapshot             | on volume            |        size | rdnd. |      stored |        used | missing info |
----------------------------------------------------------------------------------------------------------------
| testsnap             | testvolume           |     100 GiB |     3 |      27 GiB |    -135 GiB |         0  B |
| testvolume@3794      | testvolume           |     100 GiB |     3 |      27 GiB |     1.9 GiB |         0  B |
| testvolume@3897      | testvolume           |     100 GiB |     3 |     507 MiB |     432 KiB |         0  B |
| testvolume@3899      | testvolume           |     100 GiB |     3 |     334 MiB |     224 KiB |         0  B |
| testvolume@4332      | testvolume           |     100 GiB |     3 |      73 MiB |      36 KiB |         0  B |
| testvolume@4333      | testvolume           |     100 GiB |     3 |      45 MiB |      40 KiB |         0  B |
| testvolume@4334      | testvolume           |     100 GiB |     3 |      59 MiB |      16 KiB |         0  B |
| frozenvolume         | -                    |       8 GiB |     2 |      80 MiB |      80 MiB |         0  B |
----------------------------------------------------------------------------------------------------------------

Used mainly for accounting purposes. The columns are as follows:

  • snapshot - name of the snapshot

  • on volume - the name of the volume child for this snapshot if any. For example, a frozen volume would have this field empty.

  • size - the size of the snapshot as provisioned

  • rdnd. - number of copies for this volume or its erasure coding scheme

  • stored - how much data is actually written

  • used - stands for the amount of data that would be freed from the underlying drives (before redundancy) if the snapshot is removed.

  • missing info - if this value is anything other than 0  B probably some of the storpool_controller services in the cluster are not running correctly.

The used column could be negative in some cases when the snapshot has more than one child volume. In these cases deleting the snapshot would “free” negative space i.e. will end up taking more space on the underlying disks.

Similar to volumes a snapshot could have different placementGroups or other attributes, as well as templates:

# storpool snapshot testsnap template all-ssd
OK

Additional parameters that may be used:

  • placeAll - place all objects in placementGroup (Default value: default)

  • placeTail - name of placementGroup for reader (Default value: same as placeAll value)

  • placeHead - place the third replica in a different placementGroup (Default value: same as placeAll value)

  • reuseServer - place multiple copies on the same server

  • tag - set a tag in the form key=value

  • template - use template with preconfigured placement, replication, and/or limits (check 12.14.  Templates for details).

  • iops - set the maximum IOPS limit for this snapshot (in IOPS)

  • bw - set maximum bandwidth limit (in MB/s)

  • limitType - specify whether iops and bw limits ought to be for the total size of the block device or per each GiB (one of “total” or “perGiB”)

Note

The bandwidth and IOPS limits are concerning only the particular snapshot if it is attached and does not limit any child volumes using this snapshot as parent.

Also similar to the same operation with volumes a snapshot could be renamed with:

# storpool snapshot testsnap rename ubuntu1604-base
OK

Attention

Changing the name of a snapshot will not wait for clients that have this snapshot attached to update the name of the symlink. Always use client sync for all clients with the snapshot attached.

A snapshot could also be rebased to root (promoted) or rebased to another parent snapshot in a chain:

# storpool snapshot testsnap rebase # [parent-snapshot-name]
OK

To delete a snapshot:

# storpool snapshot testsnap delete testsnap
OK

Note

A snapshot sometimes will not get deleted immediately, during this period of time it will be visible with * in the output of storpool volume status or storpool snapshot list.

To set a snapshot for deferred deletion:

# storpool snapshot testsnap deleteAfter 1d
OK

The above will set a target delete date for this snapshot in exactly one day from the present time.

Note

The snapshot will be deleted at the desired point in time only if delayed snapshot delete was enabled in the local cluster, check 12.22.  Management configuration for details.

12.11.1. Remote snapshots

In case multi-site or multicluster is enabled (the cluster have a bridge service running) a snapshot could be exported and become visible to other configured clusters.

For example to export a snapshot snap1 to a location named StorPoolLab-Sofia:

# storpool snapshot snap1 export StorPoolLab-Sofia
OK

To list the presently exported snapshots:

# storpool snapshot list exports
-------------------------------------------------------------------------------
| remote                 | snapshot    | globalId    | backingUp | volumeMove |
-------------------------------------------------------------------------------
| StorPoolLab-Sofia      | snap1       | nzkr.b.cuj  | false     | false      |
-------------------------------------------------------------------------------

To list the snapshots exported from remote sites:

# storpool snapshot list remote
------------------------------------------------------------------------------------------
| location | remoteId | name      | onVolume | size         | creationTimestamp   | tags |
------------------------------------------------------------------------------------------
| s02      | a.o.cxz  | snapshot1 |          | 107374182400 | 2019-08-20 03:21:42 |      |
------------------------------------------------------------------------------------------

Single snapshot could be exported to multiple configured locations.

To create a clone of a remote snapshot locally:

# storpool snapshot snapshot1-copy template hybrid-r3 remote s02 a.o.cxz # [tag key=value]

In this example the remote location is s02 and the remoteId is a.o.cxz. Any key=value pair tags may be configured at creation time.

To unexport a local snapshot:

# storpool snapshot snap1 unexport StorPoolLab-Sofia
OK

The remote location could be swapped by the keyword all. This will attempt to unexport the snapshot from all location it was previously exported to.

Note

If the snapshot is presently being transferred the unexport operation will fail. It could be forced by adding force to the end of the unexport command, however this is discouraged in favor to waiting for any active transfer to complete.

To unexport a remote snapshot:

# storpool snapshot remote s02 a.o.cxz unexport
OK

The snapshot will no longer be visible with storpool snapshot list remote.

To unexport a remote snapshot and also set for deferred deletion in the remote site:

# storpool snapshot remote s02 a.o.cxz unexport deleteAfter 1h
OK

This will attempt to set a target delete date for a.o.cxz in the remote site in exactly one hour from the present time for this snapshot. If the minimumDeleteDelay in the remote site has a higher value, e.g. 1 day the selected value will be overwritten with the minimumDeleteDelay - in this example 1 day. For more info on deferred deletion check the 17.2.  Multi site section of this guide.

To move a snapshot to a different cluster in a multicluster environment (more on clusters here):

# storpool snapshot snap1 moveToRemote Lab-D-cl2

Note

Moving a snapshot to a remote cluster is forbidden for attached snapshots. More info on snapshot move is available in 17.13.  Volume and snapshot move.

12.12. Attachments

Attaching a volume or snapshot makes it accessible to a client under the /dev/storpool and /dev/storpool-byid directories. Volumes can be attached as read-only or read-write. Snapshots are always attached read-only.

Attaching a volume testvolume to a client with ID 1. This creates the block device /dev/storpool/testvolume:

# storpool attach volume testvolume client 1
OK

To attach a volume/snapshot to the node you are currently connected to:

# storpool attach volume testvolume here
OK
# storpool attach snapshot testsnap here
OK

By default, this command will block until the volume is attached to the client and the /dev/storpool/<volumename> symlink is created. For example if the storpool_block service has not been started the command will wait indefinitely. To set a timeout for this operation:

# storpool attach volume testvolume here timeout 10 # seconds
OK

To completely disregard the readiness check:

# storpool attach volume testvolume here noWait
OK

Note

The use of noWait is discouraged in favor of the default behaviour of the attach command.

Attaching a volume will create a read-write block device attachment by default. To attach it read-only:

# storpool volume testvolume2 attach client 12 mode ro
OK

To list all attachments:

# storpool attach list
-------------------------------------------------------------------
| client | volume               | globalId | mode | tags          |
-------------------------------------------------------------------
|     11 | testvolume           | d.n.a1z  | RW   | vc-policy=no  |
|     12 | testvolume1          | d.n.c2p  | RW   | vc-policy=no  |
|     12 | testvolume2          | d.n.uwp  | RO   | vc-policy=no  |
|     14 | testsnap             | d.n.s1m  | RO   | vc-policy=no  |
-------------------------------------------------------------------

To detach:

# storpool detach volume testvolume client 1 # or 'here' if the command is being executed on client ID 1

If a volume is actively being written or read from, a detach operation will fail:

# storpool detach volume testvolume client 11
Error: 'testvolume' is open at client 11

In this case the detach could be forced, beware that forcing a detachment is discouraged:

# storpool detach volume testvolume client 11 force yes
OK

Attention

Any operations on the volume will receive an IO Error when it is forcefully detached. Some mounted filesystems lead to kernel panic when a block device disappears when there with live operations, thus be extra careful if these filesystems are mounted on a hypervisor node directly.

If a volume or snapshot is attached to more than one client it could be detached from all nodes with a single CLI command:

# storpool detach volume testvolume all
OK
# storpool detach snapshot testsnap all
OK

12.13. Client

To check the status of the active storpool_block services in the cluster:

# storpool client status
-----------------------------------
|  client  |       status         |
-----------------------------------
|       11 | ok                   |
|       12 | ok                   |
|       13 | ok                   |
|       14 | ok                   |
-----------------------------------

To wait until a client is updated:

# storpool client 13 sync
OK

This is a way to ensure a volume with changed size is visible with its new size to any clients it is attached to.

To show detailed information about the active requests on particular client in this moment:

# storpool client 13 activeRequests
------------------------------------------------------------------------------------------------------------------------------------
| request ID                     |  request IDX |               volume |         address |        size |       op |    time active |
------------------------------------------------------------------------------------------------------------------------------------
| 9224499360847016133:3181950    |  1044        | testvolume           |     10562306048 |     128 KiB |    write |        65 msec |
| 9224499360847016133:3188784    |  1033        | testvolume           |     10562437120 |      32 KiB |     read |        63 msec |
| 9224499360847016133:3188977    |  1029        | testvolume           |     10562568192 |     128 KiB |     read |        21 msec |
| 9224499360847016133:3189104    |  1026        | testvolume           |     10596122624 |     128 KiB |     read |         3 msec |
| 9224499360847016133:3189114    |  1035        | testvolume           |     10563092480 |     128 KiB |     read |         2 msec |
| 9224499360847016133:3189396    |  1048        | testvolume           |     10629808128 |     128 KiB |     read |         1 msec |
------------------------------------------------------------------------------------------------------------------------------------

12.14. Templates

Templates are enabling easy and consistent setup and usage tracking for a collection of large number of volumes and their snapshots with common attributes, for example replication, placement groups, and/or common parent snapshot.

To create a template:

# storpool template nvme replication 3 placeAll nvme
OK
# storpool template magnetic replication 3 placeAll hdd
OK
# storpool template hybrid replication 3 placeAll hdd placeTail ssd
OK
# storpool template ssd-hybrid replication 3 placeAll ssd placeHead hdd
OK

To list all created templates:

# storpool template list
-------------------------------------------------------------------------------------------------------------------------------------
| template             |   size  | rdnd. | placeHead   | placeAll   | placeTail  |   iops  |    bw   | parent               | flags |
-------------------------------------------------------------------------------------------------------------------------------------
| magnetic             |       - |     3 | nvme        | nvme       | nvme       |       - |       - |                      |       |
| magnetic             |       - |     3 | hdd         | hdd        | hdd        |       - |       - |                      |       |
| hybrid               |       - |     3 | hdd         | hdd        | ssd        |       - |       - |                      |       |
| ssd-hybrid           |       - |     3 | hdd         | ssd        | ssd        |       - |       - |                      |       |
-------------------------------------------------------------------------------------------------------------------------------------

Please refer to 14.  Redundancy for more info on replication and erasure coding schemes (shown in rdnd. above).

To get the status of a template with detailed info on the usage and the available space left with this placement:

# storpool template status
--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
| template             | place head |  place all | place tail | rdnd. | volumes | snapshots/removing |     size |  capacity |   avail. |  avail. all |  avail. tail |  avail. head | flags |
--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
| magnetic             | hdd        | hdd        | hdd        |     3 |     115 |       631/0        |   28 TiB |    80 TiB |   52 TiB |     240 TiB |      240 TiB |      240 TiB |       |
| hybrid               | hdd        | ssd        | hdd        |     3 |     208 |       347/9        |   17 TiB |    72 TiB |   55 TiB |     240 TiB |       72 TiB |      240 TiB |       |
| ssd-hybrid           | ssd        | ssd        | hdd        |     3 |      40 |         7/0        |    4 TiB |    36 TiB |   36 TiB |     240 TiB |       72 TiB |      240 TiB |       |
--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------

To change template attributes directly:

# storpool template hdd-only size 120G propagate no
OK
# storpool template hybrid size 40G iops 4000 propagate no
OK

Parameters that can be set:

  • replication - change the number of copies for volumes or snapshots created with this template

  • size - default size if not specified for each volume created with this template

  • placeAll - place all objects in placementGroup (Default value: default)

  • placeTail - name of placementGroup for reader (Default value: same as placeAll value)

  • placeHead - place the third replica in a different placementGroup (Default value: same as placeAll value)

  • iops - set the maximum IOPS limit for this snapshot (in IOPS)

  • bw - set maximum bandwidth limit (in MB/s)

  • parent - set parent snapshot for all volumes created in this template

  • reuseServer - place multiple copies on the same server

  • limitType - specify whether iops and bw limits ought to be for the total size of the block device or per each GiB (one of “total” or “perGiB”)

When changing parameters on an already created template a new propagate parameter is required in order to specify if the changes would have to be modified on all existing volumes and/or snapshots created with this template or not. The parameter is required regardless of the template having any volumes and/or snapshots created with it.

For example in order change the bandwidth limit for all volumes and snapshots created with the already existing template magnetic:

# storpool template magnetic bw 100MB propagate yes
OK

Note

When using storpool template $TEMPLATE propagate yes, all the parameters of $TEMPLATE will be re-applied to all volumes and snapshots created with it.

Note

Changing template parameters with propagate option will not automatically re-allocate content of the existing volumes on disks. If replication or placement groups are changed, run balancer to apply new settings on the existing volumes. However if the changes are made directly to the volume instead to the template, running a balancer will not be required.

Attention

Dropping the replication (for example, from triple to dual) of a large number of volumes is an almost instant operation, however returning them back to triple is similar to creating the third copy for the first time. This is why changing replication to less than the present (e.g. from 3 to 2) will require using replicationReduce as a safety measure.

To rename a template:

# storpool template magnetic rename backup
OK

To delete a template:

# storpool template hdd-only delete hdd-only
OK

Note

The delete operation might fail if there are volumes/snapshots that are created with this template.

12.15. iSCSI

The StorPool iSCSI support is documented more extensively in the 16.  Exporting volumes using iSCSI section; these are the commands used to configure it and view the configuration.

To set the cluster’s iSCSI base IQN iqn.2019-08.com.example:examplename:

# storpool iscsi config setBaseName iqn.2019-08.com.example:examplename
OK

12.15.1. Create a portal group

To create a portal group examplepg used to group exported volumes for access by initiators using 192.168.42.247/24 (CIDR notation) as the portal IP address:

# storpool iscsi config portalGroup examplepg create addNet 192.168.42.247/24 vlan 42
OK

To create portal for the initiators to connect to (for example portal IP address 192.168.42.202 and StorPool’s SP_OURID 5):

# storpool iscsi config portal create portalGroup examplepg address 192.168.42.202 controller 5
OK

Note

This address will be handled by the storpool_iscsi process directly and will not be visible in the node with normal instruments like ip or ifconfig, check the 12.15.5.  iscsi_tool for these purposes.

12.15.2. Register an initiator

To define the iqn.2019-08.com.example:abcdefgh initiator that is allowed to connect from the 192.168.42.0/24 network (w/o authentication):

# storpool iscsi config initiator iqn.2019-08.com.example:abcdefgh create net 192.168.42.0/24
OK

To define the iqn.2019-08.com.example:client initiator that is allowed to connect from the 192.168.42.0/24 network and must authenticate using the standard iSCSI password-based challenge-response authentication method using the username user and the password secret:

# storpool iscsi config initiator iqn.2019-08.com.example:client create net 192.168.42.0/24 chap user secret
OK

12.15.3. Export a volume

To specify that the existing StorPool volume tinyvolume should be exported to one or more initiators:

# storpool iscsi config target create tinyvolume
OK

Note

Please note that changing the volume name after creating a target will not change the target name. Re-creating (unexport/re-export) the target will use the new volume name.

To actually export the StorPool volume tinyvolume to to the iqn.2019-08.com.example:abcdefgh initiator via the examplepg portal group (the StorPool iSCSI service will automatically pick a portal to export the volume through):

# storpool iscsi config export initiator iqn.2019-08.com.example:abcdefgh portalGroup examplepg volume tinyvolume
OK

Note

The volume will be visible to the initiator as IQN <BaseName>:<volume>

Same command but without specifying an initiator will export the target to all registered initiators and will be visible as the * initiator:

# storpool iscsi config export portalGroup examplepg volume tinyvolume
OK
# storpool iscsi initiator list exports
-----------------------------------------------------------------------------------------------------------------
| name                                           | volume | currentControllerId | portalGroup       | initiator |
-----------------------------------------------------------------------------------------------------------------
| iqn.2019-08.com.example:examplename:tinyvolume | test   |                  23 | examplepg         | *         |
-----------------------------------------------------------------------------------------------------------------

12.15.4. Get iSCSI configuration

To view the iSCSI cluster base IQN:

# storpool iscsi basename
---------------------------------------
| basename                            |
---------------------------------------
| iqn.2019-08.com.example:examplename |
---------------------------------------

To view the portal groups:

# storpool iscsi portalGroup list
---------------------------------------------
| name       | networksCount | portalsCount |
---------------------------------------------
| examplepg  |             1 |            2 |
---------------------------------------------

To view the portals:

# storpool iscsi portalGroup list portals
--------------------------------------------------
| group       | address             | controller |
--------------------------------------------------
| examplepg   | 192.168.42.246:3260 |          1 |
| examplepg   | 192.168.42.202:3260 |          5 |
--------------------------------------------------

To view the defined initiators:

# storpool iscsi initiator list
---------------------------------------------------------------------------------------
| name                             | username | secret | networksCount | exportsCount |
---------------------------------------------------------------------------------------
| iqn.2019-08.com.example:abcdefgh |          |        |             1 |            1 |
| iqn.2019-08.com.example:client   | user     | secret |             1 |            0 |
---------------------------------------------------------------------------------------

To view the present state of the configured iSCSI interfaces:

iscsi interfaces list
--------------------------------------------------
| ctrlId | net 0             | net 1             |
--------------------------------------------------
|     23 | 2A:60:00:00:E0:17 | 2A:60:00:00:E0:17 |
|     24 | 2A:60:00:00:E0:18 | 2A:60:00:00:E0:18 |
|     25 | 2A:60:00:00:E0:19 | 2E:60:00:00:E0:19 |
|     26 | 2A:60:00:00:E0:1A | 2E:60:00:00:E0:1A |
--------------------------------------------------

Note

These are the same interfaces configured with SP_ISCSI_IFACE in the order of appearance:

# storpool_showconf SP_ISCSI_IFACE
SP_ISCSI_IFACE=sp0,spbond1:sp1,spbond1:[lacp]

In the above output, the sp0 interface is net ID 0 and sp1 is net ID 1.

To view the volumes that may be exported to initiators:

# storpool iscsi target list
-------------------------------------------------------------------------------------
| name                                           | volume     | currentControllerId |
-------------------------------------------------------------------------------------
| iqn.2019-08.com.example:examplename:tinyvolume | tinyvolume |               65535 |
-------------------------------------------------------------------------------------

To view the volumes currently exported to initiators:

# storpool iscsi initiator list exports
--------------------------------------------------------------------------------------------------------------------------------------
| name                                           | volume     | currentControllerId | portalGroup | initiator                        |
--------------------------------------------------------------------------------------------------------------------------------------
| iqn.2019-08.com.example:examplename:tinyvolume | tinyvolume |                   1 |             | iqn.2019-08.com.example:abcdefgh |
--------------------------------------------------------------------------------------------------------------------------------------

To list the presently active sessions in the cluster use:

# storpool iscsi sessions list
----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
| id   | target                                                                   | initiator                     | portal addr       | initiator addr    | timeCreated                    | nopOut | scsi   | task | dataOut | otherOut | nopIn | scsiRsp | taskRsp | dataIn | r2t | otherIn | t free | t dataOut | t queued | t processing | t dataResp | t aborted | ISID     |
----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
| 23.0 | iqn.2020-04.com.storpool:autotest:s18-1-iscsi-test-hybrid-win-server2016 | iqn.1991-05.com.microsoft:s18 | 10.1.100.123:3260 | 10.1.100.18:49414 | 2020-07-07 09:25:16 / 00:03:54 |    209 |  89328 |    0 |       0 |        2 |   209 |         |         |  45736 |   0 |       2 |    129 |         0 |        0 |            0 |          0 |         0 | 1370000  |
| 23.1 | iqn.2020-04.com.storpool:autotest:s11-1-iscsi-test-hybrid-centos6        | iqn.2020-04.com.storpool:s11  | 10.1.100.123:3260 | 10.1.100.11:44392 | 2020-07-07 09:25:33 / 00:03:37 |    218 |  51227 |    0 |       0 |        1 |   218 |         |         |  25627 |   0 |       1 |    129 |         0 |        0 |            0 |          0 |         0 | 3d0002b8 |
| 24.0 | iqn.2020-04.com.storpool:autotest:s11-1-iscsi-test-hdd-centos6           | iqn.2020-04.com.storpool:s11  | 10.1.100.124:3260 | 10.1.100.11:51648 | 2020-07-07 09:27:27 / 00:01:43 |    107 |    424 |    0 |       0 |        1 |   107 |         |         |    224 |   0 |       1 |    129 |         0 |        0 |            0 |          0 |         0 | 3d0002b9 |
| 24.1 | iqn.2020-04.com.storpool:autotest:s18-1-iscsi-test-hdd-win-server2016    | iqn.1991-05.com.microsoft:s18 | 10.1.100.124:3260 | 10.1.100.18:49422 | 2020-07-07 09:28:22 / 00:00:48 |     43 |  39568 |    0 |       0 |        2 |    43 |         |         |  19805 |   0 |       2 |    128 |         0 |        0 |            1 |          0 |         0 | 1370000  |
| 25.0 | iqn.2020-04.com.storpool:autotest:s13-1-iscsi-test-hybrid-centos7        | iqn.2020-04.com.storpool:s13  | 10.1.100.125:3260 | 10.1.100.13:45120 | 2020-07-07 09:20:46 / 00:08:24 |    481 | 154086 |    0 |       0 |        1 |   481 |         |         |  78308 |   0 |       1 |    129 |         0 |        0 |            0 |          0 |         0 | 3d0000a8 |
| 26.0 | iqn.2020-04.com.storpool:autotest:s13-1-iscsi-test-hdd-centos7           | iqn.2020-04.com.storpool:s13  | 10.1.100.126:3260 | 10.1.100.13:43858 | 2020-07-07 09:22:52 / 00:06:18 |    369 | 147438 |    0 |       0 |        1 |   369 |         |         |  74883 |   0 |       1 |    129 |         0 |        0 |            0 |          0 |         0 | 3d0000a9 |
----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------

Here, the fields are:

  • id - identifier for the node and connection, first part matches the SP_OURID of the node with the storpool_iscsi service is running and the second is the export number.

  • target - the target IQN

  • initiator - the initiator IQN

  • portal addr - the portal group floating address and port

  • initiator addr - the initiator address and port

  • timeCreated - the time when the session was created

Initiator:

  • nopOut - number of NOP-out requests from the initiator

  • scsi - number of SCSI commands from the initiator for this session

  • task - number of SCSI Task Management Function Requests from the initiator

  • dataOut - number of SCSI Data-Out PDUs from the initiator

  • otherOut - number of non SCSI Data-Out PDUs sent to the target (Login/Logout/SNACK or Text)

  • ISID - the initiator part of the session identifier, explicitly specified by the initiator during login.

Target:

  • nopIn - number of NOP-in PDUs from the target

  • scsiRsp - number of SCSI response PDUs from the target

  • taskRsp - number of SCSI Task Management Function Response PDUs from the target

  • dataIn - number of SCSI Data-In PDUs from the target

  • r2t - number of Ready To Transfer (R2T) PDUs from the target

  • otherIn - number of non SCSI Data-In PDUs from the target (Login/Logout/SNACK or Text)

Task queue:

  • t free - number of free task queue slots

  • t dataOut - write request waiting for data from TCP

  • t queued - number of IO requests received ready to be processed

  • t processing - number of IO requests sent to the target to process

  • t dataResp - read request queued for sending over TCP

  • t aborted - number of aborted requests

To stop exporting the tinyvolume volume to the initiator with iqn iqn.2019-08.com.example:abcdefgh and the examplepg portal group:

# storpool iscsi config unexport initiator iqn.2019-08.com.example:abcdefgh portalGroup examplepg volume tinyvolume
OK

If a target was exported to all initiators (i.e. *), not specifying an initiator will unexport from all:

# storpool iscsi config unexport portalGroup examplepg volume tinyvolume
OK

To remove an iSCSI definition for the tinyvolume volume:

# storpool iscsi config target delete tinyvolume
OK

To remove access for the iqn.2019-08.com.example:client iSCSI initiator:

# storpool iscsi config initiator iqn.2019-08.com.example:client delete
OK

To remove the portal 192.168.42.202 IP address:

# storpool iscsi config portal delete address 192.168.42.202
OK

To remove portal group examplepg after all the portals have been removed:

# storpool iscsi config portalGroup examplepg delete
OK

Note

Only portal groups without portals may be deleted.

12.15.5. iscsi_tool

With the hardware accelerated iSCSI all traffic from/to the initiators is handled by the storpool_iscsi service directly. For example with the above setup the configured in the cluster setup the addresses exposed on each of the nodes could be queried with /usr/lib/storpool/iscsi_tool:

# /usr/lib/storpool/iscsi_tool
usage: /usr/lib/storpool/iscsi_tool change-port 0/1 ifaceName
usage: /usr/lib/storpool/iscsi_tool ip net list
usage: /usr/lib/storpool/iscsi_tool ip neigh list
usage: /usr/lib/storpool/iscsi_tool ip route list

To list the presently configured addresses:

# /usr/lib/storpool/iscsi_tool ip net list
10.1.100.0/24 vlan 1100 ports 1,2
10.18.1.0/24 vlan 1801 ports 1,2
10.18.2.0/24 vlan 1802 ports 1,2

To list the neighbours and their last state:

# /usr/lib/storpool/iscsi_tool ip neigh list
10.1.100.11 ok F4:52:14:76:9C:B0 lastSent 1785918292753 us, lastRcvd 918669 us
10.1.100.13 ok 0:25:90:C8:E5:AA lastSent 1785918292803 us, lastRcvd 178521 us
10.1.100.18 ok C:C4:7A:EA:85:4E lastSent 1785918292867 us, lastRcvd 178099 us
10.1.100.108 ok 1A:60:0:0:E0:8 lastSent 1785918293857 us, lastRcvd 857181794 us
10.1.100.112 ok 1A:60:0:0:E0:C lastSent 1785918293906 us, lastRcvd 1157179290 us
10.1.100.113 ok 1A:60:0:0:E0:D lastSent 1785918293922 us, lastRcvd 765392509 us
10.1.100.114 ok 1A:60:0:0:E0:E lastSent 1785918293938 us, lastRcvd 526084270 us
10.1.100.115 ok 1A:60:0:0:E0:F lastSent 1785918293954 us, lastRcvd 616948781 us
10.1.100.123 ours
[snip]

The above output includes also the portalGroup addresses residing on the node with the lowest ID in the cluster.

To list routing information:

# /usr/lib/storpool/iscsi_tool ip route list
10.1.100.0/24 local
10.18.1.0/24 local
10.18.2.0/24 local

12.15.6. iscsi_targets

The /usr/lib/storpool/iscsi_targets tool is a helper tool for Linux based initiators, showing all logged in targets on the node:

# /usr/lib/storpool/iscsi_targets
/dev/sdn      iqn.2020-04.com.storpool:autotest:s11-1-iscsi-test-hybrid-centos6
/dev/sdo      iqn.2020-04.com.storpool:autotest:s11-1-iscsi-test-hdd-centos6
/dev/sdp      iqn.2020-04.com.storpool:autotest:s11-2-iscsi-test-hybrid-centos6
/dev/sdq      iqn.2020-04.com.storpool:autotest:s11-2-iscsi-test-hdd-centos6

12.16. Kubernetes

To register a Kubernetes cluster:

# storpool kubernetes add name cluster1
OK

To disable Kubernetes cluster:

# storpool kubernetes update name cluster1 disable yes
OK

To enable Kubernetes cluster:

# storpool kubernetes update name cluster1 disable no
OK

To delete Kubernetes cluster:

# storpool kubernetes delete name cluster1
OK

To list registered Kubernetes clusters:

# storpool kubernetes list
-----------------------
| name     | disabled |
-----------------------
| cluster1 | false    |
-----------------------

To view the status of the registered Kubernetes clusters:

# storpool kubernetes status
--------------------------------------------------------------
| name     | sc | w   | pvc | noRsrc | noTempl | mode | noSC |
--------------------------------------------------------------
| cluster1 |  0 | 0/3 |   0 |      0 |       0 |    0 |    0 |
--------------------------------------------------------------
Feilds:
  sc      - registered Storage Classes
  w       - watch connections to the kube adm
  pvc     - persistentVolumeClaims beeing provisioned
  noRsrc  - persistentVolumeClaims failed due to no resources
  noTempl - persistentVolumeClaims failed due to missing template
  mode    - persistentVolumeClaims failed due to unsupported access mode
  noSC    - persistentVolumeClaims failed due to missing storage class

12.17. Relocator

The relocator is internal StorPool service that takes care of data reallocation in case of change of volume’s replication, placement group parameters or in case of any pending rebase operations. This service is turned on by default.

When needed, the relocator could be turned off with:

# storpool relocator off
OK

To turn it back on:

# storpool relocator on
OK

To display the relocator status:

# storpool relocator status
relocator on, no volumes to relocate

The following additional relocator commands are available:

storpool relocator disks

Returns the state of the disks after the relocator finishes all presently running tasks, as well as the quantity of objects and data each drive still needs to recover. The output is the same as with storpool balancer disks after the balancing task has been committed, see 12.18.  Balancer for details.

storpool relocator volume <volumename> disks

Shows the same information as the storpool relocator disks with the pending operations specific volume.

storpool relocator snapshot <snapshotname> disks

Shows the same information as the storpool relocator disks with the pending operations specific snapshot.

12.18. Balancer

The balancer is used to redistribute data in case a disk or set of disks (for example, new node) were added to or removed from a cluster. By default it is off. It has to be turned on in case of changes in cluster configuration for redistribution of data to occur.

To display the status of the balancer:

# storpool balancer status
balancer waiting, auto off

To load a re-balancing task, please refer to the 18.  Rebalancing the cluster section of this guide.

To discard the re-balancing operation:

# storpool balancer stop
OK

To actually commit the changes and start the relocations of the proposed changes:

# storpool balancer commit
OK

After the commit all changes will be only visible with storpool relocator disks and many volumes and snapshots will have the M flag in the output of storpool volume status until all relocations are completed. The progress could be followed with storpool task list (see 12.19.  Tasks).

12.19. Tasks

The tasks are all outstanding operations on recovering or relocating data in the present or between two connected clusters.

For example if a disk with ID 1401 was not in the cluster for a period of time and is then returned, all outdated objects will be recovered from the other drives with the latest changes.

These recovery operations could be listed with:

# storpool task list
----------------------------------------------------------------------------------------
|     disk |  task id |  total obj |  completed |    started |  remaining | % complete |
----------------------------------------------------------------------------------------
|     2301 | RECOVERY |         73 |          5 |          1 |         68 |         6% |
|     2315 | balancer |        180 |          0 |          1 |        180 |         0% |
----------------------------------------------------------------------------------------
|    total |          |         73 |          5 |          1 |         68 |         6% |
----------------------------------------------------------------------------------------

Other cases when tasks operations could be listed are when a re-balancing operation was committed and relocations are in progress, as well as when a cloning operation for a remote snapshot in the local cluster is in progress.

12.20. Maintenance mode

The maintenance submenu is used to configure one or more nodes in a cluster in maintenance state. A couple of checks will be performed prior to entering into maintenance state in order to prevent a node from entering in maintenance in case it has one or more live server instances in the cluster when for example the cluster is not yet fully recovered or is with decreased redundancy due to other reasons.

A node could be configured in maintenance state with:

# storpool maintenance set node 23 duration 10m description kernel_update
OK

The above will configure node ID 23 in maintenance state for 10 minutes and will configure the description to “kernel_update”.

To list the present nodes in maintenance:

# storpool maintenance list
------------------------------------------------------------
| nodeId | started             | remaining | description   |
------------------------------------------------------------
|     23 | 2020-09-30 12:55:20 | 00:09:50  | kernel_update |
------------------------------------------------------------

To complete a maintenance for a node:

# storpool maintenance complete node 23
OK

Note

All non-cluster threatening issues will not be sent by the monitoring system to external entities. All alerts will still be received by StorPool support and will be classified as “under maintenance” internally, while the node or the cluster are under maintenance mode.

Attention

Any alerts that are cluster threatening will still send super-critical alerts to both StorPool support and any other configured endpoint. More on super-critical alerts here.

Consider that a full cluster maintenance mode is also available. For more information on how to do this with storpool mgmtConfig maintenanceState, see 12.22.7.  Cluster maintenance mode.

12.21. Management menu

The mgmt menu is presently used only for reporting present clients for the StorPool API, example output:

# storpool mgmt clients

-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
| id     | name              | ip          | port  | queue                    | mode | command              | state   | clientState         | started             | runningFor | blockedBy | progress |
-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
| 283874 | 10.2.55.207:36770 | 10.2.55.207 | 36770 | clients                  | HTTP | GET AllClientsStatus | running | PAGER               | 2023-03-30 13:16:33 | 00:00:00   |           | ---      |
| 283871 | 10.2.55.207:36764 | 10.2.55.207 | 36764 | volumeSpaceClients.list  | HTTP | GET SnapshotsSpace   | running | LIST_SNAPSHOT_SPACE | 2023-03-30 13:16:33 | 00:00:00   |           | 20%      |
| 283872 | 10.2.55.207:36766 | 10.2.55.207 | 36766 | 283871.blockedClients    | HTTP | GET VolumesSpace     | blocked | STATE_ERROR         | 2023-03-30 13:16:33 | 00:00:00   |    283871 | 0%       |
| 283873 | 10.2.55.207:36768 | 10.2.55.207 | 36768 | volumeStatusClients.list | HTTP | GET VolumesGetStatus | running | LIST_VOLUME_STATUS  | 2023-03-30 13:16:33 | 00:00:00   |           | ---      |
-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------

The output includes present running tasks. Each task is identified by an id. This id is then visible as a blocker (as blockedBy) in case some of the other tasks is depending on it to finish.

Useful when debugging API related issues or as a tool for general API usage observability.

12.22. Management configuration

Tip

Please consult with StorPool support before changing the management configuration defaults.

The mgmtConfig submenu is used to set some internal configuration parameters.

12.22.1. Listing current configuration

To list the presently configured parameters:

# storpool mgmtConfig list
relocator on, interval 5.000 s
relocator transaction: min objects 320, max objects 4294967295
relocator recovery: max tasks per disk 2, max objects per disk 2400
relocator recovery objects trigger 32
relocator min free 150 GB
relocator max objects per HDD tail 0
balancer auto off, interval 5.000 s
snapshot delete interval 1.000 s
disks soft-eject interval 5.000 s
snapshot delayed delete off
snapshot dematerialize interval 1.000 s
mc owner check interval 2.000 s
mc autoreconcile interval 2.000 s
reuse server implicit on disk down disabled
max local recovery requests 1
max remote recovery requests 2
maintenance state production
max disk latency nvme 1000.000 ms
max disk latency ssd 1000.000 ms
max disk latency hdd 1000.000 ms
max disk latency journal 50.000 ms
backup template name backup_template

12.22.2. Miscellaneous parameters

To disable the deferred snapshot deletion (default on):

# storpool mgmtConfig delayedSnapshotDelete off
OK

When enabled all snapshots with configured time to be deleted will be cleared at the configured date and time.

To change the default interval between periodic checks whether disks marked for ejection can actually be ejected (5 sec.):

# storpool mgmtConfig disksSoftEjectInterval 20000 # value in ms - 20 sec.
OK

To change the default local recovery requests for all disks:

# storpool mgmtConfig maxLocalRecoveryRequests 1
OK
# storpool mgmtConfig maxRemoteRecoveryRequests 2
OK

To change the default interval (5 sec.) for the relocator to check if there is new work to be done:

# storpool mgmtConfig relocatorInterval 20000 # value is in ms - 20 sec.
OK

To set a different than the default number of objects per disk (3200) in recovery at a time:

# storpool mgmtConfig relocatorMaxRecoveryObjectsPerDisk 2000 # value in number of objects per disk
OK

To change the default maximum number of recovery tasks per disk (2 tasks):

# storpool mgmtConfig relocatorMaxRecoveryTasksPerDisk 4 # value is number of tasks per disk - will set 4 tasks
OK

To change the minimum (default 320) or the maximum (default 4294967295) number of objects per transaction for the relocator:

# storpool mgmtConfig relocatorMaxTrObjects 2147483647
OK
# storpool mgmtConfig relocatorMinTrObjects 640
OK

To change the maximum number of objects per transaction per HDD tail drives use (0 is unset, 1+ is number of objects):

# storpool mgmtConfig relocatorMaxTrObjectsPerHddTail 2

To change the maximum number of objects in recovery for a disk to be usable by the relocator (default 32):

# storpool mgmtConfig relocatorRecoveryObjectsTrigger 64

To change the default check for new snapshots for deleting:

# storpool mgmtConfig snapshotDeleteInterval

12.22.3. Snapshot dematerialization

To enable snapshot dematerialization or change the interval:

# storpool mgmtConfig snapshotDematerializeInterval 30000 # sets the interval 30 seconds, 0 disables it

Snapshot dematerialization checks and removes all objects that do not refer to any data, i.e. no change in this object from the last snapshot (or ever). This helps to reduce the number of used objects per disk in clusters with a large number of snapshots and a small number of changed blocks between the snapshots in the chain.

To update the free space threshold in GB after which the relocator will not be adding new tasks:

# storpool mgmtConfig relocatorGBFreeBeforeAdd 75 # value is in GB

12.22.4. Multi-cluster parameters

To set or change the default MultiCluster owner check interval:

# storpool mcOwnerCheckInterval 2000 # sets the interval to 2 seconds, 0 disables it

To set or change the default MultiCluster auto-reconcile interval:

# storpool mcAutoReconcileInterval 2000 # sets the interval to 2 seconds, 0 disables it

12.22.5. Reusing server on disk failure

If there is a disk down, and a new volume could not be allocated, enabling this option will retry the new volume allocation as if reuseServer was specified, helpful for minimum installation requirements with 3 nodes when one of the nodes or a disk is down. To enable the option:

# storpool mgmtConfig reuseServerImplicitOnDiskDown enable

12.22.6. Changing default template

To change the default template upon receiving a snapshot from a remote cluster, through the storpool_bridge service (was the now deprecated SP_BRIDGE_TEMPLATE):

# storpool mgmtConfig backupTemplateName all-flash # the all-flash template should exist
OK

12.22.7. Cluster maintenance mode

A full cluster maintenance mode is available for occasions involving full cluster related maintenance activities. An example would be a scheduled restart of a network switch that will be reported as missing network for all nodes in a cluster.

This mode does not perform any checks, and is mainly for informational purposes in order to sync context between customers and StorPool’s support teams. Full cluster maintenance mode could be used in addition to the per-node maintenance state explained above when necessary.

To change the full cluster maintenance state to maintenance:

# storpool mgmtConfig maintenanceState maintenance
OK

To switch back into production state:

# storpool mgmtConfig maintenanceState production
OK

In case you only need to do this for a single node you can use storpool maintenance, as described in 12.20.  Maintenance mode.

12.22.8. Latency thresholds

Note

For individual per disk latency thresholds check 12.8.4.  Disk list performance information section.

To define a global latency threshold before ejecting a HDD disk:

# storpool mgmtConfig maxDiskLatencies hdd 1000 # value is in milliseconds

To define a global latency threshold before ejecting a SSD drive:

# storpool mgmtConfig maxDiskLatencies ssd 1000 # value is in milliseconds

To define a global latency threshold before ejecting a NVMe drive:

# storpool mgmtConfig maxDiskLatencies nvme 1000 # value is in milliseconds

To define a global latency limit before ejecting a journal device:

# storpool mgmtConfig maxDiskLatencies journal 50 # value is in milliseconds

12.23. Mode

Support for couple of different output modes is available both for the interactive shell and when the CLI is invoked directly. Some custom format options are available only for some operations.

Available modes:

  • csv - Semicolon-separated values for some commands

  • json - Processed JSON output for some commands

  • pass - Pass the JSON response through

  • raw - Raw output (display the HTTP request and response)

  • text - Human readable output (default)

Example with switching to csv mode in the interactive shell:

StorPool> mode csv
OK
StorPool> net list
nodeId;flags;net 1;net 2
23;uU + AJ;22:60:00:00:F0:17;26:60:00:00:F0:17
24;uU + AJ;2A:60:00:00:00:18;2E:60:00:00:00:18
25;uU + AJ;F6:52:14:76:9C:C0;F6:52:14:76:9C:C1
26;uU + AJ;2A:60:00:00:00:1A;2E:60:00:00:00:1A
29;uU + AJ;52:6B:4B:44:02:FE;52:6B:4B:44:02:FF

The same applies when using the CLI directly:

# storpool -f csv net list # the output is the same as above
[snip]

13. Multi-server

The multi-server feature enables the use of up to seven separate storpool_server instances on a single node. This makes sense for dedicated storage nodes, or in the case of a heavily-loaded converged setup with more resources isolated for the storage system.

For example, a dedicated storage node with 36 drives would provide better peak performance with 4 server instances each controlling 1/4th of all disks/SSDs than with a single instance. Another good example would be a converged node with 16 SSDs/HDDs, which would provide better peak performance with two server instances each controlling half of the drives and running on separate CPU cores, or even running on two threads on a single CPU core compared to a single server instance.

13.1. Configuration

The configuration of the CPUs on which the different instances are running is done via cgroups, through the storpool_cg tool; for details, see 6.6.  Cgroup options.

Configuring which drive is handled by which instance is done with the storpool_initdisk tool. For example, if you have two drives whose IDs are 1101 and 1102, both controlled by the first server instance, the output from storpool_initdisk would look like this:

# storpool_initdisk --list
/dev/sde1, diskId 1101, version 10007, server instance 0, cluster init.b, SSD
/dev/sdf1, diskId 1102, version 10007, server instance 0, cluster init.b, SSD

Setting the second SSD drive (1102) to be controlled by the second server instance is done like this (X is the drive letter and N is the partition number, for example /dev/sdf1):

# storpool_initdisk -r -i 1 /dev/sdXN

Hint

The above command will fail if the storpool_server service is running, please eject the disk prior to re-setting an instance.

In some occasions, if the first server instance was configured with a large amount of cache (check SP_CACHE_SIZE in 6.  Node configuration options), when migrating from one to two instances it is recommended to first split the cache between the different instances (for example, from 8192 to 4096). These parameters will be automatically taken care of by the storpool_cg tool, check for more details in 6.6.  Cgroup options.

13.2. Helper

StorPool provides a tool for easy reconfiguration between different number of server instances. It can be used to print the required commands. For example, for a node with some SSD and some HDDs automatically assigned to 3 SSD only, and one HDD-only server instances:

[root@s25 ~]# /usr/lib/storpool/multi-server-helper.py -i 4 -s 3
/usr/sbin/storpool_initdisk -r -i 0 2532 0000:01:00.0-p1  # SSD
/usr/sbin/storpool_initdisk -r -i 0 2534 0000:02:00.0-p1  # SSD
/usr/sbin/storpool_initdisk -r -i 0 2533 0000:06:00.0-p1  # SSD
/usr/sbin/storpool_initdisk -r -i 0 2531 0000:07:00.0-p1  # SSD
/usr/sbin/storpool_initdisk -r -i 1 2505 /dev/sde1  # SSD
/usr/sbin/storpool_initdisk -r -i 1 2506 /dev/sdf1  # SSD
/usr/sbin/storpool_initdisk -r -i 1 2507 /dev/sdg1  # SSD
/usr/sbin/storpool_initdisk -r -i 1 2508 /dev/sdh1  # SSD
/usr/sbin/storpool_initdisk -r -i 2 2501 /dev/sda1  # SSD
/usr/sbin/storpool_initdisk -r -i 2 2502 /dev/sdb1  # SSD
/usr/sbin/storpool_initdisk -r -i 2 2503 /dev/sdc1  # SSD
/usr/sbin/storpool_initdisk -r -i 2 2504 /dev/sdd1  # SSD
/usr/sbin/storpool_initdisk -r -i 3 2511 /dev/sdi1  # WBC
/usr/sbin/storpool_initdisk -r -i 3 2512 /dev/sdj1  # WBC
/usr/sbin/storpool_initdisk -r -i 3 2513 /dev/sdk1  # WBC
/usr/sbin/storpool_initdisk -r -i 3 2514 /dev/sdl1  # WBC
/usr/sbin/storpool_initdisk -r -i 3 2515 /dev/sdn1  # WBC
/usr/sbin/storpool_initdisk -r -i 3 2516 /dev/sdo1  # WBC
/usr/sbin/storpool_initdisk -r -i 3 2517 /dev/sdp1  # WBC
/usr/sbin/storpool_initdisk -r -i 3 2518 /dev/sdq1  # WBC

[root@s25 ~]# /usr/lib/storpool/multi-server-helper.py -h
usage: multi-server-helper.py [-h] [-i INSTANCES] [-s [SSD_ONLY]]

Prints relevant commands for dispersing the drives to multiple server
instances

optional arguments:
  -h, --help            show this help message and exit
  -i INSTANCES, --instances INSTANCES
                        Number of instances
  -s [SSD_ONLY], --ssd-only [SSD_ONLY]
                        Splits by type, 's' SSD-only instances plusi-s HDD
                        instances (default s: 1)

Note that the commands could be executed only when the relevant storpool_server* service instances are stopped and a cgroup re-configuration would likely be required after the setup changes (see 6.6.  Cgroup options for more info on how to update cgroups).

14. Redundancy

StorPool provides two mechanisms for protecting data from unplanned events: replication and erasure coding.

14.1. Replication

With replication, redundancy is provided by having multiple copies (replicas) of the data written synchronously across the cluster. You can set the number of replication copies as needed. The replication level directly correlates with the number of servers that may be down without interruption in the service. For example, with triple replication the number of the servers that may be down simultaneously, without losing access to the data, is 2.

Each volume or snapshot could be replicated on a different set of drives. Each set of drives is configured through the placement groups. A volume would either have all of its copies in a single set of drives in a different set of nodes, or have different copies in a different set of drives. There are many parameters through which you can manage replication; for details, see 12.10.  Volumes and 12.9.  Placement groups.

Tip

When using the replication mechanism, StorPool recommends having 3 copies as a standard for critical data.

14.1.1. Triple replication

The minimum requirement for triple replication is at least three nodes (with recommended five).

With triple replication each block of data is stored on three different storage nodes. This protects the data against two simultaneous failures - for example, one node is down for maintenance, and a drive on another node fails.

14.1.2. Dual replication

Dual replication can be used for non-critical data, or for data that can be recreated from other sources. Dual-replicated data can tolerate a single failure without service interruption.

This type of replication is suitable for test and staging environments, and can be deployed on a single node cluster (not recommended for production deployments). Deployment can also be performed on larger HDD-based backup clusters.

14.2. Erasure Coding

As of release 21.0 revision 21.0.75.1e0880427 StorPool supports erasure coding on NVMe drives.

14.2.1. Features

The erasure coding mechanism reduces the amount of data stored on the same hardware set, while at the same time preserves the level of data protection. It provides the following advantages:

  • Cross-node data protection

    Erasure-coded data is always protected across servers with two parity objects, so that any two servers can fail, and user data is safe.

  • Delayed batch-encoding

    Incoming data is initially written with triple replication. The erasure coding mechanism is automatically applied later. This way the data processing overhead is significantly reduced, and the impact on latency for user I/O operations is minimized.

  • Designed for always-on operations

    Up to two storage nodes can be rebooted or brought down for maintenance while the storage system keeps running, and all data is available and in use.

  • A pure software feature

    The implementation requires no additional hardware components.

14.2.2. Redundancy schemes

StorPool supports three redundancy schemes for erasure coding - 2+2, 4+2, or 8+2 schemes. You can choose which one to use based on the size of your cluster. The naming of the schemes follows the k+m pattern:

  • k is the number of data blocks stored.

  • m is the number of parity blocks stored.

  • A redundancy scheme can recover data when any up to m blocks are lost.

For example, 4+2 stores 4 data blocks and protects them with two parity blocks. It can operate and recover when any 2 drives or nodes are lost.

When planning, consider the minimum required number of nodes (or fault sets) for each scheme:

Scheme

Nodes

Raw space used

Overhead

2+2

5+

2.4x

140%

4+2

7+

1.8x

80%

8+2

11+

1.5x

50%

For example, storing 1TB user data using the 8+2 scheme requires 1.5TB raw storage capacity.

The nodes have to be relatively similar in size. A mixture of a few very large nodes could lead to inability to use their capacity efficiently.

Note

Erasure coding requires making snapshots on a regular basis. Make sure your cluster is configured to create snapshots regularly, for example using the VolumeCare service. A single periodic snapshot per volume is required; more snapshots are optional.

15. Volumes and snapshots

Volume

Volumes are the basic service of the StorPool storage system. A volume always have a name, a global ID and a certain size. It can be read from and written to. It could be attached to hosts as read-only or read-write block device under the /dev/storpool directory (available as well at /dev/storpool-byid).

The volume name is a string consisting of one or more of the allowed characters - upper and lower latin letters (a-z,A-Z), numbers (0-9) and the delimiter dot (.), colon (:), dash (-) and underscore (_).



15.1. Creating a volume

Creating a volume

Creating volume

15.2. Deleting a volume

Deleting a volume

Deleting a volume

15.3. Renaming a volume

Renaming a volume

Renaming a volume

15.4. Resizing a volume

Resizing a volume

Resizing a volume

15.5. Snapshots

Snapshot

Snapshots are read-only point-in-time images of volumes. They are created once and cannot be changed. They can be attached to hosts as read-only block devices under /dev/storpool.


All volumes and snapshots share the same name-space. Names of volumes and snapshots are unique within a StorPool cluster. This diagram illustrates the relationship between a snapshot and a volume. Volume vol1 is based on snapshot snap1. vol1 contains only the changes since snap1 was taken. In the common case this is a small amount of data. Arrows indicate a child-parent relationship. Each volume or snapshot may have exactly one parent which it is based upon. Writes to vol1 are recorded within the volume. Reads from vol1 may be served by vol1 or by its parent snapshot - snap1, depending on whether vol1 contains changed data for the read request or not.

Namespace for volumes and snapshots
Volume snapshot relation

Snapshots and volumes are completely independent. Each snapshot may have many children (volumes and snapshots). Volumes cannot have children.






Volume snapshot chain

snap1 contains a full image. snap2 contains only the changes since snap1 was taken. vol1 and vol2 contain only the changes since snap2 was taken.









15.6. Creating a snapshot of a volume

There is a volume named vol1.

Creating a snapshot

After the first snapshot the state of vol1 is recorded in a new snapshot named snap1. vol1 does not occupy any space now, but will record any new writes which come in after the creation of the snapshot. Reads from vol1 may fall through to snap1.







Then the state of vol1 is recorded in a new snapshot named snap2. snap2 contains the changes between the moment snap1 was taken and the moment snap2 was taken. snap2’s parent is the original parent of vol1.



15.7. Creating a volume based on an existing snapshot (a.k.a. clone)

Before the creation of vol1 there is a snapshot named snap1.

Snapshot clones

A new volume, named vol1 is created. vol1 is based on snap1. The newly created volume does not occupy any space initially. Reads from the vol1 may fall through to snap1 or to snap1’s parents (if any).








15.8. Deleting a snapshot

vol1 and vol2 are based on snap1. snap1 is based on snap0. snap1 contains the changes between the moment snap0 was taken and when snap1 was taken. vol1 and vol2 contain the changes since the moment snap1 was taken.


Deleting a snapshot

After the deletion, vol1 and vol2 are based on snap1’s original parent (if any). In the example they are now based on snap0. When deleting a snapshot, the changes contained therein will not be propagated to its children and StorPool will keep the snap1 in deleting state to prevent from an explosion of disk space usage.





15.9. Rebase to null (a.k.a. promote)

vol1 is based on snap1. snap1 is in turn based on snap0. snap1 contains the changes between the moment snap0 was taken and when snap1 was taken. vol1 contains the changes since the moment snap1 was taken.

Rebase to null

After promotion vol1 is not based on a snapshot. vol1 now contains all data, not just the changes since snap1 was taken. Any relation between snap1 and snap0 is unaffected.








15.10. Rebase

vol1 is based on snap1. snap1 is in turn based on snap0. snap1 contains the changes between the moment snap0 was taken and when snap1 was taken. vol1 contains the changes since the moment snap1 was taken.


Rebase

After the rebase operation vol1 is based on snap0. vol1 now contains all changes since snap0 was taken, not just since snap1. snap1 is unchanged.









15.11. Example use of snapshots

Example use of snapshots

This is a semi-realistic example of how volumes and snapshots may be used. There is a snapshot called base.centos7. This snapshot contains a base CentOS 7 VM image, which was prepared carefully by the service provider. There are 3 customers with 4 virtual machines each. All virtual machine images are based on CentOS 7, but may contain custom data, which is unique to each VM.












Example use of snapshots

This example shows another typical use of snapshots - for restore points back in time for this volume. There is one base image for Centos 7, three snapshot restore points and one live volume cust123.v.1














16. Exporting volumes using iSCSI

If StorPool volumes need to be accessed by hosts that cannot run the StorPool client service (e.g. VMware hypervisors), they may be exported using the iSCSI protocol.

As of version 19, StorPool implements an internal user-space TCP/IP stack, which in conjunction with the NIC hardware acceleration (user-mode drivers) allows for higher performance and independence of the kernel’s TCP/IP stack and its inefficiencies.

This document provides information on how to set up iSCSI targets in a StorPool cluster. For more information on using the storpool tool for this purpose, see the 12.15.  iSCSI section of the reference guide. For details on configuring initiators (clients), see Short overview of iSCSI.

16.1. A Quick Overview of iSCSI

The iSCSI remote block device access protocol, as implemented by the StorPool iSCSI service, is a client-server protocol allowing clients (referred to as “initiators”) to read and write data to disks (referred to as “targets”) exported by iSCSI servers.

iSCSI is implemented in StorPool with Portal Groups and Portals.

A portal is one instance of the storpool_iscsi service, which listens on a TCP port (usually 3260), on specified IP addresses. Every portal has its own set of “targets” (exported volumes) that it provides service for.

A portal group is the “entry point” to the iSCSI service - a “floating” IP address that’s on the first storpool_iscsi service in the cluster and is always kept active (by automatically moving to the next instance if the one serving it is stopped/dies). All initiators connect to that IP and get redirected to the relevant instance to communicate with their target.

16.2. An iSCSI Setup in a StorPool Cluster

The StorPool implementation of iSCSI provides a way to mark StorPool volumes as accessible to iSCSI initiators, define iSCSI portals where hosts running the StorPool iSCSI service listen for connections from initiators, define portal groups over these portals, and export StorPool volumes (iSCSI targets) to iSCSI initiators in the portal groups. To simplify the configuration of the iSCSI initiators, and also to provide load balancing and failover, each portal group has a floating IP address that is automatically brought up on only a single StorPool service at a given moment; the initiators are configured to connect to this floating address, authenticating if necessary, and are then redirected to the portal of the StorPool service that actually exports the target (volume) that they need to access.

Note

As of 19, you don’t need to add the IP addresses on the nodes, those are handled directly by the StorPool TCP implementation and are not visible in ifconfig or ip. If you’re going to use multiple VLANs, those are configured in the CLI and do not require setting up VLAN interfaces on the host itself, except for debugging/testing or if a local initiator is required to access volumes through iSCSI.

In the simplest setup, there is a single portal group with a floating IP address, there is a single portal for each StorPool host that runs the iSCSI service, all the initiators connect to the floating IP address and are redirected to the correct host. For quality of service or fine-grained access control, more portal groups may be defined and some volumes may be exported via more than one portal group.

Before configuring iSCSI, the interfaces that would be used for it need to be described in storpool.conf. Here is the general config format:

SP_ISCSI_IFACE=IFACE1,RESOLVE:IFACE2,RESOLVE:[flags]

This row means that the first iSCSI network is on IFACE1 and the second one on IFACE2. The order is important for the configuration later. RESOLVE is the resolve interface, if different than the interfaces themselves, i.e. if it’s a bond or a bridge.

[flags] is not required and more importantly if not needed, must be omitted. Currently the only supported value is [lacp] (brackets included) if the interfaces are in a LACP trunk.

Examples:

Multipath, two separate interfaces used directly:

SP_ISCSI_IFACE=eth0:eth1

Active-backup bond named bond0:

SP_ISCSI_IFACE=eth0,bond0:eth1,bond0

LACP bond named bond0:

SP_ISCSI_IFACE=eth0,bond0:eth1,bond0:[lacp]

Bridge interface cloudbr0 on top of LACP bond:

SP_ISCSI_IFACE=eth0,cloudbr0:eth1,cloudbr0:[lacp]

A trivial iSCSI setup can be brought up by the following series of StorPool CLI commands below. See the CLI tutorial for more information about the commands themselves. The setup does the following:

  • has baseName/IQN of iqn.2019-08.com.example:poc-cluster;

  • has floating IP address is 192.168.42.247, which is in VLAN 42;

  • two nodes from the cluster will be able to export in this group:

    • node id 1, with IP address 192.168.42.246

    • node id 3, with IP address 192.168.42.202

  • one client is defined, with IQN iqn.2019-08.com.example:poc-cluster:hv1

  • one volume, called tinyvolume will be exported to the client defined, in the portal group.

Note

You need to obtain the exact IQN of the initiator, available at:

  • Windows Server: iSCSI initiator, it is automatically generated upon installation

  • VMWare vSphere: it is automatically assigned upon creating a software iSCSI adapter

  • Linux-based (XenServer, etc.): /etc/iscsi/initiatorname.iscsi

# storpool iscsi config setBaseName iqn.2019-08.com.example:poc-cluster
OK

# storpool iscsi config portalGroup poc create
OK

# storpool iscsi config portalGroup poc addNet 192.168.42.247/24 vlan 42
OK

# storpool iscsi config portal create portalGroup poc address 192.168.42.246 controller 1
OK

# storpool iscsi config portal create portalGroup poc address 192.168.42.202 controller 3
OK

# storpool iscsi portalGroup list
---------------------------------------
| name | networksCount | portalsCount |
---------------------------------------
| poc  |             1 |            2 |
---------------------------------------

# storpool iscsi portalGroup list portals
--------------------------------------------
| group | address             | controller |
--------------------------------------------
| poc   | 192.168.42.246:3260 |          1 |
| poc   | 192.168.42.202:3260 |          3 |
--------------------------------------------

# storpool iscsi config initiator iqn.2019-08.com.example:poc-cluster:hv1 create
OK

# storpool volume tinyvolume template tinytemplate create # assumes tinytemplate exists
OK

# storpool iscsi config target create tinyvolume
OK

# storpool iscsi config export volume tinyvolume portalGroup poc initiator iqn.2019-08.com.example:poc-cluster:hv1
OK

# storpool iscsi initiator list
----------------------------------------------------------------------------------------------
| name                                    | username | secret | networksCount | exportsCount |
----------------------------------------------------------------------------------------------
| iqn.2019-08.com.example:poc-cluster:hv1 |          |        |             0 |            1 |
----------------------------------------------------------------------------------------------

# storpool iscsi initiator list exports
---------------------------------------------------------------------------------------------------------------------------------------------
| name                                           | volume     | currentControllerId | portalGroup | initiator                               |
---------------------------------------------------------------------------------------------------------------------------------------------
| iqn.2019-08.com.example:poc-cluster:tinyvolume | tinyvolume |                   1 | poc         | iqn.2019-08.com.example:poc-cluster:hv1 |
---------------------------------------------------------------------------------------------------------------------------------------------

Below is a setup with two separate networks that allows for multipath. It uses the 192.168.41.0/24 network on the first interface, 192.168.42.0/24 on the second interface, and the .247 IP for the floating IP in both networks:

# storpool iscsi config setBaseName iqn.2019-08.com.example:poc-cluster
OK

# storpool iscsi config portalGroup poc create
OK

# storpool iscsi config portalGroup poc addNet 192.168.41.247/24
OK

# storpool iscsi config portalGroup poc addNet 192.168.42.247/24
OK

# storpool iscsi config portal create portalGroup poc address 192.168.41.246 controller 1
OK

# storpool iscsi config portal create portalGroup poc address 192.168.42.246 controller 1
OK

# storpool iscsi config portal create portalGroup poc address 192.168.41.202 controller 3
OK

# storpool iscsi config portal create portalGroup poc address 192.168.42.202 controller 3
OK

# storpool iscsi portalGroup list
---------------------------------------
| name | networksCount | portalsCount |
---------------------------------------
| poc  |             2 |            4 |
---------------------------------------

# storpool iscsi portalGroup list portals
--------------------------------------------
| group | address             | controller |
--------------------------------------------
| poc   | 192.168.41.246:3260 |          1 |
| poc   | 192.168.41.202:3260 |          3 |
| poc   | 192.168.42.246:3260 |          1 |
| poc   | 192.168.42.202:3260 |          3 |
--------------------------------------------

Note

Please note that the order of adding the networks relates to the order in the SP_ISCSI_IFACE, the first network will be bound to the first interface appearing in this configuration. More on how to list configured iSCSI interfaces is available here, more on how to list addresses exposed by a particular node, check - 12.15.5.  iscsi_tool.

There is no difference in exporting volumes in multi-path setups.

16.3. Routed iSCSI setup

16.3.1. Overview

Layer-3/routed networks present some challenges to the operation of StorPool iSCSI, unlike flat layer-2 networks:

  • routes need to be resolved for destinations based on the kernel routing table, instead of ARP;

  • floating IP addresses for the portal groups need to be accessible to the whole network;

The first task is accomplished by monitoring the kernel’s routing table, the second with an integrated BGP speaker in storpool_iscsi.

Note

StorPool’s iSCSI does not support Linux’s policy-based routing, and is not affected by iptables, nftables, or any kernel filtering/networking component.

An iSCSI deployment in a layer-3 network has the following general elements:

  • nodes with storpool_iscsi in one or multiple subnets;

  • allocated IP(s) for portal group floating IP addresses;

  • local routing daemon (bird, frr)

  • access to the network’s routing protocol.

The storpool_iscsi daemon connects to a local routing daemon via BGP and announces the floating IPs from the node those are active on. The local routing daemon talks to the network via its own protocol (BGP, OSPF or something else) and passes on the updates.

Note

In a fully routed network, the local routing daemon is also responsible for announcing the IP for cluster management (managed by storpool_mgmt)

16.3.2. Configuration

The following needs to be added to storpool.conf:

SP_ISCSI_ROUTED=1

In routed networks, when adding the portalGroup floating IP address, you need to specify it as /32.

Note

These are example configurations and may not be the exact fit for a particular setup. Handle with care.

Note

In the examples below, the ASN of the network is 65500, StorPool has been assigned 65512, and will need to announce 192.168.42.247.

To enable the BGP speaker in storpool_iscsi, the following snippet for storpool.conf is needed (the parameters are described in the comment above it):

# ISCSI_BGP_IP:BGP_DAEMON_IP:AS_FOR_ISCSI:AS_FOR_THE_DAEMON
SP_ISCSI_BGP_CONFIG=127.0.0.2:127.0.0.1:65512:65512

And here’s a snippet from bird.conf for a BGP speaker that talks to StorPool’s iSCSI:

# variables
myas = 65512;
remoteas = 65500;
neigh = 192.168.42.1

# filter to only export our floating IP

filter spip {
    if (net = 192.168.42.247/32) then accept;
    reject;
}


# external gateway
protocol bgp sw100g1 {
        local as myas;
        neighbor neigh as remoteas;
        import all;
        export filter spip;
        direct;
        gateway direct;
        allow local as;
}

# StorPool iSCSI
protocol bgp spiscsi {
        local as myas;
        neighbor 127.0.0.1 port 2179 as myas;
        import all;
        export all;
        multihop;
        next hop keep;
        allow local as;
}

Note

For protocols different than BGP, please note that the StorPool iSCSI exports the route to the floating IP with a next-hop of the IP address configured for the portal of the node, and this information needs to be preserved when announcing the route.

16.4. Caveats with a Complex iSCSI Architecture

In iSCSI portal definitions, a TCP address/port pair must be unique; only a single portal within the whole cluster may be defined at a single IP address and port. Thus, if the same StorPool iSCSI service should be able to export volumes in more than one portal group, the portals should be placed either on different ports or on different IP addresses (although it is fine that these addresses will be brought up on the same network interface on the host).

Note

Even though StorPool supports reusing IPs, separate TCP ports, etc., the general recommendation on different portal groups is to have a separate VLAN and IP range for each one. There are lots of unknowns with different ports, security issues with multiple customers in the same VLAN, etc..

The redirecting portal on the floating address of a portal group always listens on port 3260. Similarly to the above, different portal groups must have different floating IP addresses, although they are automatically brought up on the same network interfaces as the actual portals within the groups.

Some iSCSI initiator implementations (e.g. VMware vSphere) may only connect to TCP port 3260 for an iSCSI service. In a more complex setup where a StorPool service on a single host may export volumes in more than one portal group, this might mean that the different portals must reside on different IP addresses, since the port number is the same.

For technical reasons, currently a StorPool volume may only be exported by a single StorPool service (host), even though it may be exported in different portal groups. For this reason, some care should be taken in defining the portal groups so that they may have at least some StorPool services (hosts) in common.

17. Multi-site and multi-cluster

There are two sets of features allowing connections and operations to be performed on different clusters in the same (17.1.  Multicluster) datacenter or different locations (17.2.  Multi site).

General distinction between the two:

  • Multicluster covers closely packed clusters (i.e. pods or racks) with a fast and low-latency connection between them

  • Multi-site covers clusters in separate locations connected through and insecure and/or high-latency connection

17.1. Multicluster

Main use case for the multicluster is seamless scalability in the same datacenter. A volume could be live-migrated between different sub-clusters in a multicluster setup. This way workloads could be balanced between multiple sub-clusters in a location, which is generally referred to as a multicluster setup.

digraph G {
  rankdir=LR;
  compound=true;
  ranksep=1;
  style=radial;
  bgcolor="white:gray";
  image=svg;
  label="Location A";
  subgraph cluster_a0 {
    style=filled;
    bgcolor="white:lightgrey";
    node [
        style=filled,
        shape=square,
    ];
    bridge0;
    a00 [label="a0.1"];
    a01 [label="a0.2"];
    space0 [label="..."];
    a03 [label="a0.N"];
    label = "Cluster A0";
  }

  subgraph cluster_a1 {
    style=filled;
    bgcolor="white:lightgrey";
    node [
        style=filled,
        shape=square,
    ];
    bridge1;
    a10 [label="a1.1"];
    a11 [label="a1.2"];
    space1 [label="..."];
    a13 [label="a1.N"];
    label = "Cluster A1";
  }

  subgraph cluster_a2 {
    style=filled;
    bgcolor="white:lightgrey";
    node [
        style=filled,
        shape=square,
    ];
    bridge2;
    a20 [label="a2.1"];
    a21 [label="a2.2"];
    space2 [label="..."];
    a23 [label="a2.N"];
    label = "Cluster A2";
  }

  bridge0 -> bridge1 [dir=both, lhead=cluster_a1, ltail=cluster_a0];
  bridge1 -> bridge2 [dir=both, lhead=cluster_a2, ltail=cluster_a1];
  bridge0 -> bridge2 [dir=both, lhead=cluster_a2, ltail=cluster_a0];
// was:
//   bridge0 -> bridge1 [color="red", lhead=cluster_a1, ltail=cluster_a0];
//   bridge1 -> bridge0 [color="blue", lhead=cluster_a0, ltail=cluster_a1];
//   bridge1 -> bridge2 [color="red", lhead=cluster_a2, ltail=cluster_a1];
//   bridge2 -> bridge1 [color="blue", lhead=cluster_a1, ltail=cluster_a2];
//   bridge0 -> bridge2 [color="red", lhead=cluster_a2, ltail=cluster_a0];
//   bridge2 -> bridge0 [color="blue", lhead=cluster_a0, ltail=cluster_a2];

}

17.2. Multi site

Remotely connected clusters in different locations are referred to as multi site. When two remote clusters are connected, they could efficiently transfer snapshots between them. The usual case is remote backup and DR.

digraph G {
  rankdir=LR;
  compound=true;
  ranksep=2;
  image=svg;
  subgraph cluster_loc_a {
    style=radial;
    bgcolor="white:gray";
    node [
        style=filled,
        color="white:lightgrey",
        shape=square,
    ];
    a0 [label="Cluster A0"];
    a1 [label="Cluster A1"];
    a2 [label="Cluster A2"];
    label = "Location A";
  }

  subgraph cluster_loc_b {
    style=filled;
    color=grey;
    node [
        style=filled,
        color="white:grey",
        shape=square,
    ];
    b0 [label="Cluster B0"];
    b1 [label="Cluster B1"];
    b2 [label="Cluster B2"];
    label = "Location B";
  }

  a1 -> b1 [color="red", lhead=cluster_loc_b, ltail=cluster_loc_a];
  b1 -> a1 [color="blue", lhead=cluster_loc_a, ltail=cluster_loc_b];
}

17.3. Setup

Connecting clusters regardless of their locations requires the storpool_bridge service to be running on at least two nodes in each cluster.

Each node running the storpool_bridge needs the following parameters to be configured in /etc/storpool.conf or /etc/storpool.conf.d/*.conf files:

SP_CLUSTER_NAME=<Human readable name of the cluster>
SP_CLUSTER_ID=<location ID>.<cluster ID>
SP_BRIDGE_HOST=<IP address>

The following is required when a single IP will be failed over between the bridges; see 17.5.2.  Single IP failed over between the nodes:

SP_BRIDGE_IFACE=<interface> # optional with IP failover

The SP_CLUSTER_NAME is mandatory human readable name for this cluster.

The SP_CLUSTER_ID is an unique ID assigned by StorPool for each existing cluster (example nmjc.b). The cluster ID consists of two parts:

nmjc.b
|     `sub-cluster ID
`location ID

The first part before the dot (nmjc) is the location ID, and the part after the dot is the sub-cluster ID (the second part after the . - b)

The SP_BRIDGE_HOST is the IP address to listen for connections from other bridges. Note that 3749 port should be unblocked in the firewalls between the two locations.

A backup template should be configured through mgmtConfig (see 12.22.  Management configuration) The backup template is needed to instruct the local bridge which template should be used for incoming snapshots.

Warning

The backupTemplateName mgmtConfig option must be configure in the destination cluster for storpool volume XXX backup LOCATION to work (otherwise the transfer won’t start).

The SP_BRIDGE_IFACE is required when two or more bridges are configured with the same public/private key pairs. The SP_BRIDGE_HOST in this case is a floating IP address and will be configured on the SP_BRIDGE_IFACE on the host with the active bridge.

17.4. Connecting two clusters

In this examples there are two clusters named Cluster_A and Cluster_B. To have these two connected through their bridge services we would have to introduce each of them to the other.

digraph G {
  rankdir=LR;
  image=svg;
  subgraph cluster_a {
    style=filled;
    color=lightgrey;
    node [
        style=filled,
        color=white,
        shape=square,
        label="Bridge A\n\nSP_CLUSTER_ID = locationAId.aId\nSP_BRIDGE_HOST = 10.10.10.1\npublic_key: aaaa.bbbb.cccc.dddd\n",
    ];
    bridge0;
    label = "Cluster A";
  }

  subgraph cluster_b {
    style=filled;
    color=grey;
    node [
        style=filled,
        color=white,
        shape=square,
        label="Bridge B\n\nSP_CLUSTER_ID = locationBId.bId\nSP_BRIDGE_HOST = 10.10.20.1\npublic_key: eeee.ffff.gggg.hhhh\n"
    ];
    bridge1;
    label = "Cluster B";
  }
  bridge0 -> bridge1 [dir=none color=none]
}

Note

In case of a multicluster setup the location will be the same for both clusters, the procedure is the same for both cases with the slight difference that in case of multicluster the remote bridges are usually configured with noCrypto.

17.4.1. Cluster A

The following parameters from Cluster_B will be required:

  • The SP_CLUSTER_ID - locationBId.bId

  • The SP_BRIDGE_HOST IP address - 10.10.20.1

  • The public key located in /usr/lib/storpool/bridge/bridge.key.txt in the remote bridge host in Cluster_B - eeeeeeeeeeeee.ffffffffffff.ggggggggggggg.hhhhhhhhhhhhh

By using the CLI we could add Cluster_B’s location with the following commands in Cluster_A:

user@hostA # storpool location add locationBId location_b
user@hostA # storpool cluster add location_b bId
user@hostA # storpool cluster list
--------------------------------------------
| name                 | id   | location   |
--------------------------------------------
| location_b-cl1       | bId  | location_b |
--------------------------------------------

The remote name is location_b-cl1, where the clN number is automatically generated based on the cluster ID. The last step in Cluster_A is to register the Cluster_B’s bridge. The command looks like this:

user@hostA # storpool remoteBridge register location_b-cl1 10.10.20.1 eeeeeeeeeeeee.ffffffffffff.ggggggggggggg.hhhhhhhhhhhhh

Registered bridges in Cluster_A:

user@hostA # storpool remoteBridge list
----------------------------------------------------------------------------------------------------------------------------
| ip             | remote         | minimumDeleteDelay | publicKey                                              | noCrypto |
----------------------------------------------------------------------------------------------------------------------------
| 10.10.20.1     | location_b-cl1 |                    | eeeeeeeeeeeee.ffffffffffff.ggggggggggggg.hhhhhhhhhhhhh | 0        |
----------------------------------------------------------------------------------------------------------------------------

Hint

The public key in /usr/lib/storpool/bridge/bridge.key.txt will be generated on the first run of the storpool_bridge service.

Note

The noCrypto is usually 1 in case of multicluster with a secure datacenter network for higher throughput and lower latency during migrations.

digraph G {
  rankdir=LR;
  image=svg;
  compound=true;
  ranksep=2;
  subgraph cluster_a {
    style=filled;
    color=lightgrey;
    node [
        style=filled,
        color=white,
        shape=square,
        label="Bridge A",
    ];
    bridge0;
    label = "Cluster A";
  }

  subgraph cluster_b {
    style=filled;
    color=grey;
    node [
        style=filled,
        color=white,
        shape=square,
        label="Bridge B"
    ];
    bridge1;
    label = "Cluster B";
  }
  bridge0 -> bridge1 [color="red", lhead=cluster_b, ltail=cluster_a];
//   bridge1 -> bridge0 [color="blue", lhead=cluster_a, ltail=cluster_b];
}

17.4.2. Cluster B

Similarly the parameters from Cluster_A will be required for registering the location, cluster and bridge(s) in Cluster B:

  • The SP_CLUSTER_ID - locationAId.aId

  • The SP_BRIDGE_HOST IP address in Cluster_A - 10.10.10.1

  • The public key in /usr/lib/storpool/bridge/bridge.key.txt in the remote bridge host in Cluster_A - aaaaaaaaaaaaa.bbbbbbbbbbbb.ccccccccccccc.ddddddddddddd

Similarly the commands will be:

user@hostB # storpool location add locationAId location_a
user@hostB # storpool cluster add location_a aId
user@hostB # storpool cluster list
--------------------------------------------
| name                 | id   | location   |
--------------------------------------------
| location_a-cl1       | aId  | location_a |
--------------------------------------------
user@hostB # storpool remoteBridge register location_a-cl1 1.2.3.4 aaaaaaaaaaaaa.bbbbbbbbbbbb.ccccccccccccc.ddddddddddddd
user@hostB # storpool remoteBridge list
-------------------------------------------------------------------------------------------------------------------------
| ip          | remote         | minimumDeleteDelay | publicKey                                              | noCrypto |
-------------------------------------------------------------------------------------------------------------------------
| 1.2.3.4     | location_a-cl1 |                    | aaaaaaaaaaaaa.bbbbbbbbbbbb.ccccccccccccc.ddddddddddddd | 0        |
-------------------------------------------------------------------------------------------------------------------------

At this point, provided network connectivity is working, the two bridges will be connected.

digraph G {
  rankdir=LR;
  image=svg;
  compound=true;
  ranksep=2;
  subgraph cluster_a {
    style=filled;
    color=lightgrey;
    node [
        style=filled,
        color=white,
        shape=square,
        label="Bridge A",
    ];
    bridge0;
    label = "Cluster A";
  }

  subgraph cluster_b {
    style=filled;
    color=grey;
    node [
        style=filled,
        color=white,
        shape=square,
        label="Bridge B"
    ];
    bridge1;
    label = "Cluster B";
  }
  bridge0 -> bridge1 [color="red", lhead=cluster_b, ltail=cluster_a];
  bridge1 -> bridge0 [color="blue", lhead=cluster_a, ltail=cluster_b];
}

17.5. Bridge redundancy

There are two ways to add redundancy for the bridge services by configuring and starting the storpool_bridge service on two (or more) nodes in each cluster.

For both cases only one bridge is active at a time and is being failed over in case the node or the active service is restarted.

17.5.1. Separate IP addresses

Configure and start the storpool_bridge with a separate SP_BRIDGE_HOST address and a separate set of public/private key pairs. In this case each of the bridge nodes would have to be registered in the same way as explained in the 17.4.  Connecting two clusters section. The SP_BRIDGE_IFACE parameter is unset and the SP_BRIDGE_HOST address is expected by the storpool_bridge service on each of the node where it is started.

In this case each of the bridge nodes in ClusterA would have to be configured in ClusterB and vice-versa.

17.5.2. Single IP failed over between the nodes

For this, configure and start the storpool_bridge service on the first node. Then distribute the /usr/lib/storpool/bridge/bridge.key and the /usr/lib/storpool/bridge/bridge.key.txt files on the next node where the storpool_bridge service will be running.

The SP_BRIDGE_IFACE is required and represents the interface where the SP_BRIDGE_HOST address will be configured. The SP_BRIDGE_HOST will be up only on the node where the active bridge service is running until either the service or the node itself gets restarted.

With this configuration there will be only one bridge registered in the remote cluster(s), regardless of the number of nodes with running storpool_bridge in the local cluster.

The failover SP_BRIDGE_HOST is better suited for NAT/port-forwarding cases.

17.6. Bridge throughput performance

The throughput performance of a bridge connection depends on a couple of factors (not in this exact sequence) - network throughput, network latency, CPU speed and disk latency. Each could become a bottleneck and could require additional tuning in order to get a higher throughput from the available link between the two sites.

17.6.1. Network

For high-throughput links latency is the most important factor for achieving higher link utilization. For example, a low-latency 10 gbps link will be easily saturated (provided crypto is off), but would require some tuning when the latency is higher for optimizing the TCP window size. Same is in effect with lower-bandwidth links with higher latency.

For these cases the send buffer size could be bumped in small increments so that the TCP window is optimized. Check the 12.1.  Location section for more info on how to update the send buffer size in each location.

Note

For testing what would be the best send buffer size for throughput performance from primary to backup site, fill a volume with data in the primary (source) site, then create a backup to the backup (remote) site. While observing the bandwidth utilized, increase the send buffers in small increments in the source and the destination cluster until the throughput either stops rising or stays at an acceptable level.

Note that increasing the send buffers above this value can lead to delays when recovering a backup in the opposite direction.

Further sysctl changes might be required, depending on the NIC driver, check the /usr/share/doc/storpool/examples/bridge/90-StorPoolBridgeTcp.conf on the node with the storpool_bridge service, for more info on this.

17.6.2. CPU

The CPU usually becomes a bottleneck only when crypto is configured to on, sometimes it is helpful to move the bridge service on a node with a faster CPU.

If a faster CPU is not available in the same cluster, the SP_BRIDGE_SLEEP_TYPE configured to hsleep or even no might help, note that when this is configured the storpool_cg will attempt to isolate a full-CPU core (i.e. with the second thread free from other processes).

17.6.3. Disks throughput

The default remote recovery setting (SP_REMOTE_RECOVERY_PARALLEL_REQUESTS_PER_DISK) is relatively low especially for dedicated backup clusters, thus when the underlying disks in the receiving cluster are underutilized (this does not happen with flash media) they become the bottleneck. This parameter could be tuned for higher parallelism, an example would be a small cluster of 3 nodes with 8 disks, translating to 48 default queue depth from the bridge, when there are 8 * 3 * 32 available from the underlying disks and (by default with a 10gbps link), 2048 requests available from the bridge service (256 on an 1gbps link).

Note

The storpool_server services requires restart after the changes in order for the changes to be applied.

17.7. Exports

A snapshot in one of the clusters could be exported and become visible at all clusters in the location it was exported to, for example a snapshot called snap1 could be exported with:

user@hostA # storpool snapshot snap1 export location_b

It becomes visible in Cluster_B which is part of location_b and could be listed with:

user@hostB # storpool snapshot list remote
-------------------------------------------------------------------------------------------------------
| location   | remoteId             | name     | onVolume | size         | creationTimestamp   | tags |
-------------------------------------------------------------------------------------------------------
| location_b | locationAId.aId.1    | snap1    |          | 107374182400 | 2019-08-11 15:18:02 |      |
-------------------------------------------------------------------------------------------------------

The snapshot may as well be exported to the location of the source cluster where the snapshot resides. This way it will become visible to all sub-clusters in this location.

17.8. Remote clones

Any snapshot export could be cloned locally. For example, to clone a remote snapshot with globalId of locationAId.aId.1 locally we could use:

user@hostB # storpool snapshot snap1-copy template hybrid remote location_a locationAId.aId.1
digraph G {
  rankdir=LR;
  compound=true;
  ranksep=2;
  image=svg;
  subgraph cluster_a {
    style=filled;
    color=lightgrey;
    node [
        style=filled,
        color=white,
        shape=square,
        label="Bridge A",
    ];
    bridge0;
    node [
        style=filled,
        shape=circle,
        color=white,
        label="snap1\nlocationAId.aId.1"
    ]
    snap1;
    label = "Cluster A";
  }

  subgraph cluster_b {
    style=filled;
    color=grey;
    node [
        style=filled,
        color=white,
        shape=square,
        label="Bridge B"
    ];
    bridge1;
    node [
        style=filled,
        shape=circle,
        color=white,
        label="snap1_clone\nlocationAId.aId.1",
    ]
    snap1_clone;
    label = "Cluster B";
  }
  bridge0 -> bridge1 [color="red", lhead=cluster_b, ltail=cluster_a];
  bridge1 -> bridge0 [color="blue", lhead=cluster_a, ltail=cluster_b];
  snap1 -> snap1_clone
}

The name of the clone of the snapshot in Cluster_B will be snap1_clone with all parameters from the hybrid template.

Note

Note that the name of the snapshot in Cluster_B could also be exactly the same in all sub-clusters in a multicluster setup, as well as in clusters in different locations in a multi site setup.

The transfer will start immediately. Only written parts from the snapshot will be transferred between the sites. If snap1 has a size of 100GB, but only 1GB of data was ever written in the volume when it was snapshotted, eventually approximately this amount of data will be transferred between the two (sub-)clusters.

If another snapshot in the remote cluster is already based on snap1 and then exported, the actual transfer will again include only the differences between snap1 and snap2, since snap1 exists in Cluster_B.

digraph G {
  graph [nodesep=0.5, ranksep=1]
  rankdir=LR;
  compound=true;
  image=svg;
  subgraph cluster_a {
    style=filled;
    color=lightgrey;
    node [
        style=filled,
        color=white,
        shape=square,
        label="Bridge A",
    ];
    bridge0;
    node [
        style=filled,
        shape=circle,
        color=white,
        label="snap1\nlocationAId.aId.1"
    ]
    snap1;
    node [
        style=filled,
        shape=circle,
        color=white,
        label="snap2\nlocationAId.aId.2"
    ]
    snap2;
    label = "Cluster A";
    {rank=same; bridge0 snap1 snap2}
  }

  subgraph cluster_b {
    style=filled;
    color=grey;
    node [
        style=filled,
        color=white,
        shape=square,
        label="Bridge B"
    ];
    bridge1;
    node [
        style=filled,
        shape=circle,
        color=white,
        label="snap1_clone\nlocationAId.aId.1",
    ]
    snap1_clone;
    node [
        style=filled,
        shape=circle,
        color=white,
        label="snap2_clone\nlocationAId.aId.2",
    ]
    snap2_clone;
    {rank=same; bridge1 snap1_clone snap2_clone}
    label = "Cluster B";
  }
  bridge0 -> bridge1 [color="red", lhead=cluster_b, ltail=cluster_a];
  bridge1 -> bridge0 [color="blue", lhead=cluster_a, ltail=cluster_b];
  snap2 -> snap1 [dir=back];
  snap2_clone -> snap1_clone [dir=back];
  snap2 -> snap2_clone;
}

The globalId for this snapshot will be the same for all sites it has been transferred to.

17.9. Creating a remote backup on a volume

The volume backup feature is in essence a set of steps that automate the backup procedure for a particular volume.

For example to backup a volume named volume1 in Cluster_A to Cluster_B we will use:

user@hostA # storpool volume volume1 backup Cluster_B

The above command will actually trigger the following set of events:

  1. Creates a local temporary snapshot of volume1 in Cluster_A to be transferred to Cluster_B

  2. Exports the temporary snapshot to Cluster_B

  3. Instructs Cluster_B to initiate the transfer for this snapshot

  4. Exports the transferred snapshot in Cluster_B to be visible from Cluster_A

  5. Deletes the local temporary snapshot

For example, if a backup operation has been initiated for a volume called volume1 in Cluster_A, the progress of the operation could be followed with:

user@hostA # storpool snapshot list exports
-------------------------------------------------------------
| location   | snapshot     | globalId          | backingUp |
-------------------------------------------------------------
| location_b | volume1@1433 | locationAId.aId.p | true      |
-------------------------------------------------------------

Once this operation completes the temporary snapshot will no longer be visible as an export and a snapshot with the same globalId will be visible remotely:

user@hostA # snapshot list remote
------------------------------------------------------------------------------------------------------
| location   | remoteId          | name    | onVolume    | size         | creationTimestamp   | tags |
------------------------------------------------------------------------------------------------------
| location_b | locationAId.aId.p | volume1 | volume1     | 107374182400 | 2019-08-13 16:27:03 |      |
------------------------------------------------------------------------------------------------------

Note

You must have a template configured in mgmtConfig backupTemplateName in Cluster_B for this to work.

17.10. Creating an atomic remote backup for multiple volumes

Sometimes a set of volumes are used simultaneously in the same virtual machine, an example would be different filesystems for a database and its journal. In order to restore back to the same point in time all volumes a group backup could be initiated:

user@hostA# storpool volume groupBackup Cluster_B volume1 volume2

Note

The same underlying feature is used by the VolumeCare for keeping consistent snapshots for all volumes on a virtual machine.

17.11. Restoring a volume from remote snapshot

Restoring the volume to a previous state from a remote snapshot requires the following steps:

  1. Create a local snapshot from the remotely exported one:

    user@hostA # storpool snapshot volume1-snap template hybrid remote location_b locationAId.aId.p
    OK
    

    There are some bits to explain in the above example - from left to right:

    • volume1-snap - name of the local snapshot that will be created.

    • template hybrid - instructs StorPool what will be the replication and placement for the locally created snapshot.

    • remote location_b locationAId.aId.p - instructs StorPool where to look for this snapshot and what is its globalId

    Tip

    If the bridges and the connection between the locations are operational, the transfer will begin immediately.

  1. Next, create a volume with the newly created snapshot as a parent:

    .. code-block:: console
    
        user@hostA # storpool volume volume1-tmp parent volume1-snap
    
  2. Finally, the volume clone would have to be attached where it is needed.

The last two steps could be changed a bit to rename the old volume to something different and directly create the same volume name from the restored snapshot. This is handled differently in different orchestration systems. The procedure for restoring multiple volumes from a group backup requires the same set of steps.

See VolumeCare 6.5.  node info for an example implementation.

Note

From 19.01 onwards if the snapshot transfer hasn’t completed yet when the volume is created, read operations on an object that is not yet transferred will be forwarded through the bridge and will be processed by the remote cluster.

17.12. Remote deferred deletion

Note

This feature is available for both multicluster and multi-site configurations. Note that the minimumDeleteDelay is per bridge, not per location, thus all bridges to a remote location should be (re)registered with the setting.

The remote bridge could be registered with remote deferred deletion enabled. This feature will enable a user in Cluster A to unexport and set remote snapshots for deferred deletion in Cluster B.

An example for the case without deferred deletion enabled - Cluster_A and Cluster_B are two StorPool clusters in locations A and B connected with a bridge. A volume named volume1 in Cluster_A has two backup snapshots in Cluster_B called volume1@281 and volume1@294.

digraph G {
  graph [nodesep=0.5, ranksep=1]
  rankdir=LR;
  compound=true;
  image=svg;
  subgraph cluster_a {
    style=filled;
    color=lightgrey;
    node [
        style=filled,
        color=white,
        shape=square,
        label="Bridge A",
    ];
    bridge0;
    node [
        shape=rectangle,
    ]
    v1 [label=volume1]
    {rank=same; bridge0 v1}
    label = "Cluster A";
  }

  subgraph cluster_b {
    style=filled;
    color=grey;
    node [
        style=filled,
        color=white,
        shape=square,
        label="Bridge B"
    ];
    bridge1;
    node [
        shape=circle,
    ]
    v1s [label="volume1@281"]
    v2s [label="volume1@294"]
    v1s -> v2s [dir=back]
    {rank=same; bridge1 v1s v2s}
    label = "Cluster B";
  }
  bridge0 -> bridge1 [color="red", lhead=cluster_b, ltail=cluster_a];
  bridge1 -> bridge0 [color="blue", lhead=cluster_a, ltail=cluster_b];
}

The remote snapshots could be unexported from Cluster_A with the deleteAfter flag, but it will be silently ignored in Cluster_B.

To enable this feature the following steps would have to be completed in the remote bridge for Cluster_A:

  1. The bridge in Cluster_A should be registered with minimumDeleteDelay in Cluster_B.

  2. Deferred snapshot deletion should be enabled in Cluster_B; for details, see 12.22.  Management configuration.

This will enable setting up the deleteAfter parameter on an unexport operation in Cluster_B initiated from Cluster_A.

With the above example volume and remote snapshots, a user in Cluster_A could unexport the volume1@294 snapshot and set its deleteAfter flag to 7 days from the unexport with:

user@hostA # storpool snapshot remote location_b locationAId.aId.q unexport deleteAfter 7d
OK

After the completion of this operation the following events will occur:

  • The volume1@294 snapshot will immediately stop being visible in Cluster_A.

  • The snapshot will get a deleteAfter flag with timestamp a week from the time of the unexport call.

  • A week later the snapshot will be deleted, however only if deferred snapshot deletion is still turned on.

17.13. Volume and snapshot move

17.13.1. Volume move

A volume could be moved both with (live) or without attachment (offline) to a neighbor sub-cluster in a multicluster environment. This is available only for multicluster and not possible for Multi site, where only snapshots could be transferred.

To move a volume use:

# storpool volume <volumeName> moveToRemote <clusterName>

The above command will succeed only in case the volume is not attached on any of the nodes in this sub-cluster. To move the volume live while it is still attached an additional option onAttached should instruct the cluster how to proceed, for example this command:

Lab-D-cl1> volume test moveToRemote Lab-D-cl2 onAttached export

Will move the volume to the Lab-D-cl2 sub-cluster and if the volume is attached in the present cluster will export it back to Lab-D-cl1.

This is an equivalent to:

Lab-D-cl1> multiCluster on
[MC] Lab-D-cl1> cluster cmd Lab-D-cl2 attach volume test client 12
OK

Or directly executing the same CLI command in multicluster mode at a host in Lab-D-cl2 cluster.

Note

Moving a volume will also trigger moving all of its snapshots. In a case where there are parent snapshots with many child volumes they might end up in each sub-cluster their child volumes ended up being moved to as a space-saving measure.

17.13.2. Snapshot move

Moving a snapshot is essentially the same as moving a volume, with the difference that it cannot be moved when attached.

For example:

Lab-D-cl1> snapshot testsnap moveToRemote Lab-D-cl2

Will succeed only if the snapshot is not attached locally.

A snapshot part of a volume snapshot chain will trigger copying also the parent snapshots which will be automatically managed by the cluster.

18. Rebalancing the cluster

18.1. Overview

In some situations the data in the StorPool cluster needs to be rebalanced. This is performed by the balancer and the relocator tools. The relocator is an integral part of the StorPool management service, the balancer is presently an external tool available and executed on some of the nodes with access to the API.

Note

Be advised that he balancer tool will create some files it needs in the present working directory.

18.2. Rebalancing procedure

The rebalancing operation is performed in the following steps:

  • The balancer tool is executed to calculate the new state of the cluster.

  • The results from the balancer are verified by set of automated scripts.

  • The results are also manually reviewed to check whether they contain any inconsistencies and whether they achieve the intended goals. These results are available by running storpool balancer disks and will be printed at the end of balancer.sh

    If the result is not satisfactory, the balancer is executed with different parameters, until a satisfactory result is obtained.

  • Once the proposed end result is satisfactory, the calculated state is loaded into the relocator tool, by doing storpool balancer commit.

    Note that this step can be reversed only with the --restore-state option, which will revert to the initial state. If a balancing operation has ran for a while and for some reason it needs to be “cancelled”, currently that’s not supported.

  • The relocator tool performs the actual move of the data.

    The progress of the relocator tool can be monitored by storpool task list for the currently running tasks, storpool relocator status for an overview of the relocator state and storpool relocator disks (warning: slow command) for the full relocation state.

18.3. Options

The balancer tool is executed via the /usr/lib/storpool/balancer.sh wrapper and accepts the following options:

-A

Don’t only move data from fuller to emptier drives. (default -c is 10 when -A is used).

-b placementGroup

Use disks in the specified placement group to restore replication in critical conditions.

-c factor

Factor for how much data to try to move around, from 0 to 10. No default, required parameter.

-d diskId [-d diskId]

Put data only on the selected disks.

-D diskId [-D diskId]

Don’t move data from those disks.

--do-whatever-you-can

Emergency use only, and after balancer failed; will decrease the redundancy level.

-E 0-99

Don’t empty if below, in percents

--empty-down-disks

Proceed with balancing even when there are down disks, and remove all data from them.

-f percent

Allow drives to be filled up to this percentage, from 0 to 99. Default 90.

-F

Only move data from fuller to emptier drives; default -c factor is 3 when -F is used.

-g placementGroup

Work only on the specified placement group.

--ignore-down-disks

Proceed with balancing even when there are down disks, and do not remove data from them.

--ignore-src-pg-violations

Exactly what it says

-m maxAgCount

Limit the maximum allocation group count on drives to this (effectively their usable size).

-M maxDataToAdd

Limit the amount of data to copy to a single drive, to be able to rebalance “in pieces”.

--max-disbalance-before-striping X

In percents.

--min-disk-full X

Don’t remove data from disk if it is not at least this X% full.

--min-replication R

Minimum replication required.

-o overridesPgName

Specify override placement group name (required only if override template is not created).

--only-empty-disk diskId

Like -D for all other disks.

-R

Only restore replication for degraded volumes.

--restore-state

Revert to the initial state of the disks (before the balancer commit execution).

-S

Prefer tail SSD.

-V vagId [-V vagId]

Skip balancing vagId.

-v

Verbose output (shows data how all drives in the cluster would be affected).

-A and -F are the reverse of each other and mutually exclusive.

The -c value is basically the trade-off between the uniformity of the data on the disks and the amount of data moved to accomplish that. A lower factor means less data to be moved around, but sometimes more inequality between the data on the disks, a higher one - more data to be moved, but sometimes with a better result in terms of equality of the amount of data for each drive.

On clusters with drives with unsupported size (HDDs > 4TB) the -m option is required. It will limit the data moved onto these drives to up to the set number of allocation groups. This is done as the performance per TB space of larger drives is lower, and it degrades the performance for the whole cluster for high performance use cases.

The -M option is useful when a full rebalancing would involve many tasks until completed and could impact other operations (such as remote transfers, or the time required for a currently running recovery to complete). With the -M option the amount of data loaded by the balancer for each disk may be reduced, and a more rebalanced state is achieved through several smaller rebalancing operations.

The -f option is required on clusters whose drives are full above 90%. Extreme care should be used when balancing in such cases.

The -b option could be used to move data between placementGroups (in most cases from SSDs to HDDs).

18.4. Restoring volume redundancy on a failed drive

Situation: we have lost drive 1802 in placementGroup ssd. We want to remove it from the cluster and restore the redundancy of the data. We need to do the following:

storpool disk 1802 forget                               # this will also remove the drive from all placement groups it participated in
mkdir -p ~/storpool/balancer && cd ~/storpool/balancer  # it's recommended to run the following commands in a screen/tmux session
/usr/lib/storpool/balancer.sh -R
storpool balancer commit                                # to actually load the data into the relocator and start the re-balancing operation

18.5. Restoring volume redundancy for two failed drives (single-copy situation)

(Emergency) Situation: we have lost drives 1802 and 1902 in placementGroup ssd. We want to remove them from the cluster and restore the redundancy of the data. We need to do the following:

storpool disk 1802 forget                               # this will also remove the drive from all placement groups it participated in
storpool disk 1902 forget                               # this will also remove the drive from all placement groups it participated in
mkdir -p ~/storpool/balancer && cd ~/storpool/balancer  # it's recommended to run the following commands in a screen/tmux session
/usr/lib/storpool/balancer.sh -F --min-replication 2    # first balancing run, to create a second copy of the data
storpool balancer commit                                # to actually load the data into the relocator and start the re-balancing operation
# wait for the balancing to finish

/usr/lib/storpool/balancer.sh -R                        # second balancing run, to restore full redundancy
storpool balancer commit                                # to actually load the data into the relocator and start the re-balancing operation

18.6. Adding new drives and rebalancing data on them

Situation: we have added SSDs 1201, 1202 and HDDs 1510, 1511, that need to go into placement groups ssd and hdd respectively, and we want to re-balance the cluster data so that it is re-dispersed onto the new disks as well. We have no other placement groups in the cluster.

storpool placementGroup ssd addDisk 1201 addDisk 1202
storpool placementGroup hdd addDisk 1510 addDisk 1511
mkdir -p ~/storpool/balancer && cd ~/storpool/balancer  # it's recommended to run the following commands in a screen/tmux session
/usr/lib/storpool/balancer.sh -F -c 0                   # rebalance all placement groups, move data from fuller to emptier drives
storpool balancer commit                                # to actually load the data into the relocator and start the re-balancing operation

18.7. Restoring volume redundancy with rebalancing data on other placementGroup

Situation: we have to restore the redundancy of a hybrid cluster (2 copies on HDDs, one on SSDs) while the ssd placementGroup is out of free space because a few SSDs have recently failed. We can’t replace the failed drives with new ones for the moment.

mkdir -p ~/storpool/balancer && cd ~/storpool/balancer  # it's recommended to run the following commands in a screen/tmux session
/usr/lib/storpool/balancer.sh -R -c 0 -b hdd            # use placementGroup ``hdd`` as a backup and move some data from SSDs
storpool balancer commit                                # to actually load the data into the relocator and start the re-balancing operation

Note

The -f argument could be further used in order to instruct the balancer how full to keep the cluster and thus control how much data will be moved in the backup placement group.

18.8. Decommissioning a live node

Situation: a node in the cluster needs to be decommissioned, so that the data on its drives needs to be moved away. The drive numbers on that node are 101, 102 and 103.

Note

You have to make sure you have enough space to restore the redundancy before proceeding.

storpool disk 101 softEject                             # mark all drives for evacuation
storpool disk 102 softEject
storpool disk 103 softEject
mkdir -p ~/storpool/balancer && cd ~/storpool/balancer  # it's recommended to run the following commands in a screen/tmux session
/usr/lib/storpool/balancer.sh -R -c 0                   # rebalance all placement groups, -F has the same effect in this case
storpool balancer commit                                # to actually load the data into the relocator and start the re-balancing operation

18.9. Decommissioning a dead node

Situation: a node in the cluster needs to be decommissioned, as it has died and cannot be brought back. The drive numbers on that node are 101, 102 and 103.

Note

You have to make sure you have enough space to restore the redundancy before proceeding.

storpool disk 101 forget                                # remove the drives from all placement groups
storpool disk 102 forget
storpool disk 103 forget
mkdir -p ~/storpool/balancer && cd ~/storpool/balancer  # it's recommended to run the following commands in a screen/tmux session
/usr/lib/storpool/balancer.sh -R -c 0                   # rebalance all placement groups
storpool balancer commit                                # to actually load the data into the relocator and start the re-balancing operation

18.10. Resolving imbalances in the drive usage

Situation: we have an imbalance in the drive usage in the whole cluster and we want to improve it.

mkdir -p ~/storpool/balancer && cd ~/storpool/balancer  # it's recommended to run the following commands in a screen/tmux session
/usr/lib/storpool/balancer.sh -F -c 0                   # rebalance all placement groups
/usr/lib/storpool/balancer.sh -F -c 3                   # retry to see if we get a better result with more data movements
storpool balancer commit                                # to actually load the data into the relocator and start the re-balancing operation

18.11. Resolving imbalances in the drive usage with three-node clusters

Situation: we have an imbalance in the drive usage in the whole cluster and we want to improve it. We have a three-node hybrid cluster and proper balancing requires larger moves of “unrelated” data:

mkdir -p ~/storpool/balancer && cd ~/storpool/balancer  # it's recommended to run the following commands in a screen/tmux session
/usr/lib/storpool/balancer.sh -F -c 0                   # rebalance all placement groups
/usr/lib/storpool/balancer.sh -A -c 10                  # retry to see if we get a better result with more data movements
storpool balancer commit                                # to actually load the data into the relocator and start the re-balancing operation

18.12. Reverting balancer to a previous state

Situation: we have committed a rebalancing operation, but want to revert back to the previous state:

cd ~/storpool/balancer                                             # it's recommended to run the following commands in a screen/tmux session
ls                                                                 # list all saved states and choose what to revert to
/usr/lib/storpool/balancer.sh --restore-state 2022-10-28-15-39-40  # revert to 2022-10-28-15-39-40
storpool balancer commit                                           # to actually load the data into the relocator and start the re-balancing operation

18.13. Reading the output of storpool balancer disks

Here is an example output from storpool balancer disks:

-------------------------------------------------------------------------------------------------------------------------------------------------------------------------
|     disk | server |   size   |                  stored                  |                 on-disk                  |                     objects                      |
-------------------------------------------------------------------------------------------------------------------------------------------------------------------------
|        1 |   14.0 |   373 GB |   4.7 MB -> 4.7 MB      (0  B / 0  B)    |   6.1 MB -> 6.1 MB      (0  B / 0  B)    |      26 -> 26           (+0 / +0)      / 405000  |
|     1101 |   11.0 |   447 GB |    16 GB -> 15 GB    (-1.0 GB / 1.4 GB)  |    18 GB -> 17 GB    (-1.1 GB / 1.4 GB)  |   11798 -> 10040     (-1758 / +3932)   / 480000  |
|     1102 |   11.0 |   447 GB |    16 GB -> 15 GB    (-268 MB / 1.3 GB)  |    17 GB -> 17 GB    (-301 MB / 1.4 GB)  |   10843 -> 10045      (-798 / +4486)   / 480000  |
|     1103 |   11.0 |   447 GB |    16 GB -> 15 GB    (-1.0 GB / 1.8 GB)  |    18 GB -> 16 GB    (-1.2 GB / 1.9 GB)  |   12123 -> 10039     (-2084 / +3889)   / 480000  |
|     1104 |   11.0 |   447 GB |    16 GB -> 15 GB    (-757 MB / 1.3 GB)  |    17 GB -> 16 GB    (-899 MB / 1.3 GB)  |   11045 -> 10072      (-973 / +4279)   / 480000  |
|     1111 |   11.0 |   466 GB |   4.7 MB -> 4.7 MB      (0  B / 0  B)    |   5.1 MB -> 5.1 MB      (0  B / 0  B)    |      26 -> 26           (+0 / +0)      / 495000  |
|     1112 |   11.0 |   366 GB |   4.7 MB -> 4.7 MB      (0  B / 0  B)    |   5.1 MB -> 5.1 MB      (0  B / 0  B)    |      26 -> 26           (+0 / +0)      / 390000  |
|     1121 |   11.0 |   931 GB |    22 GB -> 21 GB    (-1009 MB / 830 MB)  |    22 GB -> 21 GB    (-1.0 GB / 872 MB)  |   13713 -> 12698     (-1015 / +3799)   / 975000  |
|     1122 |   11.0 |   931 GB |    21 GB -> 21 GB    (-373 MB / 2.0 GB)  |    22 GB -> 21 GB    (-379 MB / 2.0 GB)  |   13469 -> 12742      (-727 / +3801)   / 975000  |
|     1123 |   11.0 |   931 GB |    22 GB -> 21 GB    (-1.1 GB / 1.9 GB)  |    22 GB -> 21 GB    (-1.1 GB / 2.0 GB)  |   14859 -> 12629     (-2230 / +4102)   / 975000  |
|     1124 |   11.0 |   931 GB |    21 GB -> 21 GB      (36 MB / 1.8 GB)  |    21 GB -> 21 GB      (92 MB / 1.9 GB)  |   13806 -> 12743     (-1063 / +3389)   / 975000  |
|     1201 |   12.0 |   447 GB |    18 GB -> 15 GB    (-2.9 GB / 633 MB)  |    19 GB -> 16 GB    (-3.0 GB / 658 MB)  |   14148 -> 10070     (-4078 / +3050)   / 480000  |
|     1202 |   12.0 |   447 GB |    17 GB -> 15 GB    (-2.1 GB / 787 MB)  |    19 GB -> 16 GB    (-2.3 GB / 815 MB)  |   13243 -> 10067     (-3176 / +2576)   / 480000  |
|     1203 |   12.0 |   447 GB |    17 GB -> 15 GB    (-2.0 GB / 3.3 GB)  |    19 GB -> 16 GB    (-2.4 GB / 3.5 GB)  |   12746 -> 10062     (-2684 / +3375)   / 480000  |
|     1204 |   12.0 |   447 GB |    18 GB -> 15 GB    (-2.7 GB / 1.1 GB)  |    19 GB -> 16 GB    (-2.9 GB / 1.1 GB)  |   12835 -> 10075     (-2760 / +3248)   / 480000  |
|     1212 |   12.0 |   366 GB |   4.7 MB -> 4.7 MB      (0  B / 0  B)    |   6.0 MB -> 6.0 MB      (0  B / 0  B)    |      26 -> 26           (+0 / +0)      / 390000  |
|     1221 |   12.0 |   931 GB |    20 GB -> 21 GB     (569 MB / 1.5 GB)  |    21 GB -> 21 GB     (587 MB / 1.6 GB)  |   13115 -> 12616      (-499 / +3736)   / 975000  |
|     1222 |   12.0 |   931 GB |    22 GB -> 21 GB    (-979 MB / 307 MB)  |    22 GB -> 21 GB    (-1013 MB / 317 MB)  |   12938 -> 12697      (-241 / +3291)   / 975000  |
|     1223 |   12.0 |   931 GB |    22 GB -> 21 GB    (-1.1 GB / 781 MB)  |    22 GB -> 21 GB    (-1.2 GB / 812 MB)  |   13968 -> 12718     (-1250 / +3302)   / 975000  |
|     1224 |   12.0 |   931 GB |    21 GB -> 21 GB    (-784 MB / 332 MB)  |    22 GB -> 21 GB    (-810 MB / 342 MB)  |   13741 -> 12692     (-1049 / +3314)   / 975000  |
|     1225 |   12.0 |   931 GB |    21 GB -> 21 GB    (-681 MB / 849 MB)  |    22 GB -> 21 GB    (-701 MB / 882 MB)  |   13608 -> 12748      (-860 / +3420)   / 975000  |
|     1226 |   12.0 |   931 GB |    22 GB -> 21 GB    (-1.1 GB / 825 MB)  |    22 GB -> 21 GB    (-1.1 GB / 853 MB)  |   13066 -> 12692      (-374 / +3817)   / 975000  |
|     1301 |   13.0 |   447 GB |    13 GB -> 15 GB     (2.6 GB / 4.2 GB)  |    14 GB -> 17 GB     (2.7 GB / 4.4 GB)  |    7244 -> 10038     (+2794 / +6186)   / 480000  |
|     1302 |   13.0 |   447 GB |    12 GB -> 15 GB     (3.0 GB / 3.7 GB)  |    13 GB -> 17 GB     (3.1 GB / 3.9 GB)  |    7507 -> 10063     (+2556 / +5619)   / 480000  |
|     1303 |   13.0 |   447 GB |    14 GB -> 15 GB     (1.3 GB / 3.2 GB)  |    15 GB -> 17 GB     (1.3 GB / 3.4 GB)  |    7888 -> 10038     (+2150 / +5884)   / 480000  |
|     1304 |   13.0 |   447 GB |    13 GB -> 15 GB     (2.7 GB / 3.7 GB)  |    14 GB -> 17 GB     (2.8 GB / 3.9 GB)  |    7660 -> 10045     (+2385 / +5870)   / 480000  |
|     1311 |   13.0 |   466 GB |   4.7 MB -> 4.7 MB      (0  B / 0  B)    |   6.1 MB -> 6.1 MB      (0  B / 0  B)    |      26 -> 26           (+0 / +0)      / 495000  |
|     1312 |   13.0 |   366 GB |   4.7 MB -> 4.7 MB      (0  B / 0  B)    |   6.1 MB -> 6.1 MB      (0  B / 0  B)    |      26 -> 26           (+0 / +0)      / 390000  |
|     1321 |   13.0 |   931 GB |    21 GB -> 21 GB    (-193 MB / 1.1 GB)  |    21 GB -> 21 GB    (-195 MB / 1.2 GB)  |   13365 -> 12765      (-600 / +5122)   / 975000  |
|     1322 |   13.0 |   931 GB |    22 GB -> 21 GB    (-1.4 GB / 1.1 GB)  |    23 GB -> 21 GB    (-1.4 GB / 1.1 GB)  |   12749 -> 12739       (-10 / +4651)   / 975000  |
|     1323 |   13.0 |   931 GB |    21 GB -> 21 GB    (-504 MB / 2.2 GB)  |    22 GB -> 21 GB    (-496 MB / 2.3 GB)  |   13386 -> 12695      (-691 / +4583)   / 975000  |
|     1325 |   13.0 |   931 GB |    21 GB -> 20 GB    (-698 MB / 557 MB)  |    22 GB -> 21 GB    (-717 MB / 584 MB)  |   13113 -> 12768      (-345 / +2668)   / 975000  |
|     1326 |   13.0 |   931 GB |    21 GB -> 21 GB    (-507 MB / 724 MB)  |    22 GB -> 21 GB    (-522 MB / 754 MB)  |   13690 -> 12704      (-986 / +3327)   / 975000  |
|     1401 |   14.0 |   223 GB |   8.3 GB -> 7.6 GB   (-666 MB / 868 MB)  |   9.3 GB -> 8.5 GB   (-781 MB / 901 MB)  |    3470 -> 5043      (+1573 / +2830)   / 240000  |
|     1402 |   14.0 |   447 GB |   9.8 GB -> 15 GB     (5.6 GB / 5.7 GB)  |    11 GB -> 17 GB     (5.8 GB / 6.0 GB)  |    4358 -> 10060     (+5702 / +6667)   / 480000  |
|     1403 |   14.0 |   224 GB |   8.2 GB -> 7.6 GB   (-623 MB / 1.1 GB)  |   9.3 GB -> 8.6 GB   (-710 MB / 1.2 GB)  |    4547 -> 5036       (+489 / +2814)   / 240000  |
|     1404 |   14.0 |   224 GB |   8.4 GB -> 7.6 GB   (-773 MB / 1.5 GB)  |   9.4 GB -> 8.5 GB   (-970 MB / 1.6 GB)  |    4369 -> 5031       (+662 / +2368)   / 240000  |
|     1411 |   14.0 |   466 GB |   4.7 MB -> 4.7 MB      (0  B / 0  B)    |   6.0 MB -> 6.0 MB      (0  B / 0  B)    |      26 -> 26           (+0 / +0)      / 495000  |
|     1412 |   14.0 |   366 GB |   4.7 MB -> 4.7 MB      (0  B / 0  B)    |   6.0 MB -> 6.0 MB      (0  B / 0  B)    |      26 -> 26           (+0 / +0)      / 390000  |
|     1421 |   14.0 |   931 GB |    19 GB -> 21 GB     (1.9 GB / 2.6 GB)  |    19 GB -> 21 GB     (2.0 GB / 2.7 GB)  |   10670 -> 12624     (+1954 / +6196)   / 975000  |
|     1422 |   14.0 |   931 GB |    19 GB -> 21 GB     (1.6 GB / 3.2 GB)  |    20 GB -> 21 GB     (1.6 GB / 3.3 GB)  |   10653 -> 12844     (+2191 / +6919)   / 975000  |
|     1423 |   14.0 |   931 GB |    19 GB -> 21 GB     (1.9 GB / 2.5 GB)  |    19 GB -> 21 GB     (2.0 GB / 2.6 GB)  |   10715 -> 12688     (+1973 / +5846)   / 975000  |
|     1424 |   14.0 |   931 GB |    18 GB -> 20 GB     (2.2 GB / 2.9 GB)  |    19 GB -> 21 GB     (2.3 GB / 3.0 GB)  |   10723 -> 12686     (+1963 / +5505)   / 975000  |
|     1425 |   14.0 |   931 GB |    19 GB -> 21 GB     (1.3 GB / 2.5 GB)  |    20 GB -> 21 GB     (1.4 GB / 2.6 GB)  |   10702 -> 12689     (+1987 / +5486)   / 975000  |
|     1426 |   14.0 |   931 GB |    20 GB -> 21 GB     (1.0 GB / 2.5 GB)  |    20 GB -> 21 GB     (1.0 GB / 2.6 GB)  |   10737 -> 12609     (+1872 / +5771)   / 975000  |
-------------------------------------------------------------------------------------------------------------------------------------------------------------------------
|       45 |    4.0 |    29 TB |   652 GB -> 652 GB    (512 MB / 69 GB)   |   686 GB -> 685 GB   (-240 MB / 72 GB)   |  412818 -> 412818       (+0 / +159118) / 30885000 |
-------------------------------------------------------------------------------------------------------------------------------------------------------------------------

Let’s start with the last line. Here’s the meaning, field by field:

  • There are 45 drives in total.

  • There are 4 server instances.

  • The total disk capacity is 29 TB.

  • The stored data is 652 GB and will change to 652 GB. The total change for all drives afterwards is 512 MB, and the total amount of changes for the drives is 69 GB (i.e. how much will they “recover” from other drives).

  • The same is repeated for the on-disk size. Here the total amount of changes is roughly the amount of data that would need to be copied.

  • The total current number of objects will not change (i.e. from 412818 to 412818), 0 new objects will be created, the total amount of objects to be moved is 159118, and the total number of possible objects in the cluster is 30885000.

The difference between “stored” and “on-disk” size is that in the latter also includes the size of checksums and metadata.

For the rest of the lines, the data is basically the same, just per disk.

What needs to be taken into account is:

  • Are there drives that will have too much data on them? Here, both data size and objects must be checked, and they should be close to the average percentage for the placement group.

  • Is the data stored on the drives balanced, i.e. are all the drives’ usages close to the average?

  • Are there drives that should have data on them, but nothing is scheduled to be moved?

    This usually happens because a drive wasn’t added to the right placement group.

  • Will there be too much data to be moved?

To illustrate the difference of amount to be moved, here is the output of storpool balancer disks from a run with -c 10:

-------------------------------------------------------------------------------------------------------------------------------------------------------------------------
|     disk | server |   size   |                  stored                  |                 on-disk                  |                     objects                      |
-------------------------------------------------------------------------------------------------------------------------------------------------------------------------
|        1 |   14.0 |   373 GB |   4.7 MB -> 4.7 MB      (0  B / 0  B)    |   6.1 MB -> 6.1 MB      (0  B / 0  B)    |      26 -> 26           (+0 / +0)      / 405000  |
|     1101 |   11.0 |   447 GB |    16 GB -> 15 GB    (-1.0 GB / 1.7 GB)  |    18 GB -> 17 GB    (-1.1 GB / 1.7 GB)  |   11798 -> 10027     (-1771 / +5434)   / 480000  |
|     1102 |   11.0 |   447 GB |    16 GB -> 15 GB    (-263 MB / 1.7 GB)  |    17 GB -> 17 GB    (-298 MB / 1.7 GB)  |   10843 -> 10000      (-843 / +5420)   / 480000  |
|     1103 |   11.0 |   447 GB |    16 GB -> 15 GB    (-1.0 GB / 3.6 GB)  |    18 GB -> 16 GB    (-1.2 GB / 3.8 GB)  |   12123 -> 10005     (-2118 / +6331)   / 480000  |
|     1104 |   11.0 |   447 GB |    16 GB -> 15 GB    (-752 MB / 2.7 GB)  |    17 GB -> 16 GB    (-907 MB / 2.8 GB)  |   11045 -> 10098      (-947 / +5214)   / 480000  |
|     1111 |   11.0 |   466 GB |   4.7 MB -> 4.7 MB      (0  B / 0  B)    |   5.1 MB -> 5.1 MB      (0  B / 0  B)    |      26 -> 26           (+0 / +0)      / 495000  |
|     1112 |   11.0 |   366 GB |   4.7 MB -> 4.7 MB      (0  B / 0  B)    |   5.1 MB -> 5.1 MB      (0  B / 0  B)    |      26 -> 26           (+0 / +0)      / 390000  |
|     1121 |   11.0 |   931 GB |    22 GB -> 21 GB    (-1003 MB / 6.4 GB)  |    22 GB -> 21 GB    (-1018 MB / 6.7 GB)  |   13713 -> 12742      (-971 / +9712)   / 975000  |
|     1122 |   11.0 |   931 GB |    21 GB -> 21 GB    (-368 MB / 5.8 GB)  |    22 GB -> 21 GB    (-272 MB / 6.1 GB)  |   13469 -> 12718      (-751 / +8929)   / 975000  |
|     1123 |   11.0 |   931 GB |    22 GB -> 21 GB    (-1.1 GB / 5.9 GB)  |    22 GB -> 21 GB    (-1.1 GB / 6.1 GB)  |   14859 -> 12699     (-2160 / +8992)   / 975000  |
|     1124 |   11.0 |   931 GB |    21 GB -> 21 GB      (57 MB / 7.4 GB)  |    21 GB -> 21 GB     (113 MB / 7.7 GB)  |   13806 -> 12697     (-1109 / +9535)   / 975000  |
|     1201 |   12.0 |   447 GB |    18 GB -> 15 GB    (-2.8 GB / 1.2 GB)  |    19 GB -> 17 GB    (-3.0 GB / 1.2 GB)  |   14148 -> 10033     (-4115 / +4853)   / 480000  |
|     1202 |   12.0 |   447 GB |    17 GB -> 15 GB    (-2.0 GB / 1.6 GB)  |    19 GB -> 16 GB    (-2.2 GB / 1.7 GB)  |   13243 -> 10055     (-3188 / +4660)   / 480000  |
|     1203 |   12.0 |   447 GB |    17 GB -> 15 GB    (-2.0 GB / 2.3 GB)  |    19 GB -> 16 GB    (-2.3 GB / 2.4 GB)  |   12746 -> 10070     (-2676 / +4682)   / 480000  |
|     1204 |   12.0 |   447 GB |    18 GB -> 15 GB    (-2.7 GB / 2.1 GB)  |    19 GB -> 16 GB    (-2.8 GB / 2.2 GB)  |   12835 -> 10110     (-2725 / +5511)   / 480000  |
|     1212 |   12.0 |   366 GB |   4.7 MB -> 4.7 MB      (0  B / 0  B)    |   6.0 MB -> 6.0 MB      (0  B / 0  B)    |      26 -> 26           (+0 / +0)      / 390000  |
|     1221 |   12.0 |   931 GB |    20 GB -> 21 GB     (620 MB / 6.3 GB)  |    21 GB -> 21 GB     (805 MB / 6.7 GB)  |   13115 -> 12542      (-573 / +9389)   / 975000  |
|     1222 |   12.0 |   931 GB |    22 GB -> 21 GB    (-981 MB / 2.9 GB)  |    22 GB -> 21 GB    (-1004 MB / 3.0 GB)  |   12938 -> 12793      (-145 / +8795)   / 975000  |
|     1223 |   12.0 |   931 GB |    22 GB -> 21 GB    (-1.1 GB / 5.9 GB)  |    22 GB -> 21 GB    (-1.1 GB / 6.1 GB)  |   13968 -> 12698     (-1270 / +10094)  / 975000  |
|     1224 |   12.0 |   931 GB |    21 GB -> 21 GB    (-791 MB / 4.5 GB)  |    22 GB -> 21 GB    (-758 MB / 4.7 GB)  |   13741 -> 12684     (-1057 / +8616)   / 975000  |
|     1225 |   12.0 |   931 GB |    21 GB -> 21 GB    (-671 MB / 4.8 GB)  |    22 GB -> 21 GB    (-677 MB / 4.9 GB)  |   13608 -> 12690      (-918 / +8559)   / 975000  |
|     1226 |   12.0 |   931 GB |    22 GB -> 21 GB    (-1.1 GB / 6.2 GB)  |    22 GB -> 21 GB    (-1.1 GB / 6.4 GB)  |   13066 -> 12737      (-329 / +9386)   / 975000  |
|     1301 |   13.0 |   447 GB |    13 GB -> 15 GB     (2.6 GB / 4.5 GB)  |    14 GB -> 17 GB     (2.7 GB / 4.6 GB)  |    7244 -> 10077     (+2833 / +6714)   / 480000  |
|     1302 |   13.0 |   447 GB |    12 GB -> 15 GB     (3.0 GB / 4.9 GB)  |    13 GB -> 17 GB     (3.2 GB / 5.2 GB)  |    7507 -> 10056     (+2549 / +7011)   / 480000  |
|     1303 |   13.0 |   447 GB |    14 GB -> 15 GB     (1.3 GB / 3.2 GB)  |    15 GB -> 17 GB     (1.3 GB / 3.3 GB)  |    7888 -> 10020     (+2132 / +6926)   / 480000  |
|     1304 |   13.0 |   447 GB |    13 GB -> 15 GB     (2.7 GB / 4.7 GB)  |    14 GB -> 17 GB     (2.8 GB / 4.9 GB)  |    7660 -> 10075     (+2415 / +7049)   / 480000  |
|     1311 |   13.0 |   466 GB |   4.7 MB -> 4.7 MB      (0  B / 0  B)    |   6.1 MB -> 6.1 MB      (0  B / 0  B)    |      26 -> 26           (+0 / +0)      / 495000  |
|     1312 |   13.0 |   366 GB |   4.7 MB -> 4.7 MB      (0  B / 0  B)    |   6.1 MB -> 6.1 MB      (0  B / 0  B)    |      26 -> 26           (+0 / +0)      / 390000  |
|     1321 |   13.0 |   931 GB |    21 GB -> 21 GB    (-200 MB / 4.1 GB)  |    21 GB -> 21 GB    (-192 MB / 4.3 GB)  |   13365 -> 12690      (-675 / +9527)   / 975000  |
|     1322 |   13.0 |   931 GB |    22 GB -> 21 GB    (-1.3 GB / 6.9 GB)  |    23 GB -> 21 GB    (-1.3 GB / 7.2 GB)  |   12749 -> 12698       (-51 / +10047)  / 975000  |
|     1323 |   13.0 |   931 GB |    21 GB -> 21 GB    (-495 MB / 6.1 GB)  |    22 GB -> 21 GB    (-504 MB / 6.3 GB)  |   13386 -> 12693      (-693 / +9524)   / 975000  |
|     1325 |   13.0 |   931 GB |    21 GB -> 21 GB    (-620 MB / 6.6 GB)  |    22 GB -> 21 GB    (-612 MB / 6.9 GB)  |   13113 -> 12768      (-345 / +9942)   / 975000  |
|     1326 |   13.0 |   931 GB |    21 GB -> 21 GB    (-498 MB / 7.1 GB)  |    22 GB -> 21 GB    (-414 MB / 7.4 GB)  |   13690 -> 12697      (-993 / +9759)   / 975000  |
|     1401 |   14.0 |   223 GB |   8.3 GB -> 7.6 GB   (-670 MB / 950 MB)  |   9.3 GB -> 8.5 GB   (-789 MB / 993 MB)  |    3470 -> 5061      (+1591 / +3262)   / 240000  |
|     1402 |   14.0 |   447 GB |   9.8 GB -> 15 GB     (5.6 GB / 7.1 GB)  |    11 GB -> 17 GB     (5.8 GB / 7.5 GB)  |    4358 -> 10052     (+5694 / +7092)   / 480000  |
|     1403 |   14.0 |   224 GB |   8.2 GB -> 7.6 GB   (-619 MB / 730 MB)  |   9.3 GB -> 8.5 GB   (-758 MB / 759 MB)  |    4547 -> 5023       (+476 / +2567)   / 240000  |
|     1404 |   14.0 |   224 GB |   8.4 GB -> 7.6 GB   (-790 MB / 915 MB)  |   9.4 GB -> 8.5 GB   (-918 MB / 946 MB)  |    4369 -> 5062       (+693 / +2483)   / 240000  |
|     1411 |   14.0 |   466 GB |   4.7 MB -> 4.7 MB      (0  B / 0  B)    |   6.0 MB -> 6.0 MB      (0  B / 0  B)    |      26 -> 26           (+0 / +0)      / 495000  |
|     1412 |   14.0 |   366 GB |   4.7 MB -> 4.7 MB      (0  B / 0  B)    |   6.0 MB -> 6.0 MB      (0  B / 0  B)    |      26 -> 26           (+0 / +0)      / 390000  |
|     1421 |   14.0 |   931 GB |    19 GB -> 21 GB     (2.0 GB / 6.8 GB)  |    19 GB -> 21 GB     (2.1 GB / 7.0 GB)  |   10670 -> 12695     (+2025 / +10814)  / 975000  |
|     1422 |   14.0 |   931 GB |    19 GB -> 21 GB     (1.6 GB / 7.4 GB)  |    20 GB -> 21 GB     (1.7 GB / 7.7 GB)  |   10653 -> 12702     (+2049 / +10414)  / 975000  |
|     1423 |   14.0 |   931 GB |    19 GB -> 21 GB     (2.0 GB / 7.4 GB)  |    19 GB -> 21 GB     (2.1 GB / 7.8 GB)  |   10715 -> 12683     (+1968 / +10418)  / 975000  |
|     1424 |   14.0 |   931 GB |    18 GB -> 21 GB     (2.2 GB / 8.0 GB)  |    19 GB -> 21 GB     (2.3 GB / 8.3 GB)  |   10723 -> 12824     (+2101 / +9573)   / 975000  |
|     1425 |   14.0 |   931 GB |    19 GB -> 21 GB     (1.3 GB / 5.8 GB)  |    20 GB -> 21 GB     (1.4 GB / 6.1 GB)  |   10702 -> 12686     (+1984 / +10231)  / 975000  |
|     1426 |   14.0 |   931 GB |    20 GB -> 21 GB     (1.0 GB / 6.5 GB)  |    20 GB -> 21 GB     (1.2 GB / 6.8 GB)  |   10737 -> 12650     (+1913 / +10974)  / 975000  |
-------------------------------------------------------------------------------------------------------------------------------------------------------------------------
|       45 |    4.0 |    29 TB |   652 GB -> 653 GB    (1.2 GB / 173 GB)  |   686 GB -> 687 GB    (1.2 GB / 180 GB)  |  412818 -> 412818       (+0 / +288439) / 30885000 |
-------------------------------------------------------------------------------------------------------------------------------------------------------------------------

This time the total amount of data to be moved is 180GB. It’s possible to have a difference of an order of magnitude in the total data to be moved between -c 0 and -c 10. Usually best results are achieved by using the -F directly with rare occasions requiring full re-balancing (i.e. no -F and higher -c values)

18.13.1. Balancer tool output

Here’s an example of the output of the balancer tool, in non-verbose mode:

 1 -== BEFORE BALANCE ==-
 2 shards with decreased redundancy 0 (0, 0, 0)
 3 server constraint violations 0
 4 stripe constraint violations 6652
 5 placement group violations 1250
 6 pg hdd score 0.6551, objectsScore 0.0269
 7 pg ssd score 0.6824, objectsScore 0.0280
 8 pg hdd estFree  45T
 9 pg ssd estFree  19T
10 Constraint violations detected, doing a replication-restore update first
11 server constraint violations 0
12 stripe constraint violations 7031
13 placement group violations 0
14 -== POST BALANCE ==-
15 shards with decreased redundancy 0 (0, 0, 0)
16 server constraint violations 0
17 stripe constraint violations 6592
18 placement group violations 0
19 moves 14387, (1864GiB) (tail ssd 14387)
20 pg hdd score 0.6551, objectsScore 0.0269, maxDataToSingleDrive 33 GiB
21 pg ssd score 0.6939, objectsScore 0.0285, maxDataToSingleDrive 76 GiB
22 pg hdd estFree  47T
23 pg ssd estFree  19T

The run of the balancer tool has multiple steps.

First, it shows the current state of the system (lines 2-8):

  • Shards (volume pieces) with decreased redundancy.

  • Server constraint violations means that there are pieces of data which which have two or more of their copies on the same server. This is an error condition.

  • “stripe constraint violation” means that for specific pieces of data it’s not optimally striped on the drives of a specific server. This is NOT an error condition.

  • “placement group violations” means there is an error condition.

  • Lines 6 and 7 show the current average “score” (usage in %) of the placement groups, for data and objects;

  • Lines 8 and 9 show the estimated free space for the placement groups.

Then, in this run it has detected problems (in this case - placement group violations, which in most cases is a missing drive) and has done a pre-run to correct the redundancy (line 10, then again has printed on lines 11-13 the state).

And last, it runs the balancing, and reports the results. The main difference here is that for the placement groups it also reports the maximum data that will be added to a drive. As the balancing happens in parallel on all drives, this is a handy measure to see how long the balance would be (in comparison with a different balancing which might not add that much data to a single drive).

18.14. Errors from the balancer tool

If the balancer tool doesn’t complete successfully, its output MUST be examined and the root cause fixed.

18.15. Miscellaneous

If for any reason the currently running rebalancing operation needs to be paused, it can be done via storpool relocator off. In such cases StorPool Support should also be contacted, as this shouldn’t need to happen. Re-enabling it is done via storpool relocator on.

19. Troubleshooting

This part outlines the different states of a StorPool cluster, common knowledge about what should be expected and what are the recommended steps in each of them. This is intended to be used as a guideline for the operations team(s) maintaining the production system provided by StorPool.

19.1. Normal state of the system

The normal behaviour of the StorPool storage system is when it is fully configured and in up-and-running state. This is the desired state of the system.

Characteristics of this state:

19.1.1. All nodes in the storage cluster are up and running

This can be checked by using the CLI with storpool service list on any node with access to the API service.

Note

The storpool service list provides status for all services running clusterwide, rather than the services running on the node itself.

19.1.2. All configured StorPool services are up and running

This is again easily checked with storpool service list. Recently restarted services are usually spotted due to their uptime. Recently restarted services are to be taken seriously if the reason for their state is unknown even if they are running at the moment, as in the example with client ID 37 below:

# storpool service list
cluster running, mgmt on node 2
      mgmt   1 running on node  1 ver 20.00.18, started 2022-09-08 19:28:37, uptime 144 days 22:45:51
      mgmt   2 running on node  2 ver 20.00.18, started 2022-09-08 19:27:18, uptime 144 days 22:47:10 active
    server   1 running on node  1 ver 20.00.18, started 2022-09-08 19:28:59, uptime 144 days 22:45:29
    server   2 running on node  2 ver 20.00.18, started 2022-09-08 19:25:53, uptime 144 days 22:48:35
    server   3 running on node  3 ver 20.00.18, started 2022-09-08 19:23:30, uptime 144 days 22:50:58
    client   1 running on node  1 ver 20.00.18, started 2022-09-08 19:28:37, uptime 144 days 22:45:51
    client   2 running on node  2 ver 20.00.18, started 2022-09-08 19:25:32, uptime 144 days 22:48:56
    client   3 running on node  3 ver 20.00.18, started 2022-09-08 19:23:09, uptime 144 days 22:51:19
    client  21 running on node 21 ver 20.00.18, started 2022-09-08 19:20:26, uptime 144 days 22:54:02
    client  22 running on node 22 ver 20.00.18, started 2022-09-08 19:19:26, uptime 144 days 22:55:02
    client  37 running on node 37 ver 20.00.18, started 2022-09-08 13:08:12, uptime 05:06:16

19.1.3. Working cgroup memory and cpuset isolation is properly configured

Use the storpool_cg tool with an argument check to ensure everything is as expected. The tool should not return any warnings. For more information, see Control groups.

When properly configured the sum of all memory limits in the node are less than the available memory in the node. This protects the running kernel from memory shortage as well as all processes in the storpool.slice memory cgroup which ensures the stability of the storage service.

19.1.4. All network interfaces are properly configured

All network interfaces used by StorPool are up and properly configured with hardware acceleration enabled (where applicable); all network switches are configured with jumbo frames and flow control, and none of them experience any packet loss or delays. The output from storpool net list is a good start, all configured network interfaces will be seen as up with proper flags explained at the end. The desired state is uU with a + at the end for each network interface; if hardware acceleration is supported on an interface the A flag should also be present:

storpool net list
------------------------------------------------------------
| nodeId | flags   | net 1             | net 2             |
------------------------------------------------------------
|     23 | uU + AJ | 1A:00:01:00:00:17 | 1E:00:01:00:00:17 |
|     24 | uU + AJ | 1A:00:01:00:00:18 | 1E:00:01:00:00:18 |
|     25 | uU + AJ | 1A:00:01:00:00:19 | 1E:00:01:00:00:19 |
|     26 | uU + AJ | 12:00:01:00:F0:1A | 16:00:01:00:F0:1A |
|     27 | uU + AJ | 26:8A:07:43:EE:BC | 26:8A:07:43:EE:BD |
------------------------------------------------------------
Quorum status: 5 voting beacons up out of 5 expected
Flags:
  u - packets recently received from this node
  d - no packets recently received from this node

  U - this node has enough votes to be in the quorum
  D - this node does not have enough votes to be in the quorum
  + - this node considers itself in the quorum
  B - the connection to this node is through a backup link; check the cabling
  A - this node is using hardware acceleration
  J - the node uses jumbo frames

  N - a non-voting node

19.1.5. All drives are up and running

All drives in use for the storage system are performing at their specified speed, are joined in the cluster and serving requests.

This could be checked with storpool disk list internal, for example in a normally loaded cluster all drives will report low aggregate scores. Below is an example output (trimmed for brevity):

# storpool disk list internal
--------------------------------------------------------------------------------------------------------------------------------------------------------
| disk | server |        aggregate scores        |         wbc pages        |     scrub bw |                          scrub ETA | last scrub completed |
--------------------------------------------------------------------------------------------------------------------------------------------------------
| 2301 |   23.0 |        0 |        0 |        0 |        - + -     / -     |            - |                                  - |  2022-09-08 15:33:44 |
| 2302 |   23.0 |        0 |        0 |        0 |        - + -     / -     |            - |                                  - |  2022-09-08 15:28:48 |
| 2303 |   23.1 |        0 |        0 |        0 |        - + -     / -     |            - |                                  - |  2022-09-08 15:28:49 |
| 2304 |   23.1 |        0 |        0 |        0 |        - + -     / -     |            - |                                  - |  2022-09-08 15:28:50 |
| 2305 |   23.2 |        0 |        0 |        0 |        - + -     / -     |            - |                                  - |  2022-09-08 15:28:51 |
| 2306 |   23.2 |        0 |        0 |        0 |        - + -     / -     |            - |                                  - |  2022-09-08 15:28:51 |
| 2307 |   23.3 |        0 |        0 |        0 |        - + -     / -     |            - |                                  - |  2022-09-08 15:28:52 |
| 2308 |   23.3 |        0 |        0 |        0 |        - + -     / -     |            - |                                  - |  2022-09-08 15:28:53 |
| 2311 |   23.0 |        0 |        0 |        0 |        0 + 0     / 2560  |            - |                                  - |  2022-09-08 15:28:38 |
| 2312 |   23.0 |        0 |        0 |        0 |        0 + 0     / 2560  |            - |                                  - |  2022-09-08 15:28:43 |
| 2313 |   23.1 |        0 |        0 |        0 |        0 + 0     / 2560  |            - |                                  - |  2022-09-08 15:28:44 |
| 2314 |   23.1 |        0 |        0 |        0 |        0 + 0     / 2560  |            - |                                  - |  2022-09-08 15:28:45 |
| 2315 |   23.2 |        0 |        0 |        0 |        0 + 0     / 2560  |            - |                                  - |  2022-09-08 15:28:47 |
| 2316 |   23.2 |        0 |        0 |        0 |        0 + 0     / 2560  |            - |                                  - |  2022-09-08 15:28:39 |
| 2317 |   23.3 |        0 |        0 |        0 |        0 + 0     / 2560  |            - |                                  - |  2022-09-08 15:28:40 |
| 2318 |   23.3 |        0 |        0 |        0 |        0 + 0     / 2560  |            - |                                  - |  2022-09-08 15:28:42 |
[snip]

All drives are regularly scrubbed, so they would have a stable (not increasing) number of errors. The errors corrected for each drive are visible in the storpool disk list output. Last completed scrub is visible in storpool disk list internal as in the example above.

Note that Some systems may have fewer than two network interfaces or a single backend switch. Even not recommended, this is still possible and sometimes used (usually in PoC or with a backup server) when a cluster is configured with a single-VLAN network redundancy scheme. A single VLAN network redundancy configuration and an inter-switch connection is required for a cluster where only some of the nodes are with a single interface connected to the cluster.

Also, if one or more of the points describing the state above is not in effect, the system should not be considered healthy. If there is any suspicion that the system is behaving erratically even though all of the above conditions are satisfied, the recommended steps to check if everything is in order are:

  • Check top and look for the state of each of the configured storpool_* services running in the present node. A properly running service is usually in the S (sleeping) state and rarely seen in the R (running) state. The CPU usage is often reported at 100% usage when hardware sleep is enabled, due to the kernel misreporting. The actual usage is much lower and could be tracked with cpupower monitor for the CPU cores.

  • To ensure all services on this node are running correctly is to use the /usr/lib/storpool/sdump tool, which will be reporting some CPU and network usage statistics for the running services on the node. Use the -l option for the long names of the statistics.

  • On some of the nodes with running workloads (like VM instances or containers) iostat will show activity for processed requests on the block devices.

    The following example shows the normal disk activity on a node running VM instances. Note that the usage may vary greatly depending on the workload. The command used in the example is iostat -xm 1 /dev/sp-*| egrep -v " 0[.,]00$" , which will print statistics for the StorPool devices each second excluding drives that have no storage IO activity:

    Device:  rrqm/s   wrqm/s  r/s     w/s      rMB/s   wMB/s  avgrq-sz  avgqu-sz  await  r_await  w_await  svctm  %util
    sp-0     0.00     0.00    0.00    279.00   0.00    0.14   1.00      3.87      13.80  0.00     13.80    3.55   98.99
    sp-11    0.00     0.00    165.60  114.10   19.29   14.26  245.66    5.97      20.87  9.81     36.91    0.89   24.78
    sp-12    0.00     0.00    171.60  153.60   19.33   19.20  242.67    9.20      28.17  10.46    47.96    1.08   35.01
    sp-13    0.00     0.00    6.00    40.80    0.04    5.10   225.12    1.75      37.32  0.27     42.77    1.06   4.98
    sp-21    0.00     0.00    0.00    82.20    0.00    1.04   25.90     1.00      12.08  0.00     12.08    12.16  99.99
    

19.1.6. There are no hanging active requests

The output of /usr/lib/storpool/latthreshold.py is empty - shows no hanging requests and no service or disk warnings.

19.2. Degraded state

In this state some system components are not fully operational and need attention. Some examples of a degraded state below.

19.2.1. Degraded state due to service issues

A single storpool_server service on one of the storage nodes is not available or not joined in the cluster

Note that this concerns only pools with triple replication, for dual replication this is considered to be a critical state, because there are parts of the system with only one available copy. This is an example output from storpool service list:

# storpool service list
cluster running, mgmt on node 2
      mgmt   1 running on node  1 ver 20.00.18, started 2022-09-08 16:11:59, uptime 19:51:50
      mgmt   2 running on node  2 ver 20.00.18, started 2022-09-08 16:11:58, uptime 19:51:51 active
      mgmt   3 running on node  3 ver 20.00.18, started 2022-09-08 16:11:58, uptime 19:51:51
    server   1 down on node     1 ver 20.00.18
    server   2 running on node  2 ver 20.00.18, started 2022-09-08 16:12:03, uptime 19:51:46
    server   3 running on node  3 ver 20.00.18, started 2022-09-08 16:12:04, uptime 19:51:45
    client   1 running on node  1 ver 20.00.18, started 2022-09-08 16:11:59, uptime 19:51:50
    client   2 running on node  2 ver 20.00.18, started 2022-09-08 16:11:57, uptime 19:51:52
    client   3 running on node  3 ver 20.00.18, started 2022-09-08 16:11:57, uptime 19:51:52

If this is unexpected, i.e. no one has deliberately restarted or stopped the service for planned maintenance or upgrade, it is very important to first bring the service up and then to investigate the root cause for the service outage. When the storpool_server service comes back up it will start recovering outdated data on its drives. The recovery process could be monitored with storpool task list, which will output which disks are recovering, as well as how much data is there left to be recovered. Example output or storpool task list:

# storpool task list
----------------------------------------------------------------------------------------
|     disk |  task id |  total obj |  completed |    started |  remaining | % complete |
----------------------------------------------------------------------------------------
|     2303 | RECOVERY |          1 |          0 |          1 |          1 |         0% |
----------------------------------------------------------------------------------------
|    total |          |          1 |          0 |          1 |          1 |         0% |
----------------------------------------------------------------------------------------

Some of the volumes or snapshots will have the D flag (for degraded) visible in the storpool volume status output, which will disappear once all the data is fully recovered. An example situation would be a reboot of the node for a kernel or a package upgrade that requires reboot and no kernel modules were installed for the new kernel or a service (in this example the storpool_server) was not configured to start on boot and others.

Some of the configured StorPool services have failed or is not running

These could be:

  • The storpool_block service on some of the storage only nodes, without any attached volumes or snapshots.

  • A single storpool_server service or multiple instances on the same node, note again that this is critical for systems with dual replication.

  • Single API (storpool_mgmt) service with another active running API.

The reason for these could be the same as in the previous examples, usually the system log contains all information needed to check why the service is not (getting) up.

19.2.2. Degraded state due to host OS misconfiguration

Some examples include:

  • Changes in the OS configuration after a system update

    This could prevent some of the services from running after a fresh boot. For instance, due to changed names of the network interfaces used for the storage system after an upgrade, changed PCIe IDs for NVMe devices, and so on.

  • Kdump is no longer collecting kernel dump data properly

    If this occurs, it might be difficult to debug what have caused a kernel crash.

Some of the above cases will be difficult to catch prior to booting with the new environment (for example, kernel or other updates) and sometimes they are only caught after an event that reveals the issue. Thus it is important to regularly test and ensure the system is in properly configured state and collects normally.

19.2.3. Degraded state due to network interface issues

Some of the interfaces used by StorPool is not up.

This could be checked with storpool net list, like this:

# storpool net list
------------------------------------------------------------
| nodeId | flags   | net 1             | net 2             |
------------------------------------------------------------
|     23 | uU + AJ |                   | 1E:00:01:00:00:17 |
|     24 | uU + AJ |                   | 1E:00:01:00:00:18 |
|     25 | uU + AJ | 1A:00:01:00:00:19 | 1E:00:01:00:00:19 |
|     26 | uU + AJ | 12:00:01:00:F0:1A | 16:00:01:00:F0:1A |
|     27 | uU + AJ | 26:8A:07:43:EE:BC | 26:8A:07:43:EE:BD |
------------------------------------------------------------
Quorum status: 5 voting beacons up out of 5 expected
Flags:
  u - packets recently received from this node
  d - no packets recently received from this node

  U - this node has enough votes to be in the quorum
  D - this node does not have enough votes to be in the quorum
  + - this node considers itself in the quorum
  B - the connection to this node is through a backup link; check the cabling
  A - this node is using hardware acceleration
  J - the node uses jumbo frames

  N - a non-voting node

In the above example nodes 23 and 24 are not connected to the first network. This is the SP_IFACE1_CFG interface configuration in /etc/storpool.conf (check with storpool_showconf SP_IFACE1_CFG). Note that the beacons are up and running and the system is processing requests through the second network. The possible reasons could be misconfigured interfaces, StorPool configuration, or backend switch/switches.

A HW acceleration qualified interface is running without hardware acceleration

This is once again checked with storpool net list:

# storpool net list
------------------------------------------------------------
| nodeId | flags   | net 1             | net 2             |
------------------------------------------------------------
|     23 | uU +  J | 1A:00:01:00:00:17 | 1E:00:01:00:00:17 |
|     24 | uU +  J | 1A:00:01:00:00:18 | 1E:00:01:00:00:18 |
|     25 | uU + AJ | 1A:00:01:00:00:19 | 1E:00:01:00:00:19 |
|     26 | uU + AJ | 12:00:01:00:F0:1A | 16:00:01:00:F0:1A |
|     27 | uU + AJ | 26:8A:07:43:EE:BC | 26:8A:07:43:EE:BD |
------------------------------------------------------------
Quorum status: 5 voting beacons up out of 5 expected
Flags:
  u - packets recently received from this node
  d - no packets recently received from this node

  U - this node has enough votes to be in the quorum
  D - this node does not have enough votes to be in the quorum
  + - this node considers itself in the quorum
  B - the connection to this node is through a backup link; check the cabling
  A - this node is using hardware acceleration
  J - the node uses jumbo frames

  N - a non-voting node

In the above example, nodes 23 and 24 are equipped with NICs qualified for, but are running without hardware acceleration; the possible reason could be either a BIOS or an OS misconfiguration, misconfigured kernel parameters on boot, or network interface misconfiguration. Note that when a system was configured for hardware accelerated operation the cgroups configuration was also sized accordingly, thus running in this state is likely to case performance issues, due to less CPU cores isolated and reserved for the NIC interrupts and storpool_rdma threads.

Jumbo frames are expected, but not working on some of the interfaces

Could be seen with storpool net list, if some of the two networks is with MTU lower than 9k the J flag will not be listed:

# storpool net list
-------------------------------------------------------------
| nodeId | flags    | net 1             | net 2             |
-------------------------------------------------------------
|     23 | uU + A   | 12:00:01:00:F0:17 | 16:00:01:00:F0:17 |
|     24 | uU + AJ  | 1A:00:01:00:00:18 | 1E:00:01:00:00:18 |
|     25 | uU + AJ  | 1A:00:01:00:00:19 | 1E:00:01:00:00:19 |
|     26 | uU + AJ  | 1A:00:01:00:00:1A | 1E:00:01:00:00:1A |
-------------------------------------------------------------
Quorum status: 4 voting beacons up out of 4 expected
Flags:
  u - packets recently received from this node
  d - no packets recently received from this node

  U - this node has enough votes to be in the quorum
  D - this node does not have enough votes to be in the quorum
  M - this node is being damped by the rest of the nodes in the cluster
  + - this node considers itself in the quorum
  B - the connection to this node is through a backup link; check the cabling
  A - this node is using hardware acceleration
  J - the node uses jumbo frames

  N - a non-voting node

If the node is not expected to be running without jumbo frames this might be an indication for a misconfigured interface or an issue with applying the interfaces configuration on boot. Note that an OS interface configured for jumbo frames without having the switch port properly configured leads to severe performance issues.

Some network interfaces are experiencing network loss or delays on one of the networks

This might affect the latency for some of the storage operations. Depending on the node where the losses occur, it might affect a single client or affect operations in the whole cluster in case of packet loss or delays are happening on a server node. Stats for all interfaces per service are collected in the analytics platform (https://analytics.storpool.com) and could be used to investigate network performance issues. The /usr/lib/storpool/sdump tool will print the same statistics on each of the nodes with services. The usual causes for packet loss are:

  • Hardware issues (cables, SFPs, and so on).

  • Floods and DDoS attacks “leaking” into the storage network due to misconfiguration.

  • Saturation of the CPU cores that handle the interrupts for the network cards and others when hardware acceleration is not available.

  • Network loops leading to saturated switch ports or overloaded NICs.

19.2.4. Drive/Controller issues

One or more HDD or SSD drives are missing from a single server in the cluster or from servers in the same fault set

Attention

This concerns only pools with triple replication, for dual replication this is considered as critical state.

The missing drives may be seen using storpool disk list or storpool server <serverID> disk list, for example in this output disk 543 is missing from server with ID 54:

# storpool server 54 disk list
disk  |   server  | size    |   used  |  est.free  |   %     | free entries | on-disk size |  allocated objects |  errors |  flags
541   |       54  | 207 GB  |  61 GB  |    136 GB  |   29 %  |      713180  |       75 GB  |   158990 / 225000  |   0 / 0 |
542   |       54  | 207 GB  |  56 GB  |    140 GB  |   27 %  |      719526  |       68 GB  |   161244 / 225000  |   0 / 0 |
543   |       54  |      -  |      -  |         -  |    - %  |           -  |           -  |        - / -       |    -
544   |       54  | 207 GB  |  62 GB  |    134 GB  |   30 %  |      701722  |       76 GB  |   158982 / 225000  |   0 / 0 |
545   |       54  | 207 GB  |  61 GB  |    135 GB  |   30 %  |      719993  |       75 GB  |   161312 / 225000  |   0 / 0 |
546   |       54  | 207 GB  |  54 GB  |    142 GB  |   26 %  |      720023  |       68 GB  |   158481 / 225000  |   0 / 0 |
547   |       54  | 207 GB  |  62 GB  |    134 GB  |   30 %  |      719996  |       77 GB  |   179486 / 225000  |   0 / 0 |
548   |       54  | 207 GB  |  53 GB  |    143 GB  |   26 %  |      718406  |       70 GB  |   179038 / 225000  |   0 / 0 |

Usual reasons - the drive was ejected from the cluster due to a write error either by the kernel or by the running storpool_server instance. More information may be found using dmesg | tail and in the system log. More information about the model and the serial number of the failed drive is shown by storpool disk list info.

In normal conditions the server will flag the disk to be re-tested and will eject it for a quick test. Provided the disk is still working correctly and test results are not breaching any thresholds the disk will be returned into the cluster to recover. Such a case for example might happen if the stalled request was caused by an intermediate issue, like a reallocated sector.

In case the disk is breaching any sane latency and bandwidth thresholds it will not be automatically returned and will have to be re-balanced out of the cluster. Such disks are marked as “bad” (more available at storpool_initdisk options)

When one or more drives are ejected (marked as bad already) and missing, multiple volumes and/or snapshots will be listed with the D flag in the output of storpool volume status (D as Degraded), due to the missing replicas for some of the data. This is normal and expected and there are the following options in this situation:

  • The drive could still be working properly (for example, a set of bad sectors were reallocated) even after it was tested, in order to re-test you could mark the drive as –good (more info on how at storpool_initdisk options) and attempt to get back into the cluster.

  • In some occasions a disk might have lost its signatures and would have to be returned in the cluster to recover from scratch - it will be automatically re-tested upon attempt to a full (read-write) stress-test is recommended to ensure it is working correctly (fio is a good tool for this kind of tests, check --verify option). In case the stress test is successful (e.g. the drive has been written to and verified successfully), it may be reinitialized with storpool_initdisk with the same disk ID it was before. This will automatically return it to the cluster and it will fully recover all data from scratch as if it was a brand new.

  • The drive has failed irrecoverably and a replacement is available. The replacement drive is initialized with the diskID of the failed drive with storpool_initdisk. After returning it to the cluster it will fully recover all the data from the live replicas (please check 18.  Rebalancing the cluster for more).

  • A replacement is not available. The only option is to re-balance the cluster without this drive (more details in 18.  Rebalancing the cluster).

Attention

Beware that in some cases with very filled clusters it might be impossible to get the cluster back to full redundancy without overfilling some of the remaining drives. See next section.

Some of the drives in the cluster are beyond 90% (up to 96% full)

With proper planning this should be rarely an issue. A way to evade it is to add more drives or an additional server node with a full set of drives into the cluster. Another option is to remove unused volumes or snapshots.

The storpool snapshot space command will return relevant information for the referred space for each snapshot on the underlying drives. Note that snapshots with a negative value in their “used” column will not free up any space if they are removed and will remain in the deleting state, because they are parents of multiple cloned child volumes.

Note that depending on the speed with which the cluster is being populated with data by the end users this might also be considered a critical state.

Some of the drives have fewer than 140k free entries (alert for an overloaded system)

This may be observed in the output of storpool disk list or storpool server <serverID> disk list, an example from the latter below:

# storpool server 23 disk list
  disk  |  server  |    size    |    used    |  est.free  |      %  |  free entries  |  on-disk size  |  allocated objects |  errors |  flags
  2301  |    23.0  |   893 GiB  |   2.6 GiB  |   857 GiB  |    0 %  |       3719930  |       660 KiB  |       17 / 930000  |   0 / 0 |
  2302  |    23.0  |   893 GiB  |   2.6 GiB  |   857 GiB  |    0 %  |       3719929  |       668 KiB  |       17 / 930000  |   0 / 0 |
  2303  |    23.0  |   893 GiB  |   2.6 GiB  |   857 GiB  |    0 %  |       3719929  |       668 KiB  |       17 / 930000  |   0 / 0 |
  2304  |    23.0  |   893 GiB  |   2.6 GiB  |   857 GiB  |    0 %  |       3719931  |       668 KiB  |       17 / 930000  |   0 / 0 |
  2306  |    23.0  |   893 GiB  |   2.6 GiB  |   857 GiB  |    0 %  |       3719932  |       664 KiB  |       17 / 930000  |   0 / 0 |
  2305  |    23.0  |   893 GiB  |   2.6 GiB  |   857 GiB  |    0 %  |       3719930  |       660 KiB  |       17 / 930000  |   0 / 0 |
  2307  |    23.0  |   893 GiB  |   2.6 GiB  |   857 GiB  |    0 %  |         19934  |       664 KiB  |       17 / 930000  |   0 / 0 |
--------------------------------------------------------------------------------------------------------------------------------------
     7  |     1.0  |   6.1 TiB  |    18 GiB  |   5.9 TiB  |    0 %  |      26039515  |       4.5 MiB  |      119 / 6510000 |   0 / 0 |

This usually happens after the system has been loaded for longer periods of time with a sustained write workload on one or multiple volumes. If this is unexpected and the reason is an erratic workload, the recommended way to handle this is to set a limit (bandwidth, iops or both) on the loaded volumes for example with storpool volume <volumename> bw 100M iops 1000. The same could be set for multiple volumes/snapshots in a template with storpool template <templatename> bw 100M iops 1000 propagate. Please note that propagating changes for templates with a very large number of volumes and snapshots might not work. If the overloaded state is due to normally occurring workload it is best to expand the system with more drives and or reformat the drives with larger number of entries (relates mainly to HDD drives). The latter case might be cause due to lower number of hard drives in a HDD only or a hybrid pool and rarely due to overloaded SSDs.

Another case related to overloaded drives is when many volumes are created out of the same template, which requires overrides in order to shuffle the objects where the journals are residing in order to avoid overload of the same triplet of disks when all virtual machines spike for some reason (i.e. unattended upgrades, a syslog intensive cron job, etc.)

A couple of notes on the degraded states - apart from the notes for the replication above none of these should affect the stability of the system at this point. For the example with the missing disk in hybrid systems with a single failed SSD, all read requests on volumes with triple replication that have data on the failed drive will be served by some of the redundant copies on HDDs. This could bring up a bit the read latencies for the operations on the parts of the volume which were on this exact SSD. This is usually negligible in medium to large systems, i.e. in a cluster with 20 SSDs or NVMe drives, these are 1/20th of all the read operations in the cluster. In case of dual replicas on SSDs and a third replica on HDDs there is no read latency penalty whatsoever, which is also the case for missing hard drives - they will not affect the system at all and in fact some write operations are even faster, because they are not waiting for the missing drive.

19.3. Critical state

This is an emergency state that requires immediate attention and intervention from the operations team and/or StorPool Support. Some of the conditions that could lead to this state are:

  • Partial or complete network outage.

  • Power loss for some nodes in the cluster.

  • Memory shortage leading to a service failure due to missing or incomplete cgroups configuration.

The following states are an indication for critical conditions:

19.3.1. API service failure

API not reachable on any of the configured nodes (the ones running the storpool_mgmt service)

Requests to the API from any of the nodes configured to access it either stall or cannot reach a working service. This is a critical state due to the fact that the status of the cluster is unknown (might be down for that matter).

This might be caused by:

  • Misconfigured network for accessing the floating IP address - the address may be obtained by storpool_showconf http on any of the nodes with a configured storpool_mgmt service in the cluster:

    # storpool_showconf http
    SP_API_HTTP_HOST=10.3.10.78
    SP_API_HTTP_PORT=81
    
  • Failed interfaces on the hosts that have the storpool_mgmt service running. To find the interface where the StorPool API should be running use storpool_showconf api_iface :

    # storpool_showconf api_iface
    SP_API_IFACE=bond0.410
    

    It is recommended to have the API on a redundant interface (e.g. an active-backup bond interface). Note that even without an API, provided the cluster is in quorum, there should be no impact on any running operations, but changes in the cluster like creating/attaching/detaching/deleting volumes or snapshots) will be impossible. Running with no API in the cluster triggers highest severity alert to StorPool Support (essentially wake up alert) due to the unknown state of the system.

  • The cluster is not in quorum

    The cluster is in this state if the number of running voting storpool_beacon services is less than the half of the expected nodes plus one ((expected / 2) + 1). The configured number of expected nodes in the cluster may be checked with storpool_showconf expected, but is generally considered to be the number of server nodes (except when client nodes are configured as voting for some reason). In a system with 6 servers at least 4 voting beacons should be available to get back the cluster in running state:

    # storpool_showconf expected
    SP_EXPECTED_NODES=6
    

    The current number of expected votes and the number of voting beacons are displayed in the output of storpool net list, check the example above (the Quorum status: line).

API requests are not returning for more than 30-60 seconds (e.g. storpool volume status, storpool snapshot space, storpool disk list, etc.)

These API requests collect data from the running storpool_server services on each server node. Possible reasons are:

  • Network loss or delays;

  • Failing storpool_server services;

  • Failing drives or hardware (CPU, memory, controllers, etc.);

  • Overload

19.3.2. Server service failure

Two storpool_server services or whole servers are down

Two storpool_server services or whole servers are down or not joined in the cluster in different failure sets. Very risky state, because there are parts of the volumes with only one live replica, if the latest writes land on a drive returning an IO error or broken data (detected by StorPool) this will lead to data loss.

As in degraded state some of the read operations for parts of the volumes will be served from HDDs in a hybrid system and might raise read latencies. In this state it is very important to bring back the missing services/nodes as soon as possible, because a failure in any of the remaining drives in other nodes or another fault set will bring some of the volumes in down state and might lead to data loss in case of an error returned by a drive with the latest writes.

More than two storpool_server services or whole servers are down

This state results in some volumes being in down state (storpool volume status) due to some parts only on the missing drives. Recommended action in this case - check for the reasons for the degraded services or missing (unresponsive) nodes and get them back up.

Possible reasons are:

  • Lost network connectivity

  • Severe packet loss/delays/loops

  • Partial or complete power loss

  • Hardware instabilities, overheating

  • Kernel or other software instabilities, crashes

19.3.3. Client service failure

If the client service (storpool_block) is down on some of the nodes depending on it, these could be either client-only or converged hypervisors, this will stall all requests on that particular node until the service is back up.

Possible reasons are again:

  • Lost network connectivity

  • Severe packet loss/delays/loops

  • Bugs in the storpool_block service or the storpool_bd kernel module

In case of power loss or kernel crashes any virtual machine instances that were running on this node could be started on other available nodes.

19.3.4. Network interface or Switch failure

This means that the networks used for StorPool are down or are experiencing heavy packet loss or delays. In this case the quorum service will prevent a split-brain situation and will restart all services to ensure the cluster is fully connected on at least one network before it transitions again to running state. Such issues might be alleviated by a single-VLAN setup when different nodes have partial network connectivity, but will be again with severe delays in case of severe packet loss.

19.3.5. Hard Drive/SSD failures

Drives from two or more different nodes (fault sets) in the cluster are missing (or from a single node/fault set for systems with dual replication pools)

In this case multiple volumes may either experience degraded performance (hybrid placement) or will be in down state when more than two replicas are missing. All operations on volumes in down state are stalled, until the redundancy is restored (i.e. at least one replica is available). The recommended steps are to immediately check for the reasons for the missing drives/services/nodes and return them into the cluster as soon as possible.

Some of the drives are more than 97% full

At some point all cluster operations will stall until either some of the data in the cluster is deleted or new drives/nodes are added. Adding drives requires the new drives to be stress tested and a re-balancing of the system to include them, which should be carefully planned (details in 18.  Rebalancing the cluster).

Note

Cleaning up snapshots that have multiple cloned volumes and a negative value for used space in the output of storpool snapshot space will not free up any space.

Some of the drives have fewer than 100k free entries

This is usually caused by a heavily overloaded system. In this state the latencies for some operations might become very high (measured in seconds). Possible reasons are severely overloaded volumes for long periods of time without any configured bandwidth or iops limits. This could be checked by using iostat to look for volumes that are being constantly 100% loaded with a large number of requests to the storage system. Another way to check for such volumes is to use the “Top volumes” in the analytics in order to get info for the most loaded volumes and apply IOPS and or bandwidth limits accordingly. Other causes are misbehaving (underperforming) drives or misbehaving HBA/SAS controllers, the recommended way to deal with these cases is to investigate for such drives, a good idea is to check the output from storpool disk list internal for higher aggregation scores on some drives or set of drives (e.g. on the same server) or by the use of the analytics to check for abnormal latency on some of the backend nodes (i.e. drives with significantly higher operations latency compared to other drives of the same type). An example would be a failing controller causing the SATA speed to degrade to SATA1.0 (1.5Gb/s) instead of SATA 3.0 (6Gb/s), weared out batteries on a RAID controller when its cache is used to accelerate the writes on the HDDs, and others.

The circumstances leading a system to the critical state are rare and are usually preventable with by taking measures to handle all the issues at the first signs of a change from the normal to degraded state.

In any of the above cases if you feel that something is not as expected, a consultation with StorPool Support is the best course of action. StorPool Support receives notifications for all detailed cases and pro-actively takes actions to alleviate a system going into degraded or critical state as soon as practically possible.

19.3.6. Hanging requests in the cluster

The output of /usr/lib/storpool/latthreshold.py shows hanging requests and/or missing services as in the example below:

disk | reported by | peers                      |  s |   op  |      volume |                              requestId
-------------------------------------------------------------------------------------------------------------------
-    | client 2    | client 2    -> server 2.1  | 15 | read  | volume-name | 9223936579889289248:270215977642998472
-    | client 2    | client 2    -> server 2.1  | 15 | read  | volume-name | 9223936579889289248:270497452619709333
-    | client 2    | client 2    -> server 2.1  | 15 | read  | volume-name | 9223936579889289248:270778927596419790
-    | client 2    | client 2    -> server 2.1  | 15 | write | volume-name | 9223936579889289248:271060402573130531
-    | client 2    | client 2    -> server 2.1  | 15 | read  | volume-name | 9223936579889289248:271341877549841211
-    | client 2    | client 2    -> server 2.1  | 15 | read  | volume-name | 9223936579889289248:271623352526551744
-    | client 2    | client 2    -> server 2.1  | 15 | write | volume-name | 9223936579889289248:271904827503262450
server 2.1  connection status: established no_data timeout
disk 202 EXPECTED_UNKNOWN server 2.1

This could be caused by starving CPU, hardware resets, misbehaving disks or network or stalled services. The disk field in the output and the service warnings after the requests table could be used as an indicator for the misbehaving component.

Note that the active requests api call has a timeout for each service to respond. The default timeout that the latthreshold tool uses is 10 seconds. This value can be altered by using the latthreshold’s --api-requests-timeout/-A and passing it a numeric value with a time unit (m, s, ms or us) e.g. 100ms.

Service connection will have one of the following statuses:

  • established done - service reported its active requests as expected; this is not displayed in the regular output, only with --json

  • not_established - did not make a connection with the service - this could indicate the server is down, but may also indicate the service version is too old or its stream was overfilled or not connected

  • established no_data timeout - service did not respond and the connection was closed because the timeout was reached

  • established data timeout - service responded but the connection was closed because the timeout was reached before it could send all the data

  • established invalid_data - a message the service sent had invalid data in it

The latthreshold tool also reports disk statuses. Reported disk statuses will be one of the following:

  • EXPECTED_MISSING - the service response was good, but did not provide information about the disk

  • EXPECTED_NO_CONNECTION_TO_PEER - the connection to the service was not established

  • EXPECTED_NO_PEER - the service is not present

  • EXPECTED_UNKNOWN - the service response was invalid or a timeout occurred