StorPool User Guide 18

Document version 2018-05-31

1. StorPool Overview

StorPool is distributed storage software. It pools the attached storage (hard disks or SSDs) of standard servers to create a single pool of shared storage. The StorPool software is installed on each server in the cluster. It combines the performance and capacity of all drives attached to the servers into one global namespace.
StorPool's version 18.02 has been released in November 2018. The new version of the fast distributed block storage incorporates industry-leading capabilities into the solution providing everything needed to ensure efficient, reliable and scalable storage solution for your business.

StorPool provides standard block devices. You can create one or more volumes through its sophisticated volume manager. StorPool is compatible with ext4 and XFS file systems and with any system designed to work with a block device, e.g. databases and cluster file systems (like OCFS and GFS). StorPool can also be used with no file system, for example when using volumes to store VM images directly or as LVM physical volumes.

Redundancy is provided by multiple copies (replicas) of the data written synchronously across the cluster. Users may set the number of replication copies. We recommend 3 copies as a standard and 2 copies for data that is less critical. The replication level directly correlates with the number of servers that may be down without interruption in the service. For replication 3 the number of the servers that may be down simultaneously without losing access to the data is 2.

StorPool protects data and guarantees its integrity by a 64-bit checksum and version for each sector maintained by StorPool. StorPool provides a very high degree of flexibility in volume management. Unlike other storage technologies, such as RAID or ZFS, StorPool does not rely on device mirroring (pairing drives for redundancy). So every disk that is added to a StorPool cluster adds capacity to the cluster, not just for new data but also for existing data. Provided that there are sufficient copies of the data, drives can be added or taken away with no impact to the storage service. Unlike rigid systems like RAID, StorPool does not impose any strict hierarchical storage structure dictated by the underlying disks. StorPool simply creates a single pool of storage that utilises the full capacity and performance of a set of commodity drives.

2. Architecture

StorPool works on a cluster of servers in a distributed shared-nothing architecture. All functions are performed by all servers on an equal peer basis. It works on standard off-the-shelf servers running GNU/Linux.

Each storage node is responsible for data stored on its local hard drives. Storage nodes collaborate to provide the storage service. StorPool provides a shared storage pool combining all the available storage capacity. It uses synchronous replication across servers. The StorPool client communicates in parallel with all StorPool servers.

The software consists of two parts - a storage server and a storage client - that are installed on each physical server (host, node). Each host can be a storage server, a storage client, or both. To storage clients StorPool volumes appear as block devices under /dev/storpool/* and behave as normal disk devices. The data on the volumes can be read and written by all clients simultaneously; its consistency is guaranteed through a synchronous replication protocol. Volumes may be used by clients as they would use a local hard drive or disk array.

3. Feature Highlights

3.1. Scale-out, not Scale-Up

The StorPool solution is fundamentally about scaling out (scaling by adding more drives or nodes) rather than scaling up (adding capacity by replacing a storage box with larger storage box). This means StorPool can scale independently by IOPS, storage space and bandwidth. There is no bottleneck or single point of failure. StorPool can grow without interruption and in small steps - one hard drive, one server and one network interface at a time.

3.2. High Performance

StorPool combines the IOPS performance of all drives in the cluster and optimizes drive access patterns to provide low latency and handling of storage traffic bursts. The load is distributed equally between all servers through striping and sharding.

3.3. High Availability and Reliability

StorPool uses a replication mechanism that slices and stores copies of the data on different servers. For primary, high performance storage this solution has many advantages compared to RAID systems and provides considerably higher levels of reliability and availability. In case of a drive, server, or other component failure, StorPool uses another copy of the data located on another server (or rack) and none of your data is lost or even temporarily unavailable.

3.4. Commodity Hardware

StorPool supports drives and servers in a vendor-agnostic manner, allowing you to avoid vendor lock-in. This allows the use of commodity hardware, while preserving reliability and performance requirements. Moreover, unlike RAID, StorPool is drive agnostic - you can mix drives of various types, make, speed or size in a StorPool cluster.

3.5. Shared Block Device

StorPool provides shared block devices with semantics identical to a shared iSCSI or FC disk array.

3.6. Co-existence with hypervisor software

StorPool can utilize repurposed existing servers and can co-exist with hypervisor software on the same server. This means that there is no dedicated hardware for storage, and growing an IaaS cloud solution is achieved by simply adding more servers to the cluster.

3.7. Compatibility

StorPool is compatible with 64-bit Intel and AMD based servers. We support all Linux-based hypervisors and hypervisor management software. Any Linux software designed to work with a shared storage solution such as an iSCSI or FC disk array will work with StorPool. StorPool guarantees the functionality and availability of the storage solution at the Linux block device interface.

3.8. CLI interface and API

StorPool provides an easy yet powerful command-line interface (CLI) tool for administration of the data storage solution. It is simple and user-friendly - making configuration changes, provisioning and monitoring fast and efficient. StorPool also provides a RESTful JSON API, exposing all the available functionality, so you can integrate it with any existing management system.

3.9. Reliable Support

StorPool comes with reliable dedicated support: remote installation and initial configuration by StorPool’s specialists; 24x7 support; software updates.

4. Hardware Requirements

All distributed storage systems are highly dependent on the underlying hardware. There are some aspects that will help achieve maximum performance with StorPool and are best considered in advance. Each node in the cluster can be used as server, client or both; depending on the role, hardware requirements vary.

4.1. Minimum StorPool cluster

  • 3 industry-standard x86 servers;

  • any x86-64 CPU with 2 threads or more;

  • 32 GB ECC RAM per node (8+ GB used by StorPool);

  • any hard drive controller in JBOD mode;

  • 3x SATA2 hard drives;

  • dedicated 10GE LAN;

4.3. How StorPool relies on hardware

4.3.1. CPU

When the system load is increased, CPUs are saturated with system interrupts. To avoid the negative effects of this, StorPool’s server and client processes can be given one or more dedicated CPU cores. This significantly improves overall the performance and the performance consistency.

4.3.2. RAM

ECC memory can detect and correct the most common kinds of in-memory data corruption thus maintains a memory system immune to single-bit errors. Using ECC memory is an essential requirement for improving the reliability of the node. In fact, StorPool is not designed to work with non-ECC memory.

4.3.3. Storage (HDDs / SSDs)

StorPool ensures the best drive utilization. Replication and data integrity are core functionality, so RAID controllers are not required and all storage devices might be connected as JBOD.

4.3.4. Network

StorPool is a distributed system which means that the network is an essential part of it. Designed for efficiency, StorPool combines data transfer from other nodes in the cluster. This greatly improves the data throughput, compared with access to local devices, even if they are SSD.

4.4. Software Compatibility

4.4.1. Operating Systems

  • Linux

  • Windows and VMWare in our roadmap

4.4.2. File Systems

Developed and optimized for Linux, StorPool is best tested on CentOS and Ubuntu. Compatible with ext4 and XFS file systems and with any system designed to work with a block device, e.g. databases and cluster file systems (like GFS2 or OCFS2). StorPool can also be used with no file system, for example when using volumes to store VM images directly. StorPool is compatible with other technologies from the Linux storage stack, such as LVM, dm-cache/bcache, and LIO.

4.4.3. Hypervisors & Cloud Management/Orchestration

  • KVM

  • LXC/Containers

  • OpenStack

  • OpenNebula

  • OnApp

  • CloudStack

  • any other technology compatible with the Linux storage stack.

5. Installation and Upgrade

Currently the installation and upgrade procedures are performed by StorPool support team.

6. Configuration Guide

6.1. Minimal Node Configuration

To configure nodes StorPool uses a configuration file, which can be found at /etc/storpool.conf. Host specific configuration can be placed in /etc/storpool.conf.d/ folder. The minimum working configuration must specify the network interface, number of expected nodes, authentication tokens and unique ID of the node like in the following example:

#-
# Copyright (c) 2013 - 2017  StorPool.
# All rights reserved.
#

# Human readable name of the cluster, usuall form "Company Name"-"Location", e.g. StorPoolLab-Sofia
#
# Mandatory for the monitoring
SP_CLUSTER_NAME=  #<Company-Name-PoC>-<City-or-nearest-airport>

# Computed from the StorPool Support and consists of location and cluster separated by a dot, e.g. nzkr.b
#
# Mandatory since version 16.02
SP_CLUSTER_ID=  #Ask StorPool Support

# Interface for storpool communication
#
# Default: empty
SP_IFACE1_CFG=1:eth1.100:eth1:100:10.1.100.1:b:x:r
SP_IFACE2_CFG=1:eth2.100:eth2:100:10.2.100.1:b:x:r


# expected nodes for beacon operation
#
# !!! Must be specified !!!
#
SP_EXPECTED_NODES=3


# API authentication token
#
# 64bit random value
# generate for example with: 'od -vAn -N8 -tu8 /dev/random'
SP_AUTH_TOKEN=4306865639163977196


##########################


[spnode1.example.com]
SP_OURID = 1

6.2. Full Configuration Options List

The following is a complete list of the configuration options with short explanation for each of them.

6.2.1. Cluster name

Required for the pro-active monitoring performed by StorPool support team. Usually in the form <Company-Name>-<City-or-nearest-airport>:

SP_CLUSTER_NAME=StorPoolLab-Sofia

6.2.2. Cluster ID

The Cluster ID is computed from the StorPool Support ID and consists of two parts - location and cluster separated by a dot ("."). In this release each location consist of a single cluster. This will be extended to multiple clusters at a location in future releases:

SP_CLUSTER_ID=nzkr.b

6.2.3. Non-voting beacon node

For client only nodes, the storpool_server service will refuse to start on a node with SP_NODE_NON_VOTING. Default is 0:

SP_NODE_NON_VOTING=1

Attention

It is strongly recommended to configure SP_NODE_NON_VOTING at the per-host configuration sections in storpool.conf (see Per host configuration for more)

6.2.4. Communication interface for StorPool cluster

Recommended to have two dedicated network interfaces for communication between the nodes:

SP_IFACE1_CFG=1:eth1.100:eth1:100:10.1.100.1:b:x:r
SP_IFACE2_CFG=1:eth2.100:eth2:100:10.2.100.1:b:x:r

For a full explanation of all options, please check /usr/share/doc/storpool/examples/storpool.conf.example

6.2.5. Address for the API management (storpool_mgmt)

Used by the CLI. Multiple clients can simultaneously send requests to the API. The management service is usually started on one or more nodes in the cluster at a time. By default it is bound on localhost:

SP_API_HTTP_HOST=127.0.0.1

For cluster wide access and automatic failover between the nodes, multiple nodes might have the API service started. The specified IP address is brought up only on one of the nodes in the cluster at a time - the so called active API service. You may specify an available IP address (SP_API_HTTP_HOST), which will be brought up or down on the corresponding interface (SP_API_IFACE) when migrating the API service between the nodes.

To configure an interface (SP_API_IFACE) and address (SP_API_HTTP_HOST):

SP_API_HTTP_HOST=10.10.10.240
SP_API_IFACE=eth1

Note

The script that adds or deletes the SP_API_HTTP_HOST address is located at /usr/lib/storpool/api-ip and could be easily modified for other use cases (e.g. configure routing, firewalls, etc.).

6.2.6. Port for the API management (storpool_mgmt)

Port for the API management service, the default is:

SP_API_HTTP_PORT=81

6.2.7. Address for the bridge service (storpool_bridge)

Required for the local bridge service, this is the address where the bridge binds to:

SP_BRIDGE_HOST=180.220.200.8

6.2.8. Interface for the bridge address (storpool_bridge)

Expected when the SP_BRIDGE_HOST value is a floating IP address for the storpool_bridge service:

SP_BRIDGE_IFACE=bond0.900

6.2.9. Parallel requests per disk when recovering from remote (storpool_bridge)

Number of parallel requests to issue while perfoming remote recovery, between 1 and 64, default:

SP_REMOTE_RECOVERY_PARALLEL_REQUESTS_PER_DISK=2

6.2.10. Working directory

Used for fifos, sockets, core files, etc., default:

SP_WORKDIR=/var/run/storpool

Hint

On nodes with /var/run in RAM and limited memory, better use /var/spool/storpool/run.

6.2.11. Report directory

Location for collecting automated bug reports and shared memory dumps:

SP_REPORTDIR=/var/spool/storpool

6.2.12. Restart automatically in case of crash

Restart the service in case of crash if there are less than 3 crashes during this interval in seconds. If this value is 0 service will not restart at all and will have to be started manually, default is 30 minutes:

SP_RESTART_ON_CRASH=1800

6.2.13. Expected nodes

Minimum expected nodes for beacon operation, usually equal to the number of nodes with storpool_server instances running:

SP_EXPECTED_NODES=3

6.2.14. Local user for report collection

User to change the ownership of reports and crashes. Unset by default:

SP_CRASH_USER=

Note

In case of no configuration during installation, this user will be set by default to storpool.

6.2.15. Remote user for report collection

Remote user for sending reports to StorPool. Usually lowercase SP_CLUSTER_NAME. Used by rsync in the storpool_repsync utility:

SP_CRASH_REMOTE_USER=storpoollab-sofia

6.2.16. Remote host address for sending reports

The default remote is reports.storpool.com, which could be altered in case a jumphost or a custom collection node is used:

SP_CRASH_REMOTE_ADDRESS=reports.storpool.com

6.2.17. Port on the remote host for sending reports

The default port is 2266, might be altered in case a jumphost or a custom collection node is used:

SP_CRASH_REMOTE_PORT=2266

6.2.18. Group owner for the StorPool devices

The system group to use for the /dev/storpool directory and the /dev/sp-* raw disk devices:

SP_DISK_GROUP=disk

6.2.19. Permissions for the StorPool devices

The access mode to set on the /dev/sp-* raw disk devices:

SP_DISK_MODE=0660

6.2.20. Exclude disks globally or per server instance

A list of paths to drives to be excluded at instance boot time:

SP_EXCLUDE_DISKS=/dev/sda1:/dev/sdb1

Can also be specified for each server instance individually:

SP_EXCLUDE_DISKS=/dev/sdc1
SP_EXCLUDE_DISKS_1=/dev/sda1

6.2.21. Cgroup setup

Enable the usage of cgroups, default is on. Each StorPool process requires a specification of the cgroups it should be started into, there is a default configuration for each service and example configuration in /usr/share/doc/storpool/examples directory on a node where StorPool is installed. One or more processes may be placed in the same cgroup or each one may be in a cgroup of its own, as appropriate:

SP_USE_CGROUPS=1

It is mandatory to specify a SP_RDMA_CGROUPS setting for the kernel threads started by the StorPool modules:

SP_RDMA_CGROUPS=-g memory:/storpool.slice -g cpuset:/storpool.slice/rdma

Set cgroups for the storpool_block service. Usually runs on the same core as the RDMA with 2 isolated cores. Isolating the storpool_block on a dedicated core further improves the performance:

SP_BLOCK_CGROUPS=-g memory:/storpool.slice -g cpuset:/storpool.slice/block

Set cgroups for the storpool_bridge service. Depending on the load runs either on a dedicated core or on the one where the storpool_mgmt service is running:

SP_BRIDGE_CGROUPS=-g memory:/storpool.slice -g cpuset:storpool.slice/mgmt

Set cgroups for the storpool_server service. Isolating each storpool_server service on a dedicated core improves the performance of the storage operations significantly, with multiple servers the defaults are:

SP_SERVER_CGROUPS=-g memory:/storpool.slice -g cpuset:/storpool.slice/server
SP_SERVER1_CGROUPS=-g memory:/storpool.slice -g cpuset:/storpool.slice/server_1
SP_SERVER2_CGROUPS=-g memory:/storpool.slice -g cpuset:/storpool.slice/server_2
SP_SERVER3_CGROUPS=-g memory:/storpool.slice -g cpuset:/storpool.slice/server_3

Set cgroups for the storpool_beacon service, usually on the thread where some of the server instances are living:

SP_BEACON_CGROUPS=-g memory:/storpool.slice -g cpuset:/storpool.slice/beacon

Set cgroups for the storpool_mgmt service:

SP_MGMT_CGROUPS=-g memory:/mgmt.slice -g cpuset:/storpool.slice/mgmt

Note

The storpool_mgmt slice is allocating memory so the general recommendation is to keep it outside of storpool.slice, as of 18.02 release the default is a separate slice called mgmt.slice, this is an example config with a limit of 1 Gigabyte (place as /etc/cgconfig.d/mgmt.slice.conf):

group mgmt.slice {
  memory {
      memory.limit_in_bytes="1G";
      memory.memsw.limit_in_bytes="1G";
  }
}

Set cgroups for the storpool_controller service, by default constrained by CPU and memory limits in system.slice:

SP_CONTROLLER_CGROUPS=-g cpuset:system.slice -g memory:system.slice

Set cgroups for the storpool_iscsi target service, isolating storpool_iscsi on a dedicated core improves its performance:

SP_ISCSI_CGROUPS=-g cpuset:storpool.slice/iscsi -g memory:storpool.slice

Set cgroups for the storpool_nvmed service, usually on a thread where some of the storpool_server instances are running:

SP_NVMED_CGROUPS=-g cpuset:storpool.slice/beacon -g memory:storpool.slice

6.2.22. Network ans Storage controllers interrupts affinity

The setirqaff utility is started by cron every minute. It checks IRQ affinity settings for optimum performance and updates IRQ affinity settings if needed. The policy is build in the script and does not require any external configuration files. The setirqaff utility pins the network interface IRQs used for the storage system to the first CPU of storpool.slice cgroup, the HBA IRQs to the first CPU of storpool.slice/server, if present and spreads IRQs used for VM networking among the CPUs of machine.slice. The remaining IRQs are distributed among all CPUs that are not in storpool.slice.

6.2.23. Cache size

Each storpool_server process allocates this amount of RAM (in MB) for caching. The size of the cache depends on the number of storage devices on each storpool_server instance:

SP_CACHE_SIZE=4096

Note

A node with three storpool_server processes running will use 4096*3 = 12GB cache total.

Override the size of the cache for each of the storpool_server instances, useful when different instances control different number of drives:

SP_CACHE_SIZE=1024
SP_CACHE_SIZE_1=1024
SP_CACHE_SIZE_2=4096
SP_CACHE_SIZE_3=8192

Set the internal write-back caching to on:

SP_WRITE_BACK_CACHE_ENABLED=1

Attention

UPS is mandatory with WBC, a clean server shutdown is required before the UPS batteries are depleted.

6.2.24. API authentication token

This value must be unique integer for each cluster:

SP_AUTH_TOKEN=0123456789

Hint

Generate with: od -vAn -N8 -tu8 /dev/random

6.2.25. NVMe SSD drives

To instruct the storpool_server which is the PCIe ID of the NVMe SSD configure the following:

SP_NVME_PCI_ID=0000:04:00.0

6.2.26. Per host configuration

Specific details per host. The value in the square brackets should be the name of the host as returned by the hostname command. The ID of the node must be unique throughout the cluster:

[spnode1.example.com]
SP_OURID=1

The highest ID in this release is 62 or up to this number of nodes in a single cluster.

Specific configuration details might be added for each host individually, e.g.:

[spnode1.example.com]
SP_OURID=1
SP_IFACE1_CFG=1:eth1.100:eth1:100:10.1.100.1:b:x:r
SP_IFACE2_CFG=1:eth2.100:eth2:100:10.2.100.1:b:x:r
SP_NODE_NON_VOTING=1

7. Prepare Storage Devices

All hard drives, SSD, or NVME drives that will be used by StorPool must have one properly aligned partition and must have an assigned ID. Larger NVMe drives could be split into two partitions to be able to assign them to different storpool_server instances and work around a bottleneck posed by a saturated CPU by using a separate hyperthread or another CPU core for the second storpool_server instance.

The ID should be a number between 1 and 4000 and must be unique within the StorPool cluster. An example command for creating a partition on the whole drive with the proper alignment:

# parted -s --align optimal /dev/sdXN mklabel gpt -- mkpart primary 2MiB 100%    # where X is the drive letter and N is the partition number

For dual partitions on an NVMe drive use:

# parted -s --align optimal /dev/nvmeXnY mklabel gpt -- mkpart primary 2MiB 50%   # where X is the nvme device controller and Y is the device's first namespace
# parted -s --align optimal /dev/nvmeXnY mkpart primary 50% 100%

Similarly to split an even larger (e.g. 6 or 8TB+) NVMe drive to four partitions use:

# parted -s --align optimal /dev/nvmeXnY mklabel gpt -- mkpart primary 2MiB 50%   # where X is the nvme device controller and Y is the device's first namespace
# parted -s --align optimal /dev/nvmeXnY mkpart primary 25% 50%
# parted -s --align optimal /dev/nvmeXnY mkpart primary 50% 75%
# parted -s --align optimal /dev/nvmeXnY mkpart primary 75% 100%

On a brand new cluster installation it is necessary to have one drive formatted with the “init” (-I) flag of storpool_initdisk. This device is necessary only for the first start and therefore it is best to pick the first drive in the cluster.

Initializing the first drive on the first server node with the init flag:

# storpool_initdisk -I {diskId} /dev/sdXN   # where X is the drive letter and N is the partition number

Initializing an SSD or NVME SSD device with the SSD flag set:

# storpool_initdisk -s {diskId} /dev/sdXN   # where X is the drive letter and N is the partition number

Initializing an HDD drive:

# storpool_initdisk {diskId} /dev/sdXN   # where X is the drive letter and N is the partition number

List all initialized devices:

# storpool_initdisk --list

Example output:

/dev/nvme0n1, diskId 2305, version 10007, server instance 0, cluster e.b, SSD
/dev/nvme1n1, diskId 2306, version 10007, server instance 0, cluster e.b, SSD
/dev/sdr1, diskId 2301, version 10007, server instance 1, cluster e.b, SSD
/dev/sdq1, diskId 2302, version 10007, server instance 1, cluster e.b, SSD
/dev/sds1, diskId 2303, version 10007, server instance 1, cluster e.b, SSD
/dev/sdt1, diskId 2304, version 10007, server instance 1, cluster e.b, SSD
/dev/sda1, diskId 2311, version 10007, server instance 2, cluster e.b, WBC, jmv 160036C1B49
/dev/sdb1, diskId 2311, version 10007, server instance 2, cluster -, journal mv 160036C1B49
/dev/sdc1, diskId 2312, version 10007, server instance 2, cluster e.b, WBC, jmv 160036CF95B
/dev/sdd1, diskId 2312, version 10007, server instance 2, cluster -, journal mv 160036CF95B
/dev/sde1, diskId 2313, version 10007, server instance 3, cluster e.b, WBC, jmv 160036DF8DA
/dev/sdf1, diskId 2313, version 10007, server instance 3, cluster -, journal mv 160036DF8DA
/dev/sdg1, diskId 2314, version 10007, server instance 3, cluster e.b, WBC, jmv 160036ECC80
/dev/sdh1, diskId 2314, version 10007, server instance 3, cluster -, journal mv 160036ECC80

Other avaialble options:

  • -i - Specify server instance, used when more than one storpool_server instances are running on the same node

  • -r - Used to return an ejected disk back to the cluster or change some of the parameters

  • -F - Forget this disk and mark it as ejected (succeeds only without a running storpool_server instance that has the drive opened)

  • -s - set SSD flag - on new initialize only, not revertible with -r).

  • -e (count) - Initialize the disk by overriding the default entries count.

  • -j|--journal (<device>|none) - Used for HDDs when a RAID controller with a working cachevault or battery is present and configure a small WBC journal device on the same hard drive.

  • --wbc (y|n) - Used for HDDs when the internal write-back caching is enabled, implies SP_WRITE_BACK_CACHE_ENABLED to have an effect. Turned off by default.

  • --nofua (y|n) - Used to forcefully disable FUA support for an SSD device. Use with caution because may lead to data lost if device is powered off before issuing a FLUSH CACHE command.

  • --no-flush (y|n) - Used to forcefully disable FLUSH support for an SSD device.

  • --no-trim (y|n) - Used to forcefully disable TRIM support for an SSD device. Useful when the drive is behind a RAID controller without support for TRIM pass-through.

  • --force - Used when re-initializing an already initialized StorPool drive. Use with caution.

8. Verify the Installation

A StorPool installation provides the following daemons that take care of different functionality on each participant node in the cluster.

8.1. storpool_beacon

The beacon must be the first started process on all nodes in cluster. It informs all members about the availability of the node on which it is installed. If the number of the visible nodes changes, every storpool_beacon service checks that its node still participates is the quorum, i.e. it can communicate with more than half of the expected nodes, including itself (see SP_EXPECTED_NODES in the Full Configuration Options List section). If the storpool_beacon service has been started successfully, it will send to the system log (/var/log/messages, /var/log/syslog, or similar) messages such as the following for every node that comes up in the StorPool cluster:

[snip]
Jan 21 16:22:18 s01 storpool_beacon[18839]: [info] incVotes(1) from 0 to 1, voteOwner 1
Jan 21 16:23:10 s01 storpool_beacon[18839]: [info] peer 2, beaconStatus UP bootupTime 1390314187662389
Jan 21 16:23:10 s01 storpool_beacon[18839]: [info] incVotes(1) from 1 to 2, voteOwner 2
Jan 21 16:23:10 s01 storpool_beacon[18839]: [info] peer up 1
[snip]

8.2. storpool_server

The storpool_server service must be started on each node that provides its hard drives and SSDs to the cluster. If the service has started successfully, all the hard drives intended to be used as StorPool disks should be listed in the system log, e.g.:

Dec 14 09:54:19 s11 storpool_server[13658]: [info] /dev/sdl1: adding as data disk 1101 (ssd)
Dec 14 09:54:19 s11 storpool_server[13658]: [info] /dev/sdb1: adding as data disk 1111
Dec 14 09:54:20 s11 storpool_server[13658]: [info] /dev/sda1: adding as data disk 1114
Dec 14 09:54:20 s11 storpool_server[13658]: [info] /dev/sdk1: adding as data disk 1102 (ssd)
Dec 14 09:54:20 s11 storpool_server[13658]: [info] /dev/sdj1: adding as data disk 1113
Dec 14 09:54:22 s11 storpool_server[13658]: [info] /dev/sdi1: adding as data disk 1112

On a dedicated or node with a larger amount of spare resources, more than one storpool_server instance could be started (up to four instances).

8.3. storpool_block

The storpool_block service provides the client (initiator) functionality. StorPool volumes can be attached only to the nodes where this service is running. When attached to a node, a volume can be used and manipulated as a regular block device via the /dev/stopool/{volume_name} symlink:

# lsblk /dev/storpool/test
NAME MAJ:MIN RM  SIZE RO TYPE MOUNTPOINT
sp-2 251:2    0  100G  0 disk

8.4. storpool_mgmt

The storpool_mgmt service should be started on the management node. It receives requests from user space tools (CLI or API), executes them in the StorPool cluster and returns the results back to the sender. Each node can be used as an API management server, with only one node active at a time. An automatic failover mechanism is available: when the node with the active storpool_mgmt service fails, the SP_API_HTTP_HOST IP address is configured on the next node with the lowest SP_OURID with a running storpool_mgmt service.

8.5. storpool_bridge

The storpool_bridge service is started on one of the nodes in the cluster. This service synchronizes snapshots for the backup and disaster recovery use cases between this and one or more StorPool clusters in different locations.

8.6. storpool_controller

The storpool_controller service is started on all nodes running the storpool_server service. It collects information from all storpool_server instances in order to provide statistics data to the API.

Note

The storpool_controller service requires port 47567 to be open on the nodes where the API (storpool_mgmt) service is running.

9. CLI Tutorial

StorPool provides an easy yet powerful Command Line Interface (CLI) for administrating the data storage cluster. It has an integrated help system that provides useful information on every step. There are various ways to execute commands in the CLI, depending on the style and needs of the administrator. The StorPool CLI gets its configuration from /etc/storpool.conf file and command line options.

Type regular shell command with parameters:

# storpool service list

Use interactive StorPool shell:

# storpool
StorPool> service list

Pipe command output to StorPool CLI:

# echo "service list" | storpool

Redirect the standard input from a predefined file with commands:

# storpool < input_file

Display the available command line options:

# storpool --help

Error message with possible options will be displayed if the shell command is incomplete or wrong:

# storpool attach
Error: incomplete command! Expected:
    list - list the current attachments
    timeout - seconds to wait for the client to appear
    volume - specify a volume to attach
    here - attach here
    noWait - do not wait for the client
    snapshot - specify a snapshot to attach
    mode - specify the read/write mode
    client - specify a client to attach the volume to

# storpool attach volume
Error: incomplete command! Expected:
  volume - the volume to attach

Interactive shell help can be invoked by pressing the question mark key (?):

# storpool
StorPool> attach
  client - specify a client to attach the volume to {M}
  here - attach here {M}
  list - list the current attachments
  mode - specify the read/write mode {M}
  noWait - do not wait for the client {M}
  snapshot - specify a snapshot to attach {M}
  timeout - seconds to wait for the client to appear {M}
  volume - specify a volume to attach {M}

Shell autocomplete, invoked by double-pressing the Tab key, will show available options for current step:

StorPool> attach <tab> <tab>
client    here      list      mode      noWait    snapshot  timeout   volume

StorPool shell can detect incomplete lines and suggest options:

# storpool
StorPool> attach <enter>
.................^
Error: incomplete command! Expected:
    volume - specify a volume to attach
    client - specify a client to attach the volume to
    list - list the current attachments
    here - attach here
    mode - specify the read/write mode
    snapshot - specify a snapshot to attach
    timeout - seconds to wait for the client to appear
    noWait - do not wait for the client

To exit the shell use quit or exit commands or directly use the Ctrl+C or Ctrl+D keyboard shortcuts of your terminal.

9.1. Location

The location submenu is used for configuring other StorPool clusters for disaster recovery and backup purposes. The location ID is the first part (left of the .) in the SP_CLUSTER_ID configured in the remote cluster, for example to add a remote location with SP_CLUSTER_ID=nzkr.b use:

# storpool location add nzkr StorPoolLab-Sofia
OK

To list the configured remote locations use:

# storpool location list
----------------------------
| id   | name              |
----------------------------
| nzkr | StorPoolLab-Sofia |
----------------------------

To rename a location use:

# storpool location rename StorPoolLab-Sofia name StorPoolLab-Amsterdam
OK

To remove a location use:

# storpool location remove StorPoolLab-Sofia
OK

Note

This command will fail if there is an existing cluster for this location or if there is a remote bridge configured for this location

9.2. Cluster

The cluster submenu is used for configuring a cluster for an already configured remote location. The cluster ID is the second part (right from the .) in the SP_CLUSTER_ID configured in the remote cluster, for example to add the cluster b for the remote location nzkr use:

# storpool cluster add StorPoolLab-Sofia b
OK

To list the configured remote clusters use:

# storpool cluster list
--------------------------
| id | location          |
--------------------------
| b  | StorPoolLab-Sofia |
--------------------------

To remove a cluster use:

# storpool cluster remove StorPoolLab-Sofia b

9.3. Remote Bridge

The remoteBridge submenu is used to register or deregister a remote bridge for a configured remote location.

To register a remote bridge use storpool remoteBridge register <location-name> <IP address> <public-key>, for example:

# storpool remoteBridge register StorPoolLab-Sofia 10.1.100.10 ju9jtefeb8idz.ngmrsntnzhsei.grefq7kzmj7zo.nno515u6ftna6
OK

Will register the StorPoolLab-Sofia location with an IP address of 10.1.100.10 and the above public key.

In case of a change in the IP address or the public key of a remote location the remote bridge could be de-registered and then registered again with the required parameters, e.g.:

# storpool remoteBridge deregister 10.1.100.10
OK
# storpool remoteBridge register StorPoolLab-Sofia 78.90.13.150 8nbr9q162tjh.ahb6ueg16kk2.mb7y2zj2hn1ru.5km8qut54x7z
OK

To enable deferred deletion on unexport from the remote site the minimumDeleteDelay flag should also be set, the format of the command is storpool remoteBridge register <location-name> <IP address> <public-key> <minimumDeleteDelay>, where the last parameter is a time period provided as X[smhd] - X is an integer and s, m, h, and d are seconds, minutes, hours and days accordingly.

For example if we register the remote bridge for StorPoolLab-Sofia location with a minimumDeleteDelay of one day the register would look like this:

# storpool remoteBridge register StorPoolLab-Sofia 78.90.13.150 8nbr9q162tjh.ahb6ueg16kk2.mb7y2zj2hn1ru.5km8qut54x7z 1d
OK

After this operation all snapshots sent from the remote cluster could also be unexported with the deleteAfter parameter set (check the Remote snapshots section). Any deleteAfter parameters lower than the minimumDeleteDelay will be overridden by the latter.

To list all registered remote bridges use:

# storpool remoteBridge list
-----------------------------------------------------------------------------------------------------------------------
| ip          | location               | minimumDeleteDelay | publicKey                                               |
-----------------------------------------------------------------------------------------------------------------------
| 10.1.100.10 | StorPoolLab-Sofia      |                    | 8nbr9q162tjh.ahb6ueg16kk2.mb7y2zj2hn1ru.5km8qut54x7z    |
-----------------------------------------------------------------------------------------------------------------------

Check the Multi Site section from this user guide for more on deferred delete.

9.4. Network

To list basic details about the cluster network use:

# storpool net list
------------------------------------------------------------------------------------------------------
| nodeId | flags   | MAC - net 1       | MAC - net 2       | rdma port 1        | rdma port 2        |
------------------------------------------------------------------------------------------------------
|     11 | uU + AJ | 00:1B:21:6E:D0:F0 | 00:1B:21:6E:D0:F0 | -                  | -                  |
|     12 | uU + AJ | 00:1B:21:97:D7:60 | 00:1B:21:97:D7:60 | -                  | -                  |
|     13 | uU + AJ | 00:1B:21:70:DB:8C | 00:1B:21:70:DB:8C | -                  | -                  |
|     14 | uU + AJ | 00:1B:21:7B:D3:14 | 00:1B:21:7B:D3:14 | -                  | -                  |
|     15 | uU + AJ | 00:1B:21:FA:07:12 | 00:1B:21:FA:07:12 | -                  | -                  |
------------------------------------------------------------------------------------------------------
Quorum status: 5 voting beacons up out of 5 expected
Flags:
  u - packets recently received from this node
  d - no packets recently received from this node

  U - this node has enough votes to be in the quorum
  D - this node does not have enough votes to be in the quorum
  M - this node is being damped by the rest of the nodes in the cluster
  + - this node considers itself in the quorum
  B - the connection to this node is through a backup link; check the cabling
  A - this node is using hardware acceleration
  O - this node is using old protocol version
  J - the node uses jumbo frames

  N - a non-voting node

With InfiniBand or RDMA over Converged Ethernet (RoCE) networks this output is a bit different, having populated only the rdma ports, this is an example output with redundand IB network:

# storpool net list
----------------------------------------------------------------------------------------------------
| nodeId | flags | MAC - net 1       | MAC - net 2       | rdma port 1        | rdma port 2        |
----------------------------------------------------------------------------------------------------
|     11 |  uUN+ | -                 | -                 | O 0x2c9030030ca51  | O 0x2c9030030ca52  |
|     12 |  uUN+ | -                 | -                 | O 0x2c9030030aef1  | O 0x2c9030030aef2  |
|     13 |  uUN+ | -                 | -                 | O 0x2c9030030abe1  | O 0x2c9030030abe2  |
|     14 |  uUN+ | -                 | -                 | O 0x2c9030030fc21  | O 0x2c9030030fc22  |
----------------------------------------------------------------------------------------------------
[snip]

Note

The flags for the RDMA ports are I - “Idle”, C - “GidReceived”, C - “Connecting”, O - “Connected”, E - “pendingError” or “Error”.

9.5. Server

To list the nodes that are configured as StorPool servers and their storpool_server instances use:

# storpool server list
cluster running, mgmt on node 11
    server  11.0 running on node 11
    server  12.0 running on node 12
    server  13.0 running on node 13
    server  14.0 running on node 14
    server  11.1 running on node 11
    server  12.1 running on node 12
    server  13.1 running on node 13
    server  14.1 running on node 14

To get more information about which storage devices are provided by a particular server use the storpool server <ID>:

# storpool server 11 disk list
disk  |  server  |    size    |    used    |  est.free  |      %  |  free entries  |  on-disk size  |  allocated objects |  errors
1103  |    11.0  |    447 GB  |    3.1 GB  |    424 GB  |    1 %  |       1919912  |         20 MB  |    40100 / 480000  |       0
1104  |    11.0  |    447 GB  |    3.1 GB  |    424 GB  |    1 %  |       1919907  |         20 MB  |    40100 / 480000  |       0
1111  |    11.0  |    465 GB  |    2.6 GB  |    442 GB  |    1 %  |        494977  |         20 MB  |    40100 / 495000  |       0
1112  |    11.0  |    365 GB  |    2.6 GB  |    346 GB  |    1 %  |        389977  |         20 MB  |    40100 / 390000  |       0
1125  |    11.0  |    931 GB  |    2.6 GB  |    894 GB  |    0 %  |        974979  |         20 MB  |    40100 / 975000  |       0
1126  |    11.0  |    931 GB  |    2.6 GB  |    894 GB  |    0 %  |        974979  |         20 MB  |    40100 / 975000  |       0
----------------------------------------------------------------------------------------------------------------------------------------
   6  |     1.0  |    3.5 TB  |     16 GB  |    3.4 TB  |    0 %  |       6674731  |        122 MB  |   240600 / 3795000 |       0

Note

Without specifying instance the first instance is assumed - 11.0 as in the above example. The second, third and fourth storpool_server instance would be 11.1, 11.2 and 11.3 accordingly.

To list the servers that are blocked and could not join the cluster for some reason:

# storpool server blocked
cluster waiting, mgmt on node 12
  server  11.0 waiting on node 11 missing:1201,1202,1203,1204,1212,1221,1222,1223,1224,1225,1226,1301,1302,1303,1304,1311,1312,1321,1322,1323,1324,1325,1326,4043 pending:1103,1104,1111,1112,1125,1126
  server  12.0    down on node 12
  server  13.0    down on node 13
  server  14.0 waiting on node 14 missing:1201,1202,1203,1204,1212,1221,1222,1223,1224,1225,1226,1301,1302,1303,1304,1311,1312,1321,1322,1323,1324,1325,1326,4043 pending:1403,1404,1411,1412,1421,1423
  server  11.1 waiting on node 11 missing:1201,1202,1203,1204,1212,1221,1222,1223,1224,1225,1226,1301,1302,1303,1304,1311,1312,1321,1322,1323,1324,1325,1326,4043 pending:1101,1102,1121,1122,1123,1124
  server  12.1    down on node 12
  server  13.1    down on node 13
  server  14.1 waiting on node 14 missing:1201,1202,1203,1204,1212,1221,1222,1223,1224,1225,1226,1301,1302,1303,1304,1311,1312,1321,1322,1323,1324,1325,1326,4043 pending:1401,1402,1424,1425,1426

9.6. Fault sets

The fault sets are a way to instruct StorPool to use the drives in a group of nodes for only one replica of the data if they are expected to fail simultaneously. Some examples would be:
  • Multinode chassis

  • multiple nodes in the same rack backed by the same power supply

  • nodes connected to the same set of switches and so on.

To define a fault set only a name and some set of servers are needed:

# storpool faultSet chassis_1 addServer 11 addServer 12
OK

To list defined fault sets use:

# storpool faultSet list
-------------------------------------------------------------------
| name                 |                                  servers |
-------------------------------------------------------------------
| chassis_1            |                                    11 12 |
-------------------------------------------------------------------

To remove a fault set use:

# storpool faultSet chassis_1 delete chassis_1

Attention

A new fault set definition has effect only on newly created volumes. To change the configuration on already created volumes a re-balance operation would be required, see Balancer for more details on re-balancing a cluster after defining fault sets.

9.7. Services

Check the state of all services presently running in the cluster and their uptime:

# storpool service list
cluster running, mgmt on node 12
      mgmt    11 running on node 11 ver 18.02.08, started 2018-01-17 14:23:53, uptime 18:30:09
      mgmt    12 running on node 12 ver 18.02.08, started 2018-01-17 14:23:49, uptime 18:30:13 active
      mgmt    13 running on node 13 ver 18.02.08, started 2018-01-17 14:23:47, uptime 18:30:15
      mgmt    14 running on node 14 ver 18.02.08, started 2018-01-18 05:06:53, uptime 03:47:09
    server  11.0 running on node 11 ver 18.02.08, started 2018-01-18 08:29:15, uptime 00:24:47
    server  12.0 running on node 12 ver 18.02.08, started 2018-01-18 08:29:35, uptime 00:24:27
    server  13.0 running on node 13 ver 18.02.08, started 2018-01-18 08:29:54, uptime 00:24:08
    server  14.0 running on node 14 ver 18.02.08, started 2018-01-18 08:28:21, uptime 00:25:41
    server  11.1 running on node 11 ver 18.02.08, started 2018-01-18 08:29:25, uptime 00:24:37
    server  12.1 running on node 12 ver 18.02.08, started 2018-01-18 08:29:41, uptime 00:24:21
    server  13.1 running on node 13 ver 18.02.08, started 2018-01-18 08:34:12, uptime 00:19:50
    server  14.1 running on node 14 ver 18.02.08, started 2018-01-18 08:28:24, uptime 00:25:38
    client    11 running on node 11 ver 18.02.08, started 2018-01-17 14:23:52, uptime 18:30:10
    client    12 running on node 12 ver 18.02.08, started 2018-01-17 14:23:48, uptime 18:30:14
    client    13 running on node 13 ver 18.02.08, started 2018-01-17 14:23:47, uptime 18:30:15
    client    14 running on node 14 ver 18.02.08, started 2018-01-18 05:06:54, uptime 03:47:08
    bridge    11 running on node 11 ver 18.02.08, started 2018-01-18 08:27:54, uptime 00:26:08

9.8. Disk

The disk submenu is for quering or managing the avaiable disks in the cluster.

To display all available disks in all server instances in the cluster:

# storpool disk list
disk  |  server  |    size    |    used    |  est.free  |      %  |  free entries  |  on-disk size  |  allocated objects |  errors
1101  |    11.1  |    447 GB  |    3.6 GB  |    424 GB  |    1 %  |       1919976  |         20 MB  |    40100 / 480000  |       0
1102  |    11.1  |    447 GB  |    3.6 GB  |    424 GB  |    1 %  |       1919976  |         20 MB  |    40100 / 480000  |       0
1103  |    11.0  |    447 GB  |    3.6 GB  |    424 GB  |    1 %  |       1919976  |         20 MB  |    40100 / 480000  |       0
1104  |    11.0  |    447 GB  |    3.6 GB  |    424 GB  |    1 %  |       1919976  |         20 MB  |    40100 / 480000  |       0
1111  |    11.0  |    465 GB  |    3.1 GB  |    442 GB  |    1 %  |        494976  |         20 MB  |    40100 / 495000  |       0
1112  |    11.0  |    365 GB  |    3.1 GB  |    345 GB  |    1 %  |        389976  |         20 MB  |    40100 / 390000  |       0
[snip]
1425  |    14.1  |    931 GB  |    2.6 GB  |    894 GB  |    0 %  |        974980  |         20 MB  |    40100 / 975000  |       0
1426  |    14.1  |    931 GB  |    2.6 GB  |    894 GB  |    0 %  |        974979  |         20 MB  |    40100 / 975000  |       0
----------------------------------------------------------------------------------------------------------------------------------------
47  |     8.0  |     30 TB  |    149 GB  |     29 TB  |    0 %  |      53308967  |        932 MB  |  1844600 / 32430000 |       0

To display additional info regarding disks:

# storpool disk list info
disk   |  server  |    device    |       model        |           serial            |             description          |  SSD  |  WBC  |         flags          |
1101  |    11.1  |  /dev/sdf1   |  Micron_M500DC_MTFDDAK480MBB  |  14250C63689B               |                                  |  SSD  |       |                        |
1102  |    11.1  |  /dev/sde1   |  Micron_M500DC_MTFDDAK480MBB  |  14250C6368EC               |                                  |  SSD  |       |                        |
1103  |    11.0  |  /dev/sdk1   |  Micron_M500DC_MTFDDAK480MBB  |  14250C636872               |                                  |  SSD  |       |                        |
1104  |    11.0  |  /dev/sdl1   |  Micron_M500DC_MTFDDAK480MBB  |  14250C6368E5               |                                  |  SSD  |       |                        |
1111  |    11.0  |  /dev/sdj1   |  Hitachi_HDS721050CLA360  |  JP1512FR1GZBNK             |                                  |       |  WBC  |                        |
1112  |    11.0  |  /dev/sdi1   |  Hitachi_HDS721050CLA360  |  JP1532FR1HAU4K             |                                  |       |  WBC  |                        |
[snip]
1425  |    14.1  |  /dev/sdg1   |  Hitachi_HUA722010CLA330  |  JPW9K0N13RJMVL             |                                  |       |  WBC  |                        |
1426  |    14.1  |  /dev/sdf1   |  Hitachi_HUA722010CLA330  |  JPW9K0N13SJ3DL             |                                  |       |  WBC  |                        |

To display internal statistics about each disk:

# storpool disk list internal
--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
| disk | server |        aggregate scores        |         wbc pages        |   wbc iops   |   wbc bw.    |     scrub bw |                          scrub ETA | last scrub completed |
--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
| 1101 |   11.1 |        0 |        0 |        0 |        - + -     / -     |            - |            - |            - |                                  - |                    - |
| 1102 |   11.1 |        0 |        0 |        0 |        - + -     / -     |            - |            - |            - |                                  - |                    - |
| 1103 |   11.0 |        0 |        0 |        0 |        - + -     / -     |            - |            - |            - |                                  - |                    - |
| 1104 |   11.0 |        0 |        0 |        0 |        - + -     / -     |            - |            - |            - |                                  - |                    - |
| 1111 |   11.0 |        0 |        0 |        0 |        0 + 0     / 2560  |         8249 |   20 MB/sec. |            - |                                  - |                    - |
| 1112 |   11.0 |        0 |        0 |        0 |        0 + 0     / 2560  |         6499 |   20 MB/sec. |            - |                                  - |                    - |
[snip]
| 1425 |   14.1 |        0 |        0 |        0 |        0 + 0     / 2560  |        16249 |   20 MB/sec. |            - |                                  - |                    - |
| 1426 |   14.1 |        0 |        0 |        0 |        0 + 0     / 2560  |        16249 |   20 MB/sec. |            - |                                  - |                    - |
--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
The sections in this output explained:
  • aggregate scores - Internal values representing how much data is about to be defragmented on the particular drive. Usually between 0 and 1, on heavily loaded clusters the rightmost column might get into the hundreds or even thousands if some drives are severely loaded.

  • wbc pages, wbc iops, wbc bw. - Internal statistics for each drive that have the write back cache in StorPool enabled.

  • scrub bw - The scrubbing speed in MB/s

  • scrub ETA - Approximate time/date when the scrubbing operation will complete for this drive.

  • last scrub completed - The last time/date when the drive was scrubbed

Note

The default installation includes a cron job on the management nodes that starts a scrubbing job for all drives in the cluster once per week.

To set additional information for some of the disks, seen in description field with storpool disk list info:

# storpool disk 1111 description HBA2_port7
OK
# storpool disk 1104 description FAILING_SMART
OK

To mark a device as temporarily unavailable:

# storpool disk 1111 eject
OK

This will stop data replication for this disk, but will keep info on the placement groups in which it participated and which volume objects it contained.

Note

The command above will refuse to eject the disk if this operation would lead to volumes or snapshots in down state. Usually when the last up-to-date copy for some parts of a volume/snapshot is on this disk.

This drive will be visible with storpool disk list as missing, e.g.:

# storpool disk list
    disk  |  server  |    size    |    used    |  est.free  |      %  |  free entries  |  on-disk size  |  allocated objects |  errors
    [snip]
    1422  |    14.1  |         -  |         -  |         -  |    - %  |             -  |             -  |        - / -       |       -
    [snip]

Attention

This operation leads to degraded redundancy for all volumes or snapshots that have data on the ejected disk.

To mark a disk as unavailable by first re-balancing all data out to the other disks in the cluster and only then eject it:

# storpool disk 1422 softEject
OK
Balancer auto mode currently OFF. Must be ON for soft-eject to complete.

Note

This option requires StorPool balancer to be started after the above was issued, see more in the Balancer section below.

To remove a disk from the list of reported disks and all placement groups it participates in:

# storpool disk 1422 forget
OK

To get detailed information about given disk:

# storpool disk 1101 info
OK

To get detailed information about the objects on a particular disk:

# storpool disk 1101 list

To get detailed information about the active requests that the disk is performing at the moment:

# storpool disk 1101 activeRequests
-----------------------------------------------------------------------------------------------------------------------------------
| request ID                     |  request IDX |               volume |         address |       size |       op |    time active |
-----------------------------------------------------------------------------------------------------------------------------------
| 9226469746279625682:285697101441249070 |            9 |           testvolume |     85276782592 |     4.0 KB |     read |         0 msec |
| 9226469746279625682:282600876697431861 |           13 |           testvolume |     96372936704 |     4.0 KB |     read |         0 msec |
| 9226469746279625682:278097277070061367 |           19 |           testvolume |     46629707776 |     4.0 KB |     read |         0 msec |
| 9226469746279625682:278660227023482671 |          265 |           testvolume |     56680042496 |     4.0 KB |    write |         0 msec |
-----------------------------------------------------------------------------------------------------------------------------------

To issue retrim operation on a disk (available for SSD disks only):

# storpool disk 1101 retrim
OK

To start, pause or continue a scrubbing operation for a disk:

# storpool disk 1101 scrubbing start
OK
# storpool disk 1101 scrubbing pause
OK
# storpool disk 1101 scrubbing continue
OK

Note

Use storpool disk list internal to check the status of a running scrub operation or when was the last completed scrubbing operation for this disk.

9.9. Placement Groups

The placement groups are predefined sets of disks, over which volume objects will be replicated. It is possible to specify which individual disks to add to the group.

To display the defined placement groups in the cluster:

# storpool placementGroup list
name
default
hdd
ssd

To display details about a placement group:

# storpool placementGroup ssd list
type   | id
disk   | 1101 1201 1301 1401

Creating a new placement group or extend an existing one requires specifying its name and providing one or more disks to be added:

# storpool placementGroup ssd addDisk 1102
OK
# storpool placementGroup ssd addDisk 1202
OK
# storpool placementGroup ssd addDisk 1302 addDisk 1402
OK
# storpool placementGroup ssd list
type | id
disk   | 1101 1102 1201 1202 1301 1302 1401 1402

To remove one or more disks from a placement group use:

# storpool placementGroup ssd rmDisk 1402
OK
# storpool placementGroup ssd list
type | id
disk   | 1101 1102 1201 1202 1301 1302 1401

To rename a placement group:

# storpool placementGroup ssd rename M500DC
OK

The unused placement groups can be removed. To avoid accidents, the name of the group must be entered twice:

# storpool placementGroup ssd delete ssd
OK

9.10. Volumes

The volumes are the basic service of the StorPool storage system. A volume always have a name and a certain size. It can be read from and written to. It could be attached to hosts as read-only or read-write block device under the /dev/storpool directory. A volume may have one or more tags created or changed in the form name=value. The volume name is a string consisting of one or more of the allowed characters - upper and lower latin letters (a-z,A-Z), numbers (0-9) and the delimiter dot (.), colon (:), dash (-) and underscore (_). The same rules apply for the keys and values used for the volume tags. The volume name including tags cannot exceed 200 bytes.

When a volume is created, at minimum the <volumeName>, the size and replication must be specified:

# storpool volume testvolume size 100G replication 3

Additional parameters that can be used:

  • placeAll - place all objects in placementGroup (Default value: default)

  • placeTail - name of placementGroup for reader (Default value: same as placeAll value)

  • placeHead - place the third replica in a different placementGroup (Default value: same as placeAll value)

  • template - use template with preconfigured placement, replication and/or limits (please check for more in Templates section)

  • parent - use a snapshot as a parent for this volume

  • reuseServer - place multiple copies on the same server

  • baseOn - use parent volume, this will create a transient snapshot used as a parent (please check for more in Snapshots section)

  • iops - set the maximum IOPS limit for this volume (in IOPS)

  • bw - set maximum bandwidth limit (in MB/s)

  • tag - set a tag for this volume in the form name=value

  • create - create the volume, fail if it exists (optional for now)

  • update - update the volume, fail if it does not exist (optional for now)

The create is usefull in scripts when you have to prevent an involuntary update of a volume:

# storpool volume test create template hybrid-u-r3
OK
# storpool volume test create size 200G template hybrid-u-r3
Error: Volume 'test' already exists

A statement with update parameter will fail with an error if the volume does not exist:

# storpool volume test update template hybrid-u-r3
OK
# storpool volume test1 update template hybrid-u-r3
Error: volume 'test1' does not exist

To list all available volumes:

# storpool volume list
----------------------------------------------------------------------------------------------------------------------------------------------------------------
| volume               |   size  | repl. | placeHead  | placeAll   | placeTail  |   iops  |    bw   | parent               | template             | tags       |
----------------------------------------------------------------------------------------------------------------------------------------------------------------
| testvolume           |  100 GB |     3 | ultrastar  | ultrastar  | ssd        |       - |       - | testvolume@35691     | hybrid-u-r3          | name=value |
----------------------------------------------------------------------------------------------------------------------------------------------------------------

To get an overview of all volumes and snapshots and their state in the system use:

# storpool volume status
-------------------------------------------------------------------------------------------------------------------------------------------------
| volume               |    size | repl. | tags       |  alloc % |  stored | on disk | syncing | missing | status    | flags | drives down      |
-------------------------------------------------------------------------------------------------------------------------------------------------
| testvolume           |  100 GB |     3 | name=value |    0.0 % |    0  B |    0  B |    0  B |    0  B | up        |       |                  |
| testvolume@35691     |  100 GB |     3 |            |  100.0 % |  100 GB |  317 GB |    0  B |    0  B | up        | S     |                  |
-------------------------------------------------------------------------------------------------------------------------------------------------
| 2 volumes            |  200 GB |       |            |   50.0 % |  100 GB |  317 GB |    0  B |    0  B |           |       |                  |
-------------------------------------------------------------------------------------------------------------------------------------------------

Flags:
  S - snapshot
  B - balancer blocked on this volume
  D - decreased redundancy (degraded)
  M - migrating data to a new disk
  R - allow placing two disks within a replication chain onto the same server
The columns in this output are:
  • volume - name of the volume or snapshot (see flags below)

  • size - provisioned volume size, the visible size inside a VM for example

  • repl. - number of copies for this volume

  • tags - all custom key=value tags configured for this volume or snapshot

  • alloc % - how much space was used on this volume in percent

  • stored - space allocated on this volume

  • on disk - the size allocated on all drives in the cluster after replication and the overhead from data protection

  • syncing - how much data is not in sync after a drive or server was missing, the data is recovered automatically once the missing drive or server is back in the cluster

  • missing - shows how much data is not available for this volume when the volume is with status down, see status below

  • status - shows the status of the volume, which could be one of:

  • up - all copies are available

  • down - none of the copies are available for some parts of the volume

  • up soon - all copies are available and the volume will soon get up

  • flags - flags denoting features of this volume:

  • S - stands for snapshot, which is essentially a read-only (frozen) volume

  • B - used to denote that the balancer is blocked for this volume (usually when some of the drives are missing)

  • D - this flag is displayed when some of the copies is either not available or outdated and the volume is with decreased redundancy

  • M - displayed when changing the replication or a cluster re-balance is in progress

  • R - displayed when the policy for keeping copies on different servers is overridden

  • drives down - displayed when the volume is in down state, displaying the drives required to get the volume back up.

Size is in B/KB/MB/GB or TB.

To check the estimated used space by the volumes in the system use:

# storpool volume usedSpace
--------------------------------------------------------------------------------------
| volume               |       size | repl. |     stored |       used | missing info |
--------------------------------------------------------------------------------------
| testvolume           |     100 GB |     3 |     1.9 GB |     100 GB |         0  B |
--------------------------------------------------------------------------------------

The columns explained:

  • volume - name of the volume

  • size - the provisioned size of this volume

  • repl. - the replication of the volume

  • stored - how much data is stored for this volume (without referring all parent snapshots)

  • used - how much data has been written (including the data written in parent snapshots)

  • missing info - if this value is anything other than 0  B probably some of the storpool_controller services in the cluster is not running correctly.

Note

The used column shows how much data is accessible and reserved for this volume.

To list the target disk sets and objects of a volume:

# storpool volume testvolume list
volume testvolume
size 100 GB
replication 3
placeAll ultrastar
placeTail ssd
placeHead ultrastar
target disk sets:
       0: 1122 1323 1203
       1: 1424 1222 1301
       2: 1121 1324 1201
[snip]
  object: disks
       0: 1122 1323 1203
       1: 1424 1222 1301
       2: 1121 1324 1201
[snip]

Hint

In this example the volume is with hybrid placement with two copies on HDDs and one copy on SSDs (the rightmost column). The target disk sets are list of triplets of drives in the cluster used as a template for the actual objects of the volume.

To get detailed info about the disks used for this volume and the number of objects on each of them use:

# storpool volume testvolume info
    diskId | count
  1101 |   200
  1102 |   200
  1103 |   200
  [snip]
chain                | count
1121-1222-1404       |  25
1121-1226-1303       |  25
1121-1226-1403       |  25

To rename a volume use:

# storpool volume testvolume rename newvolume
OK

To add a tag for a volume:

# storpool volume testvolume tag name=value

To change a tag for a volume use:

# storpool volume testvolume tag name=newvalue

To remove a tag just set it to an empty value:

# storpool volume testvolume tag name=

To resize a volume up:

# storpool volume testvolume size +1G
OK

To shrink a volume (resize down):

# storpool volume testvolume size 50G shrinkOk

Attention

Shrinking of a storpool volume changes the size of the block device, but does not adjust the size of LVM or filesystem contained in the volume. Failing to adjust the size of the filesystem or LVM prior to shrinking the StorPool volume would result in data loss.

To delete a volume use:

# storpool volume vol1 delete vol1

Note

To avoid accidents, the volume name must be entered twice. Attached volumes cannot be deleted even when not used as a safety precaution, more in Attachments.

A volume could be converted from based on a snapshot to a stand-alone volume. For example the testvolume below is based on an anonymous snapshot:

# storpool_tree
StorPool
  `-testvolume@37126
     `-testvolume

To rebase it against root (known also as “promote”) use:

# storpool volume testvolume rebase
OK
# storpool_tree
StorPool
  `-testvolume
  `-testvolume@37126

The rebase operation could also be to a particular snapshot from a chain of parent snapshots on this child volume:

# storpool_tree
StorPool
  `-testvolume-snap1
     `-testvolume-snap2
        `-testvolume-snap3
           `-testvolume
# storpool volume testvolume rebase testvolume-snap2
OK

After the operation the volume is directly based on testvolume-snap2 and includes all changes from testvolume-snap3:

# storpool_tree
StorPool
  `-testvolume-snap1
     `-testvolume-snap2
        `-testvolume
        `-testvolume-snap3

To backup a volume named testvolume in a configured remote location StorPoolLab-Sofia use:

# storpool volume testvolume backup StorPoolLab-Sofia
OK

After this operation a temporary snapshot will be created and will be transferred in StorPoolLab-Sofia location. After the transfer completes, the local temporary snapshot will be deleted and the remote snapshot will be visible as exported from StorPoolLab-Sofia, check Remote Snapshots for more on working with snapshot exports.

When backing up a volume, the remote snapshot may have one or more tags applied, example below:

# storpool volume testvolume backup StorPoolLab-Sofia tag key=value # [tag key2=value2]

9.11. Snapshots

Snapshots are read-only point-in-time images of volumes. They are created once and cannot be changed. They can be attached to hosts as read-only block devices under /dev/storpool. Volumes and snapshots share the same name-space, thus their names are unique within a StorPool cluster. Volumes can be based on snapshots. Such volumes contain only the changes since the snapshot was taken. After a volume is created from a snapshot, writes will be recorded within the volume. Reads from volume may be served by volume or by its parent snapshot depending on whether volume contains changed data for the read request or not.

To create an unnamed (known also as anonymous) snapshot of a volume use:

# storpool volume testvolume snapshot
OK

This will create a snapshot named testvolume@<ID>, where ID is an unique serial number. Note that any tags on the volume will not be propagated to the snapshot; to set tags on the snapshot at creation time use:

# storpool volume testvolume tag key=value snapshot

To create a named snapshot of a volume use:

# storpool volume testvolume snapshot testsnap
OK

Again to directly set tags:

# storpool volume testvolume snapshot testsnapplustags tag key=value

To remove a tag on a snapshot:

# storpool snapshot testsnapplustags tag key=

To list the snapshots use:

# storpool snapshot list
----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
| snapshot          |  size   | repl. | placeHead | placeAll   | placeTail | created on          | volume      | iops  | bw   | parent          | template  | flags | targetDeleteDate | tags      |
----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
| testsnap          |  100 GB |     3 | hdd       | hdd        | ssd       | 2018-01-30 04:11:23 | testvolume  |     - |    - | testvolume@1430 | hybrid-r3 |       | -                | key=value |
| testvolume@1430   |  100 GB |     3 | hdd       | hdd        | ssd       | 2018-01-30 03:56:58 | testvolume  |     - |    - |                 | hybrid-r3 | A     | -                |           |
----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------

Flags:
  A - anonymous snapshot with auto-generated name
  B - bound snapshot
  D - snapshot currently in the process of deletion
  T - transient snapshot (created during volume cloning)
  R - allow placing two disks within a replication chain onto the same server
  P - snapshot delete blocked due to multiple children

To list the snapshots only for a particular volume use:

# storpool volume testvolume list snapshots
[snip]

A volume might directly be converted to a snapshot, the operation is also known as freeze:

# storpool volume testvolume freeze
OK

Note that the operation will fail if the volume is attached read-write, more in Attachments

To create a bound snapshot on a volume use:

# storpool volume testvolume bound snapshot
OK

This snapshot will be automatically deleted when the last child volume created from it is deleted. Useful for non-persistent images.

To list the target disk sets and objects of a snapshot:

# storpool snapshot testsnap list
[snip]

The output is similar as with storpool volume <volumename> list.

To get detailed info about the disks used for this snapshot and the number of objects on each of them use:

# storpool snapshot testsnap info
[snip]

The output is similar to the storpool volume <volumename> info.

To create a volume based on an existing snapshot (cloning) use:

# storpool volume testvolume parent centos73-base-snap
OK

Same possible through the use of templates with a parent snapshot (See Templates):

# storpool volume spd template centos73-base
OK

Create a volume based on another existing volume (cloning):

# storpool volume testvolume1 baseOn testvolume
OK

Note

This opeartion will first create an anonymous bound snapshot on testvolume and then will create testvolume1 with this snapshot as parent. The snapshot will exist until both volumes are deleted and will be automatically deleted afterwards.

To delete a snapshot use:

# storpool snapshot spdb_snap1 delete spdb_snap1
OK

Note

To avoid accidents, the name of the snapshot must be entered twice.

A snapshot could also be binded to its child volumes, it will exist until all child volumes are deleted:

# storpool snapshot testsnap bind
OK

The opposite operation is also possible, to unbind such snapshot use:

# storpool snapshot testsnap unbind
OK

To get the space that will be freed if a snapshot is deleted use:

# storpool snapshot space
-------------------------------------------------------------------------------------------------------------
| snapshot             | on volume            |       size | repl. |     stored |       used | missing info |
-------------------------------------------------------------------------------------------------------------
| testsnap             | testvolume           |     100 GB |     3 |      27 GB |    -135 GB |         0  B |
| testvolume@3794      | testvolume           |     100 GB |     3 |      27 GB |     1.9 GB |         0  B |
| testvolume@3897      | testvolume           |     100 GB |     3 |     507 MB |     432 KB |         0  B |
| testvolume@3899      | testvolume           |     100 GB |     3 |     334 MB |     224 KB |         0  B |
| testvolume@4332      | testvolume           |     100 GB |     3 |      73 MB |      36 KB |         0  B |
| testvolume@4333      | testvolume           |     100 GB |     3 |      45 MB |      40 KB |         0  B |
| testvolume@4334      | testvolume           |     100 GB |     3 |      59 MB |      16 KB |         0  B |
| frozenvolume         | -                    |       8 GB |     2 |      80 MB |      80 MB |         0  B |
-------------------------------------------------------------------------------------------------------------

Used mainly for accounting purposes. The columns explained:

  • snapshot - name of the snapshot

  • on volume - the name of the volume child for this snapshot if any. For example a frozen volume would have this field empty.

  • size - the size of the snapshot as provisioned

  • repl. - replication

  • stored - how much data is actually written

  • used - stands for the amount of data that would be freed from the underlying drives (before replication) if the snapshot is removed.

  • missing info - if this value is anything other than 0  B probably some of the storpool_controller services in the cluster is not running correctly.

The used column could be negative in some cases when the snapshot has more than one child volume. In these cases deleting the snapshot would “free” negative space i.e. will end up taking more space on the underlying disks.

Similar to volumes a snapshot could have different placementGroups or other attributes, as well as templates:

# storpool snapshot testsnap template all-ssd
OK
Additional parameters that may be used:
  • placeAll - place all objects in placementGroup (Default value: default)

  • placeTail - name of placementGroup for reader (Default value: same as placeAll value)

  • placeHead - place the third replica in a different placementGroup (Default value: same as placeAll value)

  • reuseServer - place multiple copies on the same server

  • tag - set a tag in the form key=value

  • template - use template with preconfigured placement, replication and/or limits (please check for more in Templates section)

  • iops - set the maximum IOPS limit for this snapshot (in IOPS)

  • bw - set maximum bandwidth limit (in MB/s)

Note

The bandwidth and IOPS limits are concerning only the particular snapshot if it is attached and does not limit any child volumes using this snapshot as parent.

Also similar to the same operation with volumes a snapshot could be renamed with:

# storpool snapshot testsnap rename ubuntu1604-base
OK

A snapshot could also be rebased to root (promoted) or rebased to another parent snapshot in a chain:

# storpool snapshot testsnap rebase # [parent-snapshot-name]
OK

To delete a snapshot use:

# storpool snapshot testsnap delete testsnap
OK

Note

A snapshot sometimes will not get deleted immediately, during this period of time it will be visible with * in the output of storpool volume status or storpool snapshot list.

To set a snapshot for deferred deletion use:

# storpool snapshot testsnap deleteAfter 1d
OK

The above will set a target delete date for this snapshot in exactly one day from the present time.

Note

The snapshot will be deleted at the desired point in time only if delayed snapshot delete was enabled in the local cluster, check Management configuration section from this guide.

9.11.1. Remote snapshots

In case multi-site is enabled (the cluster have a bridge service running) a snapshot could be exported and become visible to other configured locations.

For example to export a snapshot snap1 to a remote location named StorPoolLab-Sofia use:

# storpool snapshot snap1 export StorPoolLab-Sofia
OK

To list the presently exported snapshots use:

# storpool snapshot list exports
------------------------------------------------------------------
| location               | snapshot    | globalId    | backingUp |
------------------------------------------------------------------
| StorPoolLab-Sofia      | snap1       | nzkr.b.cuj  | false     |
------------------------------------------------------------------

To list the snapshots exported from remote sites use:

# storpool snapshot list remote
-----------------------------------------------------------------------------------
| location | remoteId | name      | onVolume | size         | creationTimestamp   |
-----------------------------------------------------------------------------------
| s02      | a.o.cxz  | snapshot1 |          | 107374182400 | 2018-02-10 03:21:42 |
-----------------------------------------------------------------------------------

Single snapshot could be exported to multiple configured locations.

To create a clone of a remote snapshot locally use:

# storpool snapshot snapshot1-copy template hybrid-r3 remote s02 a.o.cxz # [tag key=value]

In this example the remote location is s02 and the remoteId is a.o.cxz. Any key=value pair tags may be configured at creation time.

To unexport a local snapshot use:

# storpool snapshot snap1 unexport StorPoolLab-Sofia
OK

The remote location could be swapped by the keyword all. This will attempt to unexport the snapshot from all location it was previously exported to.

Note

If the snapshot is presently being transferred the unexport operation will fail. It could be forced by adding force to the end of the unexport command, however this is discouraged in favor to waiting for any active transfer to complete.

To unexport a remote snapshot use:

# storpool snapshot remote s02 a.o.cxz unexport
OK

The snapshot will no longer be visible with storpool snapshot list remote.

To unexport a remote snapshot and also set for deferred deletion in the remote site:

# storpool snapshot remote s02 a.o.cxz unexport deleteAfter 1h
OK

This will attempt to set a target delete date for a.o.cxz in the remote site in exactly one hour from the present time for this snapshot. If the minimumDeleteDelay in the remote site has a higher value, e.g. 1 day the selected value will be overwritten with the minimumDeleteDelay - in this example 1 day. For more info on deferred deletion check the Multi Site section of this guide.

9.12. Attachments

Attaching a volume or snapshot makes it accessible to client under /dev/storpool. Volumes can be attached as read-only or read-write. Snapshots are always attached read-only.

Attaching a volume testvolume to a client with ID 1. This creates the block device /dev/storpool/testvolume:

# storpool attach volume testvolume client 1
OK

To attach a volume/snapshot to the node you are currently connected to use:

# storpool attach volume testvolume here
OK
# storpool attach snapshot testsnap here
OK

By default this command will block until the volume is attached to the client and the /dev/storpool/<volumename> symlink is created. For example if the storpool_block service has not been started the command will wait indefinitely. To set a timeout for this operation use:

# storpool attach volume testvolume here timeout 10 # seconds
OK

To completely disregard the readiness check use:

# storpool attach volume testvolume here noWait
OK

Note

The use of noWait is discouraged in favor of the default behaviour of the attach command.

Attaching a volume will create a read-write block device attachment by default. To attach it read-only use:

# storpool volume testvolume2 attach client 12 mode ro
OK

To list all attachments use:

# storpool attach list
----------------------------------------
| client | volume               | mode |
----------------------------------------
|     11 | testvolume           | RW   |
|     12 | testvolume1          | RW   |
|     12 | testvolume2          | RO   |
|     14 | testsnap             | RO   |
----------------------------------------

To detach use:

# storpool detach volume testvolume client 1 # or 'here' if you are on client 1

If a volume is actively being written or read from, a detach operation will fail:

# storpool detach volume testvolume client 11
Error: 'testvolume' is open at client 11

In this case the detach could be forced with, however beware that forcing a detachment is discouraged:

# storpool detach volume testvolume client 11 force yes
OK

If a volume or snapshot is attached to more than one client it could be detached from all nodes with a single CLI command:

# storpool detach volume testvolume all
OK
# storpool detach snapshot testsnap all
OK

9.13. Client

To check the status of the active storpool_block services in the cluster use:

# storpool client status
-----------------------------------
|  client  |       status         |
-----------------------------------
|       11 | ok                   |
|       12 | ok                   |
|       13 | ok                   |
|       14 | ok                   |
-----------------------------------

To wait until a client is updated use:

# storpool client 13 sync
OK

This is a way to ensure a volume with changed size is visible with its new size to any clients it is attached to.

To show detailed information about the active requests on particular client in this moment use:

# storpool client 13 activeRequests
-----------------------------------------------------------------------------------------------------------------------------------
| request ID                     |  request IDX |               volume |         address |       size |       op |    time active |
-----------------------------------------------------------------------------------------------------------------------------------
| 9224499360847016133:3181950    |  1044        | testvolume           |     10562306048 |     128 KB |    write |        65 msec |
| 9224499360847016133:3188784    |  1033        | testvolume           |     10562437120 |      32 KB |     read |        63 msec |
| 9224499360847016133:3188977    |  1029        | testvolume           |     10562568192 |     128 KB |     read |        21 msec |
| 9224499360847016133:3189104    |  1026        | testvolume           |     10596122624 |     128 KB |     read |         3 msec |
| 9224499360847016133:3189114    |  1035        | testvolume           |     10563092480 |     128 KB |     read |         2 msec |
| 9224499360847016133:3189396    |  1048        | testvolume           |     10629808128 |     128 KB |     read |         1 msec |
-----------------------------------------------------------------------------------------------------------------------------------

9.14. Templates

Templates are enabling easy and consistent setup for large number of volumes with common attributes, e.g. replication, placement groups or common parent snapshot.

To create a template use:

# storpool template magnetic replication 3 placeAll hdd
OK
# storpool template hybrid replication 3 placeAll hdd placeTail ssd
OK
# storpool template ssd-hybrid replication 3 placeAll ssd placeHead hdd
OK

To list all created templates use:

# storpool template list
----------------------------------------------------------------------------------------------------------------------------
| template             |   size  | repl. | placeAll   | placeTail  | placeHead  |   iops  |    bw   | parent               |
----------------------------------------------------------------------------------------------------------------------------
| magnetic             |       - |     3 | hdd        | hdd        | hdd        |       - |       - |                      |
| hybrid               |       - |     3 | hdd        | ssd        | hdd        |       - |       - |                      |
| ssd-hybrid           |       - |     3 | ssd        | ssd        | hdd        |       - |       - |                      |
----------------------------------------------------------------------------------------------------------------------------

Hint

Understanding placement - Each volume/snapshot could be replicated on different set of drives. Each set of drives is configured through the placement groups. A volume would either have all of its copies in a single set of drives in different nodes (only placeAll configured) or have its different copies in a different set of drives by using the placeTail and placeHead parameters.

Hint

Understanding placement (dual replication) - With dual replication only the placeAll and placeTail parameters have effect. If not provided placeTail is the same as the placeAll parameter. If configured placeTail overrides the placeAll and the read operations will be served by the drives in this placement group. With this placement an example volume with dual replication and placeAll in the hdd placement group and placeTail in the ssd group would have one copy of the data on the drives in each placement group. In case some of the drives in the ssd placement group is missing the reads will be served by a drive in the hdd placement group.

Hint

Understanding placement (triple replication) - With triple replication the placeAll - placeTail dependencies and policy are again in effect, but now the placeHead parameter also comes into play. The placeHead parameter overrides the placeAll parameter. So a volume with placeAll configured in the ssd and placeHead configured in the hdd placement group would have two copies on the drives in ssd and a third copy for each chunk of data on a drive in the hdd placement group. In case a single drive (or node with drives) in the ssd placement group is missing all reads would still be coming from the drives placeAll placement group, if however a drive from another node (or another node) is missing all reads will be served by the drives in the placeHead placement group.

To get the status of a template with detailed info on the usage and the available space left with this placement use:

# storpool template status
------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
| template             | place all  | place tail | place head | repl. | volumes | snapshots/removing |    size | capacity |  avail. | avail. all | avail. tail | avail. head |
------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
| magnetic             | hdd        | hdd        | hdd        |     3 |     115 |       631/0        |   28 TB |    80 TB |   52 TB |     240 TB |      240 TB |      240 TB |
| hybrid               | hdd        | ssd        | hdd        |     3 |     208 |       347/9        |   17 TB |    72 TB |   55 TB |     240 TB |       72 TB |      240 TB |
| ssd-hybrid           | ssd        | ssd        | hdd        |     3 |      40 |         7/0        |    4 TB |    36 TB |   36 TB |     240 TB |       72 TB |      240 TB |
------------------------------------------------------------------------------------------------------------------------------------------------------------------------------

To change template attributes directly use:

# storpool template hdd-only size 120G
OK
# storpool template hybrid size 40G iops 4000
OK
Parameters that can be set:
  • replication - change the number of copies for volumes or snapshots created with this template

  • size - default size if not specified for each volume created with this template

  • placeAll - place all objects in placementGroup (Default value: default)

  • placeTail - name of placementGroup for reader (Default value: same as placeAll value)

  • placeHead - place the third replica in a different placementGroup (Default value: same as placeAll value)

  • iops - set the maximum IOPS limit for this snapshot (in IOPS)

  • bw - set maximum bandwidth limit (in MB/s)

  • parent - set parent snapshot for all volumes created in this template

  • reuseServer - place multiple copies on the same server

Changing parameters of a template will affect only new volumes created with it. To populate a certain template’s updated parameters to all existing volumes already created with it, use propagate. The following example changes the bandwidth limit for all volumes and snapshots created with the template magnetic:

# storpool template magnetic bw 100MB propagate
OK

Note

When using storpool template $TEMPLATE propagate, all the parameters of $TEMPLATE will be re-applied to all volumes and snapshots created with it.

Note

Changing template parameters with propagate option will not automatically re-allocate content of the existing volumes on disks. If replication or placement groups are changed, run balancer to apply new settings on the existing volumes. However if the changes are made directly to the volume instead to the template, running a balancer is not required.

Attention

Dropping the replication (e.g. from triple to dual) of a large number of volumes is an almost instant operation, however returning them back to triple is similar to creating the third copy for the first time.

To rename a template use:

# storpool template magnetic rename backup
OK

To delete a template use:

# storpool template hdd-only delete hdd-only
OK

Note

The delete operation might fail if there are volumes/snapshots that are created with this template.

9.15. iSCSI

The StorPool iSCSI support is documented more extensively in the StorPool iSCSI support section; these are the commands used to configure it and view the configuration.

To set the cluster’s iSCSI base IQN iqn.2018-02.com.example:examplename:

# storpool iscsi config setBaseName iqn.2018-02.com.example:examplename
OK

To create a portal group examplepg used to group exported volumes for access by initiators using 192.168.42.247/24 (CIDR notation) as the portal IP address:

# storpool iscsi config portalGroup examplepg create addNet 192.168.42.247/24
OK

To create portal for the initiators to connect to (for example portal IP address 192.168.42.202 and StorPool’s SP_OURID 5):

# storpool iscsi config portal create portalGroup examplepg address 192.168.42.202 controller 5
OK

To define the iqn.2018-02.com.example:abcdefgh initiator that is allowed to connect from the 192.168.42.0/24 network (w/o authentication):

# storpool iscsi config initiator iqn.2018-02.com.example:abcdefgh create net 192.168.42.0/24
OK

To define the iqn.2018-02.com.example:client initiator that is allowed to connect from the 192.168.42.0/24 network and must authenticate using the standard iSCSI password-based challenge-response authentication method using the username user and the password secret:

# storpool iscsi config initiator iqn.2018-02.com.example:client create net 192.168.42.0/24 chap user secret
OK

To specify that the StorPool volume tinyvolume should be exported to one or more initiators:

# storpool iscsi config target create tinyvolume
OK

To actually export the StorPool volume tinyvolume to to the iqn.2018-02.com.example:abcdefgh initiator via the examplepg portal group (the StorPool iSCSI service will automatically pick a portal to export the volume through):

# storpool iscsi config export initiator iqn.2018-02.com.example:abcdefgh portalGroup examplepg volume tinyvolume
OK

Note

The volume will be visible to the initiator as IQN <BaseName>:<volume>

To view the iSCSI cluster base IQN:

# storpool iscsi basename
---------------------------------------
| basename                            |
---------------------------------------
| iqn.2018-02.com.example:examplename |
---------------------------------------

To view the portal groups:

# storpool iscsi portalGroup list
---------------------------------------------
| name       | networksCount | portalsCount |
---------------------------------------------
| examplepg  |             1 |            0 |
---------------------------------------------

To view the portals:

# storpool iscsi portalGroup list portals
--------------------------------------------------
| group       | address             | controller |
--------------------------------------------------
| examplepg   | 192.168.42.246:3260 |          1 |
| examplepg   | 192.168.42.202:3260 |          5 |
--------------------------------------------------

To view the defined initiators:

# storpool iscsi initiator list
---------------------------------------------------------------------------------------
| name                             | username | secret | networksCount | exportsCount |
---------------------------------------------------------------------------------------
| iqn.2018-02.com.example:abcdefgh |          |        |             1 |            1 |
| iqn.2018-02.com.example:client   | user     | secret |             1 |            0 |
---------------------------------------------------------------------------------------

To view the volumes that may be exported to initiators:

# storpool iscsi target list
-------------------------------------------------------------------------------------
| name                                           | volume     | currentControllerId |
-------------------------------------------------------------------------------------
| iqn.2018-02.com.example:examplename:tinyvolume | tinyvolume |               65535 |
-------------------------------------------------------------------------------------

To view the volumes currently exported to initiators:

# storpool iscsi initiator list exports
--------------------------------------------------------------------------------------------------------------------------------------
| name                                           | volume     | currentControllerId | portalGroup | initiator                        |
--------------------------------------------------------------------------------------------------------------------------------------
| iqn.2018-02.com.example:examplename:tinyvolume | tinyvolume |                   1 |             | iqn.2018-02.com.example:abcdefgh |
--------------------------------------------------------------------------------------------------------------------------------------

To stop exporting the tinyvolume volume to the initiator with iqn iqn.2018-02.com.example:abcdefgh and the examplepg portal group:

# storpool iscsi config unexport initiator iqn.2018-02.com.example:abcdefgh portalGroup examplepg volume tinyvolume
OK

To remove an iSCSI definition for the tinyvolume volume:

# storpool iscsi config target delete tinyvolume
OK

To remove access for the iqn.2018-02.com.example:client iSCSI initiator:

# storpool iscsi config initiator iqn.2018-02.com.example:client delete
OK

To remove the portal 192.168.42.202 IP address:

# storpool iscsi config portal delete address 192.168.42.202
OK

To remove portal group examplepg after all the portals have been removed:

# storpool iscsi config portalGroup examplepg delete
OK

Note

Only portal groups without portals may be deleted.

9.16. Kubernetes

To register a Kubernetes cluster:

# storpool kubernetes add name cluster1
OK

To disable Kubernetes cluster:

# storpool kubernetes update name cluster1 disable yes
OK

To enable Kubernetes cluster:

# storpool kubernetes update name cluster1 disable no
OK

To delete Kubernetes cluster:

# storpool kubernetes delete name cluster1
OK

To list registered Kubernetes clusters:

# storpool kubernetes list
-----------------------
| name     | disabled |
-----------------------
| cluster1 | false    |
-----------------------

To view the status of the registered Kubernetes clusters:

# storpool kubernetes status
--------------------------------------------------------------
| name     | sc | w   | pvc | noRsrc | noTempl | mode | noSC |
--------------------------------------------------------------
| cluster1 |  0 | 0/3 |   0 |      0 |       0 |    0 |    0 |
--------------------------------------------------------------
Feilds:
  sc      - registered Storage Classes
  w       - watch connections to the kube adm
  pvc     - persistentVolumeClaims beeing provisioned
  noRsrc  - persistentVolumeClaims failed due to no resources
  noTempl - persistentVolumeClaims failed due to missing template
  mode    - persistentVolumeClaims failed due to unsupported access mode
  noSC    - persistentVolumeClaims failed due to missing storage class

9.17. Relocator

The relocator is internal StorPool service that takes care of data reallocation in case of change of volume’s replication group parameters or in case of any pending rebase operations. This service is turned on by default.

In case of need the relocator could be turned off with:

# storpool relocator off
OK

To turn back on use:

# storpool relocator on
OK

To display the relocator status:

# storpool relocator status
relocator on, no volumes to relocate
The following additional relocator commands are available:
  • storpool relocator disks - returns the state of the disks after the relocator finishes all presently running tasks, as well as the quantity of objects and data each drive still needs to recover. The output is the same as with storpool balancer disks after the balancing task has been committed, see Balancer for more details.

  • storpool relocator volume <volumename> disks or storpool relocator snapshot <snapshotname> disks - shows the same information as the storpool relocator disks with the pending operations specific volume or snapshot.

9.18. Balancer

The balancer is used to redistribute data in case a disk or set of disks (e.g. new node) were added to or removed from the cluster. By default it is off. It has to be turned on in case of changes in cluster configuration for redistribution of data to occur.

To display the status of the balancer:

# storpool balancer status
balancer waiting, auto off

To load a re-balancing task, please refer to the Rebalancing StorPool section of this guide.

To discard the re-balancing operation use:

# storpool balancer stop
OK

To actually commit the changes and start the relocations of the proposed changes use:

# storpool balancer commit
OK

After the commit all changes will be only visible with storpool relocator disks and many volumes and snapshots will have the M flag in the output of storpool volume status until all relocations are completed. The progress could be followed with storpool task list (see Tasks).

9.19. Tasks

The tasks are all outstanding operations on recovering or relocating data in the present or between two connected clusters.

For example if a disk with ID 1401 was not in the cluster for a period of time and is then returned, all outdated objects will be recovered from the other drives with the latest changes.

These recovery opertions could be listed with:

# storpool task list
----------------------------------------------------------------------------------------
|     disk |  task id | total size |  completed |    started |  remaining | % complete |
----------------------------------------------------------------------------------------
|     1401 |        0 |      51 GB |      45 GB |      32 MB |     6.7 GB |        86% |
|     1401 |        0 |      44 GB |       0  B |       0  B |      44 GB |         0% |
----------------------------------------------------------------------------------------
|    total |          |      96 GB |      45 GB |      32 MB |      51 GB |        46% |
----------------------------------------------------------------------------------------

Other cases when tasks operations could be listed are when a re-balancing operation was commited and relocations are in progress, as well as when a cloning operation for a remote snapshot in the local cluster is in progress.

9.20. Management Configuration

The mgmtConfig submenu is used to set some internal configuration parameters.

To list the presently configured parameters use:

# storpool mgmtConfig list
relocator on, interval 5 sec
relocator transaction: min objects 320, max objects 4294967295
relocator recovery: max tasks per disk 2, max objects per disk 3200
relocator recovery objects trigger 32
balancer auto off, interval 5 sec
snapshot delete interval 5 sec
disks soft-eject interval 5 sec
snapshot delayed delete off
snapshot dematerialize interval 10 sec

To enable deferred snapshot deletion (default off) use:

# storpool mgmtConfig delayedSnapshotDelete on
OK

When enabled all snapshots with configured time to be deleted will be cleared at the configured date and time.

To change the default interval between periodic checks whether disks marked for ejection can actually be ejected (5 sec.) use:

# storpool mgmtConfig disksSoftEjectInterval 20000 # value in ms - 20 sec.
OK

To change the default interval (5 sec.) for the relocator to check if there is new work to be done use:

# storpool mgmtConfig relocatorInterval 20000 # value is in ms - 20 sec.
OK

To set a different than the default number of objects per disk (3200) in recovery at a time:

# storpool mgmtConfig relocatorMaxRecoveryObjectsPerDisk 2000 # value in number of objects per disk
OK

To change the default maximum number of recovery tasks per disk (2 tasks) use:

# storpool mgmtConfig relocatorMaxRecoveryTasksPerDisk 4 # value is number of tasks per disk - will set 4 tasks
OK

To change the minimum (default 320) or the maximum (default 4294967295) number of objects per transaction for the relocator use:

# storpool mgmtConfig relocatorMaxTrObjects 2147483647
OK
# storpool mgmtConfig relocatorMinTrObjects 640
OK

To change the maximum number of objects in recovery for a disk to be usable by the relocator (default 32) use:

# storpool mgmtConfig relocatorRecoveryObjectsTrigger 64

To change the default check for new snapshots for deleting use:

# storpool mgmtConfig snapshotDeleteInterval

To enable snapshot dematerialization or change the interval use:

# storpool mgmtConfig snapshotDematerializeInterval 30000 # sets the interval 30 seconds, 0 disables it

Snapshot dematerialization checks and removes all objects that do not refer to any data, i.e. no change in this object from the last snapshot (or ever). This helps to reduce the number of used objects per disk in clusters with a large number of snapshots and a small number of changed blocks between the snapshots in the chain.

Please consult with StorPool support before changing the management configuration defaults.

9.21. Mode

Support for couple of different output modes is available both for the interactive shell and when the CLI is invoked directly. Some custom format options are available only for some operations.

Available modes:
  • csv - Semicolon-separated values for some commands

  • json - Processed JSON output for some commands

  • pass - Pass the JSON response through

  • raw - Raw output (display the HTTP request and response)

  • text - Human readable output (default)

Example with switching to csv mode in the interactive shell:

StorPool> mode csv
OK
StorPool> net list
nodeId;flags;MAC - net 1;MAC - net 2;MAC - net 3;MAC - net 4
11;uU + AJ;00:25:90:49:1B:DA;00:25:90:3C:B7:06;
12;uU + AJ;00:25:90:3B:B7:FF;00:25:90:3B:B7:FE;
13;uU + AJ;90:E2:BA:52:1A:6D;90:E2:BA:52:1A:6D;
14;uU + AJ;00:25:90:48:1A:F0;00:25:90:48:1A:F0;

The same applies when using the CLI directly:

# storpool -f csv net list # the output is the same as above
[snip]

10. Multi server

The multi-server feature enables the use of up to four separate storpool_server instances on a single node. This makes sense for dedicated storage nodes or in case of heavily-loaded converged setup with more resources isolated for the networking and storage.

For example a dedicated storage node with 36 drives would provide better peak performance with 4 server instances each controlling 1/4th of all disks/SSDs than with a single instance. Another good example would be a converged node with 16 SSDs/HDDs, which would provide better peak performance with two server instances each controlling half of the drives and running on separate CPU cores or even running on two threads on a single CPU core compared to a single server instance.

Enabling multi-server is as easy as intalling the multiserver module at install time.

To configure the CPUs on which the different instances are running, configure them in the cpuset cgroups in which each of the instances is started, check for the defaults in Cgroup setup.

Controlling which instance is controlling each of the drives in the node is controlled with storpool_initdisk.

For example if we have two drives with IDs of 1101 and 1102, both controlled by the first server instance the output from storpool_initdisk would look like this:

# storpool_initdisk --list
/dev/sde1, diskId 1101, version 10007, server instance 0, cluster init.b, SSD
/dev/sdf1, diskId 1102, version 10007, server instance 0, cluster init.b, SSD

Setting a SSD drive to be controlled by the second server instance would look like this:

# storpool_initdisk -r -i 1 /dev/sdXN   # where X is the drive letter and N is the partition number e.g. /dev/sdf1

Hint

The above command will fail if the storpool_server service is running, please eject the disk prior to re-setting an instance.

In some occasions if the first server instance was configured with a large amount of cache (check SP_CACHE_SIZE in the Configuration Guide) first split the cache between the different instances (e.g. from 8192 to 4096 when migrating from one to two instances).

11. Volume management

Volume

Volumes are the basic service of the StorPool storage system. A volume always have a name and a certain size. It can be read from and written to. It could be attached to hosts as read-only or read-write block device under the /dev/storpool directory.

The volume name is a string consisting of one or more of the allowed characters - upper and lower latin letters (a-z,A-Z), numbers (0-9) and the delimiter dot (.), colon (:), dash (-) and underscore (_).

11.1. Creating a volume

Creating a volume

11.2. Deleting a volume

Deleting a volume

11.3. Renaming a volume

Renaming a volume

11.4. Resizing a volume

Resizing a volume

11.5. Snapshots

Snapshot

Snapshots are read-only point-in-time images of volumes. They are created once and cannot be changed. They can be attached to hosts as read-only block devices under /dev/storpool.



Namespace for volumes and snapshots

All volumes and snapshots share the same name-space. Names of volumes and snapshots are unique within a StorPool cluster. This diagram illustrates the relationship between a snapshot and a volume. Volume vol1 is based on snapshot snap1. vol1 contains only the changes since snap1 was taken. In the common case this is a small amount of data. Arrows indicate a child-parent relationship. Each volume or snapshot may have exactly one parent which it is based upon. Writes to vol1 are recorded within the volume. Reads from vol1 may be served by vol1 or by its parent snapshot - snap1, depending on whether vol1 contains changed data for the read request or not.



Volume snapshot relation

Snapshots and volumes are completely independent. Each snapshot may have many children (volumes and snapshots). Volumes cannot have children.











Volume snapshot chain

snap1 contains a full image. snap2 contains only the changes since snap1 was taken. vol1 and vol2 contain only the changes since snap2 was taken.












11.6. Creating a snapshot of a volume

There is a volume named vol1.

Creating a snapshot

After the first snapshot the state of vol1 is recorded in a new snapshot named snap1. vol1 does not occupy any space now, but will record any new writes which come in after the creation of the snapshot. Reads from vol1 may fall through to snap1.










Then the state of vol1 is recorded in a new snapshot named snap2. snap2 contains the changes between the moment snap1 was taken and the moment snap2 was taken. snap2’s parent is the original parent of vol1.










11.7. Converting a volume to a snapshot (freeze)

There is a volume named vol1, based on a snapshot snap0. vol1 contains only the changes since snap0 was taken.

Volume freeze

After the freeze operation the state of vol1 is recorded in a new snapshot with the same name. The snapshot vol1 contains changes between the moment snap0 was taken and the moment vol1 was frozen.










11.8. Creating a volume based on an existing snapshot (a.k.a. clone)

Before the creation of vol1 there is a snapshot named snap1.

Snapshot clones

A new volume, named vol1 is created. vol1 is based on snap1. The newly created volume does not occupy any space initially. Reads from the vol1 may fall through to snap1 or to snap1’s parents (if any).










11.9. Deleting a snapshot

vol1 and vol2 are based on snap1. snap1 is based on snap0. snap1 contains the changes between the moment snap0 was taken and when snap1 was taken. vol1 and vol2 contain the changes since the moment snap1 was taken.

Deleting a snapshot

After the deletion, vol1 and vol2 are based on snap1’s original parent (if any). In the example they are now based on snap0. When deleting a snapshot, the changes contained therein will not be propagated to its children and StorPool will keep the snap1 in deleting state to prevent from an explosion of disk space usage.








11.10. Rebase to null (a.k.a. promote)

vol1 is based on snap1. snap1 is in turn based on snap0. snap1 contains the changes between the moment snap0 was taken and when snap1 was taken. vol1 contains the changes since the moment snap1 was taken.

Rebase to null

After promotion vol1 is not based on a snapshot. vol1 now contains all data, not just the changes since snap1 was taken. Any relation between snap1 and snap0 is unaffected.










11.11. Rebase

vol1 is based on snap1. snap1 is in turn based on snap0. snap1 contains the changes between the moment snap0 was taken and when snap1 was taken. vol1 contains the changes since the moment snap1 was taken.

Rebase

After the rebase operation vol1 is based on snap0. vol1 now contains all changes since snap0 was taken, not just since snap1. snap1 is unchanged.












11.12. Example use of snapshots

Example use of snapshots

This is a semi-realistic example of how volumes and snapshots may be used. There is a snapshot called base.centos7. This snapshot contains a base CentOS 7 VM image, which was prepared carefully by the service provider. There are 3 customers with 4 virtual machines each. All virtual machine images are based on CentOS 7, but may contain custom data, which is unique to each VM.














Example use of snapshots

This example shows another typical use of snapshots - for restore points back in time for this volume. There is one base image for Centos 7, three snapshot restore points and one live volume cust123.v.1





















12. StorPool iSCSI support

If StorPool volumes need to be accessed by hosts that cannot run the StorPool client service (e.g. VMware hypervisors), they may be exported using the iSCSI protocol.

12.1. A Quick Overview of iSCSI

The iSCSI remote block device access protocol, as implemented by the StorPool iSCSI service, is a client-server protocol allowing clients (referred to as “initiators”) to read and write data to disks (referred to as “targets”) exported by iSCSI servers. The iSCSI servers listen on portals (TCP ports, usually 3260, on specific IP addresses); these portals may be grouped into the so called portal groups to provide fine-grained access control or load balancing for the iSCSI connections.

12.2. An iSCSI Setup in a StorPool Cluster

The StorPool implementation of iSCSI provides a way to mark StorPool volumes as accessible to iSCSI initiators, define iSCSI portals where hosts running the StorPool iSCSI service listen for connections from initiators, define portal groups over these portals, and export StorPool volumes (iSCSI targets) to iSCSI initiators in the portal groups. To simplify the configuration of the iSCSI initiators, and also to provide load balancing and failover, each portal group has a floating IP address that is automatically brought up on only a single StorPool service at a given moment; the initiators are configured to connect to this floating address, authenticating if necessary, and are then redirected to the portal of the StorPool service that actually exports the target (volume) that they need to access.

In the simplest setup, there is a single portal group with a floating IP address, there is a single portal for each StorPool host that runs the iSCSI service, all the initiators connect to the floating IP address and are redirected to the correct host. For quality of service or fine-grained access control, more portal groups may be defined and some volumes may be exported via more than one portal group.

A trivial setup may be brought up by the following series of StorPool CLI commands; see the CLI tutorial for more information about the commands themselves:

# storpool iscsi config setBaseName iqn.2018-01.com.example:poc-cluster
OK

# storpool iscsi config portalGroup poc create
OK

# storpool iscsi config portalGroup poc create addNet 192.168.42.247/24
OK

# storpool iscsi config portal create portalGroup poc address 192.168.42.246 controller 1
OK

# storpool iscsi config portal create portalGroup poc address 192.168.42.202 controller 3
OK

# storpool iscsi portalGroup list
---------------------------------------
| name | networksCount | portalsCount |
---------------------------------------
| poc  |             1 |            2 |
---------------------------------------

# storpool iscsi portalGroup list portals
--------------------------------------------
| group | address             | controller |
--------------------------------------------
| poc   | 192.168.42.246:3260 |          1 |
| poc   | 192.168.42.202:3260 |          3 |
--------------------------------------------

# storpool iscsi config initiator iqn.2018-01.com.example:poc-cluster:hv1 create
OK

# storpool iscsi config target create tinyvolume
OK

# storpool iscsi config export volume tinyvolume portalGroup poc initiator iqn.2018-01.com.example:poc-cluster:hv1
OK

# storpool iscsi initiator list
----------------------------------------------------------------------------------------------
| name                                    | username | secret | networksCount | exportsCount |
----------------------------------------------------------------------------------------------
| iqn.2018-01.com.example:poc-cluster:hv1 |          |        |             0 |            1 |
----------------------------------------------------------------------------------------------

# storpool iscsi initiator list exports
---------------------------------------------------------------------------------------------------------------------------------------------
| name                                           | volume     | currentControllerId | portalGroup | initiator                               |
---------------------------------------------------------------------------------------------------------------------------------------------
| iqn.2018-01.com.example:poc-cluster:tinyvolume | tinyvolume |                   1 | poc         | iqn.2018-01.com.example:poc-cluster:hv1 |
---------------------------------------------------------------------------------------------------------------------------------------------

12.3. Caveats with a Complex iSCSI Architecture

In iSCSI portal definitions, a TCP address/port pair must be unique; only a single portal within the whole cluster may be defined at a single IP address and port. Thus, if the same StorPool iSCSI service should be able to export volumes in more than one portal group, the portals should be placed either on different ports or on different IP addresses (although it is fine for these addresses to be brought up on the same network interface on the host).

The redirecting portal on the floating address of a portal group always listens on port 3260. Similarly to the above, different portal groups must have different floating IP addresses, although they are automatically brought up on the same network interfaces as the actual portals within the groups.

Some iSCSI initiator implementations (e.g. VMware vSphere) may only connect to TCP port 3260 for an iSCSI service. In a more complex setup where a StorPool service on a single host may export volumes in more than one portal group, this might mean that the different portals must reside on different IP addresses, since the port number is the same.

For technical reasons, currently a StorPool volume may only be exported by a single StorPool service (host), even though it may be exported in different portal groups. For this reason, some care should be taken in defining the portal groups so that they may have at least some StorPool services (hosts) in common.

13. Multi site

13.1. Initial setup

With StorPool different locations could be connected through the storpool_bridge service running on a public interface in each location for disaster recovery and remote backup purposes.

To configure a bridge on one of the nodes in the cluster, the following parameters would have to be configured in /etc/storpool.conf:

SP_CLUSTER_NAME=<Human readable name of the cluster>
SP_CLUSTER_ID=<location ID>.<cluster ID>
SP_BRIDGE_HOST=<IP address>
SP_BRIDGE_TEMPLATE=<template>
SP_BRIDGE_IFACE=<interface> # optional with IP failover

The SP_CLUSTER_NAME is optional and could be used as a name for the location/cluster in the remote.

The SP_CLUSTER_ID is an unique ID assigned by StorPool for each existing cluster, for example nmjc.b. The cluster ID consists of two parts - a location ID (the first part before the . - nmjc) and a cluster ID (the second part after the . - b)

The SP_BRIDGE_HOST is the public IP address to listen for connections from other bridges. Note that 3749 port should be unblocked in the firewalls between the two locations.

The SP_BRIDGE_TEMPLATE is needed to instruct the local bridge which template should be used for incoming snapshots from remote sites.

The SP_BRIDGE_IFACE is required when two or more bridges are configured with the same public/private key pairs and the SP_BRIDGE_HOST is a floating IP address.

13.2. Connecting two locations

Lets have an example with two clusters named Cluster_A and Cluster_B. To have these two connected through their bridge services we would have to introduce each of them to the other.

Volume

13.2.1. Cluster A

The following parameters from Cluster_B will be required:

  • The SP_CLUSTER_ID - b.b

  • The SP_BRIDGE_HOST IP address - 5.6.7.8

  • The public key in /usr/lib/storpool/bridge/bridge.key.txt in the remote bridge host in Cluster_B - eeee.ffff.gggg.hhhh

  • The SP_CLUSTER_NAME - Cluster_B

By using the CLI we could add Cluster_B’s location with the following command in Cluster_A:

user@hostA # storpool location add b Cluster_B

Then add the cluster ID (which is b in this case) with:

user@hostA # storpool cluster add Cluster_B b

The last step in Cluster_A is to register the Cluster_B’s bridge. . The command will look like this:

user@hostA # storpool remoteBridge register Cluster_B 5.6.7.8 eeeeeeeeeeeee.ffffffffffff.ggggggggggggg.hhhhhhhhhhhhh

Registered bridges in Cluster_A:

user@hostA # storpool remoteBridge list
----------------------------------------------------------------------------------------------------------
| ip          | location   | minimumDeleteDelay | publicKey                                              |
----------------------------------------------------------------------------------------------------------
| 5.6.7.8     | Cluster_B  |                    | eeeeeeeeeeeee.ffffffffffff.ggggggggggggg.hhhhhhhhhhhhh |
----------------------------------------------------------------------------------------------------------

Hint

The public key in /usr/lib/storpool/bridge/bridge.key.txt will be generated on the first run of the storpool_bridge service.

13.2.2. Cluster B

Similarly here the parameters from Cluster_A will be required:

  • The SP_CLUSTER_ID - a.b

  • The SP_BRIDGE_HOST IP address in Cluster_A - 1.2.3.4

  • The public key in /usr/lib/storpool/bridge/bridge.key.txt in the remote bridge host in Cluster_A - aaaa.bbbb.cccc.dddd

  • The SP_CLUSTER_NAME - Cluster_A

Then to add and register the Cluster_A’s location and bridge into Cluster_B, the commands will be:

user@hostB # storpool location add a Cluster_A

Then adding the cluster ID:

user@hostB # storpool cluster add Cluster_A b

And lastly register the Cluster_A’s bridge locally with:

user@hostB # storpool remoteBridge register Cluster_A 1.2.3.4 aaaaaaaaaaaaa.bbbbbbbbbbbb.ccccccccccccc.ddddddddddddd

Registered bridges in Cluster_B:

user@hostB # storpool remoteBridge list
----------------------------------------------------------------------------------------------------------
| ip          | location   | minimumDeleteDelay | publicKey                                              |
----------------------------------------------------------------------------------------------------------
| 1.2.3.4     | Cluster_A  |                    | aaaaaaaaaaaaa.bbbbbbbbbbbb.ccccccccccccc.ddddddddddddd |
----------------------------------------------------------------------------------------------------------

At this point, provided network connectivity is working, the two bridges will be connected.

Volume

13.3. Bridge redundancy

There are two ways to add redundancy for the bridge services by configuring and starting the storpool_bridge service on two (or more) nodes in each cluster.

13.3.1. Separate IP addresses

Configure and start the storpool_bridge with a separate SP_BRIDGE_HOST address and a separate set of public/private key pairs. In this case each of the bridge nodes would have to be registered in the same way as explained in the Connecting two locations section. The SP_BRIDGE_IFACE parameter is unset and the SP_BRIDGE_HOST address is expected by the storpool_bridge service on each of the node where it is started.

Each of the bridge nodes in ClusterA would have to be configured in ClusterB and vice-versa. Only one bridge is active at a time and is being failed over in case the node or the active service is restarted.

13.3.2. Single IP failed over between the nodes

Configure and start the storpool_bridge service on the first node. Then distribute the /usr/lib/storpool/bridge/bridge.key and /usr/lib/storpool/bridge/bridge.key.txt files on the other nodes where the storpool_bridge service will be running. The SP_BRIDGE_IFACE is required and this will be the interface where the SP_BRIDGE_HOST will be configured. The SP_BRIDGE_HOST will be up only on the node where the active bridge service is running until either the service or the node itself get restarted.

With this configuration there will be only one bridge registered in the remote cluster, regardless of the number of nodes used for the storpool_bridge service in the local cluster, because all the services are essentially running with the same IP address and public/private key pair configuration.

The failover SP_BRIDGE_HOST is better suited for NAT/port-forwarding cases.

13.4. Exports

When connected, a snapshot in one of the clusters could be exported and become visible at the remote location, for example a snapshot called snap1 could be exported with:

user@hostA # storpool snapshot snap1 export Cluster_B

It becomes visible in Cluster_B and could be listed with:

user@hostB # storpool snapshot list remote
-----------------------------------------------------------------------------------
| location  | remoteId | name     | onVolume | size         | creationTimestamp   |
-----------------------------------------------------------------------------------
| Cluster_A | a.b.1    | snap1    |          | 107374182400 | 2018-01-03 15:18:02 |
----------------------------------------------------------------------------------

13.5. Remote clones

Any remote snapshot could be cloned locally. For example, to clone a remote snapshot with globalId of a.b.1 locally we could use:

user@hostB # storpool snapshot snap1-copy template hybrid remote Cluster_A a.b.1
Volume

Where the name of the clone of the snapshot in Cluster_B will be snap1-clone with all parameters from the hybrid template. Please note that the name of the snapshot in Cluster_B is just for the example, the snapshot could be named snap1 in both clusters.

The transfer will start immediately. Only written sectors from the snapshot will be transferred between the sites, for example if snap1 has a size of 100GB, but only 1GB of data was ever written, only this data will be will be transferred between the two sites.

If another snapshot in the remote site is already based on snap1 and then exported, the actual transfer will again include only the differences between snap1 and snap2, since snap1 is already cloned in Cluster_B.

Volume

The globalId for this snapshot will be the same for all sites it has been transferred to.

13.6. Creating a remote backup on a volume

The volume backup feature is in essence a set of steps that automate the backup procedure for a particular volume.

For example to backup a volume named volume1 in Cluster_A to Cluster_B we will use:

user@hostA # storpool volume volume1 backup Cluster_B

The above command will actually trigger the following set of events:

  1. Creates a local temporary snapshot of volume1 in Cluster_A to be transferred to Cluster_B

  2. Exports the temporary snapshot to Cluster_B

  3. Instructs Cluster_B to initiate the transfer for this snapshot

  4. Exports the transferred snapshot in Cluster_B to be visible from Cluster_A

  5. Deletes the local temporary snapshot

For example if a backup operation has been initiated for a volume called volume1 in Cluster_A, the progress of the operation could be followed with:

user@hostA # storpool snapshot list exports
---------------------------------------------------
| location  | snapshot     | globalId | backingUp |
---------------------------------------------------
| Cluster_B | volume1@1433 | a.b.p    | true      |
---------------------------------------------------

Once this operation completes the temporary snapshot will no longer be visible as an export and a snapshot with the same globalId will be visible remotely:

user@hostA # snapshot list remote
-------------------------------------------------------------------------------------
| location  | remoteId | name    | onVolume    | size         | creationTimestamp   |
-------------------------------------------------------------------------------------
| Cluster_B | a.b.p    | volume1 | volume1     | 107374182400 | 2018-01-13 16:27:03 |
-------------------------------------------------------------------------------------

13.7. Restoring a volume from remote snapshot

Restoring the volume to a previous state from a remote snapshot involves the following steps:

  1. The first step is to create a local snapshot from the remotely exported one:

    user@hostA # storpool snapshot volume1-restore template hybrid remote Cluster_B a.b.p
    OK
    

There are some bits to explain in the above example - from left to right:

  • volume1-restore is the name of the local snapshot that will be created.

  • The template hybrid part instructs StorPool what will be the replication and placement for the locally created snapshot.

  • The last part remote Cluster_B a.b.p is to instruct StorPool what are the remote location and the globalId, in this case Cluster_B and a.b.p.

If the bridges and the connection between the locations are operational, the transfer will begin immediately.

  1. Next create a volume with the newly created snapshot as a parent:

    user@hostA # storpool volume1-tmp parent volume1-restore
    

Note

If the transfer hasn’t completed yet some operations, e.g. read from a sector not yet completely transferred, might stale until the object being read is transfered. Request forwarding to the remote bridge is considered in our features list and will be available for future releases.

  1. Finally the volume clone would have to be attached where it is needed.

The last step might involve reconfiguring a virtual machine to use this volume instead of the presently used one. This is handled differently in different orchestration systems.

13.8. Remote deferred deletion

The remote bridge could be registered with remote deferred deletion enabled. This feature will enable a user in Cluster A to unexport and set remote snapshots for deferred deletion in Cluster B.

An example for the case without deferred deletion enabled - Cluster_A and Cluster_B are two StorPool clusters connected with a bridge. There is a volume named volume1 in Cluster_A. This volume has two backup snapshots in Cluster_B called volume1@281 and volume1@294.

Volume

The remote snapshots could be unexported from Cluster_A with the deleteAfter flag, however it will be silently ignored in Cluster_B.

To enable this feature the following steps would have to be completed in the remote bridge for Cluster_A:
  1. The bridge in Cluster_A should be registered with minimumDeleteDelay in Cluster_B.

  2. Enable deferred snapshot deletion in Cluster_B (please check Management configuration for more)

This will enable setting up the deleteAfter parameter on unexport in Cluster_B.

With the above example volume and remote snapshots, if a user in Cluster_A unexports the volume1@294 snapshot and set its deleteAfter flag for a week from now, e.g.:

user@hostA # storpool snapshot remote Cluster_B a.b.q unexport deleteAfter 7d
OK

After the completion of this operation the following events will occur:

  • The volume1@294 snapshot will immediately stop being visible in Cluster_A.

  • The snapshot will get a deleteAfter flag with timestamp a week from this moment.

  • A week later the snapshot will be deleted, however only if deferred snapshot deletion is turned on.

14. Rebalancing StorPool

14.1. Overview

In some situations the data in the StorPool cluster needs to be rebalanced. This is performed by the balancer and relocator tools. The relocator is an integral part of the StorPool management service, the balancer is currently an external tool.

Note

Be advised that he balancer tool will create some files it needs in the present working directory.

The rebalancing operation is performed in the following steps:

  • The balancer tool is ran, to calculate the new state of the cluster;

  • The results from the balancer are verified by automated scripts;

  • The results are also manually reviewed to check whether they contain any inconsistencies and whether they achieve the intended goals. These results are available by running storpool balancer disks and will be printed at the end of balancer.sh.

    • If the result is not satisfactory, the balancer is ran with different parameters, until a satisfactory result is obtained;

  • The calculations of the balancer tool are loaded into the relocator tool, by doing storpool balancer commit;

    • Please note that this step IS NOT REVERSIBLE.

  • The relocator tool performs the actual move of the data.

    • The progress of the relocator tool can be monitored by storpool task list for the currently running tasks, storpool relocator status for an overview of the relocator state and storpool relocator disks (warning: slow command) for the full relocation state.

The balancer tool is ran via the /usr/lib/storpool/balancer.sh wrapper and accepts the following options:

option

Meaning

-g placementGroup

Work only on the specified placement group

-c factor

Factor for how much data to try to move around, from 0 to 10. No default, required parameter.

-f percent

Allow drives to be filled up to this percentage, from 0 to 99. Default 90.

-M maxDataToAdd

Limit the amount of data to copy to a single drive, to be able to rebalance “in pieces”.

-m maxAgCount

Limit the maximum allocation group count on drives to this (effectively their used size).

-b placementGroup

Use disks in the specified placement group to restore replication in critical conditions.

-F

Only move data from fuller to emptier drives.

-R

Only restore replication for degraded volumes.

-R and -F are mutually exclusive.

The -c value is basically the trade-off between the balanced-ness of the placement groups and the amount of data moved to accomplish that. A lower factor means less data to be moved around, a higher one - more data.

On clusters with drives with unsupported size (HDDs > 4TB) the -m option is required. It will limit the data moved onto these drives to up to the set number of allocation groups. This is done as the performance per TB space of larger drives is lower, and it degrades the performance for the whole cluster.

The -M option helps if you want to limit the amount of data that the balancer would move around, to limit the load on the system created by the rebalancing.

The -f option is required on clusters whose drives are full above 95%. Extreme care should be used when balancing in such cases.

The -b option could be used to move data between placementGroups (in most cases from SSDs to HDDs).

14.2. Restoring volume redundancy on a failed drive

Situation: we have lost drive 1802 in placementGroup ssd. We want to remove it from the cluster and restore the redundancy of the data. We need to do the following:

storpool disk 1802 forget                               # this will also remove the drive from all placement groups it participated in
mkdir -p ~/storpool/balancer && cd ~/storpool/balancer  # it's recommended to run the following commands in a screen/tmux session
/usr/lib/storpool/balancer.sh -R -g ssd -c 0
storpool balancer commit                                # to actually load the data into the relocator and start the re-balancing operation

14.3. Adding new drives and rebalancing data on them

Situation: we have added SSDs 1201, 1202 and HDDs 1510, 1511, that need to go into placement groups ssd and hdd respectively, and we want to re-balance the cluster data so that it is re-dispersed onto the new disks as well. We have no other placement groups in the cluster.

storpool placementGroup ssd addDisk 1201 addDisk 1202
storpool placementGroup hdd addDisk 1510 addDisk 1511
mkdir -p ~/storpool/balancer && cd ~/storpool/balancer  # it's recommended to run the following commands in a screen/tmux session
/usr/lib/storpool/balancer.sh -F -c 0                   # rebalance all placement groups, move data from fuller to emptier drives
storpool balancer commit                                # to actually load the data into the relocator and start the re-balancing operation

14.4. Restoring volume redundancy with rebalancing data on other placementGroup

Situation: we have to restore the redundancy of a hybrid cluster (2 copies on HDDs, one on SSDs) while the ssd placementGroup is out of free space because a few SSDs have recently failed. We can’t replace the failed drives with new ones for the moment.

mkdir -p ~/storpool/balancer && cd ~/storpool/balancer  # it's recommended to run the following commands in a screen/tmux session
/usr/lib/storpool/balancer.sh -R -c 0 -b hdd            # use placementGroup ``hdd`` as a backup and move some data from SSDs
storpool balancer commit                                # to actually load the data into the relocator and start the re-balancing operation

14.5. Decomissioning a live node

Situation: a node in the cluster needs to be decomissioned, so that the data on its drives needs to be moved away. The drive numbers on that node are 101, 102 and 103.

Note

You have to make sure you have enough space to restore the redundancy before proceeding.

storpool disk 101 softEject                             # mark all drives for evacuation
storpool disk 102 softEject
storpool disk 103 softEject
mkdir -p ~/storpool/balancer && cd ~/storpool/balancer  # it's recommended to run the following commands in a screen/tmux session
/usr/lib/storpool/balancer.sh -R -c 0                   # rebalance all placement groups
storpool balancer commit                                # to actually load the data into the relocator and start the re-balancing operation

14.6. Decomissioning a dead node

Situation: a node in the cluster needs to be decomissioned, as it has died and cannot be brought back. The drive numbers on that node are 101, 102 and 103.

Note

You have to make sure you have enough space to restore the redundancy before proceeding.

storpool disk 101 forget                                # remove the drives from all placement groups
storpool disk 102 forget
storpool disk 103 forget
mkdir -p ~/storpool/balancer && cd ~/storpool/balancer  # it's recommended to run the following commands in a screen/tmux session
/usr/lib/storpool/balancer.sh -R -c 0                   # rebalance all placement groups
storpool balancer commit                                # to actually load the data into the relocator and start the re-balancing operation

14.7. Resolving imbalances in the drive usage

Situation: we have an imbalance in the drive usage in the whole cluster and we want to improve it.

mkdir -p ~/storpool/balancer && cd ~/storpool/balancer  # it's recommended to run the following commands in a screen/tmux session
/usr/lib/storpool/balancer.sh -c 0                      # rebalance all placement groups
/usr/lib/storpool/balancer.sh -c 6                      # retry to see if we get a better result
storpool balancer commit                                # to actually load the data into the relocator and start the re-balancing operation

14.8. Reading the output of storpool balancer disks

Here is an example output from storpool balancer disks:

-------------------------------------------------------------------------------------------------------------------------------------------------------------------------
|     disk | server |   size   |                  stored                  |                 on-disk                  |                     objects                      |
-------------------------------------------------------------------------------------------------------------------------------------------------------------------------
|        1 |   14.0 |   373 GB |   4.7 MB -> 4.7 MB      (0  B / 0  B)    |   6.1 MB -> 6.1 MB      (0  B / 0  B)    |      26 -> 26           (+0 / +0)      / 405000  |
|     1101 |   11.0 |   447 GB |    16 GB -> 15 GB    (-1.0 GB / 1.4 GB)  |    18 GB -> 17 GB    (-1.1 GB / 1.4 GB)  |   11798 -> 10040     (-1758 / +3932)   / 480000  |
|     1102 |   11.0 |   447 GB |    16 GB -> 15 GB    (-268 MB / 1.3 GB)  |    17 GB -> 17 GB    (-301 MB / 1.4 GB)  |   10843 -> 10045      (-798 / +4486)   / 480000  |
|     1103 |   11.0 |   447 GB |    16 GB -> 15 GB    (-1.0 GB / 1.8 GB)  |    18 GB -> 16 GB    (-1.2 GB / 1.9 GB)  |   12123 -> 10039     (-2084 / +3889)   / 480000  |
|     1104 |   11.0 |   447 GB |    16 GB -> 15 GB    (-757 MB / 1.3 GB)  |    17 GB -> 16 GB    (-899 MB / 1.3 GB)  |   11045 -> 10072      (-973 / +4279)   / 480000  |
|     1111 |   11.0 |   466 GB |   4.7 MB -> 4.7 MB      (0  B / 0  B)    |   5.1 MB -> 5.1 MB      (0  B / 0  B)    |      26 -> 26           (+0 / +0)      / 495000  |
|     1112 |   11.0 |   366 GB |   4.7 MB -> 4.7 MB      (0  B / 0  B)    |   5.1 MB -> 5.1 MB      (0  B / 0  B)    |      26 -> 26           (+0 / +0)      / 390000  |
|     1121 |   11.0 |   931 GB |    22 GB -> 21 GB    (-1009 MB / 830 MB)  |    22 GB -> 21 GB    (-1.0 GB / 872 MB)  |   13713 -> 12698     (-1015 / +3799)   / 975000  |
|     1122 |   11.0 |   931 GB |    21 GB -> 21 GB    (-373 MB / 2.0 GB)  |    22 GB -> 21 GB    (-379 MB / 2.0 GB)  |   13469 -> 12742      (-727 / +3801)   / 975000  |
|     1123 |   11.0 |   931 GB |    22 GB -> 21 GB    (-1.1 GB / 1.9 GB)  |    22 GB -> 21 GB    (-1.1 GB / 2.0 GB)  |   14859 -> 12629     (-2230 / +4102)   / 975000  |
|     1124 |   11.0 |   931 GB |    21 GB -> 21 GB      (36 MB / 1.8 GB)  |    21 GB -> 21 GB      (92 MB / 1.9 GB)  |   13806 -> 12743     (-1063 / +3389)   / 975000  |
|     1201 |   12.0 |   447 GB |    18 GB -> 15 GB    (-2.9 GB / 633 MB)  |    19 GB -> 16 GB    (-3.0 GB / 658 MB)  |   14148 -> 10070     (-4078 / +3050)   / 480000  |
|     1202 |   12.0 |   447 GB |    17 GB -> 15 GB    (-2.1 GB / 787 MB)  |    19 GB -> 16 GB    (-2.3 GB / 815 MB)  |   13243 -> 10067     (-3176 / +2576)   / 480000  |
|     1203 |   12.0 |   447 GB |    17 GB -> 15 GB    (-2.0 GB / 3.3 GB)  |    19 GB -> 16 GB    (-2.4 GB / 3.5 GB)  |   12746 -> 10062     (-2684 / +3375)   / 480000  |
|     1204 |   12.0 |   447 GB |    18 GB -> 15 GB    (-2.7 GB / 1.1 GB)  |    19 GB -> 16 GB    (-2.9 GB / 1.1 GB)  |   12835 -> 10075     (-2760 / +3248)   / 480000  |
|     1212 |   12.0 |   366 GB |   4.7 MB -> 4.7 MB      (0  B / 0  B)    |   6.0 MB -> 6.0 MB      (0  B / 0  B)    |      26 -> 26           (+0 / +0)      / 390000  |
|     1221 |   12.0 |   931 GB |    20 GB -> 21 GB     (569 MB / 1.5 GB)  |    21 GB -> 21 GB     (587 MB / 1.6 GB)  |   13115 -> 12616      (-499 / +3736)   / 975000  |
|     1222 |   12.0 |   931 GB |    22 GB -> 21 GB    (-979 MB / 307 MB)  |    22 GB -> 21 GB    (-1013 MB / 317 MB)  |   12938 -> 12697      (-241 / +3291)   / 975000  |
|     1223 |   12.0 |   931 GB |    22 GB -> 21 GB    (-1.1 GB / 781 MB)  |    22 GB -> 21 GB    (-1.2 GB / 812 MB)  |   13968 -> 12718     (-1250 / +3302)   / 975000  |
|     1224 |   12.0 |   931 GB |    21 GB -> 21 GB    (-784 MB / 332 MB)  |    22 GB -> 21 GB    (-810 MB / 342 MB)  |   13741 -> 12692     (-1049 / +3314)   / 975000  |
|     1225 |   12.0 |   931 GB |    21 GB -> 21 GB    (-681 MB / 849 MB)  |    22 GB -> 21 GB    (-701 MB / 882 MB)  |   13608 -> 12748      (-860 / +3420)   / 975000  |
|     1226 |   12.0 |   931 GB |    22 GB -> 21 GB    (-1.1 GB / 825 MB)  |    22 GB -> 21 GB    (-1.1 GB / 853 MB)  |   13066 -> 12692      (-374 / +3817)   / 975000  |
|     1301 |   13.0 |   447 GB |    13 GB -> 15 GB     (2.6 GB / 4.2 GB)  |    14 GB -> 17 GB     (2.7 GB / 4.4 GB)  |    7244 -> 10038     (+2794 / +6186)   / 480000  |
|     1302 |   13.0 |   447 GB |    12 GB -> 15 GB     (3.0 GB / 3.7 GB)  |    13 GB -> 17 GB     (3.1 GB / 3.9 GB)  |    7507 -> 10063     (+2556 / +5619)   / 480000  |
|     1303 |   13.0 |   447 GB |    14 GB -> 15 GB     (1.3 GB / 3.2 GB)  |    15 GB -> 17 GB     (1.3 GB / 3.4 GB)  |    7888 -> 10038     (+2150 / +5884)   / 480000  |
|     1304 |   13.0 |   447 GB |    13 GB -> 15 GB     (2.7 GB / 3.7 GB)  |    14 GB -> 17 GB     (2.8 GB / 3.9 GB)  |    7660 -> 10045     (+2385 / +5870)   / 480000  |
|     1311 |   13.0 |   466 GB |   4.7 MB -> 4.7 MB      (0  B / 0  B)    |   6.1 MB -> 6.1 MB      (0  B / 0  B)    |      26 -> 26           (+0 / +0)      / 495000  |
|     1312 |   13.0 |   366 GB |   4.7 MB -> 4.7 MB      (0  B / 0  B)    |   6.1 MB -> 6.1 MB      (0  B / 0  B)    |      26 -> 26           (+0 / +0)      / 390000  |
|     1321 |   13.0 |   931 GB |    21 GB -> 21 GB    (-193 MB / 1.1 GB)  |    21 GB -> 21 GB    (-195 MB / 1.2 GB)  |   13365 -> 12765      (-600 / +5122)   / 975000  |
|     1322 |   13.0 |   931 GB |    22 GB -> 21 GB    (-1.4 GB / 1.1 GB)  |    23 GB -> 21 GB    (-1.4 GB / 1.1 GB)  |   12749 -> 12739       (-10 / +4651)   / 975000  |
|     1323 |   13.0 |   931 GB |    21 GB -> 21 GB    (-504 MB / 2.2 GB)  |    22 GB -> 21 GB    (-496 MB / 2.3 GB)  |   13386 -> 12695      (-691 / +4583)   / 975000  |
|     1325 |   13.0 |   931 GB |    21 GB -> 20 GB    (-698 MB / 557 MB)  |    22 GB -> 21 GB    (-717 MB / 584 MB)  |   13113 -> 12768      (-345 / +2668)   / 975000  |
|     1326 |   13.0 |   931 GB |    21 GB -> 21 GB    (-507 MB / 724 MB)  |    22 GB -> 21 GB    (-522 MB / 754 MB)  |   13690 -> 12704      (-986 / +3327)   / 975000  |
|     1401 |   14.0 |   223 GB |   8.3 GB -> 7.6 GB   (-666 MB / 868 MB)  |   9.3 GB -> 8.5 GB   (-781 MB / 901 MB)  |    3470 -> 5043      (+1573 / +2830)   / 240000  |
|     1402 |   14.0 |   447 GB |   9.8 GB -> 15 GB     (5.6 GB / 5.7 GB)  |    11 GB -> 17 GB     (5.8 GB / 6.0 GB)  |    4358 -> 10060     (+5702 / +6667)   / 480000  |
|     1403 |   14.0 |   224 GB |   8.2 GB -> 7.6 GB   (-623 MB / 1.1 GB)  |   9.3 GB -> 8.6 GB   (-710 MB / 1.2 GB)  |    4547 -> 5036       (+489 / +2814)   / 240000  |
|     1404 |   14.0 |   224 GB |   8.4 GB -> 7.6 GB   (-773 MB / 1.5 GB)  |   9.4 GB -> 8.5 GB   (-970 MB / 1.6 GB)  |    4369 -> 5031       (+662 / +2368)   / 240000  |
|     1411 |   14.0 |   466 GB |   4.7 MB -> 4.7 MB      (0  B / 0  B)    |   6.0 MB -> 6.0 MB      (0  B / 0  B)    |      26 -> 26           (+0 / +0)      / 495000  |
|     1412 |   14.0 |   366 GB |   4.7 MB -> 4.7 MB      (0  B / 0  B)    |   6.0 MB -> 6.0 MB      (0  B / 0  B)    |      26 -> 26           (+0 / +0)      / 390000  |
|     1421 |   14.0 |   931 GB |    19 GB -> 21 GB     (1.9 GB / 2.6 GB)  |    19 GB -> 21 GB     (2.0 GB / 2.7 GB)  |   10670 -> 12624     (+1954 / +6196)   / 975000  |
|     1422 |   14.0 |   931 GB |    19 GB -> 21 GB     (1.6 GB / 3.2 GB)  |    20 GB -> 21 GB     (1.6 GB / 3.3 GB)  |   10653 -> 12844     (+2191 / +6919)   / 975000  |
|     1423 |   14.0 |   931 GB |    19 GB -> 21 GB     (1.9 GB / 2.5 GB)  |    19 GB -> 21 GB     (2.0 GB / 2.6 GB)  |   10715 -> 12688     (+1973 / +5846)   / 975000  |
|     1424 |   14.0 |   931 GB |    18 GB -> 20 GB     (2.2 GB / 2.9 GB)  |    19 GB -> 21 GB     (2.3 GB / 3.0 GB)  |   10723 -> 12686     (+1963 / +5505)   / 975000  |
|     1425 |   14.0 |   931 GB |    19 GB -> 21 GB     (1.3 GB / 2.5 GB)  |    20 GB -> 21 GB     (1.4 GB / 2.6 GB)  |   10702 -> 12689     (+1987 / +5486)   / 975000  |
|     1426 |   14.0 |   931 GB |    20 GB -> 21 GB     (1.0 GB / 2.5 GB)  |    20 GB -> 21 GB     (1.0 GB / 2.6 GB)  |   10737 -> 12609     (+1872 / +5771)   / 975000  |
-------------------------------------------------------------------------------------------------------------------------------------------------------------------------
|       45 |    4.0 |    29 TB |   652 GB -> 652 GB    (512 MB / 69 GB)   |   686 GB -> 685 GB   (-240 MB / 72 GB)   |  412818 -> 412818       (+0 / +159118) / 30885000 |
-------------------------------------------------------------------------------------------------------------------------------------------------------------------------

Let’s start with the last line. Here’s the meaning, field by field:

  • There are 45 drives in total.

  • There are 4 server instances.

  • The total disk capacity is 29 TB.

  • The stored data is 652 GB and will change to 652 GB. The total change for all drives afterwards is 512 MB, and the total amount of changes for the drives is 69 GB.

  • The same is repeated for the on-disk size. Here the total amount of changes is roughly the amount of data that would need to be copied.

  • The total current number of objects will not change (i.e. from 412818 to 412818), 0 new objects will be created, the total amount of objects to be moved is 159118, and the total number of possible objects in the cluster is 30885000.

The difference between “stored” and “on-disk” size is that in the latter also includes the size of checksums blocks and other internal data.

For the rest of the lines, the data is basically the same, just per disk.

What needs to be taken into account is:

  • Are there drives that will have too much data on them? Here both data size and objects must be checked, and they should be close to the average percentage for the placement group.

  • Is the data stored on the drives balanced, i.e. are all the drives’ usages close to the average?

  • Are there drives that should have data on them, but nothing is scheduled to be moved?

    • This usually happens because a drive wasn’t added to the right placement group.

  • Will there be too much data to be moved?

To illustrate the difference of amount to be moved, here is the output of storpool balancer disks from a run with -c 10:

-------------------------------------------------------------------------------------------------------------------------------------------------------------------------
|     disk | server |   size   |                  stored                  |                 on-disk                  |                     objects                      |
-------------------------------------------------------------------------------------------------------------------------------------------------------------------------
|        1 |   14.0 |   373 GB |   4.7 MB -> 4.7 MB      (0  B / 0  B)    |   6.1 MB -> 6.1 MB      (0  B / 0  B)    |      26 -> 26           (+0 / +0)      / 405000  |
|     1101 |   11.0 |   447 GB |    16 GB -> 15 GB    (-1.0 GB / 1.7 GB)  |    18 GB -> 17 GB    (-1.1 GB / 1.7 GB)  |   11798 -> 10027     (-1771 / +5434)   / 480000  |
|     1102 |   11.0 |   447 GB |    16 GB -> 15 GB    (-263 MB / 1.7 GB)  |    17 GB -> 17 GB    (-298 MB / 1.7 GB)  |   10843 -> 10000      (-843 / +5420)   / 480000  |
|     1103 |   11.0 |   447 GB |    16 GB -> 15 GB    (-1.0 GB / 3.6 GB)  |    18 GB -> 16 GB    (-1.2 GB / 3.8 GB)  |   12123 -> 10005     (-2118 / +6331)   / 480000  |
|     1104 |   11.0 |   447 GB |    16 GB -> 15 GB    (-752 MB / 2.7 GB)  |    17 GB -> 16 GB    (-907 MB / 2.8 GB)  |   11045 -> 10098      (-947 / +5214)   / 480000  |
|     1111 |   11.0 |   466 GB |   4.7 MB -> 4.7 MB      (0  B / 0  B)    |   5.1 MB -> 5.1 MB      (0  B / 0  B)    |      26 -> 26           (+0 / +0)      / 495000  |
|     1112 |   11.0 |   366 GB |   4.7 MB -> 4.7 MB      (0  B / 0  B)    |   5.1 MB -> 5.1 MB      (0  B / 0  B)    |      26 -> 26           (+0 / +0)      / 390000  |
|     1121 |   11.0 |   931 GB |    22 GB -> 21 GB    (-1003 MB / 6.4 GB)  |    22 GB -> 21 GB    (-1018 MB / 6.7 GB)  |   13713 -> 12742      (-971 / +9712)   / 975000  |
|     1122 |   11.0 |   931 GB |    21 GB -> 21 GB    (-368 MB / 5.8 GB)  |    22 GB -> 21 GB    (-272 MB / 6.1 GB)  |   13469 -> 12718      (-751 / +8929)   / 975000  |
|     1123 |   11.0 |   931 GB |    22 GB -> 21 GB    (-1.1 GB / 5.9 GB)  |    22 GB -> 21 GB    (-1.1 GB / 6.1 GB)  |   14859 -> 12699     (-2160 / +8992)   / 975000  |
|     1124 |   11.0 |   931 GB |    21 GB -> 21 GB      (57 MB / 7.4 GB)  |    21 GB -> 21 GB     (113 MB / 7.7 GB)  |   13806 -> 12697     (-1109 / +9535)   / 975000  |
|     1201 |   12.0 |   447 GB |    18 GB -> 15 GB    (-2.8 GB / 1.2 GB)  |    19 GB -> 17 GB    (-3.0 GB / 1.2 GB)  |   14148 -> 10033     (-4115 / +4853)   / 480000  |
|     1202 |   12.0 |   447 GB |    17 GB -> 15 GB    (-2.0 GB / 1.6 GB)  |    19 GB -> 16 GB    (-2.2 GB / 1.7 GB)  |   13243 -> 10055     (-3188 / +4660)   / 480000  |
|     1203 |   12.0 |   447 GB |    17 GB -> 15 GB    (-2.0 GB / 2.3 GB)  |    19 GB -> 16 GB    (-2.3 GB / 2.4 GB)  |   12746 -> 10070     (-2676 / +4682)   / 480000  |
|     1204 |   12.0 |   447 GB |    18 GB -> 15 GB    (-2.7 GB / 2.1 GB)  |    19 GB -> 16 GB    (-2.8 GB / 2.2 GB)  |   12835 -> 10110     (-2725 / +5511)   / 480000  |
|     1212 |   12.0 |   366 GB |   4.7 MB -> 4.7 MB      (0  B / 0  B)    |   6.0 MB -> 6.0 MB      (0  B / 0  B)    |      26 -> 26           (+0 / +0)      / 390000  |
|     1221 |   12.0 |   931 GB |    20 GB -> 21 GB     (620 MB / 6.3 GB)  |    21 GB -> 21 GB     (805 MB / 6.7 GB)  |   13115 -> 12542      (-573 / +9389)   / 975000  |
|     1222 |   12.0 |   931 GB |    22 GB -> 21 GB    (-981 MB / 2.9 GB)  |    22 GB -> 21 GB    (-1004 MB / 3.0 GB)  |   12938 -> 12793      (-145 / +8795)   / 975000  |
|     1223 |   12.0 |   931 GB |    22 GB -> 21 GB    (-1.1 GB / 5.9 GB)  |    22 GB -> 21 GB    (-1.1 GB / 6.1 GB)  |   13968 -> 12698     (-1270 / +10094)  / 975000  |
|     1224 |   12.0 |   931 GB |    21 GB -> 21 GB    (-791 MB / 4.5 GB)  |    22 GB -> 21 GB    (-758 MB / 4.7 GB)  |   13741 -> 12684     (-1057 / +8616)   / 975000  |
|     1225 |   12.0 |   931 GB |    21 GB -> 21 GB    (-671 MB / 4.8 GB)  |    22 GB -> 21 GB    (-677 MB / 4.9 GB)  |   13608 -> 12690      (-918 / +8559)   / 975000  |
|     1226 |   12.0 |   931 GB |    22 GB -> 21 GB    (-1.1 GB / 6.2 GB)  |    22 GB -> 21 GB    (-1.1 GB / 6.4 GB)  |   13066 -> 12737      (-329 / +9386)   / 975000  |
|     1301 |   13.0 |   447 GB |    13 GB -> 15 GB     (2.6 GB / 4.5 GB)  |    14 GB -> 17 GB     (2.7 GB / 4.6 GB)  |    7244 -> 10077     (+2833 / +6714)   / 480000  |
|     1302 |   13.0 |   447 GB |    12 GB -> 15 GB     (3.0 GB / 4.9 GB)  |    13 GB -> 17 GB     (3.2 GB / 5.2 GB)  |    7507 -> 10056     (+2549 / +7011)   / 480000  |
|     1303 |   13.0 |   447 GB |    14 GB -> 15 GB     (1.3 GB / 3.2 GB)  |    15 GB -> 17 GB     (1.3 GB / 3.3 GB)  |    7888 -> 10020     (+2132 / +6926)   / 480000  |
|     1304 |   13.0 |   447 GB |    13 GB -> 15 GB     (2.7 GB / 4.7 GB)  |    14 GB -> 17 GB     (2.8 GB / 4.9 GB)  |    7660 -> 10075     (+2415 / +7049)   / 480000  |
|     1311 |   13.0 |   466 GB |   4.7 MB -> 4.7 MB      (0  B / 0  B)    |   6.1 MB -> 6.1 MB      (0  B / 0  B)    |      26 -> 26           (+0 / +0)      / 495000  |
|     1312 |   13.0 |   366 GB |   4.7 MB -> 4.7 MB      (0  B / 0  B)    |   6.1 MB -> 6.1 MB      (0  B / 0  B)    |      26 -> 26           (+0 / +0)      / 390000  |
|     1321 |   13.0 |   931 GB |    21 GB -> 21 GB    (-200 MB / 4.1 GB)  |    21 GB -> 21 GB    (-192 MB / 4.3 GB)  |   13365 -> 12690      (-675 / +9527)   / 975000  |
|     1322 |   13.0 |   931 GB |    22 GB -> 21 GB    (-1.3 GB / 6.9 GB)  |    23 GB -> 21 GB    (-1.3 GB / 7.2 GB)  |   12749 -> 12698       (-51 / +10047)  / 975000  |
|     1323 |   13.0 |   931 GB |    21 GB -> 21 GB    (-495 MB / 6.1 GB)  |    22 GB -> 21 GB    (-504 MB / 6.3 GB)  |   13386 -> 12693      (-693 / +9524)   / 975000  |
|     1325 |   13.0 |   931 GB |    21 GB -> 21 GB    (-620 MB / 6.6 GB)  |    22 GB -> 21 GB    (-612 MB / 6.9 GB)  |   13113 -> 12768      (-345 / +9942)   / 975000  |
|     1326 |   13.0 |   931 GB |    21 GB -> 21 GB    (-498 MB / 7.1 GB)  |    22 GB -> 21 GB    (-414 MB / 7.4 GB)  |   13690 -> 12697      (-993 / +9759)   / 975000  |
|     1401 |   14.0 |   223 GB |   8.3 GB -> 7.6 GB   (-670 MB / 950 MB)  |   9.3 GB -> 8.5 GB   (-789 MB / 993 MB)  |    3470 -> 5061      (+1591 / +3262)   / 240000  |
|     1402 |   14.0 |   447 GB |   9.8 GB -> 15 GB     (5.6 GB / 7.1 GB)  |    11 GB -> 17 GB     (5.8 GB / 7.5 GB)  |    4358 -> 10052     (+5694 / +7092)   / 480000  |
|     1403 |   14.0 |   224 GB |   8.2 GB -> 7.6 GB   (-619 MB / 730 MB)  |   9.3 GB -> 8.5 GB   (-758 MB / 759 MB)  |    4547 -> 5023       (+476 / +2567)   / 240000  |
|     1404 |   14.0 |   224 GB |   8.4 GB -> 7.6 GB   (-790 MB / 915 MB)  |   9.4 GB -> 8.5 GB   (-918 MB / 946 MB)  |    4369 -> 5062       (+693 / +2483)   / 240000  |
|     1411 |   14.0 |   466 GB |   4.7 MB -> 4.7 MB      (0  B / 0  B)    |   6.0 MB -> 6.0 MB      (0  B / 0  B)    |      26 -> 26           (+0 / +0)      / 495000  |
|     1412 |   14.0 |   366 GB |   4.7 MB -> 4.7 MB      (0  B / 0  B)    |   6.0 MB -> 6.0 MB      (0  B / 0  B)    |      26 -> 26           (+0 / +0)      / 390000  |
|     1421 |   14.0 |   931 GB |    19 GB -> 21 GB     (2.0 GB / 6.8 GB)  |    19 GB -> 21 GB     (2.1 GB / 7.0 GB)  |   10670 -> 12695     (+2025 / +10814)  / 975000  |
|     1422 |   14.0 |   931 GB |    19 GB -> 21 GB     (1.6 GB / 7.4 GB)  |    20 GB -> 21 GB     (1.7 GB / 7.7 GB)  |   10653 -> 12702     (+2049 / +10414)  / 975000  |
|     1423 |   14.0 |   931 GB |    19 GB -> 21 GB     (2.0 GB / 7.4 GB)  |    19 GB -> 21 GB     (2.1 GB / 7.8 GB)  |   10715 -> 12683     (+1968 / +10418)  / 975000  |
|     1424 |   14.0 |   931 GB |    18 GB -> 21 GB     (2.2 GB / 8.0 GB)  |    19 GB -> 21 GB     (2.3 GB / 8.3 GB)  |   10723 -> 12824     (+2101 / +9573)   / 975000  |
|     1425 |   14.0 |   931 GB |    19 GB -> 21 GB     (1.3 GB / 5.8 GB)  |    20 GB -> 21 GB     (1.4 GB / 6.1 GB)  |   10702 -> 12686     (+1984 / +10231)  / 975000  |
|     1426 |   14.0 |   931 GB |    20 GB -> 21 GB     (1.0 GB / 6.5 GB)  |    20 GB -> 21 GB     (1.2 GB / 6.8 GB)  |   10737 -> 12650     (+1913 / +10974)  / 975000  |
-------------------------------------------------------------------------------------------------------------------------------------------------------------------------
|       45 |    4.0 |    29 TB |   652 GB -> 653 GB    (1.2 GB / 173 GB)  |   686 GB -> 687 GB    (1.2 GB / 180 GB)  |  412818 -> 412818       (+0 / +288439) / 30885000 |
-------------------------------------------------------------------------------------------------------------------------------------------------------------------------

This time the total amount of data to be moved is 180GB. It’s possible to have a difference of an order of magnitude in the total data to be moved between -c 0 and -c 10.

14.9. Errors from the balancer tool

If the balancer tool doesn’t complete successfully, its output MUST be examined and the root cause fixed.

14.9.1. placementGroup and other violations

Here’s a part of the output of the balancer:

-== POST BALANCE ==-
shards with decreased redundancy 0 (0, 0, 0)
server constraint violations 0
stripe constraint violations 160
placement group violations 0
  • A non-zero number of “server constraint violations” means that there are pieces of data which which have two or more of their copies on the same server. This is an error condition;

  • A non-zero number of “stripe constraint violation” means that for specific pieces of data it’s not optimally striped on the drives of a specific server. This is NOT an error condition;

  • A non-zero number of “placement group violations” means there is an error condition.

14.10. Miscellaneous

If for any reason the currently running rebalancing operation needs to be paused, it can be done via storpool relocator off. In such cases StorPool Support should also be contacted, as this shouldn’t need to happen. Re-enabling it is done via storpool relocator on.

15. Troubleshooting Guide

This part outlines the different states of a StorPool cluster, common knowledge about what should be expected and what are the recommended steps in each of them. This is intended to be used as a guideline for the operations team(s) maintaining the production system provided by StorPool.

Legend:

StorPool CLI and other example commands will be listed outlined (e.g. storpool disk list, top, ping, etc.).

Example output from a command is listed below:

# storpool disk list
...

15.1. Normal state of the system

The normal behaviour of the StorPool storage system is when it is fully configured and in the up-and-running state. This is the desired state of the system.

Characteristics of this state:

15.1.1. All nodes in the storage cluster are up and running

This can be checked by using the CLI with storpool service list on any node with access to the API service.

Note

The storpool service list provides status for all services running clusterwide, rather than the services running on the node itself.

15.1.2. All configured StorPool services are up and running

This is again easily checked with storpool service list. Recently restarted services are usually spotted due to their uptime. Recently restarted services are to be taken seriously if the reason for their state is unknown even if they are running at the moment, as in the example with client ID 37 below:

# storpool service list
cluster running, mgmt on node 2
      mgmt   1 running on node  1 ver 18.02.8, started 2018-01-01 19:28:37, uptime 144 days 22:45:51
      mgmt   2 running on node  2 ver 18.02.8, started 2018-01-01 19:27:18, uptime 144 days 22:47:10 active
      server   1 running on node  1 ver 18.02.8, started 2018-01-01 19:28:59, uptime 144 days 22:45:29
    server   2 running on node  2 ver 18.02.8, started 2018-01-01 19:25:53, uptime 144 days 22:48:35
    server   3 running on node  3 ver 18.02.8, started 2018-01-01 19:23:30, uptime 144 days 22:50:58
    client   1 running on node  1 ver 18.02.8, started 2018-01-01 19:28:37, uptime 144 days 22:45:51
    client   2 running on node  2 ver 18.02.8, started 2018-01-01 19:25:32, uptime 144 days 22:48:56
    client   3 running on node  3 ver 18.02.8, started 2018-01-01 19:23:09, uptime 144 days 22:51:19
    client  21 running on node 21 ver 18.02.8, started 2018-01-01 19:20:26, uptime 144 days 22:54:02
    client  22 running on node 22 ver 18.02.8, started 2018-01-01 19:19:26, uptime 144 days 22:55:02
    client  37 running on node 37 ver 18.02.8, started 2018-01-01 13:08:12, uptime 05:06:16

15.1.3. Working cgroup memory and cpuset isolation is properly configured

This is an example output from a node with two isolated cores for the storage system and properly configured cgroup memory limits:

# /usr/share/doc/storpool/examples/cgconfig/cgtool.sh
*** cpuset:/ exclusive:1 mems:0-1 cpus:0-39
*** cpuset:/user.slice exclusive:0 mems:0-1 cpus:3-19,23-39
*** cpuset:/system.slice exclusive:0 mems:0-1 cpus:3-19,23-39
*** cpuset:/storpool.slice exclusive:1 mems:0 cpus:0-2,20-22
*** cpuset:/machine.slice exclusive:0 mems:0-1 cpus:3-19,23-39
*** storpool.slice Currently in use 988M, Memory Limit 3037M of 257681M
*** user.slice Currently in use 498M, Memory Limit 1523M of 257681M
*** system.slice Currently in use 3674M, Memory Limit 4699M of 257681M
*** machine.slice Currently in use 80262M, Memory Limit 247400M of 257681M

In this case the sum of all memory limits in the node are less than the available memory in the node. This protects the running kernel from memory shortage as well as all processes in the storpool.slice memory cgroup which ensures the stability of the storage service.

15.1.4. All network interfaces are properly configured

All network interfaces used by StorPool are up and properly configured; all network switches are configured with jumbo frames and flow control, and none of them experience any packet loss or delays. The output from storpool net list is a good start, all configured network interfaces will be seen as up with proper flags explained at the end. The desired state is uU with a + at the end for each network interface; if hardware acceleration is supported on an interface the A flag should also be present:

storpool net list
------------------------------------------------------------
| nodeId | flags   | net 1             | net 2             |
------------------------------------------------------------
|     23 | uU + AJ | 1A:00:01:00:00:17 | 1E:00:01:00:00:17 |
|     24 | uU + AJ | 1A:00:01:00:00:18 | 1E:00:01:00:00:18 |
|     25 | uU + AJ | 1A:00:01:00:00:19 | 1E:00:01:00:00:19 |
|     26 | uU + AJ | 12:00:01:00:F0:1A | 16:00:01:00:F0:1A |
|     27 | uU + AJ | 26:8A:07:43:EE:BC | 26:8A:07:43:EE:BD |
------------------------------------------------------------
Quorum status: 5 voting beacons up out of 5 expected
Flags:
  u - packets recently received from this node
  d - no packets recently received from this node

  U - this node has enough votes to be in the quorum
  D - this node does not have enough votes to be in the quorum
  + - this node considers itself in the quorum
  B - the connection to this node is through a backup link; check the cabling
  A - this node is using hardware acceleration
  J - the node uses jumbo frames

  N - a non-voting node

15.1.5. All drives are up and running

All drives in use for the storage system are performing at their specified speed, are joined in the cluster and serving requests.

This could be checked with storpool disk list internal, for example in a normally loaded cluster all drives will report low aggregate scores, and similar write-back cache iops (for HDDs with write back cache enabled in StorPool). Below is an example output (trimmed for brevity):

# storpool disk list internal
--------------------------------------------------------------------------------------------------------------------------------------------------------
| disk | server |        aggregate scores        |         wbc pages        |     scrub bw |                          scrub ETA | last scrub completed |
--------------------------------------------------------------------------------------------------------------------------------------------------------
| 2301 |   23.0 |        0 |        0 |        0 |        - + -     / -     |            - |                                  - |  2018-01-24 15:33:44 |
| 2302 |   23.0 |        0 |        0 |        0 |        - + -     / -     |            - |                                  - |  2018-01-24 15:28:48 |
| 2303 |   23.1 |        0 |        0 |        0 |        - + -     / -     |            - |                                  - |  2018-01-24 15:28:49 |
| 2304 |   23.1 |        0 |        0 |        0 |        - + -     / -     |            - |                                  - |  2018-01-24 15:28:50 |
| 2305 |   23.2 |        0 |        0 |        0 |        - + -     / -     |            - |                                  - |  2018-01-24 15:28:51 |
| 2306 |   23.2 |        0 |        0 |        0 |        - + -     / -     |            - |                                  - |  2018-01-24 15:28:51 |
| 2307 |   23.3 |        0 |        0 |        0 |        - + -     / -     |            - |                                  - |  2018-01-24 15:28:52 |
| 2308 |   23.3 |        0 |        0 |        0 |        - + -     / -     |            - |                                  - |  2018-01-24 15:28:53 |
| 2311 |   23.0 |        0 |        0 |        0 |        0 + 0     / 2560  |            - |                                  - |  2018-01-24 15:28:38 |
| 2312 |   23.0 |        0 |        0 |        0 |        0 + 0     / 2560  |            - |                                  - |  2018-01-24 15:28:43 |
| 2313 |   23.1 |        0 |        0 |        0 |        0 + 0     / 2560  |            - |                                  - |  2018-01-24 15:28:44 |
| 2314 |   23.1 |        0 |        0 |        0 |        0 + 0     / 2560  |            - |                                  - |  2018-01-24 15:28:45 |
| 2315 |   23.2 |        0 |        0 |        0 |        0 + 0     / 2560  |            - |                                  - |  2018-01-24 15:28:47 |
| 2316 |   23.2 |        0 |        0 |        0 |        0 + 0     / 2560  |            - |                                  - |  2018-01-24 15:28:39 |
| 2317 |   23.3 |        0 |        0 |        0 |        0 + 0     / 2560  |            - |                                  - |  2018-01-24 15:28:40 |
| 2318 |   23.3 |        0 |        0 |        0 |        0 + 0     / 2560  |            - |                                  - |  2018-01-24 15:28:42 |
[snip]

All drives are regularly scrubbed, so they would have a stable (not increasing) number of errors. The errors corrected for each drive are visible in the storpool disk list output. Last completed scrub is visible in storpool disk list info as well as in the example above.

Few notes on the desired state:

  • Some systems may have fewer than two network interfaces or a single backend switch. Even not recommended, this is still possible and sometimes used (usually in PoC). A single VLAN network redundancy configuration is required for a cluster where only some of the nodes are with a single interface connected to the cluster.

  • If one or more of the points describing the state above is not in effect, the system should not be considered healthy. If there is any suspicion that the system is behaving erratically even though all of the above conditions are satisfied, the recommended steps to check if everything is in order are:

  • Check top and look for the state of each of the configured storpool_* services running in the present node. A properly running service is usually in the S (sleeping) state and rarely seen in the R (running) state. The CPU usage varies, bursts or longer periods of CPU saturation are possible depending on the running workload. Usually the storpool_block service is way below all qemu/lxc processes listed by top, when sorted by CPU usage.

  • On some of the nodes running VM instances or LXC containers the statistics for processed requests on block devices with iostat will show some activity. The following example shows the normal disk activity on a node running VM instances. Note that the usage may vary greatly depending on the workload that the VM instances are producing. The command used in the example is iostat -xm 1 /dev/sp-*| egrep -v " 0[.,]00$" , which will print statistics for the StorPool devices each second excluding drives that have no storage IO activity:

    Device:     rrqm/s   wrqm/s     r/s     w/s    rMB/s    wMB/s avgrq-sz avgqu-sz   await r_await w_await  svctm  %util
    sp-0        0.00     0.00    0.00  279.00     0.00     0.14     1.00     3.87   13.80    0.00   13.80   3.55  98.99
    sp-11       0.00     0.00     165.60  114.10    19.29    14.26   245.66     5.97   20.87    9.81   36.91   0.89  24.78
    sp-12       0.00     0.00     171.60  153.60    19.33    19.20   242.67     9.20   28.17      10.46   47.96   1.08  35.01
    sp-13       0.00     0.00    6.00   40.80     0.04     5.10    225.12     1.75   37.32    0.27   42.77   1.06   4.98
    sp-21       0.00     0.00    0.00   82.20     0.00     1.04    25.90     1.00   12.08    0.00   12.08  12.16  99.99
    

15.2. Degraded state

In this state some system components are not fully operational and need attention. Some examples of a degraded state below.

15.2.1. Degraded state due to service issues

15.2.1.1. A single storpool_server service on one of the storage nodes is not available or not joined in the cluster

Note that this concerns only pools with triple replication, for dual replication this is considered to be a critical state, because there are parts of the system with only one available copy. This is an example output from storpool service list:

# storpool service list
cluster running, mgmt on node 2
      mgmt   1 running on node  1 ver 18.02.8, started 2018-01-01 16:11:59, uptime 19:51:50
      mgmt   2 running on node  2 ver 18.02.8, started 2018-01-01 16:11:58, uptime 19:51:51 active
      mgmt   3 running on node  3 ver 18.02.8, started 2018-01-01 16:11:58, uptime 19:51:51
    server   1 down on node  1 ver 18.02.8
    server   2 running on node  2 ver 18.02.8, started 2018-01-01 16:12:03, uptime 19:51:46
    server   3 running on node  3 ver 18.02.8, started 2018-01-01 16:12:04, uptime 19:51:45
    client   1 running on node  1 ver 18.02.8, started 2018-01-01 16:11:59, uptime 19:51:50
    client   2 running on node  2 ver 18.02.8, started 2018-01-01 16:11:57, uptime 19:51:52
    client   3 running on node  3 ver 18.02.8, started 2018-01-01 16:11:57, uptime 19:51:52

If this is unexpected, i.e. no one has deliberately restarted or stopped the service for planned maintenance or upgrade, it is very important to first bring the service up and then to investigate the root cause for the service outage. When the storpool_server service comes back up it will start recovering outdated data on its drives. The recovery process could be monitored with storpool task list, which will output which disks are recovering, as well as how much data is there left to be recovered. Example output or storpool task list:

# storpool task list
----------------------------------------------------------------------------------------
|     disk |  task id | total size |  completed |    started |  remaining | % complete |
----------------------------------------------------------------------------------------
|      122 |        0 |      30 GB |      14 GB |      32 MB |      17 GB |        45% |
|      122 |    18503 |     128 MB |       0  B |       0  B |     128 MB |         0% |
|      123 |        0 |      27 GB |      13 GB |      32 MB |      13 GB |        50% |
|      123 |    18564 |      32 MB |       0  B |       0  B |      32 MB |         0% |
|      124 |        0 |      30 GB |      13 GB |      32 MB |      17 GB |        43% |
|      124 |    18503 |      64 MB |       0  B |       0  B |      64 MB |         0% |
|      221 |    18564 |     160 MB |      96 MB |      32 MB |      64 MB |        60% |
----------------------------------------------------------------------------------------
|    total |          |      87 GB |      40 GB |     128 MB |      47 GB |        46% |
----------------------------------------------------------------------------------------

Some of the volumes or snapshots will have the D flag (for degraded) visible in the storpool volume status output, which will disappear once all the data is fully recovered. An example situation would be a reboot of the node for a kernel or a package upgrade that requires reboot and: there are no kernel modules installed for the new kernel; the service (in this example the storpool_server) was not configured to start on boot; and others.

15.2.1.2. Some of the configured StorPool services have failed or is not running

These could be:

  • The storpool_block service on some of the storage only nodes, without any attached volumes or snapshots.

  • A single storpool_server service on some of the storage nodes, note that this is critical for systems with dual replication.

  • Single API (storpool_mgmt) service with another active running API

The reason for these could be the same as in the previous examples, usually the system log contains all information needed to check why the service is not (getting) up.

15.2.2. Degraded state due to host OS misconfiguration

Some examples include:

15.2.2.1. Changes in the OS configuration after a system update

This could prevent some of the services from running after a fresh boot. For instance changed names of the network interfaces used for the storage system after an upgrade.

15.2.2.2. Kdump is no longer collecting kernel dump data properly

If this occurs, it might be difficult to debug what have caused a kernel crash.

Some of the above cases will be difficult to catch prior to booting with the new environment (e.g. kernel or other updates) and sometimes they are only caught after an event that reveals the issue.

15.2.3. Degraded state due to network interface issues

15.2.3.1. Some of the interfaces used by StorPool is not up

This could be checked with storpool net list, e.g.:

# storpool net list
------------------------------------------------------------
| nodeId | flags   | net 1             | net 2             |
------------------------------------------------------------
|     23 | uU + AJ |                   | 1E:00:01:00:00:17 |
|     24 | uU + AJ |                   | 1E:00:01:00:00:18 |
|     25 | uU + AJ | 1A:00:01:00:00:19 | 1E:00:01:00:00:19 |
|     26 | uU + AJ | 12:00:01:00:F0:1A | 16:00:01:00:F0:1A |
|     27 | uU + AJ | 26:8A:07:43:EE:BC | 26:8A:07:43:EE:BD |
------------------------------------------------------------
Quorum status: 5 voting beacons up out of 5 expected
Flags:
  u - packets recently received from this node
  d - no packets recently received from this node

  U - this node has enough votes to be in the quorum
  D - this node does not have enough votes to be in the quorum
  + - this node considers itself in the quorum
  B - the connection to this node is through a backup link; check the cabling
  A - this node is using hardware acceleration
  J - the node uses jumbo frames

  N - a non-voting node

In the above example nodes 23 and 24 are not connected to the first network. This is the SP_IFACE1_CFG interface configuration in /etc/storpool.conf (check with storpool_showconf SP_IFACE1_CFG). Note that the beacons are up and running and the system is processing requests through the second network. The possible reasons could be misconfigured interfaces, StorPool configuration, or backend switch/switches.

15.2.3.2. A HW accceleration qualified interface is running without hardware acceleration

This is once again checked with storpool net list:

# storpool net list
------------------------------------------------------------
| nodeId | flags   | net 1             | net 2             |
------------------------------------------------------------
|     23 | uU +  J | 1A:00:01:00:00:17 | 1E:00:01:00:00:17 |
|     24 | uU +  J | 1A:00:01:00:00:18 | 1E:00:01:00:00:18 |
|     25 | uU + AJ | 1A:00:01:00:00:19 | 1E:00:01:00:00:19 |
|     26 | uU + AJ | 12:00:01:00:F0:1A | 16:00:01:00:F0:1A |
|     27 | uU + AJ | 26:8A:07:43:EE:BC | 26:8A:07:43:EE:BD |
------------------------------------------------------------
Quorum status: 5 voting beacons up out of 5 expected
Flags:
  u - packets recently received from this node
  d - no packets recently received from this node

  U - this node has enough votes to be in the quorum
  D - this node does not have enough votes to be in the quorum
  + - this node considers itself in the quorum
  B - the connection to this node is through a backup link; check the cabling
  A - this node is using hardware acceleration
  J - the node uses jumbo frames

  N - a non-voting node

In the above example nodes 23 and 24 are equipped with NICs qualified for, but are running without hardware acceleration; the possible reason could be either an OS misconfiguration, misconfigured kernel parameters on boot, or network interface misconfiguration.

15.2.3.3. Jumbo frames are expected, but not working on some of the interfaces

Could be seen with storpool net list, if some of the two networks is with MTU lower than 9k the J flag will not be listed:

# storpool net list
-------------------------------------------------------------
| nodeId | flags    | net 1             | net 2             |
-------------------------------------------------------------
|     23 | uU + A   | 12:00:01:00:F0:17 | 16:00:01:00:F0:17 |
|     24 | uU + AJ  | 1A:00:01:00:00:18 | 1E:00:01:00:00:18 |
|     25 | uU + AJ  | 1A:00:01:00:00:19 | 1E:00:01:00:00:19 |
|     26 | uU + AJ  | 1A:00:01:00:00:1A | 1E:00:01:00:00:1A |
-------------------------------------------------------------
Quorum status: 4 voting beacons up out of 4 expected
Flags:
  u - packets recently received from this node
  d - no packets recently received from this node

  U - this node has enough votes to be in the quorum
  D - this node does not have enough votes to be in the quorum
  M - this node is being damped by the rest of the nodes in the cluster
  + - this node considers itself in the quorum
  B - the connection to this node is through a backup link; check the cabling
  A - this node is using hardware acceleration
  J - the node uses jumbo frames

  N - a non-voting node

If the node is not expected to be running without jumbo frames this might be an indication for a misconfigured interface or an issue with applying the interfaces configuration on boot.

15.2.3.4. Some network interfaces are experiencing network loss or delays on one of the networks

This might affect the latency for some of the storage operations. Depending on the node where the losses occur, it might affect a single client or affect operations in the whole cluster in case of packet loss or delays are happening on a server node. Stats for all interfaces used for the storage should be collected once per minute in /var/log/storpool/spiface_stats.log, the counters for the underlying network interfaces will be increased for the period with rx or tx counters for packet loss or errors. The usual causes for packet loss are:

  • hardware issues (cables, SFPs, etc.)

  • floods and DDoS attacks “leaking” into the storage network due to misconfiguration

  • saturation of the CPU cores that handle the interrupts for the network cards and others when hardware acceleration is not available

15.2.4. Drive/Controller issues

15.2.4.1. One or more HDD or SSD drives are missing from a single server in the cluster or from servers in the same fault set

Note

This concerns only pools with triple replication, for dual replication this is considered as critical state.

The missing drives may be seen using storpool disk list or storpool server <serverID> disk list, for example in this output disk 543 is missing from server with ID 54:

# storpool server 54 disk list
disk  |   server  | size    |   used  |  est.free  |   %     | free entries | on-disk size |  allocated objects |  errors
541   |       54  | 207 GB  |  61 GB  |    136 GB  |   29 %  |      713180  |       75 GB  |   158990 / 225000  |    0
542   |       54  | 207 GB  |  56 GB  |    140 GB  |   27 %  |      719526  |       68 GB  |   161244 / 225000  |    0
543   |       54  |      -  |      -  |         -  |    - %  |           -  |           -  |        - / -       |    -
544   |       54  | 207 GB  |  62 GB  |    134 GB  |   30 %  |      701722  |       76 GB  |   158982 / 225000  |    0
545   |       54  | 207 GB  |  61 GB  |    135 GB  |   30 %  |      719993  |       75 GB  |   161312 / 225000  |    0
546   |       54  | 207 GB  |  54 GB  |    142 GB  |   26 %  |      720023  |       68 GB  |   158481 / 225000  |    0
547   |       54  | 207 GB  |  62 GB  |    134 GB  |   30 %  |      719996  |       77 GB  |   179486 / 225000  |    0
548   |       54  | 207 GB  |  53 GB  |    143 GB  |   26 %  |      718406  |       70 GB  |   179038 / 225000  |    0

Usual reasons - the drive was ejected from the cluster due to a write error either by the kernel or by the running storpool_server instance. More information may be found using dmesg | tail and in the system log. More information about the model and the serial number of the failed drive is shown by storpool disk list info.

When one or more drives are missing, multiple volumes and/or snapshots will be listed with the D flag in the output of storpool volume status (D as Degraded), due to the missing replicas for some of the data. This is normal and expected and there are the following options in this situation:

  • The drive could still be working correctly if the kernel false detected an error. To check if this is the case the drive may be stress-tested to ensure it is working correctly (fio is a good tool for this kind of tests, check --verify option). If the stress test is successful (e.g. the drive has been written to and verified successfully), it may be reinitialized with storpool_initdisk with the same disk ID it was before. This will automatically return it to the cluster and it will fully recover all data from scratch as if it was a brand new.

  • The drive has failed irrecoverably and a replacement is available. The replacement drive is stress-tested to ensure that it works correctly; it is then initialized with the diskID of the failed drive with storpool_initdisk. After returning it to the cluster it will fully recover all the data from the live replicas (please check Rebalancing StorPool for more).

  • A replacement is not available. The only option is to re-balance the cluster without this drive (more details in Rebalancing StorPool).

Attention

Beware that in some cases with very filled clusters it might be impossible to get the cluster back to full redundancy without overfilling some of the remaining drives. See next section.

15.2.4.2. Some of the drives in the cluster are beyond 90% (up to 96% full)

With proper planning this should be rarely an issue. A way to evade it is to add more drives or an additional server node with a full set of drives into the cluster. Another option is to remove unused volumes or snapshots.

The storpool snapshot space command will return relevant information for the referred space for each snapshot on the underlying drives. Note that snapshots with a negative value in their “used” column will not free up any space if they are removed and will remain in the deleting state, because they are parents of multiple cloned child volumes.

Note that depending on the speed with which the cluster is being populated with data by the end users this might also be considered a critical state.

15.2.4.3. Some of the drives have fewer than 140k free entries (alert for an overloaded system)

This may be observed in the output of storpool disk list or storpool server <serverID> disk list, an example from the latter below:

# storpool server 54 disk list
disk  |  server  | size    | used   |  est.free |      %  | free entries | on-disk size |  allocated objects |  errors
541   |      54  | 207 GB  | 54 GB  |   142 GB  |   26 %  |      137600  |       49 GB  |   158172 / 225000  |    0
542   |      54  | 207 GB  | 49 GB  |   147 GB  |   24 %  |      145197  |       44 GB  |   160542 / 225000  |    0
543   |      54  | 207 GB  | 49 GB  |   147 GB  |   24 %  |      141583  |       45 GB  |   181073 / 225000  |    0
544   |      54  | 207 GB  | 54 GB  |   142 GB  |   26 %  |      139556  |       48 GB  |   158128 / 225000  |    0
545   |      54  | 207 GB  | 56 GB  |   140 GB  |   27 %  |      141096  |       50 GB  |   160586 / 225000  |    0
546   |      54  | 207 GB  | 48 GB  |   148 GB  |   23 %  |      142752  |       43 GB  |   157735 / 225000  |    0
547   |      54  | 207 GB  | 55 GB  |   142 GB  |   26 %  |      131534  |       51 GB  |   178678 / 225000  |    0
548   |      54  | 207 GB  | 44 GB  |   152 GB  |   21 %  |      141673  |       41 GB  |   178192 / 225000  |    0

This usually happens after the system has been loaded for longer periods of time with a sustained random write workload on one or multiple volumes. If this is unexpected and the reason is an erratic workload, the recommended way to handle this is to set a limit (bandwidth, iops or both) on the loaded volumes for example with storpool volume <volumename> bw 100M iops 1000. The same could be set for multiple volumes/snapshots in a template with storpool template <templatename> bw 100M iops 1000 propagate. Please note that propagating changes for templates with a very large number of volumes and snapshots might not work. If the overloaded state is due to normally occurring worload it is best to expand the system with more drives. This latter case might be cause due to lower number of hard drives in a HDD only or a hybrid pool and rarely due to overloaded SSDs.

A couple of notes on the degraded states - apart from the notes for the replication above none of these should affect the stability of the system at this point. For the example with the missing disk in hybrid systems with a single failed SSD, all read requests on volumes with triple replication that have data on the failed drive will be served by some of the redundant copies on HDDs. This could bring up a bit the read latencies for the operations on the parts of the volume which were on this exact SSD. This is usually negligible in medium to large systems. In case of dual replicas on SSDs and a third replica on HDDs there is no latency change whatsoever, which is also the case for missing hard drives - they will not affect the system at all and in fact some write operations are even faster, because they are not waiting for the missing drive.

15.3. Critical state

This is an emergency state that requires immediate attention and intervention from the operations team and/or StorPool Support. Some of the conditions that could lead to this state are:

  • partial or complete network outage

  • power loss for some nodes in the cluster

  • memory shortage leading to a service failure due to missing or incomplete cgroups configuration

The following states are an indication for critical conditions:

15.3.1. API service failure

15.3.1.1. API not reachable on any of the configured nodes (the ones running the storpool_mgmt service)

Requests to the API from any of the nodes configured to access it either stall or cannot reach a working service.

This might be caused by:

  • Misconfigured network for accessing the floating IP address - the address may be obtained by storpool_confshow http on any of the nodes with a configured storpool_mgmt service in the cluster:

    # storpool_confshow http
    SP_API_HTTP_HOST=10.3.10.78
    SP_API_HTTP_PORT=81
    

Failed interfaces on the hosts that have the storpool_mgmt service running. To find the interface where the StorPool API should be running use storpool_confshow api_iface

# storpool_confshow api_iface
SP_API_IFACE=bond0.410

It is recommended to have the API on a redundant interface (e.g. an active-backup bond). Note that even without an API, provided the cluster is in quorum, there should be no impact on any running operations, but changes in the cluster like creating/attaching/detaching/deleting volumes or snapshots) will be impossible.

  • The cluster is not in quorum - The cluster is in this state if the number of running voting storpool_beacon services is less than the half of the expected nodes plus one ((expected / 2) + 1). The configured number of expected nodes in the cluster may be checked with storpool_confshow expected, so in a system with 6 servers at least 4 voting beacons should be available to get back the cluster in running state:

    # storpool_confshow expected
    SP_EXPECTED_NODES=6
    

The current number of expected votes and the number of voting beacons are displayed in the output of storpool net list, check the example above (the Quorum status: line).

15.3.1.2. API requests are not returning for more than 30-60 seconds (e.g. storpool volume status, storpool snapshot space, storpool disk list, etc.)

These API requests collect data from the running storpool_server services on each server node. Possible reasons are:

  • network loss or delays;

  • failing storpool_server services or whole server nodes;

  • failing drives;

  • overload

15.3.2. Server service failure

15.3.2.1. Two storpool_server services or whole servers are down

Two storpool_server services or whole servers are down or not joined in the cluster in different failure sets.

In this case some of the read operations for parts of the volumes are served from HDDs in a hybrid system and might significantly raise read latencies. In this state it is very important to bring back the missing services/nodes as soon as possible, because a failure in a third node or another failure set will bring some of the volumes in down state.

15.3.2.2. More than two storpool_server services or whole servers are down

This state results in some volumes being in down state (storpool volume status) due to some parts only on the missing drives. Recommended action in this case - check for the reasons for the degraded services or missing (unresponsive) nodes and get them back up.

Possible reasons are:

  • lost network connectivity

  • severe packet loss/delays

  • power loss

  • hardware instabilities, overheating, etc.

  • kernel or other software instabilities, crashes and others.

15.3.3. Client service failure

If the client service (storpool_block) is down on some of the nodes depending on it, these could be either client-only or converged hypervisors, this will stall all requests on that particular node until the service is back up.

Possible reasons are again:

  • lost network connectivity

  • severe packet loss/delays

  • hardware instabilities and more.

In case of power loss or kernel crashes any virtual machine instances that were running on this node could be started on other nodes in the cluster.

15.3.4. Network interface or Switch failure

This means that the networks used for StorPool are down or are experiencing heavy packet loss or delays. In this case the quorum service will prevent a split-brain situation and will restart all services to ensure the cluster is fully connected on at least one network before it transitions again to running state.

15.3.5. Hard Drive/SSD failures

15.3.5.1. Drives from two or more different nodes (fault sets) in the cluster are missing (or from a single node/fault set for systems with dual replication pools).

In this case multiple volumes may either experience degraded performance or be in the down state in rare cases with more than two drives missing at a time. The recommended steps are to immediately check for the reasons for the missing drives and return running drives into the cluster as soon as possible.

15.3.5.2. Some of the drives are more than 97% full.

At some point all cluster operations will stall until either some of the data in the cluster is deleted or new drives/nodes are added. Adding drives requires re-balancing the system, which should be carefully planned (details in Rebalancing StorPool). Cleaning up snapshots that have multiple cloned volumes and a negative value for used space in the output of storpool snapshot space will not free up any space.

15.3.5.3. Some of the drives have fewer than 100k free entries.

This is a signal for heavily overloaded system. In this state the latencies for some operations might become very high (measured in seconds). Possible reasons are severely overloaded volumes for long periods of time without any configured bandwidth or iops limits. This could be checked by using iostat to look for volumes that are being constantly 100% loaded with a large number of requests to the storage system. Other causes are misbehaving (underperforming) drives or misbehaving HBA/SAS controllers, the recommended way to deal with these cases is to check the output from storpool disk list internal for higher aggregation scores on some drives or set of drives (e.g. on the same server) or by using iostat to look for hard disk or SSD drives staying loaded all the time and showing a lower performance compared to other drives of the same type. An example would be a failing controller causing the SATA speed to degrade to SATA1.0 (1.5Gb/s) instead of SATA 3.0 (6Gb/s).

The circumstances leading a system to the critical state are rare and are usually preventable with periodic maintenance and by taking measures to handle all the issues at the first signs of a change from the normal to degraded state.

In every case if you feel that something is not as expected you may consult with StorPool Support. We receive notifications for all cases in the degraded state and critical state as well as daily reports with events in the cluster and we try very hard to prevent all threats by notifying the customers of each case along with the possible risks and steps with improvements to resolve the issue.