Troubleshooting
This part outlines the different states of a StorPool cluster, common knowledge about what should be expected and what are the recommended steps in each of them. This is intended to be used as a guideline for the operations team(s) maintaining the production system provided by StorPool.
Normal state of the system
The normal behaviour of the StorPool storage system is when it is fully configured and in up-and-running state. This is the desired state of the system.
Characteristics of this state:
All nodes in the storage cluster are up and running
This can be checked by using the CLI with storpool service list
on any node
with access to the API service.
Note
The storpool service list
provides status for all services running
clusterwide, rather than the services running on the node itself.
All configured StorPool services are up and running
This is again easily checked with storpool service list
. Recently restarted
services are usually spotted due to their uptime. Recently restarted services
are to be taken seriously if the reason for their state is unknown even if they
are running at the moment, as in the example with client ID 37
below:
# storpool service list
cluster running, mgmt on node 2
mgmt 1 running on node 1 ver 20.00.18, started 2022-09-08 19:28:37, uptime 144 days 22:45:51
mgmt 2 running on node 2 ver 20.00.18, started 2022-09-08 19:27:18, uptime 144 days 22:47:10 active
server 1 running on node 1 ver 20.00.18, started 2022-09-08 19:28:59, uptime 144 days 22:45:29
server 2 running on node 2 ver 20.00.18, started 2022-09-08 19:25:53, uptime 144 days 22:48:35
server 3 running on node 3 ver 20.00.18, started 2022-09-08 19:23:30, uptime 144 days 22:50:58
client 1 running on node 1 ver 20.00.18, started 2022-09-08 19:28:37, uptime 144 days 22:45:51
client 2 running on node 2 ver 20.00.18, started 2022-09-08 19:25:32, uptime 144 days 22:48:56
client 3 running on node 3 ver 20.00.18, started 2022-09-08 19:23:09, uptime 144 days 22:51:19
client 21 running on node 21 ver 20.00.18, started 2022-09-08 19:20:26, uptime 144 days 22:54:02
client 22 running on node 22 ver 20.00.18, started 2022-09-08 19:19:26, uptime 144 days 22:55:02
client 37 running on node 37 ver 20.00.18, started 2022-09-08 13:08:12, uptime 05:06:16
Working cgroup memory and cpuset isolation is properly configured
Use the storpool_cg
tool with an argument check
to ensure everything is
as expected. The tool should not return any warnings. For more information, see
Control groups.
When properly configured the sum of all memory limits in the node are less than
the available memory in the node. This protects the running kernel from memory
shortage as well as all processes in the storpool.slice
memory cgroup which
ensures the stability of the storage service.
All network interfaces are properly configured
All network interfaces used by StorPool are up and properly configured with
hardware acceleration enabled (where applicable); all network switches are
configured with jumbo frames and flow control, and none of them experience any
packet loss or delays. The output from storpool net list
is a good start,
all configured network interfaces will be seen as up with proper flags explained
at the end. The desired state is uU
with a +
at the end for each network
interface; if hardware acceleration is supported on an interface the A
flag
should also be present:
storpool net list
------------------------------------------------------------
| nodeId | flags | net 1 | net 2 |
------------------------------------------------------------
| 23 | uU + AJ | 1A:00:01:00:00:17 | 1E:00:01:00:00:17 |
| 24 | uU + AJ | 1A:00:01:00:00:18 | 1E:00:01:00:00:18 |
| 25 | uU + AJ | 1A:00:01:00:00:19 | 1E:00:01:00:00:19 |
| 26 | uU + AJ | 12:00:01:00:F0:1A | 16:00:01:00:F0:1A |
| 27 | uU + AJ | 26:8A:07:43:EE:BC | 26:8A:07:43:EE:BD |
------------------------------------------------------------
Quorum status: 5 voting beacons up out of 5 expected
Flags:
u - packets recently received from this node
d - no packets recently received from this node
U - this node has enough votes to be in the quorum
D - this node does not have enough votes to be in the quorum
+ - this node considers itself in the quorum
B - the connection to this node is through a backup link; check the cabling
A - this node is using hardware acceleration
J - the node uses jumbo frames
N - a non-voting node
All drives are up and running
All drives in use for the storage system are performing at their specified speed, are joined in the cluster and serving requests.
This could be checked with storpool disk list internal
, for example in a
normally loaded cluster all drives will report low aggregate scores. Below is an
example output (trimmed for brevity):
# storpool disk list internal
--------------------------------------------------------------------------------------------------------------------------------------------------------
| disk | server | aggregate scores | wbc pages | scrub bw | scrub ETA | last scrub completed |
--------------------------------------------------------------------------------------------------------------------------------------------------------
| 2301 | 23.0 | 0 | 0 | 0 | - + - / - | - | - | 2022-09-08 15:33:44 |
| 2302 | 23.0 | 0 | 0 | 0 | - + - / - | - | - | 2022-09-08 15:28:48 |
| 2303 | 23.1 | 0 | 0 | 0 | - + - / - | - | - | 2022-09-08 15:28:49 |
| 2304 | 23.1 | 0 | 0 | 0 | - + - / - | - | - | 2022-09-08 15:28:50 |
| 2305 | 23.2 | 0 | 0 | 0 | - + - / - | - | - | 2022-09-08 15:28:51 |
| 2306 | 23.2 | 0 | 0 | 0 | - + - / - | - | - | 2022-09-08 15:28:51 |
| 2307 | 23.3 | 0 | 0 | 0 | - + - / - | - | - | 2022-09-08 15:28:52 |
| 2308 | 23.3 | 0 | 0 | 0 | - + - / - | - | - | 2022-09-08 15:28:53 |
| 2311 | 23.0 | 0 | 0 | 0 | 0 + 0 / 2560 | - | - | 2022-09-08 15:28:38 |
| 2312 | 23.0 | 0 | 0 | 0 | 0 + 0 / 2560 | - | - | 2022-09-08 15:28:43 |
| 2313 | 23.1 | 0 | 0 | 0 | 0 + 0 / 2560 | - | - | 2022-09-08 15:28:44 |
| 2314 | 23.1 | 0 | 0 | 0 | 0 + 0 / 2560 | - | - | 2022-09-08 15:28:45 |
| 2315 | 23.2 | 0 | 0 | 0 | 0 + 0 / 2560 | - | - | 2022-09-08 15:28:47 |
| 2316 | 23.2 | 0 | 0 | 0 | 0 + 0 / 2560 | - | - | 2022-09-08 15:28:39 |
| 2317 | 23.3 | 0 | 0 | 0 | 0 + 0 / 2560 | - | - | 2022-09-08 15:28:40 |
| 2318 | 23.3 | 0 | 0 | 0 | 0 + 0 / 2560 | - | - | 2022-09-08 15:28:42 |
[snip]
All drives are regularly scrubbed, so they would have a stable (not increasing)
number of errors. The errors corrected for each drive are visible in the
storpool disk list
output. Last completed scrub is visible in storpool
disk list internal
as in the example above.
Note that Some systems may have fewer than two network interfaces or a single backend switch. Even not recommended, this is still possible and sometimes used (usually in PoC or with a backup server) when a cluster is configured with a single-VLAN network redundancy scheme. A single VLAN network redundancy configuration and an inter-switch connection is required for a cluster where only some of the nodes are with a single interface connected to the cluster.
Also, if one or more of the points describing the state above is not in effect, the system should not be considered healthy. If there is any suspicion that the system is behaving erratically even though all of the above conditions are satisfied, the recommended steps to check if everything is in order are:
Check
top
and look for the state of each of the configuredstorpool_*
services running in the present node. A properly running service is usually in theS
(sleeping) state and rarely seen in theR
(running) state. The CPU usage is often reported at 100% usage when hardware sleep is enabled, due to the kernel misreporting. The actual usage is much lower and could be tracked withcpupower monitor
for the CPU cores.To ensure all services on this node are running correctly is to use the
/usr/lib/storpool/sdump
tool, which will be reporting some CPU and network usage statistics for the running services on the node. Use the-l
option for the long names of the statistics.On some of the nodes with running workloads (like VM instances or containers)
iostat
will show activity for processed requests on the block devices.The following example shows the normal disk activity on a node running VM instances. Note that the usage may vary greatly depending on the workload. The command used in the example is
iostat -xm 1 /dev/sp-*| egrep -v " 0[.,]00$"
, which will print statistics for the StorPool devices each second excluding drives that have no storage IO activity:Device: rrqm/s wrqm/s r/s w/s rMB/s wMB/s avgrq-sz avgqu-sz await r_await w_await svctm %util sp-0 0.00 0.00 0.00 279.00 0.00 0.14 1.00 3.87 13.80 0.00 13.80 3.55 98.99 sp-11 0.00 0.00 165.60 114.10 19.29 14.26 245.66 5.97 20.87 9.81 36.91 0.89 24.78 sp-12 0.00 0.00 171.60 153.60 19.33 19.20 242.67 9.20 28.17 10.46 47.96 1.08 35.01 sp-13 0.00 0.00 6.00 40.80 0.04 5.10 225.12 1.75 37.32 0.27 42.77 1.06 4.98 sp-21 0.00 0.00 0.00 82.20 0.00 1.04 25.90 1.00 12.08 0.00 12.08 12.16 99.99
There are no hanging active requests
The output of /usr/lib/storpool/latthreshold.py
is empty - shows no hanging
requests and no service or disk warnings.
Degraded state
In this state some system components are not fully operational and need attention. Some examples of a degraded state below.
Degraded state due to service issues
A single storpool_server
service on one of the storage nodes is not
available or not joined in the cluster
Note that this concerns only pools with triple replication, for dual replication
this is considered to be a critical state, because there are parts of the system
with only one available copy. This is an example output from storpool service
list
:
# storpool service list
cluster running, mgmt on node 2
mgmt 1 running on node 1 ver 20.00.18, started 2022-09-08 16:11:59, uptime 19:51:50
mgmt 2 running on node 2 ver 20.00.18, started 2022-09-08 16:11:58, uptime 19:51:51 active
mgmt 3 running on node 3 ver 20.00.18, started 2022-09-08 16:11:58, uptime 19:51:51
server 1 down on node 1 ver 20.00.18
server 2 running on node 2 ver 20.00.18, started 2022-09-08 16:12:03, uptime 19:51:46
server 3 running on node 3 ver 20.00.18, started 2022-09-08 16:12:04, uptime 19:51:45
client 1 running on node 1 ver 20.00.18, started 2022-09-08 16:11:59, uptime 19:51:50
client 2 running on node 2 ver 20.00.18, started 2022-09-08 16:11:57, uptime 19:51:52
client 3 running on node 3 ver 20.00.18, started 2022-09-08 16:11:57, uptime 19:51:52
If this is unexpected, i.e. no one has deliberately restarted or stopped the
service for planned maintenance or upgrade, it is very important to first bring
the service up and then to investigate the root cause for the service outage.
When the storpool_server
service comes back up it will start recovering
outdated data on its drives. The recovery process could be monitored with
storpool task list
, which will output which disks are recovering, as well as
how much data is there left to be recovered. Example output or storpool task
list:
# storpool task list
----------------------------------------------------------------------------------------
| disk | task id | total obj | completed | started | remaining | % complete |
----------------------------------------------------------------------------------------
| 2303 | RECOVERY | 1 | 0 | 1 | 1 | 0% |
----------------------------------------------------------------------------------------
| total | | 1 | 0 | 1 | 1 | 0% |
----------------------------------------------------------------------------------------
Some of the volumes or snapshots will have the D
flag (for degraded) visible
in the storpool volume status
output, which will disappear once all the data
is fully recovered. An example situation would be a reboot of the node for a
kernel or a package upgrade that requires reboot and no kernel modules were
installed for the new kernel or a service (in this example the
storpool_server
) was not configured to start on boot and others.
Some of the configured StorPool services have failed or is not running
These could be:
The storpool_block service on some of the storage only nodes, without any attached volumes or snapshots.
A single
storpool_server
service or multiple instances on the same node, note again that this is critical for systems with dual replication.Single API (
storpool_mgmt
) service with another active running API.
The reason for these could be the same as in the previous examples, usually the system log contains all information needed to check why the service is not (getting) up.
Degraded state due to host OS misconfiguration
Some examples include:
Changes in the OS configuration after a system update
This could prevent some of the services from running after a fresh boot. For instance, due to changed names of the network interfaces used for the storage system after an upgrade, changed PCIe IDs for NVMe devices, and so on.
Kdump is no longer collecting kernel dump data properly
If this occurs, it might be difficult to debug what have caused a kernel crash.
Some of the above cases will be difficult to catch prior to booting with the new environment (for example, kernel or other updates) and sometimes they are only caught after an event that reveals the issue. Thus it is important to regularly test and ensure the system is in properly configured state and collects normally.
Degraded state due to network interface issues
Some of the interfaces used by StorPool is not up.
This could be checked with storpool net list
, like this:
# storpool net list
------------------------------------------------------------
| nodeId | flags | net 1 | net 2 |
------------------------------------------------------------
| 23 | uU + AJ | | 1E:00:01:00:00:17 |
| 24 | uU + AJ | | 1E:00:01:00:00:18 |
| 25 | uU + AJ | 1A:00:01:00:00:19 | 1E:00:01:00:00:19 |
| 26 | uU + AJ | 12:00:01:00:F0:1A | 16:00:01:00:F0:1A |
| 27 | uU + AJ | 26:8A:07:43:EE:BC | 26:8A:07:43:EE:BD |
------------------------------------------------------------
Quorum status: 5 voting beacons up out of 5 expected
Flags:
u - packets recently received from this node
d - no packets recently received from this node
U - this node has enough votes to be in the quorum
D - this node does not have enough votes to be in the quorum
+ - this node considers itself in the quorum
B - the connection to this node is through a backup link; check the cabling
A - this node is using hardware acceleration
J - the node uses jumbo frames
N - a non-voting node
In the above example nodes 23 and 24 are not connected to the first network.
This is the SP_IFACE1_CFG
interface configuration in /etc/storpool.conf
(check with storpool_showconf SP_IFACE1_CFG
). Note that the beacons are up
and running and the system is processing requests through the second network.
The possible reasons could be misconfigured interfaces, StorPool configuration,
or backend switch/switches.
A HW acceleration qualified interface is running without hardware acceleration
This is once again checked with storpool net list
:
# storpool net list
------------------------------------------------------------
| nodeId | flags | net 1 | net 2 |
------------------------------------------------------------
| 23 | uU + J | 1A:00:01:00:00:17 | 1E:00:01:00:00:17 |
| 24 | uU + J | 1A:00:01:00:00:18 | 1E:00:01:00:00:18 |
| 25 | uU + AJ | 1A:00:01:00:00:19 | 1E:00:01:00:00:19 |
| 26 | uU + AJ | 12:00:01:00:F0:1A | 16:00:01:00:F0:1A |
| 27 | uU + AJ | 26:8A:07:43:EE:BC | 26:8A:07:43:EE:BD |
------------------------------------------------------------
Quorum status: 5 voting beacons up out of 5 expected
Flags:
u - packets recently received from this node
d - no packets recently received from this node
U - this node has enough votes to be in the quorum
D - this node does not have enough votes to be in the quorum
+ - this node considers itself in the quorum
B - the connection to this node is through a backup link; check the cabling
A - this node is using hardware acceleration
J - the node uses jumbo frames
N - a non-voting node
In the above example, nodes 23 and 24 are equipped with NICs qualified for, but
are running without hardware acceleration; the possible reason could be either a
BIOS or an OS misconfiguration, misconfigured kernel parameters on boot, or
network interface misconfiguration. Note that when a system was configured for
hardware accelerated operation the cgroups configuration was also sized
accordingly, thus running in this state is likely to case performance issues,
due to less CPU cores isolated and reserved for the NIC interrupts and
storpool_rdma
threads.
Jumbo frames are expected, but not working on some of the interfaces
Could be seen with storpool net list
, if some of the two networks is with
MTU lower than 9k the J
flag will not be listed:
# storpool net list
-------------------------------------------------------------
| nodeId | flags | net 1 | net 2 |
-------------------------------------------------------------
| 23 | uU + A | 12:00:01:00:F0:17 | 16:00:01:00:F0:17 |
| 24 | uU + AJ | 1A:00:01:00:00:18 | 1E:00:01:00:00:18 |
| 25 | uU + AJ | 1A:00:01:00:00:19 | 1E:00:01:00:00:19 |
| 26 | uU + AJ | 1A:00:01:00:00:1A | 1E:00:01:00:00:1A |
-------------------------------------------------------------
Quorum status: 4 voting beacons up out of 4 expected
Flags:
u - packets recently received from this node
d - no packets recently received from this node
U - this node has enough votes to be in the quorum
D - this node does not have enough votes to be in the quorum
M - this node is being damped by the rest of the nodes in the cluster
+ - this node considers itself in the quorum
B - the connection to this node is through a backup link; check the cabling
A - this node is using hardware acceleration
J - the node uses jumbo frames
N - a non-voting node
If the node is not expected to be running without jumbo frames this might be an indication for a misconfigured interface or an issue with applying the interfaces configuration on boot. Note that an OS interface configured for jumbo frames without having the switch port properly configured leads to severe performance issues.
Some network interfaces are experiencing network loss or delays on one of the networks
This might affect the latency for some of the storage operations. Depending on
the node where the losses occur, it might affect a single client or affect
operations in the whole cluster in case of packet loss or delays are happening
on a server node. Stats for all interfaces per service are collected in the
analytics platform (https://analytics.storpool.com) and could be used to
investigate network performance issues. The /usr/lib/storpool/sdump
tool
will print the same statistics on each of the nodes with services. The usual
causes for packet loss are:
Hardware issues (cables, SFPs, and so on).
Floods and DDoS attacks “leaking” into the storage network due to misconfiguration.
Saturation of the CPU cores that handle the interrupts for the network cards and others when hardware acceleration is not available.
Network loops leading to saturated switch ports or overloaded NICs.
Drive/Controller issues
One or more HDD or SSD drives are missing from a single server in the cluster or from servers in the same fault set
Attention
This concerns only pools with triple replication, for dual replication this is considered as critical state.
The missing drives may be seen using storpool disk list
or storpool server
<serverID> disk list
, for example in this output disk 543
is missing from
server with ID 54
:
# storpool server 54 disk list
disk | server | size | used | est.free | % | free entries | on-disk size | allocated objects | errors | flags
541 | 54 | 207 GB | 61 GB | 136 GB | 29 % | 713180 | 75 GB | 158990 / 225000 | 0 / 0 |
542 | 54 | 207 GB | 56 GB | 140 GB | 27 % | 719526 | 68 GB | 161244 / 225000 | 0 / 0 |
543 | 54 | - | - | - | - % | - | - | - / - | -
544 | 54 | 207 GB | 62 GB | 134 GB | 30 % | 701722 | 76 GB | 158982 / 225000 | 0 / 0 |
545 | 54 | 207 GB | 61 GB | 135 GB | 30 % | 719993 | 75 GB | 161312 / 225000 | 0 / 0 |
546 | 54 | 207 GB | 54 GB | 142 GB | 26 % | 720023 | 68 GB | 158481 / 225000 | 0 / 0 |
547 | 54 | 207 GB | 62 GB | 134 GB | 30 % | 719996 | 77 GB | 179486 / 225000 | 0 / 0 |
548 | 54 | 207 GB | 53 GB | 143 GB | 26 % | 718406 | 70 GB | 179038 / 225000 | 0 / 0 |
Usual reasons - the drive was ejected from the cluster due to a write error
either by the kernel or by the running storpool_server
instance. More
information may be found using dmesg | tail
and in the system log. More
information about the model and the serial number of the failed drive is shown
by storpool disk list info
.
In normal conditions the server will flag the disk to be re-tested and will eject it for a quick test. Provided the disk is still working correctly and test results are not breaching any thresholds the disk will be returned into the cluster to recover. Such a case for example might happen if the stalled request was caused by an intermediate issue, like a reallocated sector.
In case the disk is breaching any sane latency and bandwidth thresholds it will not be automatically returned and will have to be re-balanced out of the cluster. Such disks are marked as “bad” (more available at storpool_initdisk options)
When one or more drives are ejected (marked as bad already) and missing,
multiple volumes and/or snapshots will be listed with the D
flag in the
output of storpool volume status
(D
as Degraded
), due to the missing
replicas for some of the data. This is normal and expected and there are the
following options in this situation:
The drive could still be working properly (for example, a set of bad sectors were reallocated) even after it was tested, in order to re-test you could mark the drive as –good (more info on how at storpool_initdisk options) and attempt to get back into the cluster.
In some occasions a disk might have lost its signatures and would have to be returned in the cluster to recover from scratch - it will be automatically re-tested upon attempt to a full (read-write) stress-test is recommended to ensure it is working correctly (
fio
is a good tool for this kind of tests, check--verify
option). In case the stress test is successful (e.g. the drive has been written to and verified successfully), it may be reinitialized withstorpool_initdisk
with the same disk ID it was before. This will automatically return it to the cluster and it will fully recover all data from scratch as if it was a brand new.The drive has failed irrecoverably and a replacement is available. The replacement drive is initialized with the diskID of the failed drive with
storpool_initdisk
. After returning it to the cluster it will fully recover all the data from the live replicas (please check Rebalancing the cluster for more).A replacement is not available. The only option is to re-balance the cluster without this drive (more details in Rebalancing the cluster).
Attention
Beware that in some cases with very filled clusters it might be impossible to get the cluster back to full redundancy without overfilling some of the remaining drives. See next section.
Some of the drives in the cluster are beyond 90% (up to 96% full)
With proper planning this should be rarely an issue. A way to evade it is to add more drives or an additional server node with a full set of drives into the cluster. Another option is to remove unused volumes or snapshots.
The storpool snapshot space
command will return relevant information for the
referred space for each snapshot on the underlying drives. Note that snapshots
with a negative value in their “used” column will not free up any space if they
are removed and will remain in the deleting state, because they are parents of
multiple cloned child volumes.
Note that depending on the speed with which the cluster is being populated with data by the end users this might also be considered a critical state.
Some of the drives have fewer than 140k free entries (alert for an overloaded system)
This may be observed in the output of storpool disk list
or storpool
server <serverID> disk list
, an example from the latter below:
# storpool server 23 disk list
disk | server | size | used | est.free | % | free entries | on-disk size | allocated objects | errors | flags
2301 | 23.0 | 893 GiB | 2.6 GiB | 857 GiB | 0 % | 3719930 | 660 KiB | 17 / 930000 | 0 / 0 |
2302 | 23.0 | 893 GiB | 2.6 GiB | 857 GiB | 0 % | 3719929 | 668 KiB | 17 / 930000 | 0 / 0 |
2303 | 23.0 | 893 GiB | 2.6 GiB | 857 GiB | 0 % | 3719929 | 668 KiB | 17 / 930000 | 0 / 0 |
2304 | 23.0 | 893 GiB | 2.6 GiB | 857 GiB | 0 % | 3719931 | 668 KiB | 17 / 930000 | 0 / 0 |
2306 | 23.0 | 893 GiB | 2.6 GiB | 857 GiB | 0 % | 3719932 | 664 KiB | 17 / 930000 | 0 / 0 |
2305 | 23.0 | 893 GiB | 2.6 GiB | 857 GiB | 0 % | 3719930 | 660 KiB | 17 / 930000 | 0 / 0 |
2307 | 23.0 | 893 GiB | 2.6 GiB | 857 GiB | 0 % | 19934 | 664 KiB | 17 / 930000 | 0 / 0 |
--------------------------------------------------------------------------------------------------------------------------------------
7 | 1.0 | 6.1 TiB | 18 GiB | 5.9 TiB | 0 % | 26039515 | 4.5 MiB | 119 / 6510000 | 0 / 0 |
This usually happens after the system has been loaded for longer periods of time
with a sustained write workload on one or multiple volumes. If this is
unexpected and the reason is an erratic workload, the recommended way to handle
this is to set a limit (bandwidth, iops or both) on the loaded volumes for
example with storpool volume <volumename> bw 100M iops 1000
. The same could
be set for multiple volumes/snapshots in a template with storpool template
<templatename> bw 100M iops 1000 propagate
. Please note that propagating
changes for templates with a very large number of volumes and snapshots might
not work. If the overloaded state is due to normally occurring workload it is
best to expand the system with more drives and or reformat the drives with
larger number of entries (relates mainly to HDD drives). The latter case might
be cause due to lower number of hard drives in a HDD only or a hybrid pool and
rarely due to overloaded SSDs.
Another case related to overloaded drives is when many volumes are created out of the same template, which requires overrides in order to shuffle the objects where the journals are residing in order to avoid overload of the same triplet of disks when all virtual machines spike for some reason (i.e. unattended upgrades, a syslog intensive cron job, etc.)
A couple of notes on the degraded states - apart from the notes for the replication above none of these should affect the stability of the system at this point. For the example with the missing disk in hybrid systems with a single failed SSD, all read requests on volumes with triple replication that have data on the failed drive will be served by some of the redundant copies on HDDs. This could bring up a bit the read latencies for the operations on the parts of the volume which were on this exact SSD. This is usually negligible in medium to large systems, i.e. in a cluster with 20 SSDs or NVMe drives, these are 1/20th of all the read operations in the cluster. In case of dual replicas on SSDs and a third replica on HDDs there is no read latency penalty whatsoever, which is also the case for missing hard drives - they will not affect the system at all and in fact some write operations are even faster, because they are not waiting for the missing drive.
Critical state
This is an emergency state that requires immediate attention and intervention from the operations team and/or StorPool Support. Some of the conditions that could lead to this state are:
Partial or complete network outage.
Power loss for some nodes in the cluster.
Memory shortage leading to a service failure due to missing or incomplete cgroups configuration.
The following states are an indication for critical conditions:
API service failure
API not reachable on any of the configured nodes (the ones running the storpool_mgmt
service)
Requests to the API from any of the nodes configured to access it either stall or cannot reach a working service. This is a critical state due to the fact that the status of the cluster is unknown (might be down for that matter).
This might be caused by:
Misconfigured network for accessing the floating IP address - the address may be obtained by
storpool_showconf http
on any of the nodes with a configuredstorpool_mgmt
service in the cluster:# storpool_showconf http SP_API_HTTP_HOST=10.3.10.78 SP_API_HTTP_PORT=81
Failed interfaces on the hosts that have the
storpool_mgmt
service running. To find the interface where the StorPool API should be running usestorpool_showconf api_iface
:# storpool_showconf api_iface SP_API_IFACE=bond0.410
It is recommended to have the API on a redundant interface (e.g. an active-backup bond interface). Note that even without an API, provided the cluster is in quorum, there should be no impact on any running operations, but changes in the cluster like creating/attaching/detaching/deleting volumes or snapshots) will be impossible. Running with no API in the cluster triggers highest severity alert to StorPool Support (essentially wake up alert) due to the unknown state of the system.
The cluster is not in quorum
The cluster is in this state if the number of running voting
storpool_beacon
services is less than the half of the expected nodes plus one ((expected / 2) + 1). The configured number of expected nodes in the cluster may be checked withstorpool_showconf expected
, but is generally considered to be the number of server nodes (except when client nodes are configured as voting for some reason). In a system with 6 servers at least 4 voting beacons should be available to get back the cluster in running state:# storpool_showconf expected SP_EXPECTED_NODES=6
The current number of expected votes and the number of voting beacons are displayed in the output of
storpool net list
, check the example above (theQuorum status:
line).
API requests are not returning for more than 30-60 seconds (e.g. storpool volume status
, storpool snapshot space
, storpool disk list
, etc.)
These API requests collect data from the running storpool_server
services on
each server node. Possible reasons are:
Network loss or delays;
Failing
storpool_server
services;Failing drives or hardware (CPU, memory, controllers, etc.);
Overload
Server service failure
Two storpool_server services or whole servers are down
Two storpool_server
services or whole servers are down or not joined in the
cluster in different failure sets. Very risky state, because there are parts of
the volumes with only one live replica, if the latest writes land on a drive
returning an IO error or broken data (detected by StorPool) this will lead to
data loss.
As in degraded state some of the read operations for parts of the volumes will be served from HDDs in a hybrid system and might raise read latencies. In this state it is very important to bring back the missing services/nodes as soon as possible, because a failure in any of the remaining drives in other nodes or another fault set will bring some of the volumes in down state and might lead to data loss in case of an error returned by a drive with the latest writes.
More than two storpool_server services or whole servers are down
This state results in some volumes being in down state (storpool volume
status
) due to some parts only on the missing drives. Recommended action in
this case - check for the reasons for the degraded services or missing
(unresponsive) nodes and get them back up.
Possible reasons are:
Lost network connectivity
Severe packet loss/delays/loops
Partial or complete power loss
Hardware instabilities, overheating
Kernel or other software instabilities, crashes
Client service failure
If the client service (storpool_block
) is down on some of the nodes
depending on it, these could be either client-only or converged hypervisors,
this will stall all requests on that particular node until the service is back
up.
Possible reasons are again:
Lost network connectivity
Severe packet loss/delays/loops
Bugs in the
storpool_block
service or thestorpool_bd
kernel module
In case of power loss or kernel crashes any virtual machine instances that were running on this node could be started on other available nodes.
Network interface or Switch failure
This means that the networks used for StorPool are down or are experiencing heavy packet loss or delays. In this case the quorum service will prevent a split-brain situation and will restart all services to ensure the cluster is fully connected on at least one network before it transitions again to running state. Such issues might be alleviated by a single-VLAN setup when different nodes have partial network connectivity, but will be again with severe delays in case of severe packet loss.
Hard Drive/SSD failures
Drives from two or more different nodes (fault sets) in the cluster are missing (or from a single node/fault set for systems with dual replication pools)
In this case multiple volumes may either experience degraded performance (hybrid
placement) or will be in down
state when more than two replicas are missing.
All operations on volumes in down
state are stalled, until the redundancy is
restored (i.e. at least one replica is available). The recommended steps are to
immediately check for the reasons for the missing drives/services/nodes and
return them into the cluster as soon as possible.
Some of the drives are more than 97% full
At some point all cluster operations will stall until either some of the data in the cluster is deleted or new drives/nodes are added. Adding drives requires the new drives to be stress tested and a re-balancing of the system to include them, which should be carefully planned (details in Rebalancing the cluster).
Note
Cleaning up snapshots that have multiple cloned volumes and a negative
value for used space in the output of storpool snapshot space
will
not free up any space.
Some of the drives have fewer than 100k free entries
This is usually caused by a heavily overloaded system. In this state the
latencies for some operations might become very high (measured in seconds).
Possible reasons are severely overloaded volumes for long periods of time
without any configured bandwidth or iops limits. This could be checked by using
iostat to look for volumes that are being constantly 100% loaded with a large
number of requests to the storage system. Another way to check for such volumes
is to use the “Top volumes” in the analytics in order to get info for the most
loaded volumes and apply IOPS and or bandwidth limits accordingly. Other causes
are misbehaving (underperforming) drives or misbehaving HBA/SAS controllers, the
recommended way to deal with these cases is to investigate for such drives, a
good idea is to check the output from storpool disk list internal
for higher
aggregation scores on some drives or set of drives (e.g. on the same server) or
by the use of the analytics to check for abnormal latency on some of the backend
nodes (i.e. drives with significantly higher operations latency compared to
other drives of the same type). An example would be a failing controller causing
the SATA speed to degrade to SATA1.0 (1.5Gb/s) instead of SATA 3.0 (6Gb/s),
weared out batteries on a RAID controller when its cache is used to accelerate
the writes on the HDDs, and others.
The circumstances leading a system to the critical state are rare and are usually preventable with by taking measures to handle all the issues at the first signs of a change from the normal to degraded state.
In any of the above cases if you feel that something is not as expected, a consultation with StorPool Support is the best course of action. StorPool Support receives notifications for all detailed cases and pro-actively takes actions to alleviate a system going into degraded or critical state as soon as practically possible.
Hanging requests in the cluster
The output of /usr/lib/storpool/latthreshold.py
shows hanging requests
and/or missing services as in the example below:
disk | reported by | peers | s | op | volume | requestId
-------------------------------------------------------------------------------------------------------------------
- | client 2 | client 2 -> server 2.1 | 15 | read | volume-name | 9223936579889289248:270215977642998472
- | client 2 | client 2 -> server 2.1 | 15 | read | volume-name | 9223936579889289248:270497452619709333
- | client 2 | client 2 -> server 2.1 | 15 | read | volume-name | 9223936579889289248:270778927596419790
- | client 2 | client 2 -> server 2.1 | 15 | write | volume-name | 9223936579889289248:271060402573130531
- | client 2 | client 2 -> server 2.1 | 15 | read | volume-name | 9223936579889289248:271341877549841211
- | client 2 | client 2 -> server 2.1 | 15 | read | volume-name | 9223936579889289248:271623352526551744
- | client 2 | client 2 -> server 2.1 | 15 | write | volume-name | 9223936579889289248:271904827503262450
server 2.1 connection status: established no_data timeout
disk 202 EXPECTED_UNKNOWN server 2.1
This could be caused by starving CPU, hardware resets, misbehaving disks or
network or stalled services. The disk
field in the output and the service
warnings after the requests table could be used as an indicator for the
misbehaving component.
Note that the active requests api call has a timeout for each service to
respond. The default timeout that the latthreshold
tool uses is 10 seconds.
This value can be altered by using the latthreshold’s
--api-requests-timeout/-A
and passing it a numeric value with a time
unit (m, s, ms or us) e.g. 100ms
.
Service connection will have one of the following statuses:
established done
- service reported its active requests as expected; this is not displayed in the regular output, only with--json
not_established
- did not make a connection with the service - this could indicate the server is down, but may also indicate the service version is too old or its stream was overfilled or not connectedestablished no_data timeout
- service did not respond and the connection was closed because the timeout was reachedestablished data timeout
- service responded but the connection was closed because the timeout was reached before it could send all the dataestablished invalid_data
- a message the service sent had invalid data in it
The latthreshold
tool also reports disk statuses. Reported disk statuses
will be one of the following:
EXPECTED_MISSING - the service response was good, but did not provide information about the disk
EXPECTED_NO_CONNECTION_TO_PEER - the connection to the service was not established
EXPECTED_NO_PEER - the service is not present
EXPECTED_UNKNOWN - the service response was invalid or a timeout occurred