Troubleshooting

This part outlines the different states of a StorPool cluster, common knowledge about what should be expected and what are the recommended steps in each of them. This is intended to be used as a guideline for the operations team(s) maintaining the production system provided by StorPool.

Normal state of the system

The normal behaviour of the StorPool storage system is when it is fully configured and in up-and-running state. This is the desired state of the system.

Characteristics of this state:

All nodes in the storage cluster are up and running

This can be checked by using the CLI with storpool service list on any node with access to the API service.

Note

The storpool service list provides status for all services running clusterwide, rather than the services running on the node itself.

All configured StorPool services are up and running

This is again easily checked with storpool service list. Recently restarted services are usually spotted due to their uptime. Recently restarted services are to be taken seriously if the reason for their state is unknown even if they are running at the moment, as in the example with client ID 37 below:

# storpool service list
cluster running, mgmt on node 2
      mgmt   1 running on node  1 ver 20.00.18, started 2022-09-08 19:28:37, uptime 144 days 22:45:51
      mgmt   2 running on node  2 ver 20.00.18, started 2022-09-08 19:27:18, uptime 144 days 22:47:10 active
    server   1 running on node  1 ver 20.00.18, started 2022-09-08 19:28:59, uptime 144 days 22:45:29
    server   2 running on node  2 ver 20.00.18, started 2022-09-08 19:25:53, uptime 144 days 22:48:35
    server   3 running on node  3 ver 20.00.18, started 2022-09-08 19:23:30, uptime 144 days 22:50:58
    client   1 running on node  1 ver 20.00.18, started 2022-09-08 19:28:37, uptime 144 days 22:45:51
    client   2 running on node  2 ver 20.00.18, started 2022-09-08 19:25:32, uptime 144 days 22:48:56
    client   3 running on node  3 ver 20.00.18, started 2022-09-08 19:23:09, uptime 144 days 22:51:19
    client  21 running on node 21 ver 20.00.18, started 2022-09-08 19:20:26, uptime 144 days 22:54:02
    client  22 running on node 22 ver 20.00.18, started 2022-09-08 19:19:26, uptime 144 days 22:55:02
    client  37 running on node 37 ver 20.00.18, started 2022-09-08 13:08:12, uptime 05:06:16

Working cgroup memory and cpuset isolation is properly configured

Use the storpool_cg tool with an argument check to ensure everything is as expected. The tool should not return any warnings. For more information, see Control groups.

When properly configured the sum of all memory limits in the node are less than the available memory in the node. This protects the running kernel from memory shortage as well as all processes in the storpool.slice memory cgroup which ensures the stability of the storage service.

All network interfaces are properly configured

All network interfaces used by StorPool are up and properly configured with hardware acceleration enabled (where applicable); all network switches are configured with jumbo frames and flow control, and none of them experience any packet loss or delays. The output from storpool net list is a good start, all configured network interfaces will be seen as up with proper flags explained at the end. The desired state is uU with a + at the end for each network interface; if hardware acceleration is supported on an interface the A flag should also be present:

storpool net list
------------------------------------------------------------
| nodeId | flags   | net 1             | net 2             |
------------------------------------------------------------
|     23 | uU + AJ | 1A:00:01:00:00:17 | 1E:00:01:00:00:17 |
|     24 | uU + AJ | 1A:00:01:00:00:18 | 1E:00:01:00:00:18 |
|     25 | uU + AJ | 1A:00:01:00:00:19 | 1E:00:01:00:00:19 |
|     26 | uU + AJ | 12:00:01:00:F0:1A | 16:00:01:00:F0:1A |
|     27 | uU + AJ | 26:8A:07:43:EE:BC | 26:8A:07:43:EE:BD |
------------------------------------------------------------
Quorum status: 5 voting beacons up out of 5 expected
Flags:
  u - packets recently received from this node
  d - no packets recently received from this node

  U - this node has enough votes to be in the quorum
  D - this node does not have enough votes to be in the quorum
  + - this node considers itself in the quorum
  B - the connection to this node is through a backup link; check the cabling
  A - this node is using hardware acceleration
  J - the node uses jumbo frames

  N - a non-voting node

All drives are up and running

All drives in use for the storage system are performing at their specified speed, are joined in the cluster and serving requests.

This could be checked with storpool disk list internal, for example in a normally loaded cluster all drives will report low aggregate scores. Below is an example output (trimmed for brevity):

# storpool disk list internal
--------------------------------------------------------------------------------------------------------------------------------------------------------
| disk | server |        aggregate scores        |         wbc pages        |     scrub bw |                          scrub ETA | last scrub completed |
--------------------------------------------------------------------------------------------------------------------------------------------------------
| 2301 |   23.0 |        0 |        0 |        0 |        - + -     / -     |            - |                                  - |  2022-09-08 15:33:44 |
| 2302 |   23.0 |        0 |        0 |        0 |        - + -     / -     |            - |                                  - |  2022-09-08 15:28:48 |
| 2303 |   23.1 |        0 |        0 |        0 |        - + -     / -     |            - |                                  - |  2022-09-08 15:28:49 |
| 2304 |   23.1 |        0 |        0 |        0 |        - + -     / -     |            - |                                  - |  2022-09-08 15:28:50 |
| 2305 |   23.2 |        0 |        0 |        0 |        - + -     / -     |            - |                                  - |  2022-09-08 15:28:51 |
| 2306 |   23.2 |        0 |        0 |        0 |        - + -     / -     |            - |                                  - |  2022-09-08 15:28:51 |
| 2307 |   23.3 |        0 |        0 |        0 |        - + -     / -     |            - |                                  - |  2022-09-08 15:28:52 |
| 2308 |   23.3 |        0 |        0 |        0 |        - + -     / -     |            - |                                  - |  2022-09-08 15:28:53 |
| 2311 |   23.0 |        0 |        0 |        0 |        0 + 0     / 2560  |            - |                                  - |  2022-09-08 15:28:38 |
| 2312 |   23.0 |        0 |        0 |        0 |        0 + 0     / 2560  |            - |                                  - |  2022-09-08 15:28:43 |
| 2313 |   23.1 |        0 |        0 |        0 |        0 + 0     / 2560  |            - |                                  - |  2022-09-08 15:28:44 |
| 2314 |   23.1 |        0 |        0 |        0 |        0 + 0     / 2560  |            - |                                  - |  2022-09-08 15:28:45 |
| 2315 |   23.2 |        0 |        0 |        0 |        0 + 0     / 2560  |            - |                                  - |  2022-09-08 15:28:47 |
| 2316 |   23.2 |        0 |        0 |        0 |        0 + 0     / 2560  |            - |                                  - |  2022-09-08 15:28:39 |
| 2317 |   23.3 |        0 |        0 |        0 |        0 + 0     / 2560  |            - |                                  - |  2022-09-08 15:28:40 |
| 2318 |   23.3 |        0 |        0 |        0 |        0 + 0     / 2560  |            - |                                  - |  2022-09-08 15:28:42 |
[snip]

All drives are regularly scrubbed, so they would have a stable (not increasing) number of errors. The errors corrected for each drive are visible in the storpool disk list output. Last completed scrub is visible in storpool disk list internal as in the example above.

Note that Some systems may have fewer than two network interfaces or a single backend switch. Even not recommended, this is still possible and sometimes used (usually in PoC or with a backup server) when a cluster is configured with a single-VLAN network redundancy scheme. A single VLAN network redundancy configuration and an inter-switch connection is required for a cluster where only some of the nodes are with a single interface connected to the cluster.

Also, if one or more of the points describing the state above is not in effect, the system should not be considered healthy. If there is any suspicion that the system is behaving erratically even though all of the above conditions are satisfied, the recommended steps to check if everything is in order are:

  • Check top and look for the state of each of the configured storpool_* services running in the present node. A properly running service is usually in the S (sleeping) state and rarely seen in the R (running) state. The CPU usage is often reported at 100% usage when hardware sleep is enabled, due to the kernel misreporting. The actual usage is much lower and could be tracked with cpupower monitor for the CPU cores.

  • To ensure all services on this node are running correctly is to use the /usr/lib/storpool/sdump tool, which will be reporting some CPU and network usage statistics for the running services on the node. Use the -l option for the long names of the statistics.

  • On some of the nodes with running workloads (like VM instances or containers) iostat will show activity for processed requests on the block devices.

    The following example shows the normal disk activity on a node running VM instances. Note that the usage may vary greatly depending on the workload. The command used in the example is iostat -xm 1 /dev/sp-*| egrep -v " 0[.,]00$" , which will print statistics for the StorPool devices each second excluding drives that have no storage IO activity:

    Device:  rrqm/s   wrqm/s  r/s     w/s      rMB/s   wMB/s  avgrq-sz  avgqu-sz  await  r_await  w_await  svctm  %util
    sp-0     0.00     0.00    0.00    279.00   0.00    0.14   1.00      3.87      13.80  0.00     13.80    3.55   98.99
    sp-11    0.00     0.00    165.60  114.10   19.29   14.26  245.66    5.97      20.87  9.81     36.91    0.89   24.78
    sp-12    0.00     0.00    171.60  153.60   19.33   19.20  242.67    9.20      28.17  10.46    47.96    1.08   35.01
    sp-13    0.00     0.00    6.00    40.80    0.04    5.10   225.12    1.75      37.32  0.27     42.77    1.06   4.98
    sp-21    0.00     0.00    0.00    82.20    0.00    1.04   25.90     1.00      12.08  0.00     12.08    12.16  99.99
    

There are no hanging active requests

The output of /usr/lib/storpool/latthreshold.py is empty - shows no hanging requests and no service or disk warnings.

Degraded state

In this state some system components are not fully operational and need attention. Some examples of a degraded state below.

Degraded state due to service issues

A single storpool_server service on one of the storage nodes is not available or not joined in the cluster

Note that this concerns only pools with triple replication, for dual replication this is considered to be a critical state, because there are parts of the system with only one available copy. This is an example output from storpool service list:

# storpool service list
cluster running, mgmt on node 2
      mgmt   1 running on node  1 ver 20.00.18, started 2022-09-08 16:11:59, uptime 19:51:50
      mgmt   2 running on node  2 ver 20.00.18, started 2022-09-08 16:11:58, uptime 19:51:51 active
      mgmt   3 running on node  3 ver 20.00.18, started 2022-09-08 16:11:58, uptime 19:51:51
    server   1 down on node     1 ver 20.00.18
    server   2 running on node  2 ver 20.00.18, started 2022-09-08 16:12:03, uptime 19:51:46
    server   3 running on node  3 ver 20.00.18, started 2022-09-08 16:12:04, uptime 19:51:45
    client   1 running on node  1 ver 20.00.18, started 2022-09-08 16:11:59, uptime 19:51:50
    client   2 running on node  2 ver 20.00.18, started 2022-09-08 16:11:57, uptime 19:51:52
    client   3 running on node  3 ver 20.00.18, started 2022-09-08 16:11:57, uptime 19:51:52

If this is unexpected, i.e. no one has deliberately restarted or stopped the service for planned maintenance or upgrade, it is very important to first bring the service up and then to investigate the root cause for the service outage. When the storpool_server service comes back up it will start recovering outdated data on its drives. The recovery process could be monitored with storpool task list, which will output which disks are recovering, as well as how much data is there left to be recovered. Example output or storpool task list:

# storpool task list
----------------------------------------------------------------------------------------
|     disk |  task id |  total obj |  completed |    started |  remaining | % complete |
----------------------------------------------------------------------------------------
|     2303 | RECOVERY |          1 |          0 |          1 |          1 |         0% |
----------------------------------------------------------------------------------------
|    total |          |          1 |          0 |          1 |          1 |         0% |
----------------------------------------------------------------------------------------

Some of the volumes or snapshots will have the D flag (for degraded) visible in the storpool volume status output, which will disappear once all the data is fully recovered. An example situation would be a reboot of the node for a kernel or a package upgrade that requires reboot and no kernel modules were installed for the new kernel or a service (in this example the storpool_server) was not configured to start on boot and others.

Some of the configured StorPool services have failed or is not running

These could be:

  • The storpool_block service on some of the storage only nodes, without any attached volumes or snapshots.

  • A single storpool_server service or multiple instances on the same node, note again that this is critical for systems with dual replication.

  • Single API (storpool_mgmt) service with another active running API.

The reason for these could be the same as in the previous examples, usually the system log contains all information needed to check why the service is not (getting) up.

Degraded state due to host OS misconfiguration

Some examples include:

  • Changes in the OS configuration after a system update

    This could prevent some of the services from running after a fresh boot. For instance, due to changed names of the network interfaces used for the storage system after an upgrade, changed PCIe IDs for NVMe devices, and so on.

  • Kdump is no longer collecting kernel dump data properly

    If this occurs, it might be difficult to debug what have caused a kernel crash.

Some of the above cases will be difficult to catch prior to booting with the new environment (for example, kernel or other updates) and sometimes they are only caught after an event that reveals the issue. Thus it is important to regularly test and ensure the system is in properly configured state and collects normally.

Degraded state due to network interface issues

Some of the interfaces used by StorPool is not up.

This could be checked with storpool net list, like this:

# storpool net list
------------------------------------------------------------
| nodeId | flags   | net 1             | net 2             |
------------------------------------------------------------
|     23 | uU + AJ |                   | 1E:00:01:00:00:17 |
|     24 | uU + AJ |                   | 1E:00:01:00:00:18 |
|     25 | uU + AJ | 1A:00:01:00:00:19 | 1E:00:01:00:00:19 |
|     26 | uU + AJ | 12:00:01:00:F0:1A | 16:00:01:00:F0:1A |
|     27 | uU + AJ | 26:8A:07:43:EE:BC | 26:8A:07:43:EE:BD |
------------------------------------------------------------
Quorum status: 5 voting beacons up out of 5 expected
Flags:
  u - packets recently received from this node
  d - no packets recently received from this node

  U - this node has enough votes to be in the quorum
  D - this node does not have enough votes to be in the quorum
  + - this node considers itself in the quorum
  B - the connection to this node is through a backup link; check the cabling
  A - this node is using hardware acceleration
  J - the node uses jumbo frames

  N - a non-voting node

In the above example nodes 23 and 24 are not connected to the first network. This is the SP_IFACE1_CFG interface configuration in /etc/storpool.conf (check with storpool_showconf SP_IFACE1_CFG). Note that the beacons are up and running and the system is processing requests through the second network. The possible reasons could be misconfigured interfaces, StorPool configuration, or backend switch/switches.

A HW acceleration qualified interface is running without hardware acceleration

This is once again checked with storpool net list:

# storpool net list
------------------------------------------------------------
| nodeId | flags   | net 1             | net 2             |
------------------------------------------------------------
|     23 | uU +  J | 1A:00:01:00:00:17 | 1E:00:01:00:00:17 |
|     24 | uU +  J | 1A:00:01:00:00:18 | 1E:00:01:00:00:18 |
|     25 | uU + AJ | 1A:00:01:00:00:19 | 1E:00:01:00:00:19 |
|     26 | uU + AJ | 12:00:01:00:F0:1A | 16:00:01:00:F0:1A |
|     27 | uU + AJ | 26:8A:07:43:EE:BC | 26:8A:07:43:EE:BD |
------------------------------------------------------------
Quorum status: 5 voting beacons up out of 5 expected
Flags:
  u - packets recently received from this node
  d - no packets recently received from this node

  U - this node has enough votes to be in the quorum
  D - this node does not have enough votes to be in the quorum
  + - this node considers itself in the quorum
  B - the connection to this node is through a backup link; check the cabling
  A - this node is using hardware acceleration
  J - the node uses jumbo frames

  N - a non-voting node

In the above example, nodes 23 and 24 are equipped with NICs qualified for, but are running without hardware acceleration; the possible reason could be either a BIOS or an OS misconfiguration, misconfigured kernel parameters on boot, or network interface misconfiguration. Note that when a system was configured for hardware accelerated operation the cgroups configuration was also sized accordingly, thus running in this state is likely to case performance issues, due to less CPU cores isolated and reserved for the NIC interrupts and storpool_rdma threads.

Jumbo frames are expected, but not working on some of the interfaces

Could be seen with storpool net list, if some of the two networks is with MTU lower than 9k the J flag will not be listed:

# storpool net list
-------------------------------------------------------------
| nodeId | flags    | net 1             | net 2             |
-------------------------------------------------------------
|     23 | uU + A   | 12:00:01:00:F0:17 | 16:00:01:00:F0:17 |
|     24 | uU + AJ  | 1A:00:01:00:00:18 | 1E:00:01:00:00:18 |
|     25 | uU + AJ  | 1A:00:01:00:00:19 | 1E:00:01:00:00:19 |
|     26 | uU + AJ  | 1A:00:01:00:00:1A | 1E:00:01:00:00:1A |
-------------------------------------------------------------
Quorum status: 4 voting beacons up out of 4 expected
Flags:
  u - packets recently received from this node
  d - no packets recently received from this node

  U - this node has enough votes to be in the quorum
  D - this node does not have enough votes to be in the quorum
  M - this node is being damped by the rest of the nodes in the cluster
  + - this node considers itself in the quorum
  B - the connection to this node is through a backup link; check the cabling
  A - this node is using hardware acceleration
  J - the node uses jumbo frames

  N - a non-voting node

If the node is not expected to be running without jumbo frames this might be an indication for a misconfigured interface or an issue with applying the interfaces configuration on boot. Note that an OS interface configured for jumbo frames without having the switch port properly configured leads to severe performance issues.

Some network interfaces are experiencing network loss or delays on one of the networks

This might affect the latency for some of the storage operations. Depending on the node where the losses occur, it might affect a single client or affect operations in the whole cluster in case of packet loss or delays are happening on a server node. Stats for all interfaces per service are collected in the analytics platform (https://analytics.storpool.com) and could be used to investigate network performance issues. The /usr/lib/storpool/sdump tool will print the same statistics on each of the nodes with services. The usual causes for packet loss are:

  • Hardware issues (cables, SFPs, and so on).

  • Floods and DDoS attacks “leaking” into the storage network due to misconfiguration.

  • Saturation of the CPU cores that handle the interrupts for the network cards and others when hardware acceleration is not available.

  • Network loops leading to saturated switch ports or overloaded NICs.

Drive/Controller issues

One or more HDD or SSD drives are missing from a single server in the cluster or from servers in the same fault set

Attention

This concerns only pools with triple replication, for dual replication this is considered as critical state.

The missing drives may be seen using storpool disk list or storpool server <serverID> disk list, for example in this output disk 543 is missing from server with ID 54:

# storpool server 54 disk list
disk  |   server  | size    |   used  |  est.free  |   %     | free entries | on-disk size |  allocated objects |  errors |  flags
541   |       54  | 207 GB  |  61 GB  |    136 GB  |   29 %  |      713180  |       75 GB  |   158990 / 225000  |   0 / 0 |
542   |       54  | 207 GB  |  56 GB  |    140 GB  |   27 %  |      719526  |       68 GB  |   161244 / 225000  |   0 / 0 |
543   |       54  |      -  |      -  |         -  |    - %  |           -  |           -  |        - / -       |    -
544   |       54  | 207 GB  |  62 GB  |    134 GB  |   30 %  |      701722  |       76 GB  |   158982 / 225000  |   0 / 0 |
545   |       54  | 207 GB  |  61 GB  |    135 GB  |   30 %  |      719993  |       75 GB  |   161312 / 225000  |   0 / 0 |
546   |       54  | 207 GB  |  54 GB  |    142 GB  |   26 %  |      720023  |       68 GB  |   158481 / 225000  |   0 / 0 |
547   |       54  | 207 GB  |  62 GB  |    134 GB  |   30 %  |      719996  |       77 GB  |   179486 / 225000  |   0 / 0 |
548   |       54  | 207 GB  |  53 GB  |    143 GB  |   26 %  |      718406  |       70 GB  |   179038 / 225000  |   0 / 0 |

Usual reasons - the drive was ejected from the cluster due to a write error either by the kernel or by the running storpool_server instance. More information may be found using dmesg | tail and in the system log. More information about the model and the serial number of the failed drive is shown by storpool disk list info.

In normal conditions the server will flag the disk to be re-tested and will eject it for a quick test. Provided the disk is still working correctly and test results are not breaching any thresholds the disk will be returned into the cluster to recover. Such a case for example might happen if the stalled request was caused by an intermediate issue, like a reallocated sector.

In case the disk is breaching any sane latency and bandwidth thresholds it will not be automatically returned and will have to be re-balanced out of the cluster. Such disks are marked as “bad” (more available at storpool_initdisk options)

When one or more drives are ejected (marked as bad already) and missing, multiple volumes and/or snapshots will be listed with the D flag in the output of storpool volume status (D as Degraded), due to the missing replicas for some of the data. This is normal and expected and there are the following options in this situation:

  • The drive could still be working properly (for example, a set of bad sectors were reallocated) even after it was tested, in order to re-test you could mark the drive as –good (more info on how at storpool_initdisk options) and attempt to get back into the cluster.

  • In some occasions a disk might have lost its signatures and would have to be returned in the cluster to recover from scratch - it will be automatically re-tested upon attempt to a full (read-write) stress-test is recommended to ensure it is working correctly (fio is a good tool for this kind of tests, check --verify option). In case the stress test is successful (e.g. the drive has been written to and verified successfully), it may be reinitialized with storpool_initdisk with the same disk ID it was before. This will automatically return it to the cluster and it will fully recover all data from scratch as if it was a brand new.

  • The drive has failed irrecoverably and a replacement is available. The replacement drive is initialized with the diskID of the failed drive with storpool_initdisk. After returning it to the cluster it will fully recover all the data from the live replicas (please check Rebalancing the cluster for more).

  • A replacement is not available. The only option is to re-balance the cluster without this drive (more details in Rebalancing the cluster).

Attention

Beware that in some cases with very filled clusters it might be impossible to get the cluster back to full redundancy without overfilling some of the remaining drives. See next section.

Some of the drives in the cluster are beyond 90% (up to 96% full)

With proper planning this should be rarely an issue. A way to evade it is to add more drives or an additional server node with a full set of drives into the cluster. Another option is to remove unused volumes or snapshots.

The storpool snapshot space command will return relevant information for the referred space for each snapshot on the underlying drives. Note that snapshots with a negative value in their “used” column will not free up any space if they are removed and will remain in the deleting state, because they are parents of multiple cloned child volumes.

Note that depending on the speed with which the cluster is being populated with data by the end users this might also be considered a critical state.

Some of the drives have fewer than 140k free entries (alert for an overloaded system)

This may be observed in the output of storpool disk list or storpool server <serverID> disk list, an example from the latter below:

# storpool server 23 disk list
  disk  |  server  |    size    |    used    |  est.free  |      %  |  free entries  |  on-disk size  |  allocated objects |  errors |  flags
  2301  |    23.0  |   893 GiB  |   2.6 GiB  |   857 GiB  |    0 %  |       3719930  |       660 KiB  |       17 / 930000  |   0 / 0 |
  2302  |    23.0  |   893 GiB  |   2.6 GiB  |   857 GiB  |    0 %  |       3719929  |       668 KiB  |       17 / 930000  |   0 / 0 |
  2303  |    23.0  |   893 GiB  |   2.6 GiB  |   857 GiB  |    0 %  |       3719929  |       668 KiB  |       17 / 930000  |   0 / 0 |
  2304  |    23.0  |   893 GiB  |   2.6 GiB  |   857 GiB  |    0 %  |       3719931  |       668 KiB  |       17 / 930000  |   0 / 0 |
  2306  |    23.0  |   893 GiB  |   2.6 GiB  |   857 GiB  |    0 %  |       3719932  |       664 KiB  |       17 / 930000  |   0 / 0 |
  2305  |    23.0  |   893 GiB  |   2.6 GiB  |   857 GiB  |    0 %  |       3719930  |       660 KiB  |       17 / 930000  |   0 / 0 |
  2307  |    23.0  |   893 GiB  |   2.6 GiB  |   857 GiB  |    0 %  |         19934  |       664 KiB  |       17 / 930000  |   0 / 0 |
--------------------------------------------------------------------------------------------------------------------------------------
     7  |     1.0  |   6.1 TiB  |    18 GiB  |   5.9 TiB  |    0 %  |      26039515  |       4.5 MiB  |      119 / 6510000 |   0 / 0 |

This usually happens after the system has been loaded for longer periods of time with a sustained write workload on one or multiple volumes. If this is unexpected and the reason is an erratic workload, the recommended way to handle this is to set a limit (bandwidth, iops or both) on the loaded volumes for example with storpool volume <volumename> bw 100M iops 1000. The same could be set for multiple volumes/snapshots in a template with storpool template <templatename> bw 100M iops 1000 propagate. Please note that propagating changes for templates with a very large number of volumes and snapshots might not work. If the overloaded state is due to normally occurring workload it is best to expand the system with more drives and or reformat the drives with larger number of entries (relates mainly to HDD drives). The latter case might be cause due to lower number of hard drives in a HDD only or a hybrid pool and rarely due to overloaded SSDs.

Another case related to overloaded drives is when many volumes are created out of the same template, which requires overrides in order to shuffle the objects where the journals are residing in order to avoid overload of the same triplet of disks when all virtual machines spike for some reason (i.e. unattended upgrades, a syslog intensive cron job, etc.)

A couple of notes on the degraded states - apart from the notes for the replication above none of these should affect the stability of the system at this point. For the example with the missing disk in hybrid systems with a single failed SSD, all read requests on volumes with triple replication that have data on the failed drive will be served by some of the redundant copies on HDDs. This could bring up a bit the read latencies for the operations on the parts of the volume which were on this exact SSD. This is usually negligible in medium to large systems, i.e. in a cluster with 20 SSDs or NVMe drives, these are 1/20th of all the read operations in the cluster. In case of dual replicas on SSDs and a third replica on HDDs there is no read latency penalty whatsoever, which is also the case for missing hard drives - they will not affect the system at all and in fact some write operations are even faster, because they are not waiting for the missing drive.

Critical state

This is an emergency state that requires immediate attention and intervention from the operations team and/or StorPool Support. Some of the conditions that could lead to this state are:

  • Partial or complete network outage.

  • Power loss for some nodes in the cluster.

  • Memory shortage leading to a service failure due to missing or incomplete cgroups configuration.

The following states are an indication for critical conditions:

API service failure

API not reachable on any of the configured nodes (the ones running the storpool_mgmt service)

Requests to the API from any of the nodes configured to access it either stall or cannot reach a working service. This is a critical state due to the fact that the status of the cluster is unknown (might be down for that matter).

This might be caused by:

  • Misconfigured network for accessing the floating IP address - the address may be obtained by storpool_showconf http on any of the nodes with a configured storpool_mgmt service in the cluster:

    # storpool_showconf http
    SP_API_HTTP_HOST=10.3.10.78
    SP_API_HTTP_PORT=81
    
  • Failed interfaces on the hosts that have the storpool_mgmt service running. To find the interface where the StorPool API should be running use storpool_showconf api_iface :

    # storpool_showconf api_iface
    SP_API_IFACE=bond0.410
    

    It is recommended to have the API on a redundant interface (e.g. an active-backup bond interface). Note that even without an API, provided the cluster is in quorum, there should be no impact on any running operations, but changes in the cluster like creating/attaching/detaching/deleting volumes or snapshots) will be impossible. Running with no API in the cluster triggers highest severity alert to StorPool Support (essentially wake up alert) due to the unknown state of the system.

  • The cluster is not in quorum

    The cluster is in this state if the number of running voting storpool_beacon services is less than the half of the expected nodes plus one ((expected / 2) + 1). The configured number of expected nodes in the cluster may be checked with storpool_showconf expected, but is generally considered to be the number of server nodes (except when client nodes are configured as voting for some reason). In a system with 6 servers at least 4 voting beacons should be available to get back the cluster in running state:

    # storpool_showconf expected
    SP_EXPECTED_NODES=6
    

    The current number of expected votes and the number of voting beacons are displayed in the output of storpool net list, check the example above (the Quorum status: line).

API requests are not returning for more than 30-60 seconds (e.g. storpool volume status, storpool snapshot space, storpool disk list, etc.)

These API requests collect data from the running storpool_server services on each server node. Possible reasons are:

  • Network loss or delays;

  • Failing storpool_server services;

  • Failing drives or hardware (CPU, memory, controllers, etc.);

  • Overload

Server service failure

Two storpool_server services or whole servers are down

Two storpool_server services or whole servers are down or not joined in the cluster in different failure sets. Very risky state, because there are parts of the volumes with only one live replica, if the latest writes land on a drive returning an IO error or broken data (detected by StorPool) this will lead to data loss.

As in degraded state some of the read operations for parts of the volumes will be served from HDDs in a hybrid system and might raise read latencies. In this state it is very important to bring back the missing services/nodes as soon as possible, because a failure in any of the remaining drives in other nodes or another fault set will bring some of the volumes in down state and might lead to data loss in case of an error returned by a drive with the latest writes.

More than two storpool_server services or whole servers are down

This state results in some volumes being in down state (storpool volume status) due to some parts only on the missing drives. Recommended action in this case - check for the reasons for the degraded services or missing (unresponsive) nodes and get them back up.

Possible reasons are:

  • Lost network connectivity

  • Severe packet loss/delays/loops

  • Partial or complete power loss

  • Hardware instabilities, overheating

  • Kernel or other software instabilities, crashes

Client service failure

If the client service (storpool_block) is down on some of the nodes depending on it, these could be either client-only or converged hypervisors, this will stall all requests on that particular node until the service is back up.

Possible reasons are again:

  • Lost network connectivity

  • Severe packet loss/delays/loops

  • Bugs in the storpool_block service or the storpool_bd kernel module

In case of power loss or kernel crashes any virtual machine instances that were running on this node could be started on other available nodes.

Network interface or Switch failure

This means that the networks used for StorPool are down or are experiencing heavy packet loss or delays. In this case the quorum service will prevent a split-brain situation and will restart all services to ensure the cluster is fully connected on at least one network before it transitions again to running state. Such issues might be alleviated by a single-VLAN setup when different nodes have partial network connectivity, but will be again with severe delays in case of severe packet loss.

Hard Drive/SSD failures

Drives from two or more different nodes (fault sets) in the cluster are missing (or from a single node/fault set for systems with dual replication pools)

In this case multiple volumes may either experience degraded performance (hybrid placement) or will be in down state when more than two replicas are missing. All operations on volumes in down state are stalled, until the redundancy is restored (i.e. at least one replica is available). The recommended steps are to immediately check for the reasons for the missing drives/services/nodes and return them into the cluster as soon as possible.

Some of the drives are more than 97% full

At some point all cluster operations will stall until either some of the data in the cluster is deleted or new drives/nodes are added. Adding drives requires the new drives to be stress tested and a re-balancing of the system to include them, which should be carefully planned (details in Rebalancing the cluster).

Note

Cleaning up snapshots that have multiple cloned volumes and a negative value for used space in the output of storpool snapshot space will not free up any space.

Some of the drives have fewer than 100k free entries

This is usually caused by a heavily overloaded system. In this state the latencies for some operations might become very high (measured in seconds). Possible reasons are severely overloaded volumes for long periods of time without any configured bandwidth or iops limits. This could be checked by using iostat to look for volumes that are being constantly 100% loaded with a large number of requests to the storage system. Another way to check for such volumes is to use the “Top volumes” in the analytics in order to get info for the most loaded volumes and apply IOPS and or bandwidth limits accordingly. Other causes are misbehaving (underperforming) drives or misbehaving HBA/SAS controllers, the recommended way to deal with these cases is to investigate for such drives, a good idea is to check the output from storpool disk list internal for higher aggregation scores on some drives or set of drives (e.g. on the same server) or by the use of the analytics to check for abnormal latency on some of the backend nodes (i.e. drives with significantly higher operations latency compared to other drives of the same type). An example would be a failing controller causing the SATA speed to degrade to SATA1.0 (1.5Gb/s) instead of SATA 3.0 (6Gb/s), weared out batteries on a RAID controller when its cache is used to accelerate the writes on the HDDs, and others.

The circumstances leading a system to the critical state are rare and are usually preventable with by taking measures to handle all the issues at the first signs of a change from the normal to degraded state.

In any of the above cases if you feel that something is not as expected, a consultation with StorPool Support is the best course of action. StorPool Support receives notifications for all detailed cases and pro-actively takes actions to alleviate a system going into degraded or critical state as soon as practically possible.

Hanging requests in the cluster

The output of /usr/lib/storpool/latthreshold.py shows hanging requests and/or missing services as in the example below:

disk | reported by | peers                      |  s |   op  |      volume |                              requestId
-------------------------------------------------------------------------------------------------------------------
-    | client 2    | client 2    -> server 2.1  | 15 | read  | volume-name | 9223936579889289248:270215977642998472
-    | client 2    | client 2    -> server 2.1  | 15 | read  | volume-name | 9223936579889289248:270497452619709333
-    | client 2    | client 2    -> server 2.1  | 15 | read  | volume-name | 9223936579889289248:270778927596419790
-    | client 2    | client 2    -> server 2.1  | 15 | write | volume-name | 9223936579889289248:271060402573130531
-    | client 2    | client 2    -> server 2.1  | 15 | read  | volume-name | 9223936579889289248:271341877549841211
-    | client 2    | client 2    -> server 2.1  | 15 | read  | volume-name | 9223936579889289248:271623352526551744
-    | client 2    | client 2    -> server 2.1  | 15 | write | volume-name | 9223936579889289248:271904827503262450
server 2.1  connection status: established no_data timeout
disk 202 EXPECTED_UNKNOWN server 2.1

This could be caused by starving CPU, hardware resets, misbehaving disks or network or stalled services. The disk field in the output and the service warnings after the requests table could be used as an indicator for the misbehaving component.

Note that the active requests api call has a timeout for each service to respond. The default timeout that the latthreshold tool uses is 10 seconds. This value can be altered by using the latthreshold’s --api-requests-timeout/-A and passing it a numeric value with a time unit (m, s, ms or us) e.g. 100ms.

Service connection will have one of the following statuses:

  • established done - service reported its active requests as expected; this is not displayed in the regular output, only with --json

  • not_established - did not make a connection with the service - this could indicate the server is down, but may also indicate the service version is too old or its stream was overfilled or not connected

  • established no_data timeout - service did not respond and the connection was closed because the timeout was reached

  • established data timeout - service responded but the connection was closed because the timeout was reached before it could send all the data

  • established invalid_data - a message the service sent had invalid data in it

The latthreshold tool also reports disk statuses. Reported disk statuses will be one of the following:

  • EXPECTED_MISSING - the service response was good, but did not provide information about the disk

  • EXPECTED_NO_CONNECTION_TO_PEER - the connection to the service was not established

  • EXPECTED_NO_PEER - the service is not present

  • EXPECTED_UNKNOWN - the service response was invalid or a timeout occurred