Degraded state troubleshooting

The information provided here is about the degraded state of a StorPool cluster - what should be expected, and what are the recommended steps. This is intended to be used as a guideline for the operations teams maintaining a StorPool system. For details on what to do when the system is in a different state, see Normal state troubleshooting and Critical state troubleshooting.

During a degraded state some system components are not fully operational and need attention. The following sections provide some examples of a degraded state, and suggest troubleshooting steps.

Service issues

For an overview of what the services do, see StorPool services.

storpool_server not available or not joined

A single storpool_server service on one of the storage nodes is not available, or has not joined in the cluster.

Note

This concerns only pools with triple replication (see Redundancy). For dual replication this is considered to be a critical state, because there are parts of the system with only one available copy.

You can check which service is down using the storpool service list command (see Services). Here is an example output:

# storpool service list
cluster running, mgmt on node 2
      mgmt   1 running on node  1 ver 20.00.18, started 2022-09-08 16:11:59, uptime 19:51:50
      mgmt   2 running on node  2 ver 20.00.18, started 2022-09-08 16:11:58, uptime 19:51:51 active
      mgmt   3 running on node  3 ver 20.00.18, started 2022-09-08 16:11:58, uptime 19:51:51
    server   1 down on node     1 ver 20.00.18
    server   2 running on node  2 ver 20.00.18, started 2022-09-08 16:12:03, uptime 19:51:46
    server   3 running on node  3 ver 20.00.18, started 2022-09-08 16:12:04, uptime 19:51:45
    client   1 running on node  1 ver 20.00.18, started 2022-09-08 16:11:59, uptime 19:51:50
    client   2 running on node  2 ver 20.00.18, started 2022-09-08 16:11:57, uptime 19:51:52
    client   3 running on node  3 ver 20.00.18, started 2022-09-08 16:11:57, uptime 19:51:52

If this is unexpected – that is, no one has deliberately restarted or stopped the service for planned maintenance or upgrade – it is very important to first bring the service up and then to investigate the root cause for the service outage. When the storpool_server service comes back up it will start recovering outdated data on its drives. The recovery process could be monitored with the storpool task list command (see Tasks), which will output which disks are recovering, as well as how much data is there left to be recovered. Here is an example:

# storpool task list
----------------------------------------------------------------------------------------
|     disk |  task id |  total obj |  completed |    started |  remaining | % complete |
----------------------------------------------------------------------------------------
|     2303 | RECOVERY |          1 |          0 |          1 |          1 |         0% |
----------------------------------------------------------------------------------------
|    total |          |          1 |          0 |          1 |          1 |         0% |
----------------------------------------------------------------------------------------

Some of the volumes or snapshots will have the D flag (degraded) visible in the output of the storpool volume status command (see Volume status), which will disappear once all the data is fully recovered. An example situation would be a reboot of the node for a kernel or a package upgrade that requires reboot and no kernel modules were installed for the new kernel or a service (in this example the storpool_server) was not configured to start on boot and others.

Services failed or are not running

Some of the configured StorPool services have failed or is not running. These could be:

  • The storpool_block service on some of the storage only nodes, without any attached volumes or snapshots.

  • A single storpool_server service, or multiple instances on the same node; note again that this is critical for systems with dual replication.

  • Single API (storpool_mgmt) service with another active running API.

The reason for these could be the same as in the previous examples. Usually the system log contains all information needed to check why the service is not (getting) up.

Host OS misconfiguration

Some examples include:

  • Changes in the OS configuration after a system update

    This could prevent some of the services from running after a fresh boot. For instance, due to changed names of the network interfaces used for the storage system after an upgrade, changed PCIe IDs for NVMe devices, and so on.

  • Kdump is no longer collecting kernel dump data properly

    If this occurs, it might be difficult to debug what have caused a kernel crash.

Some of the above cases will be difficult to catch prior to booting with the new environment (for example, kernel or other updates), and sometimes they are only caught after an event that reveals the issue. Thus, it is important to regularly test and ensure the system is in properly configured state and collects normally.

Network interface issues

The storpool net list command can be of help when troubleshooting networking issues; for details, see Network.

Interfaces are down

Some of the interfaces used by StorPool are not up. You can check if this has happened using the storpool net list command, like this:

# storpool net list
------------------------------------------------------------
| nodeId | flags   | net 1             | net 2             |
------------------------------------------------------------
|     23 | uU + AJ |                   | 1E:00:01:00:00:17 |
|     24 | uU + AJ |                   | 1E:00:01:00:00:18 |
|     25 | uU + AJ | 1A:00:01:00:00:19 | 1E:00:01:00:00:19 |
|     26 | uU + AJ | 12:00:01:00:F0:1A | 16:00:01:00:F0:1A |
|     27 | uU + AJ | 26:8A:07:43:EE:BC | 26:8A:07:43:EE:BD |
------------------------------------------------------------
Quorum status: 5 voting beacons up out of 5 expected
Flags:
  u - packets recently received from this node
  d - no packets recently received from this node

  U - this node has enough votes to be in the quorum
  D - this node does not have enough votes to be in the quorum
  + - this node considers itself in the quorum
  B - the connection to this node is through a backup link; check the cabling
  A - this node is using hardware acceleration
  J - the node uses jumbo frames

  N - a non-voting node

In the example above, nodes 23 and 24 are not connected to the first network. This is the SP_IFACE1_CFG option in the /etc/storpool.conf file (see Interfaces for StorPool cluster communication and StorPool block protocol); you can check with the storpool_showconf SP_IFACE1_CFG command.

Note that in this example the beacons are up and running, and the system is processing requests through the second network. The possible reasons could be misconfigured interfaces, StorPool configuration, or backend switches.

No hardware acceleration

A HW acceleration qualified interface is running without hardware acceleration. Once again, you can check this with storpool net list:

# storpool net list
------------------------------------------------------------
| nodeId | flags   | net 1             | net 2             |
------------------------------------------------------------
|     23 | uU +  J | 1A:00:01:00:00:17 | 1E:00:01:00:00:17 |
|     24 | uU +  J | 1A:00:01:00:00:18 | 1E:00:01:00:00:18 |
|     25 | uU + AJ | 1A:00:01:00:00:19 | 1E:00:01:00:00:19 |
|     26 | uU + AJ | 12:00:01:00:F0:1A | 16:00:01:00:F0:1A |
|     27 | uU + AJ | 26:8A:07:43:EE:BC | 26:8A:07:43:EE:BD |
------------------------------------------------------------
Quorum status: 5 voting beacons up out of 5 expected
Flags:
  u - packets recently received from this node
  d - no packets recently received from this node

  U - this node has enough votes to be in the quorum
  D - this node does not have enough votes to be in the quorum
  + - this node considers itself in the quorum
  B - the connection to this node is through a backup link; check the cabling
  A - this node is using hardware acceleration
  J - the node uses jumbo frames

  N - a non-voting node

In the example above, nodes 23 and 24 are equipped with NICs qualified for, but are running without hardware acceleration. The possible reason could be either a BIOS or an OS misconfiguration, misconfigured kernel parameters on boot, or network interface misconfiguration.

Note that when a system was configured for hardware accelerated operation the cgroups configuration was also sized accordingly. Thus, running in this state is likely to cause performance issues, due to less CPU cores isolated and reserved for the NIC interrupts and storpool_rdma threads.

Jumbo frames issues

Jumbo frames are expected, but not working on some of the interfaces. This could be observed with storpool net list: if some of the two networks is with MTU lower than 9k, then the J flag will not be listed:

# storpool net list
-------------------------------------------------------------
| nodeId | flags    | net 1             | net 2             |
-------------------------------------------------------------
|     23 | uU + A   | 12:00:01:00:F0:17 | 16:00:01:00:F0:17 |
|     24 | uU + AJ  | 1A:00:01:00:00:18 | 1E:00:01:00:00:18 |
|     25 | uU + AJ  | 1A:00:01:00:00:19 | 1E:00:01:00:00:19 |
|     26 | uU + AJ  | 1A:00:01:00:00:1A | 1E:00:01:00:00:1A |
-------------------------------------------------------------
Quorum status: 4 voting beacons up out of 4 expected
Flags:
  u - packets recently received from this node
  d - no packets recently received from this node

  U - this node has enough votes to be in the quorum
  D - this node does not have enough votes to be in the quorum
  M - this node is being damped by the rest of the nodes in the cluster
  + - this node considers itself in the quorum
  B - the connection to this node is through a backup link; check the cabling
  A - this node is using hardware acceleration
  J - the node uses jumbo frames

  N - a non-voting node

If the node is not expected to be running without jumbo frames this might be an indication for a misconfigured interface or an issue with applying the interfaces configuration on boot. Note that an OS interface configured for jumbo frames without having the switch port properly configured leads to severe performance issues.

Network losses or delays

Some network interfaces are experiencing network loss or delays on one of the networks.

This might affect the latency for some of the storage operations. Depending on the node where the losses occur, it might affect a single client or affect operations in the whole cluster in case of packet loss or delays are happening on a server node. Stats for all interfaces per service are collected in the analytics platform and could be used to investigate network performance issues. The /usr/lib/storpool/sdump tool will print the same statistics on each of the nodes with services. The usual causes for packet loss are:

  • Hardware issues (cables, SFPs, and so on).

  • Floods and DDoS attacks “leaking” into the storage network due to misconfiguration.

  • Saturation of the CPU cores that handle the interrupts for the network cards and others when hardware acceleration is not available.

  • Network loops leading to saturated switch ports or overloaded NICs.

Drive/Controller issues

Missing drives

One or more HDD or SSD drives are missing from a single server in the cluster, or from servers in the same fault set.

Attention

This concerns only pools with triple replication; for dual replication this is considered as critical state.

The missing drives may be observed using the storpool disk list or storpool server <serverID> disk list commands (see Listings disks and Server). For example, in this output disk 543 is missing from server with ID 54:

# storpool server 54 disk list
disk  |   server  | size    |   used  |  est.free  |   %     | free entries | on-disk size |  allocated objects |  errors |  flags
541   |       54  | 207 GB  |  61 GB  |    136 GB  |   29 %  |      713180  |       75 GB  |   158990 / 225000  |   0 / 0 |
542   |       54  | 207 GB  |  56 GB  |    140 GB  |   27 %  |      719526  |       68 GB  |   161244 / 225000  |   0 / 0 |
543   |       54  |      -  |      -  |         -  |    - %  |           -  |           -  |        - / -       |    -
544   |       54  | 207 GB  |  62 GB  |    134 GB  |   30 %  |      701722  |       76 GB  |   158982 / 225000  |   0 / 0 |
545   |       54  | 207 GB  |  61 GB  |    135 GB  |   30 %  |      719993  |       75 GB  |   161312 / 225000  |   0 / 0 |
546   |       54  | 207 GB  |  54 GB  |    142 GB  |   26 %  |      720023  |       68 GB  |   158481 / 225000  |   0 / 0 |
547   |       54  | 207 GB  |  62 GB  |    134 GB  |   30 %  |      719996  |       77 GB  |   179486 / 225000  |   0 / 0 |
548   |       54  | 207 GB  |  53 GB  |    143 GB  |   26 %  |      718406  |       70 GB  |   179038 / 225000  |   0 / 0 |

A drive is usually missing since it was ejected from the cluster due to a write error, either by the kernel or by the running storpool_server instance. More information may be found using dmesg | tail and in the system log. To obtain details about the model and the serial number of the failed drive, use the storpool disk list info command.

In normal conditions, the server will flag the disk to be re-tested and will eject it for a quick test. Provided the disk is still working correctly and test results are not breaching any thresholds, it will be returned into the cluster to recover. Such a case for example might happen if the stalled request was caused by an intermediate issue, like a reallocated sector.

In case the disk is breaching any sane latency and bandwidth thresholds it will not be automatically returned and will have to be re-balanced out of the cluster. Such disks are marked as “bad” (see Device preparation options)

When one or more drives are ejected (marked as bad already) and missing, multiple volumes and/or snapshots will be listed with the D flag in the output of storpool volume status (D as Degraded), due to the missing replicas for some of the data. This is normal and expected and there are the following options in this situation:

  • Re-testing

    The drive could still be working properly (for example, a set of bad sectors were reallocated) even after it was tested. To re-test it you could mark the drive as –good (see Device preparation options) and attempt to get it back into the cluster.

  • Recovering from scratch

    In some occasions a disk might have lost its signatures and would have to be returned in the cluster to recover from scratch - it will be automatically re-tested upon attempt; a full (read-write) stress-test is recommended to ensure it is working correctly (fio is a good tool for this kind of tests, check --verify option). In case the stress test is successful (e.g. the drive has been written to and verified successfully), it may be reinitialized with storpool_initdisk with the same disk ID it was before. This will automatically return it to the cluster and it will fully recover all data from scratch as if it was a brand new.

  • Replacing a drive

    The drive has failed irrecoverably and a replacement is available. The replacement drive is initialized with the diskID of the failed drive with storpool_initdisk (see Initializing a drive). After returning it to the cluster it will fully recover all the data from the live replicas (see Rebalancing the cluster).

  • A replacement is not available. The only option is to re-balance the cluster without this drive (more details in Rebalancing the cluster).

Attention

Beware that in some cases with very filled clusters it might be impossible to get the cluster back to full redundancy without overfilling some of the remaining drives. See next section.

Full drives

Some of the drives in the cluster are beyond 90% (up to 96% full). With proper planning this should be rarely an issue. You can evade this situation is one of the following ways:

The storpool snapshot space command will return relevant information for the referred space for each snapshot on the underlying drives. Note that snapshots with a negative value in their “used” column will not free up any space if they are removed and will remain in the deleting state, because they are parents of multiple cloned child volumes.

Note that depending on the speed with which the cluster is being populated with data by the end users this might also be considered a critical state.

Low number of free entries

Some of the drives have fewer than 140k free entries (alert for an overloaded system). This may be observed in the output of storpool disk list or storpool server <serverID> disk list, here is an example from the latter:

# storpool server 23 disk list
  disk  |  server  |    size    |    used    |  est.free  |      %  |  free entries  |  on-disk size  |  allocated objects |  errors |  flags
  2301  |    23.0  |   893 GiB  |   2.6 GiB  |   857 GiB  |    0 %  |       3719930  |       660 KiB  |       17 / 930000  |   0 / 0 |
  2302  |    23.0  |   893 GiB  |   2.6 GiB  |   857 GiB  |    0 %  |       3719929  |       668 KiB  |       17 / 930000  |   0 / 0 |
  2303  |    23.0  |   893 GiB  |   2.6 GiB  |   857 GiB  |    0 %  |       3719929  |       668 KiB  |       17 / 930000  |   0 / 0 |
  2304  |    23.0  |   893 GiB  |   2.6 GiB  |   857 GiB  |    0 %  |       3719931  |       668 KiB  |       17 / 930000  |   0 / 0 |
  2306  |    23.0  |   893 GiB  |   2.6 GiB  |   857 GiB  |    0 %  |       3719932  |       664 KiB  |       17 / 930000  |   0 / 0 |
  2305  |    23.0  |   893 GiB  |   2.6 GiB  |   857 GiB  |    0 %  |       3719930  |       660 KiB  |       17 / 930000  |   0 / 0 |
  2307  |    23.0  |   893 GiB  |   2.6 GiB  |   857 GiB  |    0 %  |         19934  |       664 KiB  |       17 / 930000  |   0 / 0 |
--------------------------------------------------------------------------------------------------------------------------------------
     7  |     1.0  |   6.1 TiB  |    18 GiB  |   5.9 TiB  |    0 %  |      26039515  |       4.5 MiB  |      119 / 6510000 |   0 / 0 |

This usually happens after the system has been loaded for longer periods of time with a sustained write workload on one or multiple volumes. If this is unexpected and the reason is an erratic workload, the recommended way to handle this is to set a limit (bandwidth, iops, or both) on the loaded volumes; for example, use a command like storpool volume <volumename> bw 100M iops 1000. The same could be set for multiple volumes/snapshots in a template with storpool template <templatename> bw 100M iops 1000 propagate. Note that propagating changes for templates with a very large number of volumes and snapshots might not work.

If the overloaded state is due to normally occurring workload, it is best to expand the system with more drives and/or reformat the drives with larger number of entries (relates mainly to HDD drives). The latter case might be caused by low number of hard drives in a HDD-only or a hybrid pool, and rarely due to overloaded SSDs.

Another case related to overloaded drives is when many volumes are created out of the same template. This requires overrides to shuffle the objects where the journals are residing; the goal is avoiding overload of the same triplet of disks when all virtual machines spike for some reason (i.e. unattended upgrades, a syslog intensive cron job, and so on).

More information

Apart from the notes for the replication above, none of the degraded state cases listed here should affect the stability of the system. For the example, with the missing disk in hybrid systems with a single failed SSD, all read requests on volumes with triple replication that have data on the failed drive will be served by some of the redundant copies on HDDs. This could bring up a bit the read latencies for the operations on the parts of the volume which were on this exact SSD. This is usually negligible in medium to large systems, i.e. in a cluster with 20 SSDs or NVMe drives – these are 1/20th of all the read operations in the cluster. In case of dual replicas on SSDs and a third replica on HDDs there is no read latency penalty whatsoever, which is also the case for missing hard drives - they will not affect the system at all, and in fact some write operations are even faster, because they are not waiting for the missing drive.