Critical state troubleshooting

The information provided here is about the critical state of a StorPool cluster - what should be expected, and what are the recommended steps. This is intended to be used as a guideline for the operations teams maintaining a StorPool system. For details on what to do when the system is in a different state, see Normal state troubleshooting and Degraded state troubleshooting.

“Critical” is an emergency state that requires immediate attention and intervention from the operations team and/or StorPool Support. Some of the conditions that could lead to this state are:

Partial or complete network outage.
Power loss for some nodes in the cluster.
Memory shortage leading to a service failure due to missing or incomplete cgroups configuration.

The following sections provide some examples of a critical state, and suggest troubleshooting steps.

API service failure 

API not reachable 

API not reachable on any of the configured nodes (the ones running the storpool_mgmt service). Requests to the API from any of the nodes configured to access it either stall or cannot reach a working service. This is a critical state due to the fact that the status of the cluster is unknown (might be down for that matter).

This might be caused by:

Misconfigured network for accessing the floating IP address.

The address may be obtained using the storpool_showconf http command on any of the nodes with a configured storpool_mgmt service in the cluster:
```
# storpool_showconf http
SP_API_HTTP_HOST=10.3.10.78
SP_API_HTTP_PORT=81
```
Failed interfaces on the hosts that have the storpool_mgmt service running.

To find the interface where the StorPool API should be running use the following command:
```
# storpool_showconf api_iface
SP_API_IFACE=bond0.410
```
It is recommended to have the API on a redundant interface; for example, an active-backup bond interface). Note that even without an API – provided the cluster is in quorum – there should be no impact on any running operations, but changes in the cluster like creating/attaching/detaching/deleting volumes or snapshots will be impossible. Running with no API in the cluster triggers the highest severity alert to StorPool Support (essentially wake up alert) due to the unknown state of the system.
The cluster is not in quorum

The cluster is in this state if the number of running voting storpool_beacon services is less than the half of the expected nodes plus one: (expected / 2) + 1. The configured number of expected nodes in the cluster may be checked with storpool_showconf expected, but is generally considered to be the number of server nodes (except when client nodes are configured as voting for some reason). For example, in a system with 6 servers at least 4 voting beacons should be available to get back the cluster to a running state:
```
# storpool_showconf expected
SP_EXPECTED_NODES=6
```
The current number of expected votes and the number of voting beacons are displayed in the output of storpool net list on the Quorum status: line; for details, see Network.

Delayed API responses 

API requests are not returning for more than 30-60 seconds; for example, using commands like storpool volume status, storpool snapshot space, storpool disk list, and so on. These API requests collect data from the running storpool_server services on each server node. Possible reasons are:

Network loss or delays.
Failing storpool_server services.
Failing drives or hardware (CPU, memory, controllers, etc.).
Overload.

Server service failure 

Two storpool_server services or whole servers are down 

Two storpool_server services or whole servers are down or not joined in the cluster in different failure sets. Very risky state, because there are parts of the volumes with only one live replica; if the latest writes land on a drive returning an IO error or broken data (detected by StorPool) this will lead to data loss.

As in degraded state, some of the read operations for parts of the volumes will be served from HDDs in a hybrid system and might raise read latencies. In this state it is very important to bring back the missing services/nodes as soon as possible, because a failure in any of the remaining drives in other nodes or another fault set will bring some of the volumes in down state and might lead to data loss in case of an error returned by a drive with the latest writes.

More than two storpool_server services or whole servers are down 

This state results in some volumes being in down state (storpool volume status) due to some parts only on the missing drives. Recommended action in this case: check for the reasons for the degraded services or missing (unresponsive) nodes and get them back up. Possible reasons are:

Lost network connectivity.
Severe packet loss/delays/loops.
Partial or complete power loss.
Hardware instabilities, overheating.
Kernel or other software instabilities, crashes.

Client service failure 

If the client service (storpool_block) is down on some of the nodes – either client-only or converged hypervisors – this will stall all requests on that particular node until the service is back up. Possible reasons are again:

Lost network connectivity
Severe packet loss/delays/loops
Bugs in the storpool_block service or the storpool_bd kernel module

In case of power loss or kernel crashes any virtual machine instances that were running on this node could be started on other available nodes.

Network interface or switch failure 

This means that the networks used for StorPool are down or are experiencing heavy packet loss or delays. In this case the quorum service will prevent a split-brain situation and will restart all services to ensure the cluster is fully connected on at least one network before it transitions again to running state. Such issues might be alleviated by a single-VLAN setup when different nodes have partial network connectivity, but will be again with severe delays in case of severe packet loss.

Hard drive/SSD failures 

Missing drives from two or more nodes 

Drives from two or more different nodes (fault sets) in the cluster are missing; or from a single node/fault set for systems with dual replication pools.

In this case multiple volumes may either experience degraded performance (hybrid placement) or will be in down state when more than two replicas are missing. All operations on volumes in down state are stalled, until the redundancy is restored (i.e. at least one replica is available). The recommended steps are to immediately check for the reasons for the missing drives/services/nodes and return them into the cluster as soon as possible.

Some of the drives are more than 97% full 

At some point all cluster operations will stall until either some of the data in the cluster is deleted (see Deleting volumes and Deleting snapshots), or new drives or nodes are added (see Storage devices and Adding nodes). Adding drives requires the new drives to be stress tested and a re-balancing of the system to include them, which should be carefully planned (see Rebalancing the cluster).

Note

Cleaning up snapshots that have multiple cloned volumes and a negative value for used space in the output of storpool snapshot space will not free up any space.

Some of the drives have fewer than 100k free entries 

This is usually caused by a heavily overloaded system. In this state the latencies for some operations might become very high (measured in seconds). Possible reasons are:

Severely overloaded volumes for long periods of time without any configured bandwidth or iops limits.

This could be checked by using iostat to look for volumes that are being constantly 100% loaded with a large number of requests to the storage system. Another way to check for such volumes is to use the “Top volumes” in the analytics in order to get info for the most loaded volumes and apply IOPS and or bandwidth limits accordingly.
Misbehaving (under-performing) drives, or misbehaving HBA/SAS controllers.

The recommended way to deal with these cases is to investigate for such drives. A good idea is to check the output from storpool disk list internal (see Listings disks) for higher aggregation scores on some drives or set of drives (for example, on the same server). You could also use the analytics to check for abnormal latency on some of the backend nodes, like drives with significantly higher operations latency compared to other drives of the same type. An example would be a failing controller causing the SATA speed to degrade to SATA1.0 (1.5Gb/s) instead of SATA 3.0 (6Gb/s), worn out batteries on a RAID controller when its cache is used to accelerate the writes on the HDDs, and others.

The circumstances leading a system to the critical state are rare and are usually preventable with by taking measures to handle all the issues at the first signs of a change from the normal to degraded state.

In any of the above cases if you feel that something is not as expected, a consultation with StorPool Support is the best course of action. StorPool Support receives notifications for all detailed cases and pro-actively takes actions to alleviate a system going into degraded or critical state as soon as practically possible.