Critical state troubleshooting
The information provided here is about the critical state of a StorPool cluster - what should be expected, and what are the recommended steps. This is intended to be used as a guideline for the operations teams maintaining a StorPool system. For details on what to do when the system is in a different state, see Normal state troubleshooting and Degraded state troubleshooting.
“Critical” is an emergency state that requires immediate attention and intervention from the operations team and/or StorPool Support. Some of the conditions that could lead to this state are:
Partial or complete network outage.
Power loss for some nodes in the cluster.
Memory shortage leading to a service failure due to missing or incomplete cgroups configuration.
The following sections provide some examples of a critical state, and suggest troubleshooting steps.
API service failure
API not reachable
API not reachable on any of the configured nodes (the ones running the storpool_mgmt service). Requests to the API from any of the nodes configured to access it either stall or cannot reach a working service. This is a critical state due to the fact that the status of the cluster is unknown (might be down for that matter).
This might be caused by:
Misconfigured network for accessing the floating IP address.
The address may be obtained using the
storpool_showconf httpcommand on any of the nodes with a configuredstorpool_mgmtservice in the cluster:# storpool_showconf http SP_API_HTTP_HOST=10.3.10.78 SP_API_HTTP_PORT=81
Failed interfaces on the hosts that have the
storpool_mgmtservice running.To find the interface where the StorPool API should be running use the following command:
# storpool_showconf api_iface SP_API_IFACE=bond0.410
It is recommended to have the API on a redundant interface; for example, an active-backup bond interface). Note that even without an API – provided the cluster is in quorum – there should be no impact on any running operations, but changes in the cluster like creating/attaching/detaching/deleting volumes or snapshots will be impossible. Running with no API in the cluster triggers the highest severity alert to StorPool Support (essentially wake up alert) due to the unknown state of the system.
The cluster is not in quorum
The cluster is in this state if the number of running voting storpool_beacon services is less than the half of the expected nodes plus one:
(expected / 2) + 1. The configured number of expected nodes in the cluster may be checked withstorpool_showconf expected, but is generally considered to be the number of server nodes (except when client nodes are configured as voting for some reason). For example, in a system with 6 servers at least 4 voting beacons should be available to get back the cluster to a running state:# storpool_showconf expected SP_EXPECTED_NODES=6
The current number of expected votes and the number of voting beacons are displayed in the output of
storpool net liston theQuorum status:line; for details, see Network.
Delayed API responses
API requests are not returning for more than 30-60 seconds; for example, using commands like storpool volume status, storpool snapshot space, storpool disk list, and so on.
These API requests collect data from the running storpool_server services on
each server node. Possible reasons are:
Network loss or delays.
Failing
storpool_serverservices.Failing drives or hardware (CPU, memory, controllers, etc.).
Overload.
Server service failure
Two storpool_server services or whole servers are down
Two storpool_server services or whole servers are down or not joined in the cluster in different failure sets. Very risky state, because there are parts of the volumes with only one live replica; if the latest writes land on a drive returning an IO error or broken data (detected by StorPool) this will lead to data loss.
As in degraded state, some of the read operations for parts of the volumes will be served from HDDs in a hybrid system and might raise read latencies. In this state it is very important to bring back the missing services/nodes as soon as possible, because a failure in any of the remaining drives in other nodes or another fault set will bring some of the volumes in down state and might lead to data loss in case of an error returned by a drive with the latest writes.
More than two storpool_server services or whole servers are down
This state results in some volumes being in down state (storpool volume status) due to some parts only on the missing drives.
Recommended action in this case: check for the reasons for the degraded services or missing (unresponsive) nodes and get them back up.
Possible reasons are:
Lost network connectivity.
Severe packet loss/delays/loops.
Partial or complete power loss.
Hardware instabilities, overheating.
Kernel or other software instabilities, crashes.
Client service failure
If the client service (storpool_block) is down on some of the nodes – either client-only or converged hypervisors – this will stall all requests on that particular node until the service is back up. Possible reasons are again:
Lost network connectivity
Severe packet loss/delays/loops
Bugs in the
storpool_blockservice or thestorpool_bdkernel module
In case of power loss or kernel crashes any virtual machine instances that were running on this node could be started on other available nodes.
Network interface or switch failure
This means that the networks used for StorPool are down or are experiencing heavy packet loss or delays. In this case the quorum service will prevent a split-brain situation and will restart all services to ensure the cluster is fully connected on at least one network before it transitions again to running state. Such issues might be alleviated by a single-VLAN setup when different nodes have partial network connectivity, but will be again with severe delays in case of severe packet loss.
Hard drive/SSD failures
Missing drives from two or more nodes
Drives from two or more different nodes (fault sets) in the cluster are missing; or from a single node/fault set for systems with dual replication pools.
In this case multiple volumes may either experience degraded performance (hybrid placement) or will be in down state when more than two replicas are missing.
All operations on volumes in down state are stalled, until the redundancy is restored (i.e. at least one replica is available).
The recommended steps are to immediately check for the reasons for the missing drives/services/nodes and return them into the cluster as soon as possible.
Some of the drives are more than 97% full
At some point all cluster operations will stall until either some of the data in the cluster is deleted (see Deleting volumes and Deleting snapshots), or new drives or nodes are added (see Storage devices and Adding nodes). Adding drives requires the new drives to be stress tested and a re-balancing of the system to include them, which should be carefully planned (see Rebalancing the cluster).
Note
Cleaning up snapshots that have multiple cloned volumes and a negative
value for used space in the output of storpool snapshot space will
not free up any space.
Some of the drives have fewer than 100k free entries
This is usually caused by a heavily overloaded system. In this state the latencies for some operations might become very high (measured in seconds). Possible reasons are:
Severely overloaded volumes for long periods of time without any configured bandwidth or iops limits.
This could be checked by using
iostatto look for volumes that are being constantly 100% loaded with a large number of requests to the storage system. Another way to check for such volumes is to use the “Top volumes” in the analytics in order to get info for the most loaded volumes and apply IOPS and or bandwidth limits accordingly.Misbehaving (under-performing) drives, or misbehaving HBA/SAS controllers.
The recommended way to deal with these cases is to investigate for such drives. A good idea is to check the output from
storpool disk list internal(see Listings disks) for higher aggregation scores on some drives or set of drives (for example, on the same server). You could also use the analytics to check for abnormal latency on some of the backend nodes, like drives with significantly higher operations latency compared to other drives of the same type. An example would be a failing controller causing the SATA speed to degrade to SATA1.0 (1.5Gb/s) instead of SATA 3.0 (6Gb/s), worn out batteries on a RAID controller when its cache is used to accelerate the writes on the HDDs, and others.
The circumstances leading a system to the critical state are rare and are usually preventable with by taking measures to handle all the issues at the first signs of a change from the normal to degraded state.
In any of the above cases if you feel that something is not as expected, a consultation with StorPool Support is the best course of action. StorPool Support receives notifications for all detailed cases and pro-actively takes actions to alleviate a system going into degraded or critical state as soon as practically possible.