Normal state troubleshooting
The information provided here is about the normal state of a StorPool cluster - what should be expected, and what are the recommended steps. This is intended to be used as a guideline for the operations teams maintaining a StorPool system. For details on what to do when the system is in a different state, see Degraded state troubleshooting and Critical state troubleshooting.
The normal behaviour of the StorPool storage system is when it is fully configured and in up-and-running state. This is the desired state of the system. The characteristics of this state are outlined in the sections below.
All nodes in the storage cluster are up and running
This can be checked by using the CLI with the storpool service list command (see Services) on any node with access to the API service.
Note
This command provides status for all services running clusterwide, rather than the services running on the node itself.
All configured StorPool services are up and running
This is again easily checked with the storpool service list command.
Recently restarted services are usually spotted due to their uptime.
Recently restarted services are to be taken seriously if the reason for their state is unknown even if they are running at the moment, as in the example with client ID 37 below:
# storpool service list
cluster running, mgmt on node 2
mgmt 1 running on node 1 ver 20.00.18, started 2022-09-08 19:28:37, uptime 144 days 22:45:51
mgmt 2 running on node 2 ver 20.00.18, started 2022-09-08 19:27:18, uptime 144 days 22:47:10 active
server 1 running on node 1 ver 20.00.18, started 2022-09-08 19:28:59, uptime 144 days 22:45:29
server 2 running on node 2 ver 20.00.18, started 2022-09-08 19:25:53, uptime 144 days 22:48:35
server 3 running on node 3 ver 20.00.18, started 2022-09-08 19:23:30, uptime 144 days 22:50:58
client 1 running on node 1 ver 20.00.18, started 2022-09-08 19:28:37, uptime 144 days 22:45:51
client 2 running on node 2 ver 20.00.18, started 2022-09-08 19:25:32, uptime 144 days 22:48:56
client 3 running on node 3 ver 20.00.18, started 2022-09-08 19:23:09, uptime 144 days 22:51:19
client 21 running on node 21 ver 20.00.18, started 2022-09-08 19:20:26, uptime 144 days 22:54:02
client 22 running on node 22 ver 20.00.18, started 2022-09-08 19:19:26, uptime 144 days 22:55:02
client 37 running on node 37 ver 20.00.18, started 2022-09-08 13:08:12, uptime 05:06:16
Working cgroup memory and cpuset isolation is properly configured
Use the storpool_cg tool with the check argument to ensure everything is as expected.
The tool should not return any warnings; for more information, see storpool_cg check.
When properly configured, the sum of all memory limits in the node is lower than the available memory in the node.
This protects the running kernel from memory shortage – as well as all processes in the storpool.slice memory cgroup – which ensures the stability of the storage service.
All network interfaces are properly configured
All network interfaces used by StorPool are up and properly configured with hardware acceleration enabled (where applicable); all network switches are configured with jumbo frames and flow control, and none of them experience any packet loss or delays.
The output from the storpool net list command (see Network) is a good start; all configured network interfaces will be displayed as up with proper flags explained at the end.
The desired state is uU with a + at the end for each network interface; if hardware acceleration is supported on an interface the A flag should also be present:
storpool net list
------------------------------------------------------------
| nodeId | flags | net 1 | net 2 |
------------------------------------------------------------
| 23 | uU + AJ | 1A:00:01:00:00:17 | 1E:00:01:00:00:17 |
| 24 | uU + AJ | 1A:00:01:00:00:18 | 1E:00:01:00:00:18 |
| 25 | uU + AJ | 1A:00:01:00:00:19 | 1E:00:01:00:00:19 |
| 26 | uU + AJ | 12:00:01:00:F0:1A | 16:00:01:00:F0:1A |
| 27 | uU + AJ | 26:8A:07:43:EE:BC | 26:8A:07:43:EE:BD |
------------------------------------------------------------
Quorum status: 5 voting beacons up out of 5 expected
Flags:
u - packets recently received from this node
d - no packets recently received from this node
U - this node has enough votes to be in the quorum
D - this node does not have enough votes to be in the quorum
+ - this node considers itself in the quorum
B - the connection to this node is through a backup link; check the cabling
A - this node is using hardware acceleration
J - the node uses jumbo frames
N - a non-voting node
All drives are up and running
All drives in use for the storage system are performing at their specified speed, and are joined in the cluster and serving requests.
This could be checked with the storpool disk list internal command (see Listings disks).
For example, in a normally loaded cluster all drives will report low aggregate scores.
Below is an example output (trimmed for brevity):
# storpool disk list internal
--------------------------------------------------------------------------------------------------------------------------------------------------------
| disk | server | aggregate scores | wbc pages | scrub bw | scrub ETA | last scrub completed |
--------------------------------------------------------------------------------------------------------------------------------------------------------
| 2301 | 23.0 | 0 | 0 | 0 | - + - / - | - | - | 2022-09-08 15:33:44 |
| 2302 | 23.0 | 0 | 0 | 0 | - + - / - | - | - | 2022-09-08 15:28:48 |
| 2303 | 23.1 | 0 | 0 | 0 | - + - / - | - | - | 2022-09-08 15:28:49 |
| 2304 | 23.1 | 0 | 0 | 0 | - + - / - | - | - | 2022-09-08 15:28:50 |
| 2305 | 23.2 | 0 | 0 | 0 | - + - / - | - | - | 2022-09-08 15:28:51 |
| 2306 | 23.2 | 0 | 0 | 0 | - + - / - | - | - | 2022-09-08 15:28:51 |
| 2307 | 23.3 | 0 | 0 | 0 | - + - / - | - | - | 2022-09-08 15:28:52 |
| 2308 | 23.3 | 0 | 0 | 0 | - + - / - | - | - | 2022-09-08 15:28:53 |
| 2311 | 23.0 | 0 | 0 | 0 | 0 + 0 / 2560 | - | - | 2022-09-08 15:28:38 |
| 2312 | 23.0 | 0 | 0 | 0 | 0 + 0 / 2560 | - | - | 2022-09-08 15:28:43 |
| 2313 | 23.1 | 0 | 0 | 0 | 0 + 0 / 2560 | - | - | 2022-09-08 15:28:44 |
| 2314 | 23.1 | 0 | 0 | 0 | 0 + 0 / 2560 | - | - | 2022-09-08 15:28:45 |
| 2315 | 23.2 | 0 | 0 | 0 | 0 + 0 / 2560 | - | - | 2022-09-08 15:28:47 |
| 2316 | 23.2 | 0 | 0 | 0 | 0 + 0 / 2560 | - | - | 2022-09-08 15:28:39 |
| 2317 | 23.3 | 0 | 0 | 0 | 0 + 0 / 2560 | - | - | 2022-09-08 15:28:40 |
| 2318 | 23.3 | 0 | 0 | 0 | 0 + 0 / 2560 | - | - | 2022-09-08 15:28:42 |
[snip]
All drives are regularly scrubbed, so they would have a stable (not increasing) number of errors.
The errors corrected for each drive are visible in the output of storpool disk list .
Last completed scrub is visible in the output of storpool disk list internal, as shown in the example above.
Note that some systems may have fewer than two network interfaces or a single backend switch. Even though this not recommended this is still possible, and sometimes used (usually in PoC or with a backup server) when a cluster is configured with a single-VLAN network redundancy scheme. A single VLAN network redundancy configuration and an inter-switch connection is required for a cluster where only some of the nodes are with a single interface connected to the cluster.
Also, if one or more of the points describing the state above is not in effect, the system should not be considered healthy. If there is any suspicion that the system is behaving erratically even though all of the above conditions are satisfied, the recommended steps to check if everything is in order are:
Check
topand look for the state of each of the configuredstorpool_*services running in the present node.A properly running service is usually in the
S(sleeping) state and rarely seen in theR(running) state. The CPU usage is often reported at 100% usage when hardware sleep is enabled, due to the kernel misreporting. The actual usage is much lower and could be tracked withcpupower monitorfor the CPU cores.To ensure all services on this node are running correctly you should use the
/usr/lib/storpool/sdumptool.It will report some CPU and network usage statistics for the running services on the node. Use the
-loption for the long names of the statistics.On some of the nodes with running workloads (like VM instances or containers)
iostatwill show activity for processed requests on the block devices.The following example shows the normal disk activity on a node running VM instances. Note that the usage may vary greatly depending on the workload. The command used in the example is
iostat -xm 1 /dev/sp-*| egrep -v " 0[.,]00$", which will print statistics for the StorPool devices each second excluding drives that have no storage IO activity:Device: rrqm/s wrqm/s r/s w/s rMB/s wMB/s avgrq-sz avgqu-sz await r_await w_await svctm %util sp-0 0.00 0.00 0.00 279.00 0.00 0.14 1.00 3.87 13.80 0.00 13.80 3.55 98.99 sp-11 0.00 0.00 165.60 114.10 19.29 14.26 245.66 5.97 20.87 9.81 36.91 0.89 24.78 sp-12 0.00 0.00 171.60 153.60 19.33 19.20 242.67 9.20 28.17 10.46 47.96 1.08 35.01 sp-13 0.00 0.00 6.00 40.80 0.04 5.10 225.12 1.75 37.32 0.27 42.77 1.06 4.98 sp-21 0.00 0.00 0.00 82.20 0.00 1.04 25.90 1.00 12.08 0.00 12.08 12.16 99.99
There are no hanging active requests
The output of /usr/lib/storpool/latthreshold.py is empty - it shows no hanging requests and no service or disk warnings.