Hanging requests in the cluster

The output of /usr/lib/storpool/latthreshold.py shows hanging requests and/or missing services. Here is an example:

disk | reported by | peers                      |  s |   op  |      volume |                              requestId
-------------------------------------------------------------------------------------------------------------------
-    | client 2    | client 2    -> server 2.1  | 15 | read  | volume-name | 9223936579889289248:270215977642998472
-    | client 2    | client 2    -> server 2.1  | 15 | read  | volume-name | 9223936579889289248:270497452619709333
-    | client 2    | client 2    -> server 2.1  | 15 | read  | volume-name | 9223936579889289248:270778927596419790
-    | client 2    | client 2    -> server 2.1  | 15 | write | volume-name | 9223936579889289248:271060402573130531
-    | client 2    | client 2    -> server 2.1  | 15 | read  | volume-name | 9223936579889289248:271341877549841211
-    | client 2    | client 2    -> server 2.1  | 15 | read  | volume-name | 9223936579889289248:271623352526551744
-    | client 2    | client 2    -> server 2.1  | 15 | write | volume-name | 9223936579889289248:271904827503262450
server 2.1  connection status: established no_data timeout
disk 202 EXPECTED_UNKNOWN server 2.1

This could be caused by starving CPU, hardware resets, misbehaving disks or network or stalled services. The disk field in the output and the service warnings after the requests table could be used as an indicator for the misbehaving component.

Note that the active requests API call has a timeout for each service to respond. The default timeout that the latthreshold tool uses is 10 seconds. This value can be altered by using tool’s --api-requests-timeout/-A option and passing it a numeric value with a time unit (m, s, ms or us), for example 100ms.

Service connection will have one of the following statuses:

Option

Description

established done

Service reported its active requests as expected; this is not displayed in the regular output, only with --json.

not_established

Did not make a connection with the service - this could indicate the server is down, but may also indicate the service version is too old, or its stream was overfilled or not connected.

established no_data timeout

Service did not respond and the connection was closed because the timeout was reached.

established data timeout

Service responded but the connection was closed because the timeout was reached before it could send all the data.

established invalid_data

A message the service sent had invalid data in it.

The latthreshold tool also reports disk statuses. Reported disk statuses will be one of the following:

Status

Description

EXPECTED_MISSING

The service response was good, but did not provide information about the disk.

EXPECTED_NO_CONNECTION_TO_PEER

The connection to the service was not established.

EXPECTED_NO_PEER

The service is not present.

EXPECTED_UNKNOWN

The service response was invalid or a timeout occurred.