Hanging requests in the cluster
The output of /usr/lib/storpool/latthreshold.py shows hanging requests and/or missing services.
Here is an example:
disk | reported by | peers | s | op | volume | requestId
-------------------------------------------------------------------------------------------------------------------
- | client 2 | client 2 -> server 2.1 | 15 | read | volume-name | 9223936579889289248:270215977642998472
- | client 2 | client 2 -> server 2.1 | 15 | read | volume-name | 9223936579889289248:270497452619709333
- | client 2 | client 2 -> server 2.1 | 15 | read | volume-name | 9223936579889289248:270778927596419790
- | client 2 | client 2 -> server 2.1 | 15 | write | volume-name | 9223936579889289248:271060402573130531
- | client 2 | client 2 -> server 2.1 | 15 | read | volume-name | 9223936579889289248:271341877549841211
- | client 2 | client 2 -> server 2.1 | 15 | read | volume-name | 9223936579889289248:271623352526551744
- | client 2 | client 2 -> server 2.1 | 15 | write | volume-name | 9223936579889289248:271904827503262450
server 2.1 connection status: established no_data timeout
disk 202 EXPECTED_UNKNOWN server 2.1
This could be caused by starving CPU, hardware resets, misbehaving disks or network or stalled services.
The disk field in the output and the service warnings after the requests table could be used as an indicator for the misbehaving component.
Note that the active requests API call has a timeout for each service to respond.
The default timeout that the latthreshold tool uses is 10 seconds.
This value can be altered by using tool’s --api-requests-timeout/-A option and passing it a numeric value with a time unit (m, s, ms or us), for example 100ms.
Service connection will have one of the following statuses:
Option |
Description |
|---|---|
|
Service reported its active requests as expected; this is not displayed in the regular output, only with |
|
Did not make a connection with the service - this could indicate the server is down, but may also indicate the service version is too old, or its stream was overfilled or not connected. |
|
Service did not respond and the connection was closed because the timeout was reached. |
|
Service responded but the connection was closed because the timeout was reached before it could send all the data. |
|
A message the service sent had invalid data in it. |
The latthreshold tool also reports disk statuses.
Reported disk statuses will be one of the following:
Status |
Description |
|---|---|
EXPECTED_MISSING |
The service response was good, but did not provide information about the disk. |
EXPECTED_NO_CONNECTION_TO_PEER |
The connection to the service was not established. |
EXPECTED_NO_PEER |
The service is not present. |
EXPECTED_UNKNOWN |
The service response was invalid or a timeout occurred. |