Disk and journal performance tracking
The server instances keep an average latency for each drive and its journal (if one is configured).
This allows for setting up millisecond-scale hard limit for each disk type (SSD, NVMe, HDD, and any drives with a journal), either globally, or on a per-disk or per-journal limit for each drive in a cluster.
The collected average is based on the last window of 128 requests to the drive. The maximum latency is based on the maximum average value recorded for this disk drive from the last time it went up in the cluster.
Why it is needed
The main idea behind this feature is to allow handling RAID controller issues or misbehaving disks generally, as well as tracking performance for all drives over time.
How it works
Once the node is upgraded to an enabled version, the collected statistics will start being visible in the analytics platform.
Note that for completely idle clusters the average value stays the same as it was for the past 128 requests. Though we rarely see an idle cluster, and all production clusters are running the IO latency monitoring service, which will regularly populate and change the average values.
They could also be viewed from the CLI, for details See Disk list performance information.
After a while, there will be enough data to define a threshold for each disk type globally. This threshold usually depends on the use case and the actual workload in the cluster, the drives used, whether a drive is behind a controller with a journal on the controller’s cachevault/battery or on an NVMe device, and probably other factors as well.
For more information about configuring a global threshold, see Latency thresholds.
Note
As a safety measure, this mechanism will only work on one node in the cluster. If there is a latency threshold event in another disk or journal while there are already ejected drives, no further drives get ejected from the cluster even if they have latency above the configured thresholds so that the redundancy could never go down to less than two live replicas.
Typical use cases
One example use case is when a battery or a cachevault unit in a RAID controller fails, in which case all writes stop being completed in the controller’s RAM.
In this case, the latency of each of the drives behind this controller gets raised, due to operations completing in the HDDs. The configured threshold is reached, and the server drops the journal for each disk and starts completing writes in the server’s RAM transparently. This immediately raises a monitoring alert that will point to an issue with this controller battery/cache-vault or configuration.
The default behavior of most LSI-based controllers is to flip all cache to write-through mode when some of the disks behind the controller fail. This also leads to abnormal write latency and is handled by the new feature similar to the battery/cachevault issue example above.
As an additional benefit, the maximum thresholds may now be configured for all drives in a cluster globally for each drive type, allowing for a lower impact on user operations when some of the drives misbehave. Currently if a drive stalls, each request hitting it might take up to ten seconds before it is ejected. With this feature, the thresholds may be configured as low as required by the use case and the hardware specification.
As of release 19.4 revision 19.01.2930.57ca5627f, the disks will be marked for testing and the server will automatically attempt to return them into the cluster in case this was an intermittent failure. The server stops attempting to re-test and return in case there are more than four failures for the past hour.
Lastly, tracking the performance of each disk over time allows for visibility, and when comparing, subtle changes for similar drives in the cluster are required. A latency-sensitive workload might benefit from tighter limits when necessary.
When a drive or a journal is ejected, the server instance keeps the last 128 requests sent to the drive right before it was ejected with detailed info on the last requests it handled.
History
This feature was initially added with release 19.4 revision 19.01.2877.2ee379917.
Updated in release 19.4 revision 19.01.2930.57ca5627f.
As of release 20.0 revision 20.0.93.78df908ec there are new defaults for each disk type, which will lessen the impact of a misbehaving disk earlier.
The new defaults are based on aggregated data from thousands of nodes and their drives, and are applied by the StorPool operations team in all clusters.