StorPool analytics
Overview
The metrics collected from StorPool clusters can be viewed at https://analytics.storpool.com/. The purpose of this document is to guide customers through the information available there and on some of its use cases.
Access credentials for the analytics system are provided during the initial setup of the cluster. For any issues related to those, please contact StorPool support.
The full sources of the dashboards are available on request.
Basics
The data is organized in two dimensions:
Granularity (per second, or per minute samples), and
Component/subsystem (CPU, memory, back-end, front-end, iSCSI, etc.)
The site uses the industry-standard Grafana software.
Example scenarios for use
Here are some scenarios how the system can be used to track different types of issues. For more information, see Dashboards reference.
Single drive creating delays
A simple scenario is tracking to see if a single drive is creating problems in a cluster.
The starting point for this would be All server stat. The graph with the read or write latency would look different for a specific server.
For the specific server, Per-host backend stat can be consulted to see if all drives or a group of them show different latencies. Care should be taken not to mistake SSD/NVMe and HDDs, as they’ll have different performance characteristics.
If all (or a group of) drives on a node show the same level of latency, then the issue is most probably related to the node and controllers on it. The drives otherwise can be compared in the Per-disk backend stat with other drives on other nodes, and any spikes in latency can be co-related with usage patterns on the drive to decide if there is a problem.
The final step would be verification on the node itself, via SMART and other tools, to see if the latencies can be explained.
Balancing CPU usage on hypervisors
There is a specific dashboard for this purpose, called Total CPU non-SP stat, which tracks the usage for all CPUs that are not isolated to be used by StorPool.
The best indicator in the dashboard is the “Run queue wait per node” graph, which tracks the congestion of the node. The “wait” is the time a process was scheduled to run on a CPU, but had to wait for another process to complete. For KVM-based hypervisors, this translates directly to “steal” time inside the virtual machines. On a non-congested node, “wait” should be close to zero, so based on this information the overloaded nodes can be picked and some VMs moved away from them.
If there is no congestion and just a better balance of CPU usage is wanted, the “Run queue per node” graph can be observed. It also includes the actual time processes were running, plus the time processes waited, and can be used to decide which hosts are overloaded and when.
Checking for CPU starvation for StorPool services
As most StorPool deployments use the hsleep
(see 7. Why the StorPool processes seem to be at 100% CPU usage all the time?) type of
sleep, on normal graphs/stats it looks like every StorPool process utilizes 100%
of the CPU. As noted in the FAQ, this is not the case, and the actual CPU usage
can be checked in the Network dashboard.
There are two relevant graphs, “Loops per second” and “Slept”. The “Slept” is the amount of CPU time in milliseconds that the process spent sleeping/waiting, and the “loops” are number of iterations the process has performed. A zero or close-to-zero value for “Slept” and somewhat high amount of “loops” means that the process has become CPU-bound. Low “sleep” and low number of “loops” points to a different type of problem (for example, CPU being throttled to a very low frequency).
Dashboards reference
Home
The home dashboard contains a basic overview of the cluster and links to all other dashboards.
The first line contains “Annunciator” graphs for the basic system performance
parameters - IOps, throughput, latency and queue size, for the front-end
(volumes accessed via storpool_block
) and back-end (for the
storpool_server
processes that talk to the drives). The graphs take their
data from the per-second stats for the last 1 hour.
The second line contains the cluster name and some basic “inventory” parameters,
like the number of attached volumes, available drives and clients
(storpool_block
) currently processing operation.
Below that are the general dashboards for the all the different categories.
CPU
The data in these dashboards is taken from the cpustat measurement and contains measurements for every CPU on every node of the system, once per minute or once per second.
Two parameters come from the kernel scheduler and form the “run-queue”. The
parameters are run
and wait
and mean “amount of time a process has been
running on the CPU” and “amount of time process(es) waited to run on the CPU”.
These two parameters, especially the wait
one describe very well how
congested the CPU is.
Note that a CPU in this context is a CPU thread, not the whole chip or core. For
example, if you have the following output in lscpu
, it means that data would
be collected for 40 CPUs:
Architecture: x86_64
CPU op-mode(s): 32-bit, 64-bit
Byte Order: Little Endian
CPU(s): 40
On-line CPU(s) list: 0-39
Thread(s) per core: 2
Core(s) per socket: 10
Socket(s): 2
NUMA node(s): 2
Vendor ID: GenuineIntel
CPU family: 6
Model: 62
Model name: Intel(R) Xeon(R) CPU E5-2680 v2 @ 2.80GHz
...
CPU stat
This dashboard gives an overview of the CPU usage on a single node.
Per-CPU stat
This dashboard tracks basic parameters for a single (or multiple selected) CPU threads. The parameters tracked are the load type (system, user, irq, etc.) and the run-queue (both time spent executing and time spent waiting to execute).
Per-service CPU stat
This dashboard shows the CPU usage for CPU threads used by StorPool services.
Total CPU non-SP stat
This dashboard tracks all CPUs that are not in StorPool-related slices. This dashboard is useful to track hypervisor load and related issues.
Total CPU stat
This dashboard shows an overview of the CPU usage on all nodes.
Servers
The data in the following dashboards comes from the diskstat
measurement. The data inside is the collected information from the
/usr/lib/storpool/server_stat
tool that provides metrics about a running
storpool_server
instance.
This data describes the communication between the storpool_server
processes
and the physical drives in the systems (hence the “back-end” term used in some
of the graphs).
These dashboards are useful to track issues with load related to specific drives or increases in system-related load (like aggregations).
All server stat
This dashboard provides an overview of the general storpool_server
performance metrics per node, composed in one view.
General server stat
This dashboard provides a general overview of the cluster’s back-end performance.
Per-disk backend stat
This dashboard provides overview of the usage of a single drive by the
storpool_server
instances.
Per-host backend stat
This dashboard provides overview of the usage the storpool_server
instances
on a single node.
Clients
These dashboards track the operations on volumes from the clients connecting via
storpool_block
. Their data comes from the iostat measurement.
The main difference between this set of dashboards and the Volume ones is that this one is structured related to the client nodes, not the volumes themselves.
All client stat
This dashboard shows all client operations in the cluster, grouped by server.
General client stat
This dashboard shows a summary of all client operations in the cluster.
Per-host client stat
This dashboard allows for viewing the stats for client operations on a single cluster.
Memory
These dashboards track the memory usage in cgroups and on the nodes in general. Their data comes from the memstat measurement.
Cgroup memory
This dashboard shows the usage of all cgroup memory slices in the cluster.
Cgroup memory per node
This dashboard shows the usage of all cgroup memory slices on a single node.
Volume
These dashboards track the operations on specific volumes from the clients
connecting via storpool_block
. Their data comes from the
iostat measurement.
Per-volume stat
This dashboard shows the I/O stats for a single volume.
Top volumes
This dashboard provides information on the volumes with most load, either in operations (read/write) or bandwidth.
Network
The only dashboard here tracks network and service stats for StorPool services. Its data comes from the servicestat and netstat measurements.
Network service stat
The first two rows describe per-service metrics:
Data transfers
are the transfers done over the StorPool protocolFailed transfers
are the amount of transfers that had failed for some reason (and would be re-tried)Loops is the number
of loops processing loops a specific has doneSlept is the amount
of time a process has slept/waited for work to process
The rest of the graphs describe metrics for the network communication on the two network interface on every node. More information about them can be found in the netstat documentation.
System disk
Both dashboards in this section track I/O stats for all physical drives in the system, as seen by the kernel.
SP disk stat
This dashboard filters drives by their StorPool ID.
System disk stat
This dashboard shows all drives and partitions by device name.
iSCSI
These dashboards provide overview of iSCSI traffic based based on different filters. Their data comes from the iscsisession measurement.
iSCSI stats per initiator
The stats in this dashboard are grouped/filtered by initiator name and displayed for every session.
iSCSI stats per node/network
The stats in this dashboard are summed/grouped by node and interface, to give a general overview of the usage of the iSCSI network.
iSCSI stats per target
The stats in this dashboard are grouped/filtered by target name and displayed for every session.
iSCSI totals per initiator
The stats in this dashboard are grouped/filtered by initiator name and summed for all sessions.
iSCSI totals per initiator
The stats in this dashboard are grouped/filtered by target name and summed for all sessions.
Template
This section tracks StorPool templates and is useful to view usage and free space patterns. The data comes from the template measurement.
Template usage
This dashboard shows stored and free space per template and placement group.
Template usage - internal
This dashboard contains the same information as Template usage
, with some
additional internal parameters.
Disk
This section shows the general disk usage. The data comes from the disk measurement.
Disk usage
This dashboard shows the overall usage of HDD and SSD/NVMe drives in the cluster.
Disk usage - internal
This dashboard shows all internal StorPool parameters for all drives in the cluster, split by type (SSD/HDD).
Single disk usage - internal
This dashboard shows all internal StorPool parameters for a single drive in the cluster.
Custom
This section is usually empty. Customers can create dashboards with the tag
custom
and those will show in that panel.