StorPool analytics

Overview 

The metrics collected from StorPool clusters can be viewed at https://analytics.storpool.com/. The purpose of this document is to guide customers through the information available there and on some of its use cases.

Access credentials for the analytics system are provided during the initial setup of the cluster. For any issues related to those, please contact StorPool support.

The full sources of the dashboards are available on request.

Basics 

The data is organized in two dimensions:

Granularity (per second, or per minute samples), and
Component/subsystem (CPU, memory, back-end, front-end, iSCSI, etc.)

The site uses the industry-standard Grafana software.

Example scenarios for use 

Here are some scenarios how the system can be used to track different types of issues. For more information, see Dashboards reference.

Single drive creating delays

A simple scenario is tracking to see if a single drive is creating problems in a cluster.

The starting point for this would be All server stat. The graph with the read or write latency would look different for a specific server.

For the specific server, Per-host backend stat can be consulted to see if all drives or a group of them show different latencies. Care should be taken not to mistake SSD/NVMe and HDDs, as they’ll have different performance characteristics.

If all (or a group of) drives on a node show the same level of latency, then the issue is most probably related to the node and controllers on it. The drives otherwise can be compared in the Per-disk backend stat with other drives on other nodes, and any spikes in latency can be co-related with usage patterns on the drive to decide if there is a problem.

The final step would be verification on the node itself, via SMART and other tools, to see if the latencies can be explained.

Balancing CPU usage on hypervisors

There is a specific dashboard for this purpose, called Total CPU non-SP stat, which tracks the usage for all CPUs that are not isolated to be used by StorPool.

The best indicator in the dashboard is the “Run queue wait per node” graph, which tracks the congestion of the node. The “wait” is the time a process was scheduled to run on a CPU, but had to wait for another process to complete. For KVM-based hypervisors, this translates directly to “steal” time inside the virtual machines. On a non-congested node, “wait” should be close to zero, so based on this information the overloaded nodes can be picked and some VMs moved away from them.

If there is no congestion and just a better balance of CPU usage is wanted, the “Run queue per node” graph can be observed. It also includes the actual time processes were running, plus the time processes waited, and can be used to decide which hosts are overloaded and when.

Checking for CPU starvation for StorPool services

As most StorPool deployments use the hsleep (see Why the StorPool processes seem to be at 100% CPU usage all the time?) type of sleep, on normal graphs/stats it looks like every StorPool process utilizes 100% of the CPU. As noted in the FAQ, this is not the case, and the actual CPU usage can be checked in the Network dashboard.

There are two relevant graphs, “Loops per second” and “Slept”. The “Slept” is the amount of CPU time in milliseconds that the process spent sleeping/waiting, and the “loops” are number of iterations the process has performed. A zero or close-to-zero value for “Slept” and somewhat high amount of “loops” means that the process has become CPU-bound. Low “sleep” and low number of “loops” points to a different type of problem (for example, CPU being throttled to a very low frequency).

Dashboards reference 

Home

The home dashboard contains a basic overview of the cluster and links to all other dashboards.

The first line contains “Annunciator” graphs for the basic system performance parameters - IOps, throughput, latency and queue size, for the front-end (volumes accessed via storpool_block) and back-end (for the storpool_server processes that talk to the drives). The graphs take their data from the per-second stats for the last 1 hour.

The second line contains the cluster name and some basic “inventory” parameters, like the number of attached volumes, available drives and clients (storpool_block) currently processing operation.

Below that are the general dashboards for the all the different categories.

CPU

The data in these dashboards is taken from the cpustat measurement and contains measurements for every CPU on every node of the system, once per minute or once per second.

Two parameters come from the kernel scheduler and form the “run-queue”. The parameters are run and wait and mean “amount of time a process has been running on the CPU” and “amount of time process(es) waited to run on the CPU”. These two parameters, especially the wait one describe very well how congested the CPU is.

Note that a CPU in this context is a CPU thread, not the whole chip or core. For example, if you have the following output in lscpu, it means that data would be collected for 40 CPUs:

Architecture:          x86_64
CPU op-mode(s):        32-bit, 64-bit
Byte Order:            Little Endian
CPU(s):                40
On-line CPU(s) list:   0-39
Thread(s) per core:    2
Core(s) per socket:    10
Socket(s):             2
NUMA node(s):          2
Vendor ID:             GenuineIntel
CPU family:            6
Model:                 62
Model name:            Intel(R) Xeon(R) CPU E5-2680 v2 @ 2.80GHz
...

CPU stat

This dashboard gives an overview of the CPU usage on a single node.

Per-CPU stat

This dashboard tracks basic parameters for a single (or multiple selected) CPU threads. The parameters tracked are the load type (system, user, irq, etc.) and the run-queue (both time spent executing and time spent waiting to execute).

Per-service CPU stat

This dashboard shows the CPU usage for CPU threads used by StorPool services.

Total CPU non-SP stat

This dashboard tracks all CPUs that are not in StorPool-related slices. This dashboard is useful to track hypervisor load and related issues.

Total CPU stat

This dashboard shows an overview of the CPU usage on all nodes.

Servers

The data in the following dashboards comes from the diskstat measurement. The data inside is the collected information from the /usr/lib/storpool/server_stat tool that provides metrics about a running storpool_server instance.

This data describes the communication between the storpool_server processes and the physical drives in the systems (hence the “back-end” term used in some of the graphs).

These dashboards are useful to track issues with load related to specific drives or increases in system-related load (like aggregations).

All server stat

This dashboard provides an overview of the general storpool_server performance metrics per node, composed in one view.

General server stat

This dashboard provides a general overview of the cluster’s back-end performance.

Per-disk backend stat

This dashboard provides overview of the usage of a single drive by the storpool_server instances.

Per-host backend stat

This dashboard provides overview of the usage the storpool_server instances on a single node.

Clients

These dashboards track the operations on volumes from the clients connecting via storpool_block. Their data comes from the iostat measurement.

The main difference between this set of dashboards and the Volume ones is that this one is structured related to the client nodes, not the volumes themselves.

All client stat

This dashboard shows all client operations in the cluster, grouped by server.

General client stat

This dashboard shows a summary of all client operations in the cluster.

Per-host client stat

This dashboard allows for viewing the stats for client operations on a single cluster.

Memory

These dashboards track the memory usage in cgroups and on the nodes in general. Their data comes from the memstat measurement.

Cgroup memory

This dashboard shows the usage of all cgroup memory slices in the cluster.

Cgroup memory per node

This dashboard shows the usage of all cgroup memory slices on a single node.

Volume

These dashboards track the operations on specific volumes from the clients connecting via storpool_block. Their data comes from the iostat measurement.

Per-volume stat

This dashboard shows the I/O stats for a single volume.

Top volumes

This dashboard provides information on the volumes with most load, either in operations (read/write) or bandwidth.

Network

The only dashboard here tracks network and service stats for StorPool services. Its data comes from the servicestat and netstat measurements.

Network service stat

The first two rows describe per-service metrics:

Data transfers are the transfers done over the StorPool protocol
Failed transfers are the amount of transfers that had failed for some reason (and would be re-tried)
Loops is the number of loops processing loops a specific has done
Slept is the amount of time a process has slept/waited for work to process

The rest of the graphs describe metrics for the network communication on the two network interface on every node. More information about them can be found in the netstat documentation.

System disk

Both dashboards in this section track I/O stats for all physical drives in the system, as seen by the kernel.

SP disk stat

This dashboard filters drives by their StorPool ID.

System disk stat

This dashboard shows all drives and partitions by device name.

iSCSI

These dashboards provide overview of iSCSI traffic based based on different filters. Their data comes from the iscsisession measurement.

iSCSI stats per initiator

The stats in this dashboard are grouped/filtered by initiator name and displayed for every session.

iSCSI stats per node/network

The stats in this dashboard are summed/grouped by node and interface, to give a general overview of the usage of the iSCSI network.

iSCSI stats per target

The stats in this dashboard are grouped/filtered by target name and displayed for every session.

iSCSI totals per initiator

The stats in this dashboard are grouped/filtered by initiator name and summed for all sessions.

iSCSI totals per initiator

The stats in this dashboard are grouped/filtered by target name and summed for all sessions.

Template

This section tracks StorPool templates and is useful to view usage and free space patterns. The data comes from the template measurement.

Template usage

This dashboard shows stored and free space per template and placement group.

Template usage - internal

This dashboard contains the same information as Template usage, with some additional internal parameters.

Disk

This section shows the general disk usage. The data comes from the disk measurement.

Disk usage

This dashboard shows the overall usage of HDD and SSD/NVMe drives in the cluster.

Disk usage - internal

This dashboard shows all internal StorPool parameters for all drives in the cluster, split by type (SSD/HDD).

Single disk usage - internal

This dashboard shows all internal StorPool parameters for a single drive in the cluster.

Custom

This section is usually empty. Customers can create dashboards with the tag custom and those will show in that panel.

StorPool analytics

Overview

Basics

Example scenarios for use

Single drive creating delays

Balancing CPU usage on hypervisors

Checking for CPU starvation for StorPool services

Dashboards reference

Home

CPU

CPU stat

Per-CPU stat

Per-service CPU stat

Total CPU non-SP stat

Total CPU stat

Servers

All server stat

General server stat

Per-disk backend stat

Per-host backend stat

Clients

All client stat

General client stat

Per-host client stat

Memory

Cgroup memory

Cgroup memory per node

Volume

Per-volume stat

Top volumes

Network

Network service stat

System disk

SP disk stat

System disk stat

iSCSI

iSCSI stats per initiator

iSCSI stats per node/network

iSCSI stats per target

iSCSI totals per initiator

iSCSI totals per initiator

Template

Template usage

Template usage - internal

Disk

Disk usage

Disk usage - internal

Single disk usage - internal

Custom

Overview 

Basics 

Example scenarios for use 

Dashboards reference 