StoPool analytics

1. Overview

The metrics collected from StorPool clusters can be viewed at https://analytics.storpool.com/. The purpose of this document is to guide customers through the information available there and on some of its use cases.

Access credentials for the analytics system are provided during the initial setup of the cluster. For any issues related to those, please contact StorPool support.

The full sources of the dashboards are available on request.

2. Basics

The data is organized in two dimensions:

  • granularity (per second, or per minute samples), and

  • component/subsystem (CPU, memory, back-end, front-end, iSCSI, etc.)

The site uses the industry-standard Grafana software.

3. Example scenarios for use

Here are some scenarios how the system can be used to track different types of issues. For any extra information, please consult the Dashboards reference below.

3.1. Single drive creating delays

A simple scenario is tracking to see if a single drive is creating problems in a cluster.

The starting point for this would be All server stat. The graph with the read or write latency would look different for a specific server.

For the specific server, Per-host backend stat can be consulted to see if all drives or a group of them show different latencies. Care should be taken not to mistake SSD/NVMe and HDDs, as they’ll have different performance characteristics.

If all (or a group of) drives on a node show the same level of latency, then the issue is most probably related to the node and controllers on it. The drives otherwise can be compared in the Per-disk backend stat with other drives on other nodes, and any spikes in latency can be co-related with usage patterns on the drive to decide if there is a problem.

The final step would be verification on the node itself, via SMART and other tools, to see if the latencies can be explained.

3.2. Balancing CPU usage on hypervisors

There is a specific dashboard for this purpose, called Total CPU non-SP stat, which tracks the usage for all CPUs that are not isolated to be used by StorPool.

The best indicator in the dashboard is the “Run queue wait per node” graph, which tracks the congestion of the node. The “wait” is the time a process was scheduled to run on a CPU, but had to wait for another process to complete. For KVM-based hypervisors, this translates directly to “steal” time inside the virtual machines. On a non-congested node, “wait” should be close to zero, so based on this information the overloaded nodes can be picked and some VMs moved away from them.

If there is no congestion and just a better balance of CPU usage is wanted, the “Run queue per node” graph can be observed. It also includes the actual time processes were running, plus the time processes waited, and can be used to decide which hosts are overloaded and when.

3.3. Checking for CPU starvation for StorPool services

As most StorPool deployments use the hsleep (see 6.  Why the StorPool processes seem to be at 100% CPU usage all the time?) type of sleep, on normal graphs/stats it looks like every StorPool process utilizes 100% of the CPU. As noted in the FAQ, this is not the case, and the actual CPU usage can be checked in the Network dashboard.

There are two relevant graphs, “Loops per second” and “Slept”. The “Slept” is the amount of CPU time in milliseconds that the process spent sleeping/waiting, and the “loops” are number of iterations the process has performed. A zero or close-to-zero value for “Slept” and somewhat high amount of “loops” means that the process has become CPU-bound. Low “sleep” and low number of “loops” points to a different type of problem (for example, CPU being throttled to a very low frequency).

4. Dashboards reference

Categories:

4.1. Home

The home dashboard contains a basic overview of the cluster and links to all other dashboards.

The first line contains “Annunciator” graphs for the basic system performance parameters - IOps, throughput, latency and queue size, for the front-end (volumes accessed via storpool_block) and back-end (for the storpool_server processes that talk to the drives). The graphs take their data from the per-second stats for the last 1 hour.

The second line contains the cluster name and some basic “inventory” parameters, like the number of attached volumes, available drives and clients (storpool_block) currently processing operation.

Below that are the general dashboards for the all the different categories.

4.2. CPU

The data in these dashboards is taken from the ref:metric_cpustat measurement and contains measurements for every CPU on every node of the system, once per minute or once per second.

Two parameters come from the kernel scheduler and form the “run-queue”. The parameters are run and wait and mean “amount of time a process has been running on the CPU” and “amount of time process(es) waited to run on the CPU”. These two parameters, especially the wait one describe very well how congested the CPU is.

Note that a CPU in this context is a CPU thread, not the whole chip or core. For example, if you have the following output in lscpu, it means that data would be collected for 40 CPUs:

Architecture:          x86_64
CPU op-mode(s):        32-bit, 64-bit
Byte Order:            Little Endian
CPU(s):                40
On-line CPU(s) list:   0-39
Thread(s) per core:    2
Core(s) per socket:    10
Socket(s):             2
NUMA node(s):          2
Vendor ID:             GenuineIntel
CPU family:            6
Model:                 62
Model name:            Intel(R) Xeon(R) CPU E5-2680 v2 @ 2.80GHz
...

4.2.1. CPU stat

This dashboard gives an overview of the CPU usage on a single node.

4.2.2. Per-CPU stat

This dashboard tracks basic parameters for a single (or multiple selected) CPU threads. The parameters tracked are the load type (system, user, irq, etc.) and the run-queue (both time spent executing and time spent waiting to execute).

4.2.3. Per-service CPU stat

This dashboard shows the CPU usage for CPU threads used by StorPool services.

4.2.4. Total CPU non-SP stat

This dashboard tracks all CPUs that are not in StorPool-related slices. This dashboard is useful to track hypervisor load and related issues.

4.2.5. Total CPU stat

This dashboard shows an overview of the CPU usage on all nodes.

4.3. Servers

The data in the following dashboards comes from the 4.5.  diskstat measurement. The data inside is the collected information from the /usr/lib/storpool/server_stat tool that provides metrics about a running storpool_server instance.

This data describes the communication between the storpool_server processes and the physical drives in the systems (hence the “back-end” term used in some of the graphs).

These dashboards are useful to track issues with load related to specific drives or increases in system-related load (like aggregations).

4.3.1. All server stat

This dashboard provides an overview of the general storpool_server performance metrics per node, composed in one view.

4.3.2. General server stat

This dashboard provides a general overview of the cluster’s back-end performance.

4.3.3. Per-disk backend stat

This dashboard provides overview of the usage of a single drive by the storpool_server instances.

4.3.4. Per-host backend stat

This dashboard provides overview of the usage the storpool_server instances on a single node.

4.4. Clients

These dashboards track the operations on volumes from the clients connecting via storpool_block. Their data comes from the 4.6.  iostat measurement.

The main difference between this set of dashboards and the Volume ones is that this one is structured related to the client nodes, not the volumes themselves.

4.4.1. All client stat

This dashboard shows all client operations in the cluster, grouped by server.

4.4.2. General client stat

This dashboard shows a summary of all client operations in the cluster.

4.4.3. Per-host client stat

This dashboard allows for viewing the stats for client operations on a single cluster.

4.5. Memory

These dashboards track the memory usage in cgroups and on the nodes in general. Their data comes from the 4.8.  memstat measurement.

4.5.1. Cgroup memory

This dashboard shows the usage of all cgroup memory slices in the cluster.

4.5.2. Cgroup memory per node

This dashboard shows the usage of all cgroup memory slices on a single node.

4.6. Volume

These dashboards track the operations on specific volumes from the clients connecting via storpool_block. Their data comes from the 4.6.  iostat measurement.

4.6.1. Per-volume stat

This dashboard shows the I/O stats for a single volume.

4.6.2. Top volumes

This dashboard provides information on the volumes with most load, either in operations (read/write) or bandwidth.

4.7. Network

The only dashboard here tracks network and service stats for StorPool services. Its data comes from the 4.10.  servicestat and 4.9.  netstat measurements.

4.7.1. Network service stat

The first two rows describe per-service metrics:

  • Data transfers are the transfers done over the StorPool protocol;

  • Failed transfers are the amount of transfers that had failed for some reason (and would be re-tried);

  • Loops is the number of loops processing loops a specific has done;

  • Slept is the amount of time a process has slept/waited for work to process.

The rest of the graphs describe metrics for the network communication on the two network interface on every node. More information about them can be found in the 4.9.  netstat documentation.

4.8. System disk

Both dashboards in this section track I/O stats for all physical drives in the system, as seen by the kernel.

4.8.1. SP disk stat

This dashboard filters drives by their StorPool ID.

4.8.2. System disk stat

This dashboard shows all drives and partitions by device name.

4.9. iSCSI

These dashboards provide overview of iSCSI traffic based based on different filters. Their data comes from the 4.7.  iscsisession measurement.

4.9.1. iSCSI stats per initiator

The stats in this dashboard are grouped/filtered by initiator name and displayed for every session.

4.9.2. iSCSI stats per node/network

The stats in this dashboard are summed/grouped by node and interface, to give a general overview of the usage of the iSCSI network.

4.9.3. iSCSI stats per target

The stats in this dashboard are grouped/filtered by target name and displayed for every session.

4.9.4. iSCSI totals per initiator

The stats in this dashboard are grouped/filtered by initiator name and summed for all sessions.

4.9.5. iSCSI totals per initiator

The stats in this dashboard are grouped/filtered by target name and summed for all sessions.

4.10. Template

This section tracks StorPool templates and is useful to view usage and free space patterns. The data comes from the 4.11.  template measurement.

4.10.1. Template usage

This dashboard shows stored and free space per template and placement group.

4.10.2. Template usage - internal

This dashboard contains the same information as Template usage, with some additional internal parameters.

4.11. Disk

This section shows the general disk usage. The data comes from the 4.3.  disk measurement.

4.11.1. Disk usage

This dashboard shows the overall usage of HDD and SSD/NVMe drives in the cluster.

4.11.2. Disk usage - internal

This dashboard shows all internal StorPool parameters for all drives in the cluster, split by type (SSD/HDD).

4.11.3. Single disk usage - internal

This dashboard shows all internal StorPool parameters for a single drive in the cluster.

4.12. Custom

This section is usually empty. Customers can create dashboards with the tag custom and those will show in that panel.