StorPool’s monitoring system alerts & description

1. Introduction

StorPool monitors all deployments and provides notifications to customers on issues with their clusters. This document describes the different types of notifications, their meaning and gives pointers to other parts of the documentation on further debugging and resolving the problems.

The alerts are split in the following broad categories:

  • Cluster status alerts, based on data from the StorPool API;

  • Per-host alerts, based on data from the hosts;

  • Metrics-based alerts, based on metrics data from the hosts;

  • Others.

The alerts also can be composite, single-node or cluster-wide:

  • A single-node alert describes a single service on a single node, for example a SSD/HDD drive;

  • A composite alert describes all alerts for nodes of particular type, for example all disks in a cluster;

  • A cluster-wide alert is about the cluster itself, for example if the relocator service is not running.

The severities of the alerts are:

  • OK (no alarm); * Seeing an alert for this state in most cases is a result of a transition from either WARNING or CRITICAL.

  • WARNING (problems can be expected);

  • CRITICAL (problems imminent or already happening);

  • super-critical (critical with [SC] tag) - the cluster is either down or in a similar state. This type of alert generates an automated call to StorPool support;

  • UNKNOWN (fresh information has not been received about the particular service or cluster).

There are currently two supported channels for sending alarms: email and Slack. For slack, it’s possible to push directly to the customer’s slack (with access provided by them), to a shared channel between the customer’s organization and StorPool, or in a channel provided in StorPool’s slack for the customer.

Information on how to implement these checks independent from StorPool’s monitoring system can be found in Common monitoring via the StorPool API.

2. Cluster status alerts

2.1. agscore-entries

Type: per placement group (cluster-wide)

Reports if the aggregate score for entries of multiple drives is above a certain threshold (different for hard drives and SSD/NVMes).

2.2. agscore-space

Type: per placement group (cluster-wide)

Reports if the aggregate score for space usage of multiple drives is above a certain threshold (different for hard drives and SSD/NVMes).

2.3. balancer

Type: cluster-wide

Tracks the state of the internal balancer. Currently unused.

2.4. bridge

Type: composite and per-host

Tracks the state of the storpool_bridge services in the cluster - if some of them are down, or if a bridge service is expected, but missing.

2.5. client

Type: composite and per-host

Tracks the state of the storpool_block services in the cluster. All nodes are expected to have it running:

  • if a service is down;

  • if a service is missing on a node that has a working network;

  • if the configuration update for the service has failed (which is a super-critical alarm, as this means there is a hard-to-detect network problem).

2.6. controller

Type: composite and per-host

Tracks the state of the storpool_controller service in the cluster.

2.7. dematerialization

Type: cluster-wide

Reports the state of snapshot dematerialization in the cluster. For more information, see the User Guides.

2.8. disbalance

Type: cluster-wide

Currently disabled. Supposed to show the amount of disbalance of disk usage in the cluster.

2.9. disk

Type: per-disk

Reports information for specific disk: device, serial number, last scrubbing time. Will report an alert on a missing drive, too much usage of objects, entries, or space.

2.10. diskentries

Type: per placement group

2.11. disk-journals

Type: composite only

Reports if drives that had journals are back in the cluster without journals. Their lack can lead to an order of magnitude slower writes.

2.12. disk-softeject

Type: composite only

Reports if drives are being softEjected (balanced out) of the cluster.

2.13. disk-test

Type: composite only

Reports if any drives are ejected and are pending tests/being tested.

2.14. diskobjects

Type: per placement group

Reports if drives in the placement group have object usage above 70%. This is relevant especially for VolumeCare deployments or systems that create a large amount of snapshots, as lack of available on any of the disks objects prevents the creation of new volumes or snapshots.

2.15. disks

Type: composite

Reports composite information for all disks in the system. Will raise a super-critical alarm if drives on more than one server are missing.

Note

There have been discussions that this super-critical alarm should be triggered only for data left with a single copy in triple-replicated setup or on data without any copies left. Because of faultsets and balancer overrides, this is impossible to do with the existing monitoring data.

2.16. disk-softeject

Type: cluster-wide

Reports which drives have been softEjected (with rebalancing), or are in the process of being softEject-ed.

2.17. diskspace

Type: per placement group

Reports if drives in a specific placement group have space usage above 90%.

2.18. iscsi

Type: composite and per-host

Reports the state of iSCSI target services (storpool_iscsi) in the cluster.

2.19. latthreshold-cli

Type: cluster-wide

Reports abnormal latencies of I/O operations in the cluster performed by clients. Large latencies in this alert can trigger a super-critical alarm.

2.20. latthreshold-disk

Type: cluster-wide

Reports abnormal latencies of I/O operations in the cluster performed on disks (hardware).

2.21. maintenances

Type: per-cluster

Reports if any nodes are in scheduled maitenance and the state of the maintenance (ongoing, expired, etc). Non-supercritical alerts for nodes under maintenance are suppressed.

2.22. mgmt

Type: composite and per-host

Reports the state of storpool_mgmt services in the cluster. Will raise an alert if there’s just one management node in a cluster of more than one node. Will raise a super-critical alert if no managements are available, as no data can be collected for the status of the cluster.

Note

Not being able to collect data is different from not receiving any data from the cluster. This is triggered by the collections script sending a message with errors and missing data.

2.23. mgmtConfig

Type: cluster-wide

Will report discrepancies in the internal cluster configuration. Handled exclusively by StorPool support.

2.24. monitoringdata

Type: cluster-wide

Reports if there was no monitoring data received for the last 15 minutes.

The main reasons for this alert are connectivity issues between the cluster and StorPool’s monitoring system. Those are easily detected by doing a traceroute from the active mgmt node in the cluster to mon1.storpool.com, and most of the time are related to DNS or roting.

2.25. needbalance

Type: cluster-wide

Reports if there are volumes or snapshots with placement constraint violations, which in turn would require a rebalancing of the cluster to be fixed. Currently handled by StorPool support.

There are multiple reasons for this alert to become active:

  • Volumes were created when a node was missing from a three-node cluster;

  • A drive/node has died and redundancy hasn’t been restored;

2.26. network

Type: composite and per-host

Reports the state of the network connectivity to the cluster of a specific node. It will report also if a node does not consider itself in the cluster or if it’s using a backup link.

Note

The network state is collected from the beacon on the active management node, thus it’s not fully representative of what each node sees,

2.27. onapp-bkp-vol

Type: cluster-wide

This will report if there are stale volumes used by OnApp to create backups. For more information see Stale backup volumes and snapshots cleanup procedure.

2.28. reformat

Type: cluster-wide

Internal use only. Reports if the entry usage on some drives warrants them to be reformatted with a larger amount.

2.29. relocator

Type: cluster-wide

Will report if the internal relocator service (part of storpool_mgmt) is running.

2.30. server

Type: composite and per-host

Reports the status of all storpool_server instances in the cluster. Will also report if there are drives in recovery on the instance. It will also report if a storpool_controller service is missing on the node with the server instance(s), or their versions differ.

2.31. snaplen

Type: cluster-wide

Will report if there are snapshot chains above a certain limit. This is used to track if the chain shortening periodic task is working.

2.32. snaprepl

Type: cluster-wide

Reports if the replication factor of all snapshots in the cluster matches the replication factor of the template they’re in.

2.33. snaptargets

Type: cluster-wide

Reports if the disk placement for snapshots and their parents matches.

2.34. snaptmpl

Type: cluster-wide

Reports if all snapshots have a template.

2.35. tasks

Type: cluster-wide

Reports the tasks state in the cluster, including recoveries.

2.36. template

Type: per template type

Reports free space per template type. A template type is defined by the following 4 parameters: placeHead, placeAll, placeTail and replication factor.

This is the standard way of seeing the available space for a specific template in the cluster.

Note

The reported free space is a conservative estimate. Disbalance or different-sized drives in a placement group can skew this considerably.

2.37. totalvolumes

Type: cluster-wide

Reports if the amount of volumes in a cluster is too low (less than two, as at least that much are expected for the required storpool_iolatmon service, or more than 10000).

2.38. volumerepl

Type: cluster-wide

Reports if the replication factor of all volumes in the cluster matches the replication factor of the template they’re in.

2.39. volumes

Type: cluster-wide

Reports if there are volumes named in a way that is not expected for the cluster, to track left-over tests. Does not send alerts.

2.40. volumesizes

Type: cluster-wide

Report if there are volumes whose size is not aligned to 4096 bytes. Does not send alerts.

Note

The volumes are a problem for some tools, their unaligned nature leads to using smaller block sizes and bad performance. One such example is ntfsclone.

2.41. volumetargets

Type: cluster-wide

Reports if the disk placement for volumes and their parent snapshot matches.

2.42. volumetmpl

Type: cluster-wide

Reports if all volumes have a template.

3. Per-host alerts

3.1. apichecks

Type: composite (no single-node version is available)

This alert describes if a specific host can access the IP address in SP_API_HTTP_HOST, via ICMP Ping and TCP. This is required so most orchestrations and the storpool_controller service can communicate with the storpool_mgmt service.

3.2. kernels

Type: composite (no single-node version is available)

The alert shows the kernels on the node that do not have StorPool modules installed.

This alert is currently handled by StorPool support, by doing the installation of new kernel modules and/or scheduling an upgrade where necessary.

3.3. rootcgprocess

Type: composite (no single-node version is available)

This alert shows all processes running in the root cgroup. For more information on cgroups, see Control Groups .

Note

The importance of this alert is that with processes in the root cgroup, it’s possible that the OOM killer will trigger because of them and kill random processes, as the ones in the root cgroup are not constrained in their memory usage.

4. Metrics-based alerts

4.1. cgroups

Type: composite and single-node

This alert tracks the memory usage of the StorPool memory cgroup (storpool.slice) and alerts if it goes above 80%. This gives advance warning if the memory needs to be expanded, if there are bugs (like memory leaks), or if there’s something running the cgroup that shouldn’t be there.

The alert hasn’t been extended to customer cgroups like machine.slice, as customer usage/overprovisioning vary wildly.

4.2. cpustat

Type: composite and single-node

This alert tracks the run-queue/wait time on CPUs used for StorPool services and notifies if it goes above 1.8. This is used to see if more than one service gets put on the same CPU thread, or if there are other processes/kernel tasks scheduled in the StorPool cpuset, that can result in latencies or similar issues.

The alert hasn’t been extended to customer cgroups like machine.slice, as customer usage/overprovisioning vary wildly.

4.3. dataholes

Type: composite (no single-node version is available)

This alert tracks the amount of dropped packets per interface. It’s calculated based on the expected packets to be received for a request, i.e. when a “data hole” occurs, a gap in the received packet stream.

There can be different reasons for the alert:

  • CRC errors on an interface;

  • Bad cables;

  • cross-switch link problems (in the case of a network with more than two switches);

Note

This error is a problem in receiving data. So an error on a lot of nodes might mean a problem with the sending from a specific node, i.e. the problem will always warrant investigation of all nodes/NICs in the cluster.

4.4. diskerrors

Type: composite (no single-node version is available)

This alert tracks the amount of new disk errors (their rate) for specific disks and alerts on more than 10 per hour. This is useful to have an advance warning which drives are expected to fail.

Note

A disk error in this context is a read from a drive that returned data with incorrect checksum.

4.5. iolatmon

Type: cluster-wide

This alert tracks the state of the storpool_iolatmon service, which does read/write operations on test volumes to proble the cluster’s health. The alert checks for the existence of the test volumes and if they have any traffic.

A future update is being discussed to also track the latencies observed.

4.6. stats

Type: composite and single-node

This alert shows if metrics data has been received for a host, if it’s delayed, or if it’s in the future.

Possible reasons for the alert:

  • Lack of data might mean that there is either an Internet connectivity/DNS problem with the node, the storpool_stat service has stopped, or there’s a problem with the OS root drive (either full or remounted read-only);

  • Data in the future means wrong clock and is mostly resolved by fixing the clock with ntpdate and checking the chrony or ntpd config;

  • Delayed data might mean that the node is either catching up, it’s clock is wrong (see the previous point), or is going into the state described in the first point.

4.7. status

Type: composite and single-node

This alert reports if status.json has been updated/transmitted by storpool_abrtsync to reports.storpool.com, which is the current way of receiving per-host checks.

5. Others

5.1. billingdata

Type: cluster-wide

Reports if billing data (daily report) has been received from the cluster. Handled exclusively by StorPool support.