Monitoring alerts

Introduction

StorPool monitors all deployments and provides notifications to customers on issues with their clusters. This document describes the different types of notifications, their meaning and gives pointers to other parts of the documentation on further debugging and resolving the problems.

Alerts categories

The alerts are split in the following broad categories:

  • Cluster status alerts, based on data from the StorPool API

  • Per-host alerts, based on data from the hosts

  • Metrics-based alerts, based on metrics data from the hosts

  • Others

The alerts also can be composite, single-node, or cluster-wide:

  • A single-node alert describes a single service on a single node, for example a SSD/HDD drive.

  • A composite alert describes all alerts for nodes of particular type, for example all disks in a cluster.

  • A cluster-wide alert is about the cluster itself, for example if the relocator service is not running.

Severity levels

The severity levels of the alerts are:

  • OK: No alarm.

    Seeing an alert for this state in most cases is a result of a transition from either WARNING or CRITICAL.

  • WARNING: Problems can be expected.

  • CRITICAL: Problems imminent or already happening.

  • Super-critical (critical with [SC] tag): The cluster is either down or in a similar state. This type of alert generates an automated call to StorPool support.

  • UNKNOWN: Fresh information has not been received about the particular service or cluster.

There are currently two supported channels for sending alarms: email and Slack. For Slack, it’s possible to push directly to the customer’s Slack (with access provided by them), to a shared channel between the customer’s organization and StorPool, or in a channel provided in StorPool’s slack for the customer.

Information on how to implement these checks independent from StorPool’s monitoring system can be found in Common monitoring via the StorPool API.

Cluster status alerts

agscore-entries

Type: per placement group (cluster-wide)

Reports if the aggregate score for entries of multiple drives is above a certain threshold (different for hard drives and SSD/NVMes).

Note

Both this and the alert below are related to either very full drives, or too much writes/trims happening simultaneously, for longer periods of time. The alert by itself is a pointer that a further investigation is needed on why the drives behave this way, be it from overload, general device slowness, higher loads in the cluster, and so on.

agscore-space

Type: per placement group (cluster-wide)

Reports if the aggregate score for space usage of multiple drives is above a certain threshold (different for hard drives and SSD/NVMes).

attachments-count

Type: cluster-wide

Reports if any hosts is close to the limit of possible volume attachments.

balancer

Type: cluster-wide

Tracks the state of the internal balancer. Currently unused.

bridge

Type: composite and per-host

Tracks the state of the storpool_bridge services in the cluster - if some of them are down, or if a bridge service is expected, but missing.

bridgestatus

Type: per remote location (cluster-wide)

Tracks the connections to remote clusters and their state. If there are no active connections, it will report all errors from all connections (note that there can be only one active connection at a time).

client

Type: composite and per-host

Tracks the state of the storpool_block services in the cluster. All nodes are expected to have it running:

  • If a service is down.

  • If a service is missing on a node that has a working network.

[SC] A super-critical alarm is raised if the configuration update for the service has failed (as this means there is a hard-to-detect network problem).

clusteruptime

Type: cluster-wide

Tracks if the cluster uptime is too low (if the cluster has flapped recently).

Currently, this alert generates only a warning. After a testing period, it should be promoted to super-critical.

controller

Type: composite and per-host

Tracks the state of the storpool_controller service in the cluster.

Note the following:

  • The controller service lives in the system.slice memory cgroup and can be killed because of memory pressure, this seems to be the most common case.

  • A controller could also be shown as “down” if it’s not connecting to the storpool_mgmt service. If the service is running, firewalls and connectivity are the obvious next things to check.

dematerialization

Type: cluster-wide

Reports the state of snapshot dematerialization in the cluster. For more information, see 12.22.  Management configuration.

disbalance

Type: cluster-wide

Currently disabled. Supposed to show the amount of disbalance of disk usage in the cluster.

disk

Type: per-disk

Reports information for specific disk: device, serial number, last scrubbing time. Will report an alert on a missing drive, too much usage of objects, entries, or space.

diskentries

Type: per placement group

Reports if the available entries per disk have fallen below a certain threshold.

[SC] This alerts goes super-critical if a drive is below a certain threshold.

Note that handling these types of problems includes:

  • Finding all volumes that have too much writing and cap their number of IOps.

  • Sometimes in case of severely delayed operations, ejecting the offending drive to get some breathing room (note that this will probably quickly get another drive in trouble);

  • After the case is handled, investigate if raising the number of entries or expanding the cluster is needed.

disk-missing-pg

Type: composite only

Reports if there are drives in the cluster that do not belong to any placement group.

disk-journals

Type: composite only

Reports if drives that had journals are back in the cluster without journals. Their lack can lead to an order of magnitude slower writes.

disk-pending-errors

Type: composite only

Reports if there are drives that have detected incorrect checksum for a block of data and are currently in the process of recovering it. The alert exists as this should happen immediately and if it stalls, it requires investigation.

disk-recoveries

Type: composite only

Reports if drives have a wrong configuration for maxLocalRecoveryRequests and maxRemoteRecoveryRequests. This is handled exclusively by StorPool support and should be removed when the related bugfix is deployed on all clusters.

disk-softeject

Type: composite only

Reports if drives are being softEjected (balanced out) of the cluster.

disk-test

Type: composite only

Reports if any drives are ejected and are pending tests/being tested. Will also report if drives have been tested too much times because of failures, which would mean that those are a good idea to be replaced.

disk-to-test

Type: composite only

Reports if any drives are marked for testing, but have not been tested. Fix explanation can be found here .

diskobjects

Type: per placement group

Reports if drives in the placement group have object usage above 70%. This is relevant especially for VolumeCare deployments or systems that create a large amount of snapshots, as lack of available on any of the disks objects prevents the creation of new volumes or snapshots.

disks

Type: composite

Reports composite information for all disks in the system. Will raise a super-critical alarm if drives on more than one server are missing.

[SC] This alert generates super-critical alarms if drives are missing on two or more nodes.

Note that there have been discussions that this super-critical alarm should be triggered only for data left with a single copy in triple-replicated setup or on data without any copies left. Because of faultsets and balancer overrides, this is impossible to do with the existing monitoring data.

disk-softeject

Type: cluster-wide

Reports which drives have been softEject-ed (with rebalancing), or are in the process of being softEject-ed.

diskspace

Type: per placement group

Reports if drives in a specific placement group have space usage above 90%.

features

Type: cluster-wide

Reports if specific StorPool features of newly deployed releases can be enabled or are blocked.

iscsi

Type: composite and per-host

Reports the state of iSCSI target services (storpool_iscsi) in the cluster.

iscsi-backup

Type: composite and per-host

Reports if any iSCSI initiators are connected via a “backup” link. This issue can happen if the iSCSI networks are swapped, and there is no cross-link between the two sides of the network.

iscsi-ctrlrs

Type: composite

Reports if any iSCSI portals (running instances of storpool_iscsi) are not added to any portal group, i.e. are unused.

latthreshold-cli

Type: cluster-wide

Reports abnormal latencies of I/O operations in the cluster performed by clients.

Note for this and the alert below: the initial starting point is always to run /usr/lib/storpool/latthreshold.py on a node with access to the management of the cluster, to see what is hanging. The follow-up depends on the common denominator for the problematic requests. More details available here.

The format of the alert is as follows:

[SC]CRITICAL:clients [(1 volume iolatmon:full:1) ] user ops [ 1 read, max 86352 ms ]

This means that on node with ID 1, the volume iolatmon:full:1 had delayed user operations. There was one read operation pending that has been in progress for around 86 seconds.

[SC] This alert generates super-critical alarms on large latencies.

latthreshold-disk

Type: cluster-wide

Reports abnormal latencies of I/O operations in the cluster performed on disks (hardware).

The format of the alert is as follows:

latthreshold-disks is WARNING:disks [(256 volume iolatmon:full:1) ] user ops [ 1 read, max 20621 ms ]

This means that the drive with id 256 has delayed operations for the volume iolatmon:full:1, the operations were user initiated, there was one of them in progress when the check ran, and the maximum delay for such operation was 20621 ms (i.e. close to 20 seconds). It’s possible to see the same also for write operations, and for more than one drive or volume.

[SC] This alert generates super-critical alarms on large latencies.

locations

Type: cluster-wide

Tracks the settings for the receive/transmit buffers of the storpool_bridge service for different locations. An alert here means that the buffers are too low (or unset), and should be corrected. More information can be found at Bridge throughput performance.

maintenances

Type: per-cluster

Reports if any nodes are in scheduled maintenance and the state of the maintenance (ongoing, expired, etc). Non-supercritical alerts for nodes under maintenance are suppressed.

mgmt

Type: composite and per-host

Reports the state of storpool_mgmt services in the cluster. Will raise an alert if there’s just one management node in a cluster of more than one node.

[SC] Will raise a super-critical alert if no managements are available, as no data can be collected for the status of the cluster. Will also raise a super-critical alarm if some of the requests to the storpool_mgmt do not return data/time out.

Note

Not being able to collect data is different from not receiving any data from the cluster. For details, see monitoringdata.

mgmtConfig

Type: cluster-wide

Will report discrepancies in the internal cluster configuration. Handled exclusively by StorPool support.

monitoringdata

Type: cluster-wide

Reports if there was no monitoring data received for the last 15 minutes.

The main reasons for this alert are connectivity issues between the cluster and StorPool’s monitoring system. Those are easily detected by doing a traceroute from the active mgmt node in the cluster to mon1.storpool.com, and most of the time are related to DNS or routing.

needbalance

Type: cluster-wide

Reports if there are volumes or snapshots with placement constraint violations, which in turn would require a rebalancing of the cluster to be fixed. Currently handled by StorPool support.

There are multiple reasons for this alert to become active:

  • Volumes were created when a node was missing from a three-node cluster;

  • A drive/node has died and redundancy hasn’t been restored;

network

Type: composite and per-host

Reports the state of the network connectivity to the cluster of a specific node. It will report also if a node does not consider itself in the cluster or if it’s using a backup link.

Note that the network state is collected from the beacon on the active management node, thus it’s not fully representative of what each node sees.

onapp-bkp-vol

Type: cluster-wide

This will report if there are stale volumes used by OnApp to create backups. For more information see Stale backup volumes and snapshots cleanup procedure.

placementgroup-drives

Type: per placement group (cluster-wide)

This will report if there are drives mixed with and without SSD flag in the same placement group.

quorum

Type: cluster-wide

This will report if the expected number of voting nodes is in the cluster.

reformat

Type: cluster-wide

Internal use only. Reports if the entry usage on some drives warrants them to be reformatted with a larger amount.

relocator

Type: cluster-wide

Will report if the internal relocator service (part of storpool_mgmt) is running.

server

Type: composite and per-host

Reports the status of all storpool_server instances in the cluster. Will also report if there are drives in recovery on the instance. It will also report if a storpool_controller service is missing on the node with the server instance(s), or their versions differ.

Note that the versions of the storpool_server and storpool_controller service are checked against each other, as the controller needs to parse the memory structures of the server and there’s a possibility of mismatch between versions. In general, the easiest solution is a restart of the service with the lower version, taking into account what is installed, if other services need to be restarted, if this needs to be scheduled, etc..

snapfromremote

Type: cluster-wide

Reports if there are too many snapshots being transferred from a remote location, or if there are too many snapshots being transferred of a single parent.

snaplen

Type: cluster-wide

Will report if there are snapshot chains above a certain limit. This is used to track if the chain shortening periodic task is working.

snaprepl

Type: cluster-wide

Reports if the replication factor of all snapshots in the cluster matches the replication factor of the template they’re in.

snaptargets

Type: cluster-wide

Reports if the disk placement for snapshots and their parents matches.

snaptmpl

Type: cluster-wide

Reports if all snapshots have a template.

tasks

Type: cluster-wide

Reports the tasks state in the cluster, including recoveries.

template

Type: per template type

Reports free space per template type. A template type is defined by the following 4 parameters: placeHead, placeAll, placeTail, and replication factor.

This is the standard way of seeing the available space for a specific template in the cluster.

Note

The reported free space is a conservative estimate. Disbalance or different-sized drives in a placement group can skew this considerably.

totalvolumes

Type: cluster-wide

Reports if the amount of volumes in a cluster is too low (less than two, as at least that much are expected for the required storpool_iolatmon service, or more than 10000).

volumecare-local

Type: cluster-wide

Note

This and all other alerts that have volumecare in the name are related to the operations of VolumeCare.

Reports if there are any detected problems with the snapshot creation of VolumeCare. The following issues are detected:

  • No snapshots for a specific volume (that has a policy that requires them).

  • Newest snapshot for a specific volume is too old.

  • Stale (very old snapshot) detected for a volume.

  • Policy for a volume does not exist.

The first three are investigated in the logs of VolumeCare, the last is a configuration issue. The main cause is the service not running.

volumecare-policies

Type: cluster-wide

Reports if the volumes of a specific VM have different policies (which would prevent volumecare from making snapshots). This is resolved by correcting the policies of some of the volumes of the offending VM.

Note that this may also be caused by the volumes being on different templates (that have different volumecare policies).

volumecare-remote

Type: cluster-wide

Reports is there are any detected problems with the snapshot transfer of volumecare snapshots to a remote cluster. Issues detected:

  • No remote snapshots for a specific volume (that has a policy that requires them).

  • Newest remote snapshot for a specific volume is too old.

  • Stale (very old) remote snapshot detected for a volume.

The first two are investigated in the logs of volumecare, the last one is a configuration issue. The main cause is the service not running.

volumecare-svc

Type: cluster-wide

Currently reports only if there are volumecare snapshots in the target cluster and the monitoring has not been updated with the proper policy. Handled exclusively by StorPool support.

volumerepl

Type: cluster-wide

Reports if the replication factor of all volumes in the cluster matches the replication factor of the template they’re in.

volumes

Type: cluster-wide

Reports if there are volumes named in a way that is not expected for the cluster, to track left-over tests. Does not send alerts.

volumesizes

Type: cluster-wide

Report if there are volumes whose size is not aligned to 4096 bytes. Does not send alerts.

Note

The volumes are a problem for some tools, their unaligned nature leads to using smaller block sizes and bad performance. One such example is ntfsclone.

volumetargets

Type: cluster-wide

Reports if the disk placement for volumes and their parent snapshot matches.

volumetmpl

Type: cluster-wide

Reports if all volumes have a template.

Per-host alerts

apichecks

Type: composite (no single-node version is available)

This alert describes if a specific host can access the IP address in SP_API_HTTP_HOST, via ICMP Ping and TCP. This is required so most orchestrations and the storpool_controller service can communicate with the storpool_mgmt service.

configfile

Type: composite (no single-node version is available)

This check tracks if the /etc/storpool.conf file is different between the nodes in the cluster.

configuration

Type: composite and single-node

Alerts on some common configuration problems in /etc/storpool.conf and /etc/storpool.conf.d/*:

  • Incorrect cluster ID;

  • Incorrect cluster name;

  • Difference between configured and loaded network configuration;

  • Missing API interface;

  • Existence of obsolete configuration options.

hw-ecc

Type: composite (no single-node version is available)

Reports if any node is using non-ECC memory (which is unsupported by StorPool).

initiators-iscsi

Type: composite (no single-node version is available)

This alert tracks the reachability of all connected iSCSI initiators from all target nodes. Its main purpose is to catch cases of partial connectivity loss.

These are resolved by fixing the underlying network issues.

kernels

Type: composite (no single-node version is available)

The alert shows the kernels on the node that do not have StorPool modules installed. If there’s a debug kernel installed in the node, the alert is CRITICAL, otherwise it’s a WARNING.

This alert is currently handled by StorPool support, by doing the installation of new kernel modules and/or scheduling an upgrade where necessary. In the case of a debug kernel, the customers are notified to remove it from the node, as StorPool does not support these kernels and they’re unsuited for use in production.

lldp

Type: composite (no single-node version is available)

The alert shows the sets of nodes that have the switch pair they’re connected to is swapped.

Example output:

[ sw100g1_(1)_4c:76:25:e8:49:40 sw100g2_(1)_3c:2c:30:38:43:80 => 3, 6 ][ sw100g2_(1)_3c:2c:30:38:43:80 sw100g1_(1)_4c:76:25:e8:49:40 => 20 ]

This means that nodes 3 and 6 have their first port connected to sw100g1_(1)_4c:76:25:e8:49:40 and second port to sw100g2_(1)_3c:2c:30:38:43:80, and for node 20 it’s reversed.

This is resolved by either:

  • Swapping the cables on the node.

  • Swapping the interfaces (SP_IFACE1_CFG and SP_IFACE2_CFG in /etc/storpool.conf).

Note

The switch naming is NAME_(CHASSIS_ID)_MAC, and all data is taken from the LLDP frames sent by the switch/router (i.e. the MAC is the MAC of the switch chassis).

rootcgprocess

Type: composite (no single-node version is available)

This alert shows all processes running in the root cgroup. For more information on cgroups, see Control groups .

The importance of this alert is that with processes in the root cgroup, it’s possible that the OOM killer will trigger because of them and kill random processes, as the ones in the root cgroup are not constrained in their memory usage.

portals-iscsi

Type: composite (no single-node version is available)

This alert tracks the reachability of all iSCSI portals on all nodes from all other nodes. Its main purpose is to catch cases of partial connectivity loss.

These are resolved by fixing the underlying network issues.

status

Type: composite (no single-node version is available)

This alert means that the status.json file generated by storpool_stat and transmitted by storpool_abrtsync has not been updated for a specific node. This leads to lack of per-host data for the host in question.

This is usually resolved by verifying that both storpool_stat and storpool_abrtsync are running and are able to send data.

Metrics-based alerts

cgroups

Type: composite and single-node

This alert tracks the memory usage of the StorPool memory cgroup (storpool.slice) and alerts if it goes above 80%. This gives advance warning if the memory needs to be expanded, if there are bugs (like memory leaks), or if there’s something running the cgroup that shouldn’t be there.

The alert hasn’t been extended to customer cgroups like machine.slice, as customer usage/over-provisioning vary wildly.

cpustat

Type: composite and single-node

This alert tracks the run-queue/wait time on CPUs used for StorPool services and notifies if it goes above 1.8. This is used to see if more than one service gets put on the same CPU thread, or if there are other processes/kernel tasks scheduled in the StorPool cpuset, that can result in latencies or similar issues.

The alert hasn’t been extended to customer cgroups like machine.slice, as customer usage/over-provisioning vary wildly.

dataholes

Type: composite (no single-node version is available)

This alert tracks the amount of dropped packets per interface. It’s calculated based on the expected packets to be received for a request, i.e. when a “data hole” occurs, a gap in the received packet stream.

There can be different reasons for the alert:

  • CRC errors on an interface;

  • Bad cables;

  • cross-switch link problems (in the case of a network with more than two switches);

Note

This error is a problem in receiving data. So an error on a lot of nodes might mean a problem with the sending from a specific node, i.e. the problem will always warrant investigation of all nodes/NICs in the cluster.

diskerrors

Type: composite (no single-node version is available)

This alert tracks the amount of new disk errors (their rate) for specific disks and alerts on more than 10 per hour. This is useful to have an advance warning which drives are expected to fail.

Note

A disk error in this context is a read from a drive that returned data with incorrect checksum.

io-latencies

Type: per template type

Report if there is a significant increase between the mean of IO latencies for the last 24 hours and the typical IO latencies for the previous 28 days. A template type is defined by the following 4 parameters: placeHead, placeAll, placeTail and replication factor.

iolatmon

Type: cluster-wide

This alert tracks the state of the storpool_iolatmon service, which does read/write operations on test volumes to problem the cluster’s health. The alert checks for the existence of the test volumes and if they have any traffic.

A future update is being discussed to also track the latencies observed.

service-latency

Type: cluster-wide

This alert tracks the latencies for disk read/write operations and the latencies for StorPool read/write requests.

It reports if the latencies for StorPool operations become significantly higher than the disk operation latencies.

Some of the possible reasons for this issue are:

  • Storage network connectivity issues between nodes;

  • One or more of the StorPool services experiencing a high load.

service-load

Type: cluster-wide

This alert tracks if any StorPool service seems to be overloaded.

One specific case is the storpool_iscsi services (iSCSI controllers) which in this case most probably means it would benefit from a rebalance of its exported targets.

stats

Type: composite and single-node

This alert shows if metrics data has been received for a host, if it’s delayed, or if it’s in the future.

Possible reasons for the alert:

  • Lack of data might mean that there is either an Internet connectivity/DNS problem with the node, the storpool_stat service has stopped, or there is a problem with the OS root drive (either full or remounted read-only).

  • Data in the future means wrong clock and is mostly resolved by fixing the clock with ntpdate and checking the chrony or ntpd configuration.

  • Delayed data might mean that the node is either catching up, it’s clock is wrong (see the previous point), or is going into the state described in the first point.

status

Type: composite and single-node

This alert reports if status.json has been updated/transmitted by storpool_abrtsync to reports.storpool.com, which is the current way of receiving per-host checks.

Others

billingdata

Type: cluster-wide

Reports if billing data (daily report) has been received from the cluster. Handled exclusively by StorPool support.