StorPool’s monitoring system alerts & description

1. Introduction

StorPool monitors all deployments and provides notifications to customers on issues with their clusters. This document describes the different types of notifications, their meaning and gives pointers to other parts of the documentation on further debugging and resolving the problems.

The alerts are split in the following broad categories:

  • Cluster status alerts, based on data from the StorPool API;

  • Per-host alerts, based on data from the hosts;

  • Metrics-based alerts, based on metrics data from the hosts;

  • Others.

The alerts also can be composite, single-node or cluster-wide:

  • A single-node alert describes a single service on a single node, for example a SSD/HDD drive;

  • A composite alert describes all alerts for nodes of particular type, for example all disks in a cluster;

  • A cluster-wide alert is about the cluster itself, for example if the relocator service is not running.

The severities of the alerts are:

  • OK (no alarm);
    • Seeing an alert for this state in most cases is a result of a transition from either WARNING or CRITICAL.

  • WARNING (problems can be expected);

  • CRITICAL (problems imminent or already happening);

  • super-critical (critical with [SC] tag) - the cluster is either down or in a similar state. This type of alert generates an automated call to StorPool support;

  • UNKNOWN (fresh information has not been received about the particular service or cluster).

There are currently two supported channels for sending alarms: email and Slack. For slack, it’s possible to push directly to the customer’s slack (with access provided by them), to a shared channel between the customer’s organization and StorPool, or in a channel provided in StorPool’s slack for the customer.

Information on how to implement these checks independent from StorPool’s monitoring system can be found in Common monitoring via the StorPool API.

2. Cluster status alerts

2.1. agscore-entries

Type: per placement group (cluster-wide)

Reports if the aggregate score for entries of multiple drives is above a certain threshold (different for hard drives and SSD/NVMes).

Note

Both this and the alert below are related to either very full drives, or too much writes/trims happening simultaneously, for longer periods of time. The alert by itself is a pointer that a furter investigation is needed on why the drives behave this way, be it from overload, general device slowness, higher loads in the cluster, etc..

2.2. agscore-space

Type: per placement group (cluster-wide)

Reports if the aggregate score for space usage of multiple drives is above a certain threshold (different for hard drives and SSD/NVMes).

2.3. balancer

Type: cluster-wide

Tracks the state of the internal balancer. Currently unused.

2.4. bridge

Type: composite and per-host

Tracks the state of the storpool_bridge services in the cluster - if some of them are down, or if a bridge service is expected, but missing.

2.5. client

Type: composite and per-host

Tracks the state of the storpool_block services in the cluster. All nodes are expected to have it running:

  • if a service is down;

  • if a service is missing on a node that has a working network;

[SC] A super-critical alarm is raised if the configuration update for the service has failed (as this means there is a hard-to-detect network problem).

2.6. controller

Type: composite and per-host

Tracks the state of the storpool_controller service in the cluster.

Note

The controller service lives in the system.slice memory cgroup and can be killed because of memory pressure, this seems to be the most common case.

A controller could also be shown as “down” if it’s not connecting to the storpool_mgmt service. If the service is running, firewalls and connectivity are the obvious next things to check.

2.7. dematerialization

Type: cluster-wide

Reports the state of snapshot dematerialization in the cluster. For more information, see the User Guides.

2.8. disbalance

Type: cluster-wide

Currently disabled. Supposed to show the amount of disbalance of disk usage in the cluster.

2.9. disk

Type: per-disk

Reports information for specific disk: device, serial number, last scrubbing time. Will report an alert on a missing drive, too much usage of objects, entries, or space.

2.10. diskentries

Type: per placement group

Reports if the available entries per disk have fallen below a certain threshold.

[SC] This alerts goes super-critical if a drive is below a certain threshold.

Note

Handling these types of problems includes:

  • finding all volumes that have too much writing and cap their number of IOps;

  • somtimes in case of severely delayed operations, ejecting the offending drive to get some breathing room (note that this will probably quickly get another drive in trouble);

  • after the case is handled, investigate if raising the number of entries or expanding the cluster is needed.

2.11. disk-journals

Type: composite only

Reports if drives that had journals are back in the cluster without journals. Their lack can lead to an order of magnitude slower writes.

2.12. disk-recoveries

Type: composite only

Reports if drives have a wrong configuration for maxLocalRecoveryRequests and maxRemoteRecoveryRequests. This is handled exclusively by StorPool support and should be removed when the related bugfix is deployed on all clusters.

2.13. disk-softeject

Type: composite only

Reports if drives are being softEjected (balanced out) of the cluster.

2.14. disk-test

Type: composite only

Reports if any drives are ejected and are pending tests/being tested. Will also report if drives have been tested too much times because of failures, which would mean that those are a good idea to be replaced.

2.15. diskobjects

Type: per placement group

Reports if drives in the placement group have object usage above 70%. This is relevant especially for VolumeCare deployments or systems that create a large amount of snapshots, as lack of available on any of the disks objects prevents the creation of new volumes or snapshots.

2.16. disks

Type: composite

Reports composite information for all disks in the system. Will raise a super-critical alarm if drives on more than one server are missing.

[SC] This alert generates super-critical alarms if drives are missing on two or more nodes.

Note

There have been discussions that this super-critical alarm should be triggered only for data left with a single copy in triple-replicated setup or on data without any copies left. Because of faultsets and balancer overrides, this is impossible to do with the existing monitoring data.

2.17. disk-softeject

Type: cluster-wide

Reports which drives have been softEject-ed (with rebalancing), or are in the process of being softEject-ed.

2.18. diskspace

Type: per placement group

Reports if drives in a specific placement group have space usage above 90%.

2.19. iscsi

Type: composite and per-host

Reports the state of iSCSI target services (storpool_iscsi) in the cluster.

2.20. latthreshold-cli

Type: cluster-wide

Reports abnormal latencies of I/O operations in the cluster performed by clients.

[SC] This alert generates super-critical alarms on large latencies.

Note

For this and the alert below, the intial starting point is always to run /usr/lib/storpool/latthreshold.py on a management node of the cluster, to see what is hanging. The follow-up depends on the common denominator for the problematic requests. More details available here.

2.21. latthreshold-disk

Type: cluster-wide

Reports abnormal latencies of I/O operations in the cluster performed on disks (hardware).

[SC] This alert generates super-critical alarms on large latencies.

2.22. maintenances

Type: per-cluster

Reports if any nodes are in scheduled maintenance and the state of the maintenance (ongoing, expired, etc). Non-supercritical alerts for nodes under maintenance are suppressed.

2.23. mgmt

Type: composite and per-host

Reports the state of storpool_mgmt services in the cluster. Will raise an alert if there’s just one management node in a cluster of more than one node.

[SC] Will raise a super-critical alert if no managements are available, as no data can be collected for the status of the cluster. Will also raise a super-critical alarm if some of the requests to the storpool_mgmt do not return data/time out.

Note

Not being able to collect data is different from not receiving any data from the cluster. See monitoringdata below.

2.24. mgmtConfig

Type: cluster-wide

Will report discrepancies in the internal cluster configuration. Handled exclusively by StorPool support.

2.25. monitoringdata

Type: cluster-wide

Reports if there was no monitoring data received for the last 15 minutes.

The main reasons for this alert are connectivity issues between the cluster and StorPool’s monitoring system. Those are easily detected by doing a traceroute from the active mgmt node in the cluster to mon1.storpool.com, and most of the time are related to DNS or routing.

2.26. needbalance

Type: cluster-wide

Reports if there are volumes or snapshots with placement constraint violations, which in turn would require a rebalancing of the cluster to be fixed. Currently handled by StorPool support.

There are multiple reasons for this alert to become active:

  • Volumes were created when a node was missing from a three-node cluster;

  • A drive/node has died and redundancy hasn’t been restored;

2.27. network

Type: composite and per-host

Reports the state of the network connectivity to the cluster of a specific node. It will report also if a node does not consider itself in the cluster or if it’s using a backup link.

Note

The network state is collected from the beacon on the active management node, thus it’s not fully representative of what each node sees,

2.28. onapp-bkp-vol

Type: cluster-wide

This will report if there are stale volumes used by OnApp to create backups. For more information see Stale backup volumes and snapshots cleanup procedure.

2.29. reformat

Type: cluster-wide

Internal use only. Reports if the entry usage on some drives warrants them to be reformatted with a larger amount.

2.30. relocator

Type: cluster-wide

Will report if the internal relocator service (part of storpool_mgmt) is running.

2.31. server

Type: composite and per-host

Reports the status of all storpool_server instances in the cluster. Will also report if there are drives in recovery on the instance. It will also report if a storpool_controller service is missing on the node with the server instance(s), or their versions differ.

Note

The versions of the storpool_server and storpool_controller serivce are checked against each other, as the controller needs to parse the memory structures of the server and there’s a possibility of mismatch between versions. In general, the easiest solution is a restart of the service with the lower version, taking into account what is installed, if other services need to be restarted, if this needs to be scheduled, etc..

2.32. snapfromremote

Type: cluster-wide

Reports if there are too many snapshots being transferred from a remote location, or if there are too many snapshots being transferred of a single parent.

2.33. snaplen

Type: cluster-wide

Will report if there are snapshot chains above a certain limit. This is used to track if the chain shortening periodic task is working.

2.34. snaprepl

Type: cluster-wide

Reports if the replication factor of all snapshots in the cluster matches the replication factor of the template they’re in.

2.35. snaptargets

Type: cluster-wide

Reports if the disk placement for snapshots and their parents matches.

2.36. snaptmpl

Type: cluster-wide

Reports if all snapshots have a template.

2.37. tasks

Type: cluster-wide

Reports the tasks state in the cluster, including recoveries.

2.38. template

Type: per template type

Reports free space per template type. A template type is defined by the following 4 parameters: placeHead, placeAll, placeTail and replication factor.

This is the standard way of seeing the available space for a specific template in the cluster.

Note

The reported free space is a conservative estimate. Disbalance or different-sized drives in a placement group can skew this considerably.

2.39. totalvolumes

Type: cluster-wide

Reports if the amount of volumes in a cluster is too low (less than two, as at least that much are expected for the required storpool_iolatmon service, or more than 10000).

2.40. volumecare-local

Type: cluster-wide

Note

This alert and all others that have volumecare in the name are related to the operations of VolumeCare.

Reports if there are any detected problems with the snapshot creation of volumecare. The following issues are detected:

  • No snapshots for a specific volume (that has a policy that requires them)

  • Newest snapshot for a specific volume is too old

  • Stale (very old snapshot) detected for a volume

  • Policy for a volume does not exist

The first three are investigated in the logs of volumecare, the last is a configuration issue. The main cause is the service not running.

2.41. volumecare-policies

Type: cluster-wide

Reports if the volumes of a specific VM have different policies (which would prevent volumecare from making snapshots). This is resolved by correcting the policies of some of the volumes of the offending VM.

Note that this may also be caused by the volumes being on different templates (that have different volumecare policies).

2.42. volumecare-remote

Type: cluster-wide

Reports is there are any detected problems with the snapshot transfer of volumecare snapshots to a remote cluster. Issues detected:

  • No remote snapshots for a specific volume (that has a policy that requires them)

  • Newest remote snapshot for a specific volume is too old

  • Stale (very old) remote snapshot detected for a volume

The first two are investigated in the logs of volumecare, the last one is a configuration issue. The main cause is the service not running.

2.43. volumecare-svc

Type: cluster-wide

Currently reports only if there are volumecare snapshots in the target cluster and the monitoring has not been updated with the proper policy. Handled exclusively by StorPool support.

2.44. volumerepl

Type: cluster-wide

Reports if the replication factor of all volumes in the cluster matches the replication factor of the template they’re in.

2.45. volumes

Type: cluster-wide

Reports if there are volumes named in a way that is not expected for the cluster, to track left-over tests. Does not send alerts.

2.46. volumesizes

Type: cluster-wide

Report if there are volumes whose size is not aligned to 4096 bytes. Does not send alerts.

Note

The volumes are a problem for some tools, their unaligned nature leads to using smaller block sizes and bad performance. One such example is ntfsclone.

2.47. volumetargets

Type: cluster-wide

Reports if the disk placement for volumes and their parent snapshot matches.

2.48. volumetmpl

Type: cluster-wide

Reports if all volumes have a template.

3. Per-host alerts

3.1. apichecks

Type: composite (no single-node version is available)

This alert describes if a specific host can access the IP address in SP_API_HTTP_HOST, via ICMP Ping and TCP. This is required so most orchestrations and the storpool_controller service can communicate with the storpool_mgmt service.

3.2. kernels

Type: composite (no single-node version is available)

The alert shows the kernels on the node that do not have StorPool modules installed.

This alert is currently handled by StorPool support, by doing the installation of new kernel modules and/or scheduling an upgrade where necessary.

3.3. rootcgprocess

Type: composite (no single-node version is available)

This alert shows all processes running in the root cgroup. For more information on cgroups, see Control Groups .

Note

The importance of this alert is that with processes in the root cgroup, it’s possible that the OOM killer will trigger because of them and kill random processes, as the ones in the root cgroup are not constrained in their memory usage.

4. Metrics-based alerts

4.1. cgroups

Type: composite and single-node

This alert tracks the memory usage of the StorPool memory cgroup (storpool.slice) and alerts if it goes above 80%. This gives advance warning if the memory needs to be expanded, if there are bugs (like memory leaks), or if there’s something running the cgroup that shouldn’t be there.

The alert hasn’t been extended to customer cgroups like machine.slice, as customer usage/over-provisioning vary wildly.

4.2. cpustat

Type: composite and single-node

This alert tracks the run-queue/wait time on CPUs used for StorPool services and notifies if it goes above 1.8. This is used to see if more than one service gets put on the same CPU thread, or if there are other processes/kernel tasks scheduled in the StorPool cpuset, that can result in latencies or similar issues.

The alert hasn’t been extended to customer cgroups like machine.slice, as customer usage/over-provisioning vary wildly.

4.3. dataholes

Type: composite (no single-node version is available)

This alert tracks the amount of dropped packets per interface. It’s calculated based on the expected packets to be received for a request, i.e. when a “data hole” occurs, a gap in the received packet stream.

There can be different reasons for the alert:

  • CRC errors on an interface;

  • Bad cables;

  • cross-switch link problems (in the case of a network with more than two switches);

Note

This error is a problem in receiving data. So an error on a lot of nodes might mean a problem with the sending from a specific node, i.e. the problem will always warrant investigation of all nodes/NICs in the cluster.

4.4. diskerrors

Type: composite (no single-node version is available)

This alert tracks the amount of new disk errors (their rate) for specific disks and alerts on more than 10 per hour. This is useful to have an advance warning which drives are expected to fail.

Note

A disk error in this context is a read from a drive that returned data with incorrect checksum.

4.5. iolatmon

Type: cluster-wide

This alert tracks the state of the storpool_iolatmon service, which does read/write operations on test volumes to problem the cluster’s health. The alert checks for the existence of the test volumes and if they have any traffic.

A future update is being discussed to also track the latencies observed.

4.6. stats

Type: composite and single-node

This alert shows if metrics data has been received for a host, if it’s delayed, or if it’s in the future.

Possible reasons for the alert:

  • Lack of data might mean that there is either an Internet connectivity/DNS problem with the node, the storpool_stat service has stopped, or there’s a problem with the OS root drive (either full or remounted read-only);

  • Data in the future means wrong clock and is mostly resolved by fixing the clock with ntpdate and checking the chrony or ntpd configuration;

  • Delayed data might mean that the node is either catching up, it’s clock is wrong (see the previous point), or is going into the state described in the first point.

4.7. status

Type: composite and single-node

This alert reports if status.json has been updated/transmitted by storpool_abrtsync to reports.storpool.com, which is the current way of receiving per-host checks.

5. Others

5.1. billingdata

Type: cluster-wide

Reports if billing data (daily report) has been received from the cluster. Handled exclusively by StorPool support.