Monitoring alerts
Introduction
StorPool monitors all deployments and provides notifications to customers on issues with their clusters. This document describes the different types of notifications, their meaning and gives pointers to other parts of the documentation on further debugging and resolving the problems.
Alerts categories
The alerts are split in the following broad categories:
Cluster status alerts, based on data from the StorPool API
Per-host alerts, based on data from the hosts
Metrics-based alerts, based on metrics data from the hosts
Others
The alerts also can be composite, single-node, or cluster-wide:
A single-node alert describes a single service on a single node, for example a SSD/HDD drive.
A composite alert describes all alerts for nodes of particular type, for example all disks in a cluster.
A cluster-wide alert is about the cluster itself, for example if the relocator service is not running.
Severity levels
The severity levels of the alerts are:
OK: No alarm.
Seeing an alert for this state in most cases is a result of a transition from either WARNING or CRITICAL.
WARNING: Problems can be expected.
CRITICAL: Problems imminent or already happening.
Super-critical (critical with [SC] tag): The cluster is either down or in a similar state. This type of alert generates an automated call to StorPool support.
UNKNOWN: Fresh information has not been received about the particular service or cluster.
There are currently two supported channels for sending alarms: email and Slack. For Slack, it’s possible to push directly to the customer’s Slack (with access provided by them), to a shared channel between the customer’s organization and StorPool, or in a channel provided in StorPool’s slack for the customer.
Information on how to implement these checks independent from StorPool’s monitoring system can be found in Common monitoring via the StorPool API.
Cluster status alerts
agscore-entries
Type: per placement group (cluster-wide)
Reports if the aggregate score for entries of multiple drives is above a certain threshold (different for hard drives and SSD/NVMes).
Note
Both this and the alert below are related to either very full drives, or too much writes/trims happening simultaneously, for longer periods of time. The alert by itself is a pointer that a further investigation is needed on why the drives behave this way, be it from overload, general device slowness, higher loads in the cluster, and so on.
agscore-space
Type: per placement group (cluster-wide)
Reports if the aggregate score for space usage of multiple drives is above a certain threshold (different for hard drives and SSD/NVMes).
attachments-count
Type: cluster-wide
Reports if any hosts is close to the limit of possible volume attachments.
balancer
Type: cluster-wide
Tracks the state of the internal balancer. Currently unused.
bridge
Type: composite and per-host
Tracks the state of the storpool_bridge
services in the cluster - if some of
them are down, or if a bridge service is expected, but missing.
bridgestatus
Type: per remote location (cluster-wide)
Tracks the connections to remote clusters and their state. If there are no active connections, it will report all errors from all connections (note that there can be only one active connection at a time).
client
Type: composite and per-host
Tracks the state of the storpool_block
services in the cluster. All nodes
are expected to have it running:
If a service is down.
If a service is missing on a node that has a working network.
[SC] A super-critical alarm is raised if the configuration update for the service has failed (as this means there is a hard-to-detect network problem).
clusteruptime
Type: cluster-wide
Tracks if the cluster uptime is too low (if the cluster has flapped recently).
Currently, this alert generates only a warning. After a testing period, it should be promoted to super-critical.
controller
Type: composite and per-host
Tracks the state of the storpool_controller
service in the cluster.
Note the following:
The controller service lives in the
system.slice
memory cgroup and can be killed because of memory pressure, this seems to be the most common case.A controller could also be shown as “down” if it’s not connecting to the
storpool_mgmt
service. If the service is running, firewalls and connectivity are the obvious next things to check.
dematerialization
Type: cluster-wide
Reports the state of snapshot dematerialization in the cluster. For more information, see Management configuration.
disbalance
Type: cluster-wide
Currently disabled. Supposed to show the amount of disbalance of disk usage in the cluster.
disk
Type: per-disk
Reports information for specific disk: device, serial number, last scrubbing time. Will report an alert on a missing drive, too much usage of objects, entries, or space.
diskentries
Type: per placement group
Reports if the available entries per disk have fallen below a certain threshold.
[SC] This alerts goes super-critical if a drive is below a certain threshold.
Note that handling these types of problems includes:
Finding all volumes that have too much writing and cap their number of IOps.
Sometimes in case of severely delayed operations, ejecting the offending drive to get some breathing room (note that this will probably quickly get another drive in trouble);
After the case is handled, investigate if raising the number of entries or expanding the cluster is needed.
disk-missing-pg
Type: composite only
Reports if there are drives in the cluster that do not belong to any placement group.
disk-journals
Type: composite only
Reports if drives that had journals are back in the cluster without journals. Their lack can lead to an order of magnitude slower writes.
disk-pending-errors
Type: composite only
Reports if there are drives that have detected incorrect checksum for a block of data and are currently in the process of recovering it. The alert exists as this should happen immediately and if it stalls, it requires investigation.
disk-recoveries
Type: composite only
Reports if drives have a wrong configuration for maxLocalRecoveryRequests
and maxRemoteRecoveryRequests
(see Local and remote recovery). This
is handled exclusively by StorPool support and should be removed when the
related bugfix is deployed on all clusters.
disk-softeject
Type: composite only
Reports if drives are being softEjected (balanced out) of the cluster.
disk-test
Type: composite only
Reports if any drives are ejected and are pending tests/being tested. Will also report if drives have been tested too much times because of failures, which would mean that those are a good idea to be replaced.
disk-to-test
Type: composite only
Reports if any drives are marked for testing, but have not been tested. Fix explanation can be found in Testing a StorPool drive.
diskobjects
Type: per placement group
Reports if drives in the placement group have object usage above 70%. This is relevant especially for VolumeCare deployments or systems that create a large amount of snapshots, as lack of available on any of the disks objects prevents the creation of new volumes or snapshots.
disks
Type: composite
Reports composite information for all disks in the system. Will raise a super-critical alarm if drives on more than one server are missing.
[SC] This alert generates super-critical alarms if drives are missing on two or more nodes.
Note that there have been discussions that this super-critical alarm should be triggered only for data left with a single copy in triple-replicated setup or on data without any copies left. Because of faultsets and balancer overrides, this is impossible to do with the existing monitoring data.
disk-softeject
Type: cluster-wide
Reports which drives have been softEject-ed (with rebalancing), or are in the process of being softEject-ed.
diskspace
Type: per placement group
Reports if drives in a specific placement group have space usage above 90%.
features
Type: cluster-wide
Reports if specific StorPool features of newly deployed releases can be enabled or are blocked.
iscsi
Type: composite and per-host
Reports the state of iSCSI target services (storpool_iscsi
) in the cluster.
iscsi-backup
Type: composite and per-host
Reports if any iSCSI initiators are connected via a “backup” link. This issue can happen if the iSCSI networks are swapped, and there is no cross-link between the two sides of the network.
iscsi-ctrlrs
Type: composite
Reports if any iSCSI portals (running instances of storpool_iscsi
) are not
added to any portal group, i.e. are unused.
latthreshold-cli
Type: cluster-wide
Reports abnormal latencies of I/O operations in the cluster performed by clients.
Note for this and the alert below: the initial starting point is always to run
/usr/lib/storpool/latthreshold.py
on a node with access to the management of
the cluster, to see what is hanging. The follow-up depends on the common
denominator for the problematic requests. More details available here.
The format of the alert is as follows:
[SC]CRITICAL:clients [(1 volume iolatmon:full:1) ] user ops [ 1 read, max 86352 ms ]
This means that on node with ID 1, the volume iolatmon:full:1
had delayed
user operations. There was one read operation pending that has been in progress
for around 86 seconds.
[SC] This alert generates super-critical alarms on large latencies.
latthreshold-disk
Type: cluster-wide
Reports abnormal latencies of I/O operations in the cluster performed on disks (hardware).
The format of the alert is as follows:
latthreshold-disks is WARNING:disks [(256 volume iolatmon:full:1) ] user ops [ 1 read, max 20621 ms ]
This means that the drive with id 256
has delayed operations for the volume
iolatmon:full:1
, the operations were user initiated, there was one of them
in progress when the check ran, and the maximum delay for such operation was
20621
ms (i.e. close to 20 seconds). It’s possible to see the same also for
write operations, and for more than one drive or volume.
[SC] This alert generates super-critical alarms on large latencies.
locations
Type: cluster-wide
Tracks the settings for the receive/transmit buffers of the storpool_bridge
service for different locations. An alert here means that the buffers are too
low (or unset), and should be corrected. More information can be found at
Bridge throughput performance.
maintenances
Type: per-cluster
Reports if any nodes are in scheduled maintenance and the state of the maintenance (ongoing, expired, etc). Non-supercritical alerts for nodes under maintenance are suppressed.
mgmt
Type: composite and per-host
Reports the state of storpool_mgmt
services in the cluster. Will raise an
alert if there’s just one management node in a cluster of more than one node.
[SC] Will raise a super-critical alert if no managements are available, as no data can be collected for the status of the cluster. Will also raise a super-critical alarm if some of the requests to the storpool_mgmt do not return data/time out.
Note
Not being able to collect data is different from not receiving any
data from the cluster. For details, see monitoringdata
.
mgmtConfig
Type: cluster-wide
Will report discrepancies in the internal cluster configuration. Handled exclusively by StorPool support.
monitoringdata
Type: cluster-wide
Reports if there was no monitoring data received for the last 15 minutes.
The main reasons for this alert are connectivity issues between the cluster and
StorPool’s monitoring system. Those are easily detected by doing a
traceroute
from the active mgmt node in the cluster to
mon1.storpool.com
, and most of the time are related to DNS or routing.
needbalance
Type: cluster-wide
Reports if there are volumes or snapshots with placement constraint violations, which in turn would require a rebalancing of the cluster to be fixed. Currently handled by StorPool support.
There are multiple reasons for this alert to become active:
Volumes were created when a node was missing from a three-node cluster;
A drive/node has died and redundancy hasn’t been restored;
network
Type: composite and per-host
Reports the state of the network connectivity to the cluster of a specific node. It will report also if a node does not consider itself in the cluster or if it’s using a backup link.
Note that the network state is collected from the beacon on the active management node, thus it’s not fully representative of what each node sees.
onapp-bkp-vol
Type: cluster-wide
This will report if there are stale volumes used by OnApp to create backups. For more information see Cleanup of stale backup volumes and snapshots.
placementgroup-drives
Type: per placement group (cluster-wide)
This will report if there are drives mixed with and without SSD flag in the same placement group.
quorum
Type: cluster-wide
This will report if the expected number of voting nodes is in the cluster.
reformat
Type: cluster-wide
Internal use only. Reports if the entry usage on some drives warrants them to be reformatted with a larger amount.
relocator
Type: cluster-wide
Will report if the internal relocator service (part of storpool_mgmt
) is
running.
server
Type: composite and per-host
Reports the status of all storpool_server
instances in the cluster. Will
also report if there are drives in recovery on the instance. It will also report
if a storpool_controller
service is missing on the node with the server
instance(s), or their versions differ.
Note that the versions of the storpool_server
and storpool_controller
service are checked against each other, as the controller needs to parse the
memory structures of the server and there’s a possibility of mismatch between
versions. In general, the easiest solution is a restart of the service with the
lower version, taking into account what is installed, if other services need to
be restarted, if this needs to be scheduled, etc..
snapfromremote
Type: cluster-wide
Reports if there are too many snapshots being transferred from a remote location, or if there are too many snapshots being transferred of a single parent.
snaplen
Type: cluster-wide
Will report if there are snapshot chains above a certain limit. This is used to track if the chain shortening periodic task is working.
snaprepl
Type: cluster-wide
Reports if the replication factor of all snapshots in the cluster matches the replication factor of the template they’re in.
snaptargets
Type: cluster-wide
Reports if the disk placement for snapshots and their parents matches.
snaptmpl
Type: cluster-wide
Reports if all snapshots have a template.
tasks
Type: cluster-wide
Reports the tasks state in the cluster, including recoveries.
template
Type: per template type
Reports free space per template type. A template type is defined by the following 4 parameters: placeHead, placeAll, placeTail, and replication factor.
This is the standard way of seeing the available space for a specific template in the cluster.
Note
The reported free space is a conservative estimate. Disbalance or different-sized drives in a placement group can skew this considerably.
totalvolumes
Type: cluster-wide
Reports if the amount of volumes in a cluster is too low (less than two, as at
least that much are expected for the required storpool_iolatmon
service, or
more than 10000).
volumecare-local
Type: cluster-wide
Note
This and all other alerts that have volumecare
in the name are
related to the operations of VolumeCare.
Reports if there are any detected problems with the snapshot creation of VolumeCare. The following issues are detected:
No snapshots for a specific volume (that has a policy that requires them).
Newest snapshot for a specific volume is too old.
Stale (very old snapshot) detected for a volume.
Policy for a volume does not exist.
The first three are investigated in the logs of VolumeCare, the last is a configuration issue. The main cause is the service not running.
volumecare-policies
Type: cluster-wide
Reports if the volumes of a specific VM have different policies (which would prevent volumecare from making snapshots). This is resolved by correcting the policies of some of the volumes of the offending VM.
Note that this may also be caused by the volumes being on different templates (that have different volumecare policies).
volumecare-remote
Type: cluster-wide
Reports is there are any detected problems with the snapshot transfer of volumecare snapshots to a remote cluster. Issues detected:
No remote snapshots for a specific volume (that has a policy that requires them).
Newest remote snapshot for a specific volume is too old.
Stale (very old) remote snapshot detected for a volume.
The first two are investigated in the logs of volumecare, the last one is a configuration issue. The main cause is the service not running.
volumecare-svc
Type: cluster-wide
Currently reports only if there are volumecare snapshots in the target cluster and the monitoring has not been updated with the proper policy. Handled exclusively by StorPool support.
volumerepl
Type: cluster-wide
Reports if the replication factor of all volumes in the cluster matches the replication factor of the template they’re in.
volumes
Type: cluster-wide
Reports if there are volumes named in a way that is not expected for the cluster, to track left-over tests. Does not send alerts.
volumesizes
Type: cluster-wide
Report if there are volumes whose size is not aligned to 4096 bytes. Does not send alerts.
Note
The volumes are a problem for some tools, their unaligned nature leads
to using smaller block sizes and bad performance. One such example is
ntfsclone
.
volumetargets
Type: cluster-wide
Reports if the disk placement for volumes and their parent snapshot matches.
volumetmpl
Type: cluster-wide
Reports if all volumes have a template.
Per-host alerts
apichecks
Type: composite (no single-node version is available)
This alert describes if a specific host can access the IP address in
SP_API_HTTP_HOST
, via ICMP Ping and TCP. This is required so most
orchestrations and the storpool_controller
service can communicate with the
storpool_mgmt
service.
configfile
Type: composite (no single-node version is available)
This check tracks if the /etc/storpool.conf
file is different between the
nodes in the cluster.
configuration
Type: composite and single-node
Alerts on some common configuration problems in /etc/storpool.conf
and
/etc/storpool.conf.d/*
:
Incorrect cluster ID;
Incorrect cluster name;
Difference between configured and loaded network configuration;
Missing API interface;
Existence of obsolete configuration options.
hw-ecc
Type: composite (no single-node version is available)
Reports if any node is using non-ECC memory (which is unsupported by StorPool).
initiators-iscsi
Type: composite (no single-node version is available)
This alert tracks the reachability of all connected iSCSI initiators from all target nodes. Its main purpose is to catch cases of partial connectivity loss.
These are resolved by fixing the underlying network issues.
kernels
Type: composite (no single-node version is available)
The alert shows the kernels on the node that do not have StorPool modules installed. If there’s a debug kernel installed in the node, the alert is CRITICAL, otherwise it’s a WARNING.
This alert is currently handled by StorPool support, by doing the installation of new kernel modules and/or scheduling an upgrade where necessary. In the case of a debug kernel, the customers are notified to remove it from the node, as StorPool does not support these kernels and they’re unsuited for use in production.
lldp
Type: composite (no single-node version is available)
The alert shows the sets of nodes that have the switch pair they’re connected to is swapped.
Example output:
[ sw100g1_(1)_4c:76:25:e8:49:40 sw100g2_(1)_3c:2c:30:38:43:80 => 3, 6 ][ sw100g2_(1)_3c:2c:30:38:43:80 sw100g1_(1)_4c:76:25:e8:49:40 => 20 ]
This means that nodes 3 and 6 have their first port connected to
sw100g1_(1)_4c:76:25:e8:49:40
and second port to
sw100g2_(1)_3c:2c:30:38:43:80
, and for node 20 it’s reversed.
This is resolved by either:
Swapping the cables on the node.
Swapping the interfaces (
SP_IFACE1_CFG
andSP_IFACE2_CFG
in/etc/storpool.conf
).
Note
The switch naming is NAME_(CHASSIS_ID)_MAC, and all data is taken from the LLDP frames sent by the switch/router (i.e. the MAC is the MAC of the switch chassis).
rootcgprocess
Type: composite (no single-node version is available)
This alert shows all processes running in the root cgroup. For more information on cgroups, see Control groups .
The importance of this alert is that with processes in the root cgroup, it’s possible that the OOM killer will trigger because of them and kill random processes, as the ones in the root cgroup are not constrained in their memory usage.
portals-iscsi
Type: composite (no single-node version is available)
This alert tracks the reachability of all iSCSI portals on all nodes from all other nodes. Its main purpose is to catch cases of partial connectivity loss.
These are resolved by fixing the underlying network issues.
status
Type: composite (no single-node version is available)
This alert means that the status.json
file generated by storpool_stat
and transmitted by storpool_abrtsync
has not been updated for a specific
node. This leads to lack of per-host data for the host in question.
This is usually resolved by verifying that both storpool_stat
and
storpool_abrtsync
are running and are able to send data.
Metrics-based alerts
cgroups
Type: composite and single-node
This alert tracks the memory usage of the StorPool memory cgroup
(storpool.slice
) and alerts if it goes above 80%. This gives advance warning
if the memory needs to be expanded, if there are bugs (like memory leaks), or if
there’s something running the cgroup that shouldn’t be there.
The alert hasn’t been extended to customer cgroups like machine.slice
, as
customer usage/over-provisioning vary wildly.
cpustat
Type: composite and single-node
This alert tracks the run-queue/wait time on CPUs used for StorPool services and notifies if it goes above 1.8. This is used to see if more than one service gets put on the same CPU thread, or if there are other processes/kernel tasks scheduled in the StorPool cpuset, that can result in latencies or similar issues.
The alert hasn’t been extended to customer cgroups like machine.slice
, as
customer usage/over-provisioning vary wildly.
dataholes
Type: composite (no single-node version is available)
This alert tracks the amount of dropped packets per interface. It’s calculated based on the expected packets to be received for a request, i.e. when a “data hole” occurs, a gap in the received packet stream.
There can be different reasons for the alert:
CRC errors on an interface;
Bad cables;
cross-switch link problems (in the case of a network with more than two switches);
Note
This error is a problem in receiving data. So an error on a lot of nodes might mean a problem with the sending from a specific node, i.e. the problem will always warrant investigation of all nodes/NICs in the cluster.
diskerrors
Type: composite (no single-node version is available)
This alert tracks the amount of new disk errors (their rate) for specific disks and alerts on more than 10 per hour. This is useful to have an advance warning which drives are expected to fail.
Note
A disk error in this context is a read from a drive that returned data with incorrect checksum.
io-latencies
Type: per template type
Report if there is a significant increase between the mean of IO latencies for the last 24 hours and the typical IO latencies for the previous 28 days. A template type is defined by the following 4 parameters: placeHead, placeAll, placeTail and replication factor.
iolatmon
Type: cluster-wide
This alert tracks the state of the storpool_iolatmon
service, which does
read/write operations on test volumes to problem the cluster’s health. The alert
checks for the existence of the test volumes and if they have any traffic.
A future update is being discussed to also track the latencies observed.
service-latency
Type: cluster-wide
This alert tracks the latencies for disk read/write operations and the latencies for StorPool read/write requests.
It reports if the latencies for StorPool operations become significantly higher than the disk operation latencies.
Some of the possible reasons for this issue are:
Storage network connectivity issues between nodes;
One or more of the StorPool services experiencing a high load.
service-load
Type: cluster-wide
This alert tracks if any StorPool service seems to be overloaded.
One specific case is the storpool_iscsi
services (iSCSI controllers) which
in this case most probably means it would benefit from a rebalance of its
exported targets.
stats
Type: composite and single-node
This alert shows if metrics data has been received for a host, if it’s delayed, or if it’s in the future.
Possible reasons for the alert:
Lack of data might mean that there is either an Internet connectivity/DNS problem with the node, the
storpool_stat
service has stopped, or there is a problem with the OS root drive (either full or remounted read-only).Data in the future means wrong clock and is mostly resolved by fixing the clock with
ntpdate
and checking thechrony
orntpd
configuration.Delayed data might mean that the node is either catching up, it’s clock is wrong (see the previous point), or is going into the state described in the first point.
status
Type: composite and single-node
This alert reports if status.json
has been updated/transmitted by
storpool_abrtsync
to reports.storpool.com
, which is the current way of
receiving per-host checks.
Others
billingdata
Type: cluster-wide
Reports if billing data (daily report) has been received from the cluster. Handled exclusively by StorPool support.