Monitoring metrics collected
Overview
StorPool provides a hosted monitoring system that allows all StorPool customers to easily and reliably monitor the health and performance of their StorPool storage clusters. The health and performance information is collected by the storage and client nodes and sent over an encrypted connection to the StorPool monitoring servers, running in StorPool’s own infrastructure in a data center with restricted physical access.
What data is collected
The data that is collected includes performance metrics of the storage nodes, storage devices, storage network traffic, and some metadata of the stored volumes. Following is an exhaustive list of the data collected and stored by the StorPool hosted monitoring system:
Cluster status
List and status of the disks
List and status of the StorPool services
List and status of the storage networks
List and status of the attachments
List and status of the relocation tasks running
List and status of the volumes and snapshots
List and status of the placementGroups and templates
Status of relocator and balancer services
Cluster (mgmtConfig) configuration
Maintenances set in progress
List and status of the iSCSI sessions
Pending I/O requests that have been active more than 10 seconds - client
Unreachable initiator addresses in each portal group that have at least one export and have connected to some of the portals at least once.
This allows continued tracking of the connectivity to the initiators regardless of the presense of a live session. The
storpool_stat
service tracks initiator addresses for each portal group until they have no exports in the portal group, or alternatively until they are deleted from the cluster configuration.This is relevant for all clusters configured to export targets through iSCSI. The cluster have to be configured with a kernel-visible address for each portal group in the cluster configuration on each of the nodes running the
storpool_iscsi
service (related change for adding interfaces).Initially added with 19.2 revision 19.01.1946.0b0b05206 release.
Per-host status
Running kernel version and other installed kernels
Processes running in the root cgroup
Connectivity to the node with the active API
Crash reports of the StorPool processes (that include internal logs and core dumps)
Crash reports of the Linux kernel
Performance monitoring
List and status of the disks
List and status of the templates
List and status of the iSCSI sessions
Storage server performance metrics: reads and writes per second, bytes/s, network transfer time, processing time, disk read/write time, disk busy time, queues length, system tasks utilization
Stats from /proc/diskstat (I/O stats) for all attached StorPool and system disk drives - number of iops, bytes/s, busy time, and so on
Stats for CPU usage and queue for every CPU
Memory usage for all cgroups in the system (cache, rss, and so on)
Network and other stats from the StorPool services on the node
Metadata
The collected information listed above contains some metadata about the volumes stored. This includes:
Volume name
Volume size
Volume utilization - used space
StorPool template name used to create the volume
Replication factor for the volume
QoS parameters - configured IOPS and bandwidth limits
Tags of the volume
The format and the information contained in the volume names depend on the cloud management system used. It typically contains a UUID of the virtual disk (OpenStack) or disk sequence numbers (OpenNebula) like “one-{VM_ID}-{Disk_ID}”.
Cloud management systems may store additional metadata in the volume tags, like virtual machine ID that is using the volume or the backup policy (used by the volumecare service).
User data
Attention
No user data is collected, processed, or transferred by the monitoring agents and StorPool monitoring servers. The content of the volumes and snapshots are never read or processed, except for the main function of the storage system - to store and retrieve the user data on user requests.
In the crash reports collected, any user information (like data buffers) is not recorded at all. No extra information can be received from these than what’s described above. For the Linux kernel crash reports, only the backtrace/log information is sent, the crash report itself is not transferred.
How the data is collected
The data is collected using agents, part of the StorPool software, running on the storage, and client nodes with installed StorPool client. The agents regularly collect the health-check and performance metrics, perform preliminary processing - validation, aggregation, calculation of derived parameters, and send the data to the hosted StorPool monitoring system.
The agents that collect and send the information are implemented in scripting languages and are available for audit by the StorPool customers.
How the data is sent
All monitoring and performance metrics data is sent by the agents over encrypted HTTP/TLS or SSH connections. Connections are established by the agents on the storage and client nodes with installed StorPool software, to a pool of redundant monitoring servers in the StorPool’s own infrastructure. The TLS/SSH connections can be established over the Internet or using a purpose-built VPN between the Storage cluster and StorPool infrastructure. In both cases, TLS encrypted communication is used between the agents and the monitoring servers.
The destination servers, where the monitoring and performance data is sent are configured locally on the storage nodes and are under the customer’s control.
Agents sending the data are authenticated by the monitoring servers using individual per-cluster pre-shared keys before the data is accepted for processing.
How the data is processed
StorPool monitoring servers store and process the received data locally on the StorPool hosted monitoring system. No data is sent to external systems or third parties for storing or processing. The health check data is used to update the current health status of the elements. It is not stored in raw format, but as the processed health status of the elements and historical data about the events - changes in their status.
Performance metrics are stored in a raw high-resolution data format in a time-series database and in an aggregated 1-minute resolution format. The aggregated data is stored for 12 months, after which it is deleted. The high-resolution data is stored for 48 hours.
Information about the monitoring events, like disk or node failure, cluster health status change, etc. can be sent using 3rd party services to the subscribed customers. These notification services currently used are Slack (slack.com) and e-mail. Notifications are configured/enabled per cluster.
All servers where the monitoring and performance data is processed and stored are owned and managed by StorPool and are part of the infrastructure managed by StorPool. The servers are located in a certified data center with restricted access.