StorPool 19.01 Release Notes¶
This section will be used for new features or changes that need a bit more explanation than the usual one-liners at the StorPool 19.01 Release Change Log.
Auto-interface configuration¶
The iface-genconf
instrument is now extended to cover complementary iSCSI
configuration as well.
The main reason for the extension is a planned change in the monitoring system
to periodically check if the portal group and portal addresses exposed by the
storpool_iscsi
controller services are still reachable. This change will
require adding OS/kernel interfaces for each of the portal groups, so that they
could be used as a source IP endpoint for the periodic monitoring checks.
The iface-genconf
is now accepting an additional --iscsicfg
option in
the form of:
VLAN,NET0_IP/NET0_PREFIX
- for single network portal groupsVLAN_NET0,NET0_IP/NET0_PREFIX:VLAN_NET1,NET1_IP/NET1_PREFIX
- for multipath portal groups
The additional option is available only with --auto
so the usage is assumed
as complementary to the interface configurations initially created with the same
tool and is required to auto-detect configurations where the interfaces
for the storage system and the iSCSI are overlapping.
An example of adding an additional single portal group with VLAN 100 and portal
group address 10.1.100.251/24
would look like this:
iface-genconf -a --noop --iscsicfg 100,10.1.100.251/24
The above will auto-detect the operating system, the type of interface
configuration used for the storage system and iSCSI, and depending on the
configuration type (i.e. exclusive interfaces or a bond) will print
interface configuration on the console. Without the -noop
option
non-existing interface configurations will be created and ones that already exist will not be
automatically replaced (unless iface-genconf
is instructed to).
The IP addresses for each of the nodes are derived by the SP_OURID
and could be adjusted with the --iscsi-ip-offset
option that will be
summed to the SP_OURID
when constructing the IP address.
The most common case for single network portal group configuration is either
with an active-backup or LACP bond configured on top of the interfaces
configured as SP_ISCSI_IFACE
.
For example with SP_ISCSI_IFACE=ens2,bond0;ens3,bond0
the additional
interface will be bond0.100
with IP of 10.1.100.1
for the node with
SP_OURID=1
, etc.
The same example for a multipath portal group with VLAN 201 for the first network and 202 for the second:
iface-genconf -a --noop --iscsicfg 201,10.2.1.251/24:202,10.2.2.251/24
In case of exclusive interfaces (ex. SP_ISCSI_IFACE=ens2:ens3
) or in case
of an active-backup bond configuration (ex. SP_ISCSI_IFACE=ens2,bond0:ens3,bond0
)
the interfaces will be configured on top of each of the underlying interfaces
accordingly:
ens2.201
with IP10.2.1.1
ens2.202
with IP10.2.2.1
The example is assumed for a controller node with SP_OURID=1
.
In case of an LACP bond (i.e. SP_ISCSI_IFACE=ens2,bond0:ens3,bond0:[lacp]
)
all VLAN interfaces will be configured on top of the bond interface (example
bond0.201
and bond0.202
with the same addresses), but such peculiar
configurations should be rare.
The --iscsicfg
could be provided multiple times for multiple portal group
configurations.
All configuration options available with iface-genconf --help
, some examples
could be seen at https://github.com/storpool/ansible
Initially added with 19.01.1511.0b533fb – -> 19.01.1548.00e5a5633 release.
Volume overrides¶
Note
Presently overides are manually added by StorPool support only in places that require them, at a later point they will be re-created periodically so that new volume objects are included in this analisys regularly.
Volume disk set overrides (or in short just volume overrides) refer to changing the target disk sets for particular object IDs of a volume with different disks than the ones inherited from a parent snapshot or created by the allocator.
This feature is useful when many volumes are created from the same parent snapshot, which is the usual case when a virtual machine template is used to create many virtual machines ending up with the same OS root disk type. These are usually the same OS and filesystem type, as well as behaviour. In the common case the filesystem journal will be overwritten on the same block device offset for all such volumes. For example a cron job running on all such virtual machines at the same time (i.e. unattended upgrade) will lead to writes to this same exact object or set of objects with the same couple of disks in the cluster ending up processing all these writes. This ends up causing an excessive load on this set of disks in the cluster which will lead to degraded performance when these drives start to aggregate all the overwritten data or just from the extra load.
A set of tools could now be used to collect metadata for all objects from each of the disks from the API and analyse which objects are with the most excessive number of writes in the cluster. These tools will calculate proper overrides for such objects so that even in case of an excessive load on these particular offsets on all volumes created out of the same parent, they will end up on different sets of disks instead of the ones inherited from the parent snapshot in the original virtual machine template.
The way the tooling is designed to work is by looking for a template called
override
which placeTail
parameter is used as a target placement group
for the disks used as replacement for the most overwritten objects. For example
if a cluster has one template with hybrid placement (i.e. one or more replicas
on HDDs and tail on SSD or NVMe drives) an override would have to be the SSD or
NVME placement group. An example:
-------------------------------------------------------------------------------------------------
| template | size | repl. | placeHead | placeAll | placeTail | iops | bw | parent | flags |
-------------------------------------------------------------------------------------------------
| hybrid | - | 3 | hdd | hdd | ssd | - | - | | |
| overrides | - | - | default | default | ssd | - | - | | |
-------------------------------------------------------------------------------------------------
Multiple templates will use the same overrides
template placeTail
specification.
An example would be an SSD only and HDD-only template in which case the drives
for the top most overwritten objects will be overridden with SSD disks.
The tool to collect and compute overrides is
/usr/lib/storpool/collect_override_data
, and the resulting overrides.json
file could be loaded with:
# storpool balancer override add-from-file ./overrides.json
Note that once overides are loaded on future re-balancing operations the overrides will be re-calculated (more details on balancer - 14. Rebalancing StorPool).
The default number of top objects to be overridden is 9600 or 32GiB of virtual space.
This could be specified as the MAX_OBJ_COUNT environment variable to
collect_override_data
tool.
First appears with changelog_19.01.1500.82af794 release.
In-server disk tester¶
This feature improves the way storage media and/or controller failures are handled, by automatically trying to return a drive that previously failed an operation back into the cluster in case it is still available and recover from the failure.
On many occasions a disk write (or read) operation might timeout after the drive’s internal failure handling mechanisms kick in. An example is a bad sector on an HDD drive being replaced or a controller to which the drive is connected resets. In some of these occasions the operation times out and an I/O error is returned to StorPool, which triggers an eject for the failed disk drive. This might be an indication for pending failure, but in most cases the disk might continue working without any issues for weeks, sometimes even months before another failure occurs.
With this feature such failures will now be handled by automatically re-testing each such drive if it is still visible to the operating system. If the results from the tests are within expected thresholds the disk is returned back into the cluster.
A disk test can be triggered manually as well. The drive will automatically be returned back into the cluster if the test was successful. The last result from the test can be quieried through the CLI.
First appears with 19.01.1108.02703b8c5 – -> 19.01.1217.1635af7 release.
Reuse server implicit on disk down¶
This feature allows a volume to be created even if a cluster is on the minimum system requirements of three nodes with a single disk missing from the cluster or if one of the nodes is down at the moment.
Before this change all attempts to create a volume in this state would have resulted in “Not enough servers or fault sets for replication 3” error for volumes with replication 3.
With this feature enabled the volume will be created as if the volume was
created with reuseServer
(more on this here - reuse_server_implicit_on_disk_down_cli)
The only downside is that the volume will have two of its replicas on drives in the same server. When the missing node comes back a re-balancing will be required so that all replicas created on the same server are re-distributed back on all nodes. A new alert will be raised for these occasions, more on the new alert here.
More on how to enable in the CLI tutorial (link to section).
This change will be gradually enabled on all eligible clusters and will be turned on by default for all new installations.
Maintenance mode¶
Maintenance mode allows for a node or a full cluster to be configured in “maintenance” mode, which prevents all expected alerts from being raised by the monitoring. This excludes cluster availability threatening alerts, which will still be raised even if the node or the whole cluster is under maintenance.
The feature was developed to allow easier maintenance of running nodes and to automate most checks that services can be stopped or restarted without impact, which had to be done manually before. This also provides a quicker way of synchronization between the customer and StorPool support on routine maintenance work.
The feature is enabled by the CLI (per node maintenance or full cluster), and it allows only for a limited amount of nodes to be under maintenance at the same time, performing some basic checks before enabling it.
The main use case of this feature is for customers or support staff to set a node in maintenance before performing any operation that would interrupt or affect services on the node, for example reboot or hardware maintenance.
More information on the usage can be found in the user guide.
Active requests¶
StorPool has support for listing the active requests on disks and clients.
With this change this functionality is expanded to be able to get all such
data from other services (for example storpool_bridge
and
storpool_iscsi
) and to show some extra details. This is also expanded
to be done with a single call for the whole cluster, greatly simplifying
monitoring & debugging.
This feature was developed to replace the latthreshold
tool that showed
all requests of clients and disks that were taking more than a set time to
complete. The present implementation gets all required data without the
need for sending a separate API call for each client/disk in the cluster.
Non-blocking NIC initialization¶
Non-blocking NIC initialization and configuration allows for StorPool services to continue operating normally during hardware reconfigurations or errors (for example link flaps, changing MTU or other NIC properties, adding/removing VFs to enable/disable hardware acceleration, etc.).
This was developed to handle delays from hardware-related reconfiguration or initialization issues, as most NICs require resets, recreation of filters or other tasks which take time for the NIC to process and if just busy-waited would not allow the services to process other requests in the mean time.
The feature works by sending requests to the NIC for specific changes and rechecking their progress periodically, between processing other tasks or when waiting for other events to complete.
Creation/access of volumes based on global IDs¶
StorPool now can allow the creation of a volume without a name provided, by assigning it a globally unique ID, by which it could be then referred.
Note
Globally-unique in this regard means world-wide global uniqueness, i.e. no two StorPool deployments can generate the same ID.
This feature was developed to handle some race conditions and retry-type cases in orchestration systems in multi-cluster environment, which could lead to duplication and reuse of another volume. By allowing the storage system to ensure the uniqueness of volume identifiers, all these cases are solved.
MultiCluster mode¶
MultiCluster mode is a new set of features allowing management and interoperations of multiple StorPool clusters, located in the same data-center allowing for much higher scalability and better reliability in large deployments. Instead of creating one big storage cluster with many hosts, the multi-clustering approach adds the flexibility to have multiple purpose-built clusters, each using fewer hosts. The set of features is built as an extension to Multi site (more at multi-site) and each location could now have multiple clusters. For a different analogy, a multicluster would relate to multi-site similarly as a single datacenter to a city.
The first new concept is that of an exported volume. Just like snapshots can be
currently exported to another cluster through the bridge, now volumes can be
exported too for all nearby clusters. An exported volume can be attached to any
StorPool client in a target cluster and all read and write operations for that
volume will be forwarded by the storpool_bridge
to the cluster where the
volume actually resides (how to export a volume
here)
The second concept is the ability to move volumes and their snapshots between sub-clusters. The volume is moved by snapshoting it in the sub-cluster where it currently resides - i.e. the source sub-cluster, exporting that snapshot and instructing the destination sub-cluster to re-create the volume with that same snapshot as parent. This all happens transparently with a single command. While the volume’s data is being transferred, all read requests targeting data not yet moved to the destination sub-cluster are forwarded through the bridge service to the source sub-cluster. When moving a volume there is an option to also export it back to the source sub-cluster where it came from and updating the attachments there, so that if the volume is attached to a running VM it will not notice the move. More on volume and snapshot move here.
Both features have minimal impact on the user IOs. An exported volume adds a few tens of microseconds delay to the IO operations, and a volume move would usually stall between a few hundred milliseconds up to few seconds. So for everything except for some extremely demanding workloads a volume move would go practically unnoticed.
A target usecase for volume move is live migration of a virtual machine or a container between hosts that are in different sub-clusters in the multicluster. The move to the destination sub-cluster involves also exporting the volume back and attaching it to the host in the source cluster, almost like a normal live migration between hosts in the same cluster. If the migration fails for some reason and will not be retried the volume can be returned back to the original cluster. Its data both in the volume and its snapshots will transparently move back to the source cluster.
The new functionality is implemented only in extensions to the API - there is no change in what all current API calls do.
There are two extensions to allow using these features explained below.
The first one is the ability to execute API commands in another sub-cluster in a multicluster setup, an example is available here.
The second extension is multicluster mode of operation. It is enabled by
specifying -M
as a parameter to the storpool CLI command or executing
multiCluster on
in interactive mode (example).
This mimics the single cluster operations in a multicluster environment without
placing additional burden on the integration.
For example a multicluster attach volume command will first check if a volume is currently present in the local sub-cluster. If it is, the command proceeds as a normal non-multicluster attach operation. If the volume is not present in this sub-cluster the API will search for the volume in all connected sub-clusters that are part of the multicluster and if it is found will issue a volume move to the local cluster (exporting it back to the source cluster if the volume is attached there) and will then attach the volume locally. The same is in effect for all other volume and snapshot related commands. The idea is that the multicluster mode can be used as a drop-in replacement of the non multicluster mode, having no effect for local volumes. The only difference is that an attach operation must be targeted to the cluster with the target client in it. This is the main reason for the first extensions, so that the integration doesn’t need to connect to the different API endpoints in each cluster to do its job.
Some words on the caveats - The issue in a Multicluster mode is that of naming the volumes. The whole multicluster acts as a single namespace for volume names, however there is no StorPool enforced uniqueness constraint. We thought long and hard on this decision, but decided at the end against enforcing it, as there is no way to guarantee the uniqueness in cases of connectivity or other disruption in communication between clusters without blocking local cluster operations. We believe that having the ability to work with a single cluster in the multicluster in all cases is far more important.
This decision however places the burden of keeping the volume names unique on the integration. There are few ways this can be handled, the first option is to never reuse a volume name, even when retrying a failed operation, or operation that has an unknown status (e.g. timeout). So if the integration can guarantee this, the uniqueness constraint is satisfied.
Another option, that we are in the process of moving all integrations to, is to
use another new feature - instead of providing a name for the volume create
operation, have StorPool assign unique name to it. StorPool already maintains
globally unique identifiers for each volume in existence - the globalId in the
json output. So the new feature is to be able to use the globalId as identifier
for a volume instead of a name and the ability to create a volume without
specifying a name and having StorPool return the unique identifier. This mode
however is a significant change to the way an integration works. For our
integrations we have chosen to move to this mode as they are plugged in a
complex management systems that we can’t guarantee will never reuse a volume
name, for example during a retry of a volume create operation. This also comes
with a new directory - /dev/storpool-byid
holding symlinks to the
/dev/sp-X
devices but instead of using the volume names as in
/dev/storpool
, using the globalIds.
Violating the uniqueness constraint will eventually result in data corruption
for a given volume, so handling it is a must before considering the multicluster
features. A simple example of how data corruption will occur if there are two
volumes with the same name: Let’s assume the volume name in question is test
and we have one volume test in the source sub-cluster A
with a VM working
with it and another volume test in the target sub-cluster B
. If trying to
migrate the VM from cluster A
to cluster B
the multicluster volume
attach command will find the wrong volume test already existing in the target
sub-cluster B
and simply attach it to the client, so when the VM is migrated
the data in the volume test
will be totally different resulting in data
corruption and VM crash.
There are two more considerations, again naming related, but they are not as dangerous - volume move will recreate the volume in the target cluster with a template named as the one in the source cluster. So every cluster in a multicluster must have all the templates created with the same names.
And the last one is that the different clusters in the multicluster should have
the same names in each of the clusters, so that the remote cluster commands can
be executed from any API endpoint with the same effect. Just for completeness
in each cluster the local cluster can also be (and should be) named, so a
storpool cluster cmd NAME
will work for it too.