Introduction to multi-cluster mode
Multi-cluster mode is a set of features allowing management and interoperations of multiple StorPool clusters located in the same data-center, which allows for much higher scalability and better reliability in large deployments. Instead of creating one big storage cluster with many hosts, the multi-clustering approach adds the flexibility to have multiple purpose-built clusters, each using fewer hosts. The set of features is built as an extension to multi-site mode, and each location could now have multiple clusters. For a different analogy, a multi-cluster would relate to multi-site similarly as a single datacenter to a city.
This document provides and overview of the multi-cluster mode. For more information about using this mode, see Multi-site and multi-cluster.
Basic concepts
The first concept is that of an exported volume.
Just like snapshots can be currently exported to another cluster through the bridge, now volumes can be exported too for all clusters in the same location.
An exported volume can be attached to any StorPool client in a target cluster, and all read and write operations for that volume will be forwarded via the storpool_bridge
service to the cluster where the volume actually resides.
For more information, see Listing exported volumes.
The second concept is the ability to move volumes and their snapshots between sub-clusters. The volume is moved by snapshotting it in the sub-cluster where it currently resides - that is, the source sub-cluster, exporting that snapshot and instructing the destination sub-cluster to re-create the volume with that same snapshot as parent. This all happens transparently with a single command. While the volume’s data is being transferred, all read requests targeting data not yet moved to the destination sub-cluster are forwarded through the bridge service to the source sub-cluster. When moving a volume there is an option to also export it back to the source sub-cluster where it came from and updating the attachments there, so that if the volume is attached to a running VM it will not notice the move. For more information, see Moving volumes and snapshots.
Both features have minimal impact on the user input and output (IO) operations.
An exported volume adds 2-4 times the storpool_bridge
’s link’s RTT to the IO operations (which is usually tens of microseconds).
So for everything - except for some extremely demanding workloads - a volume move would go practically unnoticed.
Both features are triggered automatically when attaching a volume in a remote cluster, and do not need to be explicitly done via the API.
Use cases
A target use case for volume move is the live migration of a virtual machine between hosts that are in different sub-clusters in the multi-cluster. The move to the destination sub-cluster involves also exporting the volume back and leaving the existing attachment on the host in the source cluster intact, like a normal live migration between hosts in the same cluster. If the migration fails for some reason and will not be retried, then the volume can be returned back to the original cluster. Its data – both in the volume and its snapshots – will transparently move back to the source cluster.
Implementation
The functionality is implemented only in extensions to the REST API - there is no change in what all current API calls do.
There are two extensions to allow using these features (explained below):
The first one is the ability to execute API commands in another sub-cluster in a multi-cluster setup. For an example, see Moving volumes and snapshots.
The second extension is the multi-cluster mode of operation. It is enabled by specifying
-M
as a parameter to thestorpool
CLI command or executingmultiCluster on
in interactive mode (example). This mimics the single cluster operations in a multi-cluster environment without placing additional burden on the integration.
For example, a multi-cluster “attach volume” command will first check if a volume is currently present in the local sub-cluster. If it is, the command proceeds as a normal non-multi-cluster attach operation. If the volume is not present in this sub-cluster, the API will search for the volume in all connected sub-clusters that are part of the multi-cluster, and if it is found will issue a volume move to the local cluster (exporting it back to the source cluster if the volume is attached there) and will then attach the volume locally.
The same is in effect for all other volume- and snapshot-related commands. The idea is that the multi-cluster mode can be used as a drop-in replacement of the non multi-cluster mode, having no effect for local volumes. The only difference is that an attach operation must be targeted to the cluster with the target client in it. This is the main reason for the first extensions, so that the integration doesn’t need to connect to the different API endpoints in each cluster to do its job.
Some functions (like storpool net list
, see Network) do not make sense in a multi-cluster context and just return the local one.
For queries that are not multi-cluster ones – but still query separately all sub-clusters in the multi-cluster – there’s extra functionality in the multi-cluster API called AllClusters
.
This is used in cases like storpool template status (see Templates) to query all sub-clusters for their space usage and availability.
Volume naming
When using the multi-cluster mode you should be careful with volume naming.
The whole multi-cluster acts as a single namespace for volume names; however, there is no uniqueness constraint enforced by StorPool. We thought long and hard on this decision, but decided at the end against enforcing it, as there is no way to guarantee the uniqueness in cases of connectivity or other disruption in communication between clusters without blocking local cluster operations. We believe that having the ability to work with a single cluster in the multi-cluster in all cases is far more important.
Naming approaches
This decision for having no uniqueness constraint places the burden of keeping the volume names unique on the integration. This can be handled in one of the following ways:
Never reuse a volume name, even when retrying a failed operation, or operation that has an unknown status (for example, timeout). So if the integration can guarantee this, the uniqueness constraint is satisfied.
Instead of providing a name during the “volume create” operation, set StorPool to create a name using the unique ID (
globalId
) that StorPool assigns and maintains for each volume.Such a volume is referred to as “nameless”. It can be created only using the
VolumeCreate
call from StorPool’s REST API, by providing an empty value for the ‘name’ parameter ("name":""
). Volumes created this way appear in the volume list with a name starting with the “~” character, followed by theglobalId
of the volume; for example,~abcd.b.t
.This approach however represents a significant change to the way an integration works. We have chosen to move to this mode for our integrations as they are plugged in a complex management systems, for which we cannot guarantee will never reuse a volume name; for example, during a retry of a volume create operation.
This also comes with a new directory
/dev/storpool-byid
holding symlinks to the/dev/sp-X
devices, but using theglobalId
instead of using the volume names as in/dev/storpool
.
Currently, all the integrations that support multi-cluster use global IDs, and no integration tries to guarantee name uniqueness.
Considerations
There are two more considerations related to naming. They have lower impact:
Volume move will recreate the volume in the target cluster with a template named as the one in the source cluster.
So, every cluster in a multi-cluster must have all the templates created with the same names.
Different clusters in the multi-cluster should have the same names in each of the clusters, so that the remote cluster commands can be executed from any API endpoint with the same effect.
Just for completeness, in each cluster the local cluster can also be (and should be) named, so a
storpool cluster cmd NAME
will work for it too.
Example with several volumes having the same name
Violating the uniqueness constraint will eventually result in data corruption for a given volume, so handling it is a must before considering the multi-cluster features. Here is a simple example of how data corruption will occur if there are two volumes with the same name:
Let’s assume the volume name in question is
test
.There is one volume
test
in the source sub-clusterA
with a VM working with it, and another volumetest
in the target sub-clusterB
.If you try to migrate the VM from cluster
A
to clusterB
, the multi-cluster “volume attach” command will find the wrong volumetest
already existing in the target sub-clusterB
, and will simply attach it to the client.Thus, when the VM is migrated, the data in the
test
volume will be totally different, resulting in data corruption and VM crash.