Common

1. What is the upgrade policy and compatibility between versions?

StorPool supports upgrades between the oldest production release and the latest release. *

Having this said, an upgrade between versions 19.01.xx to 19.01.yy will not be very different from an upgrade between 19.01.xx and 20.0.xx.

As usual, our exceptional support team will fully handle the effort to apply any new best practices or enable new features after each upgrade.

Note

  • An exception to this rule might be a release that is still considered a technical preview and is not expected to get into production.

2. How to calculate the estimate usable disk space for a hybrid cluster?

For the common case when there is one replica (copy) on an SSDs placement group the available capacity is approximately the combined capacity of all SSDs without 10% overhead.

\[\frac{\sum (SSD\ capacity)}{1.1} = usable\ space\]

The only exception is for cases when the available space for the other two copies in the hard drives placement group is less than 2 times the available in the SSD placement group, i.e. they are the limiting factor. In such a case the same calculation will look like this:

\[\frac{\sum (HDD\ capacity)}{2.2} = usable\ space\]

Same applies for cluster with dual replicas on SSD media and single replica on HDDs with the reverse logic:

\[\frac{\sum (SSD\ capacity)}{2.2} = usable\ space\]

3. What is the current capacity, provisioned and available space in the cluster?

The needed information could be collected using storpool template status CLI command or in the collected statistics over time at https://analitycs.storpool.com for the cluster at hand.

# storpool template status
----------------------------------------------------------------------------------------------------------------------------------------------------------------
| template             | place head | place all  | place tail | rdnd. | volumes | snapshots/removing |    size | capacity |  avail. | avail. all | avail. tail |
----------------------------------------------------------------------------------------------------------------------------------------------------------------
| hybrid               | hdd        | hdd        | ssd        |     3 |       2 |         0/0        |  8.0 TB |    11 TB |  9.2 TB |      56 TB |      9.2 TB |
----------------------------------------------------------------------------------------------------------------------------------------------------------------

The size column shows the space that is provisioned. The capacity column shows the full capacity that could be filled with the given template placement and replication.

The avail. column shows an approximation of the available space for storing new data. It shows how much additional data can be stored in the cluster using this template. The value in this column is calculated as the minimum free capacity of all disks in the placement group, reduced by a safety margin of 10GiB, multiplied by the number of disks in the placement group, and divided by the replication factor. I.e.:

\[avail = \frac{ ( \min (free) - 10GiB) . N }{replicas}\]

Where free is the free capacity of each disk in the placement group, N is the number of disks in the placement group, replicas is the number of replicas stored in this placement group.

When a template uses multiple placement groups, as in the example above, the value is calculated for each placement group separately, and the smallest value is reported.

In most cases this well approximates the available space. There are two cases when the reported value can differ significantly from the actual available space:

  1. When disks usage in the placement group is not equally balanced. In this case the disk with the minimum amount of free space will determine the reported value.

  2. When disks of different capacity are used in the same placement group. StorPool tries to balance the usage of all disks to the same percentage. Thus the smaller disks will have smaller free capacity as an absolute value. In this case the reported available capacity will not represent correctly the total available space. Note that this will not affect the actual storage capacity of the system.

The estimate is done this way to be conservative and provide good enough early warning before the cluster runs out of free space.

4. What is the current thin provisioning gain of the cluster?

The gain of the thin provisioning could be calculated with the values from the storpool template status CLI command (see an example output above) using the following formula

\[\frac{size}{capacity - avail} = gain\]

For example:

\[\frac{8}{11 - 9.2} = 4.67\]

The calculated gain of the thin provisioning is x4.67.

5. How does StorPool handle writes?

StorPool is using copy-on-write strategy for writing on the drives, which is needed to guarantee data consistency and allows to perform very fast storage operations. However this requires aggregation for random workloads, which usually happens in the background when no significant load is pushed on the backend storage. This could be observed with iostat and with storpool disk {diskId} activeRequests and is perfectly normal.

What we have observed in production systems and naturally gathered blktraces, the higher IOPS demand is in short bursts with much lower average demanded IOPS. This allows the system to aggregate and cope with the workload without affecting storage operations.

In cases when the storage system is hammered with unusually high artificial random workloads for long periods of time it will start aggregating while performing the workload, which will slow down the storage operations. At some point the performance will settle to some point, in which the aggregating and the random workload will be balanced and will continue this way until the drives are full.

6. Why are there partitions on all SATA drives used by StorPool?

First reason is that by using partitions, a proper alignment could be enforced, e.g. using 2M for the start of the partition will deal with any internal alignment of the underlying device smaller than 2M even in cases where the disk is a virtual device exposed by a RAID controller (for example).

Another reason is that sometimes the controllers are changing the order of drives on each boot. This case is sometimes combined with the boot drive for the root device of the operating system on the same controller. In such cases having GRUB installed on all other devices as well is the only workaround for getting consistent booting.

We have also seen the kernel getting confused of the data in the first few sectors with the weird side effect of detecting a phantom partition at the last few gigabytes of the drive, obstructing normal operations with the disk. In this case the best option is to re-create the same disk ID (i.e. re-balance out, then re-balance back in) on a partition on the same drive with proper alignment.

This rule does not apply for NVMe devices, due to the way they are being managed (the kernel does not see the devices managed by StorPool due to kernel bypass). Even then properly aligned partitions are required if the device is used for journals for other devices (e.g. an Optane drive in front of HDDs) or if the same drive needs to be split to many server instances for performance reasons.

7. Why the StorPool processes seem to be at 100% CPU usage all the time?

TL;DR: They are not (using 100% CPU), but the kernel gets confused that they are.

Longer explanation - sometimes processes like the storpool_server and storpool_block are reported by top at 100% CPU usage all the time. This is the expected behaviour when hardware sleep is enabled (the SP_SLEEP_TYPE parameter set to hsleep) and is related to the implementation of time critical services in StorPool. The actual CPU usage is much lower than reported by top, and can be monitored with cpupower monitor, values of Mperf C0 field for the CPUs dedicated for these processes. The only exception of this rule is for nodes/services configured with a different than the default SP_SLEEP_TYPE in which cases the ksleep will show variable usage and the no sleep will actually keep them at 100% CPU usage.

8. What addresses uses StorPool for monitoring

The IP addresses from which we access the servers and the servers send monitoring/statistic data back are 46.233.30.128/32, 78.90.13.150/32, 185.117.80.0/24 (IPv4) and 2a05:5e40:f00f::/48 (IPv6). The port used for sending monitoring/statistics is 443 (HTTPS). Because we look up the host by name, there should be an available DNS service configured on the nodes.

A simple test is to try connecting to mon1.storpool.com on port 443.

9. What is required when I add/change memory modules on a hypervisor?

In case of memory module(s) addition or changes the best current practice is to run a full memtest before the hypervisor is returned into production. We have a reduced set of memory tests, usually available for execution as ~/storpool/perform_memtest.sh which will re-validate the memory with parallel memtester executions. The usual run time is between 10 and 40 minutes, depending on the number of CPU cores and the amount of memory installed in the hypervisor.

If memory was added the old cgroup limits will have to be updated, which in most cases will be as easy as:

storpool_cg print # to see the presently configured limits
storpool_cg conf -NME # -N for noop to check what will be changed
storpool_cg conf -ME # to actually perform the changes live

The above example is for hypervisor-only nodes, nodes that are also exposing disks to the cluster (i.e. running one or more storpool_server services) would need the converged=1 parameter on the storpool_cg conf ... command line, because this cannot be auto-detected. Detailed info regarding cgroups configuration and the storpool_cg tool is available at Control groups.

In any case if unsure, please open up a ticket with StorPool support.

Exceptions

1. StorPool not working on vlan interface on I350 NIC

Generally 1GE interfaces are not supported with StorPool, but they are useful in some occasions with testing installations. For I350 based NICs, the VLAN offload must be disabled on the parent NIC.

Example to verify current state, disable vlan-offload and confirm the change

# verify
[root@s21 ~]# ethtool -k eth1 | grep vlan-offload
rx-vlan-offload: on
tx-vlan-offload: on
# disable
[root@s21 ~]# ethtool -K eth1 rxvlan off
Actual changes:
rx-vlan-offload: off
tx-vlan-offload: off [requested on]
# confirm
[root@s21 ~]# ethtool -k eth1 | grep vlan-offload
rx-vlan-offload: off
tx-vlan-offload: off [requested on]

This configuration works only without any hardware-acceleration with CPU cores reserved for the NIC interrupts and the storpool_rdma user space threads (the iface_acc=false flag in storpool_cg).

Onapp integration

Support for cloudboot clients (a.k.a. SmartServers) is not available with the native StorPool protocol and StorPool-OnApp integration, due to the many operational compatibility issues that might arise from such configuration. Only OnApp KVM Static Boot hypervisors can be connected to StorPool using the StorPool native protocol and the StorPool-OnApp integration (with separate volume per virtual disk).

The best alternative is native iSCSI and volume export for all cloudboot clients managed by OnApp as normal LVM. The iSCSI support in our latest releases improves the performance of the native iSCSI target, which is now near identical compared to StorPool native with the recommended settings.

The only downside of using iSCSI instead of StorPool native protocol are the lack of end-to-end data integrity (checksums) from the target to the initiator and the direct correspondence between a virtual disk and a StorPool volume (LUN), available only with StorPool-OnApp integration and static hypervisors. What is gained with the iSCSI approach for OnApp Smart Servers is the lack of StorPool code on the hosts, i.e. no cgroups, kernel parameters, no StorPool dependencies to install and maintain (which is hard to keep up to date, especially with the CloudBoot/SmartServer case).

Erasure coding

1. Is Erasure Coding enabled for a whole cluster, or is it enabled on a per-volume basis?

When erasure coding (EC) is enabled in a cluster, the conversion from triple replication to an erasure coding redundancy scheme is done one volume-snapshot chain at a time. Volume-snapshot chains are converted individually, and the conversion is performed live, so all user operations continue to be processed while the conversion to erasure coding is being performed.

Conversion here means that at least one snapshot is on an EC template and is being encoded. Data in volumes remains triple replicated even when erasure coding is enabled.

2. Are there any expected performance issues that need to be considered?

  • During normal operations:

    Read and write performance is the same as you get from triple replication - same latency, max IOPS, and bandwidth.

  • Initial conversion:

    Based on our testing, there might be a slight latency increase during the initial conversion from triple replication to the erasure-coded state. In environments where EC is deployed everything is running okay, but just in case we have introduced some configurable conversion speed limits to ensure no latency issues for user input/output operations.

  • Fault state:

    There’s some read performance penalty when you have one or two nodes down. The impact of one node down is about 6 to 10 microseconds increased average read latency. In the case of two nodes down, the performance impact is 20 to 30 microseconds increased average read latency compared to normal operations. In both cases, there is no impact on the write operations.

3. What hardware configurations are supported?

Initially, erasure coding in StorPool will be supported only for all-NVMe SSD deployments.

4. What are the supported erasure coding schemes and their respective overheads?

The supported erasure coding schemes are:

  • 2+2 - supports clusters with 5 nodes or more, has approximately 2.4x overhead

  • 4+2 - supports clusters with 7 nodes or more, has approximately 1.8x overhead

  • 8+2 - supports clusters with 11 nodes or more, has approximately 1.5x overhead

5. Do the required regular snapshots offset some of the space savings?

Each snapshot keeps the difference between the previous snapshot and the volume, so the overhead of each snapshot is proportional to the amount of writing that happened in the volume since the previous snapshot. The total snapshot overhead depends on the number of volumes, the frequency of snapshots, the change rate between each snapshot, and the retention settings of the snapshots.

As a result of the total snapshot overhead, the cluster will have slightly more stored data, but that increase should be negligible compared to the space savings achieved thanks to erasure coding. If needed, we can help you calculate the estimated total snapshot overhead in case you go with the default policy that keeps 24 hourly snapshots and 3 daily snapshots.

6. Would I need additional space from triple replication to erasure coding during the conversion?

No. There is no point in time when additional space is required for erasure coding compared to triple replication. The data on the drives will be read once to calculate parity blocks and then a rebalance is initiated. As a result, converting data to erasure coding will take less space on drives.

7. Is there an ideal minimum volume size?

No. Every volume in the system can benefit from the space-saving effect of erasure coding.

8. How frequently does StorPool recalculate the parity blocks - with each change of the data blocks, or with another method?

StorPool does not recalculate parity blocks. The calculation of parity blocks is done as a background task after creating each regular snapshot of an EC-enabled volume. Since StorPool stores new data in volumes and their unique new data is kept in each newly-created snapshot, there is no need to recalculate older parity blocks.

As older snapshots expire, the common parts between the expired and new snapshots are merged into newer snapshots as a background task.

9. What is the impact of erasure coding on the network load?

During initial conversion there is a slight increase in network traffic between nodes, which depends on the configured bandwidth limit for the 3N to EC conversion task. During normal operations with EC there’s an extra read of the data to perform the encoding of snapshots (when new snapshots are created).

10. Is there a difference between the performance of the different erasure coding schemes?

  • During normal operations:

    No, the performance is consistent between all erasure coding schemes.

  • During initial conversion:

    No, the slight latency increase is observed for all erasure coding schemes.

  • During fault states:

    Yes, for 8+2 the impact to latency is slightly higher compared to 4+2 or 2+2 for both 1-node-down and 2-nodes-down events. The numbers under question number 2 are for the 8+2 scheme, so you can expect less performance impact if you use another scheme.

11. Is there any delay in user operations when regular snapshots are being created?

No. StorPool performs snapshots instantly and with no performance impact.

12. What are the chances that erasure coding can’t recover from a disk failure?

The level of protection erasure coding provides is equivalent to the protection of triple replication. It can recover from any double failure (that is, failure of any number of storage devices in up to two fault sets). For example, an erasure coded cluster can fully recover if any of the following occurs:

  • Any two fault sets in the cluster are permanently lost

  • One fault set and one or more drives in another fault set fail

  • Multiple drives on two different fault sets fail

Note

When no fault sets are configured, a single node is a fault set.