Cluster capacity

In StorPool, a drive is split into allocation groups. Writing data to a drive is performed by creating many entries into allocation groups.

As a result of the write operations sometimes allocation groups can reach a state where there is free space in them that currently cannot be used. That’s why StorPool features an aggregation background operation, which merges entries if needed and compacts the used space.

The following sections explain how you can use this information when determining the total, free, and used space of your cluster.

Available information and its meaning

StorPool disk list

Here is example output from the storpool disk list command (for details, see Disk):

# storpool disk list

disk  |  server  |    size    |    used    |  est.free  |      %  |  free entries  |  on-disk size  |  allocated objects |  errors |  flags
|    10.0  |   3.5 TiB  |   3.0 TiB  |   468 GiB  |   86 %  |      12418867  |       2.8 TiB  |   808354 / 3675000 |   0 / 0 |
|    10.0  |   3.5 TiB  |   3.0 TiB  |   486 GiB  |   86 %  |      12331687  |       2.8 TiB  |   807014 / 3675000 |   0 / 0 |
|    10.0  |   3.5 TiB  |   3.0 TiB  |   478 GiB  |   86 %  |      12187140  |       2.8 TiB  |   794240 / 3675000 |   0 / 0 |
|    10.1  |   3.5 TiB  |   3.0 TiB  |   462 GiB  |   87 %  |      11992145  |       2.8 TiB  |   799545 / 3675000 |   0 / 0 |
|    10.1  |   3.5 TiB  |   3.0 TiB  |   464 GiB  |   86 %  |      12166754  |       2.8 TiB  |   805114 / 3675000 |   0 / 0 |
|    10.1  |   3.5 TiB  |   3.0 TiB  |   483 GiB  |   86 %  |      12091472  |       2.8 TiB  |   807347 / 3675000 |   0 / 0 |
|    10.2  |   3.5 TiB  |   3.0 TiB  |   480 GiB  |   86 %  |      12050469  |       2.8 TiB  |   803044 / 3675000 |   0 / 0 |
|    10.2  |   3.5 TiB  |   3.0 TiB  |   466 GiB  |   86 %  |      12065849  |       2.8 TiB  |   804601 / 3675000 |   0 / 0 |
|    10.2  |   3.5 TiB  |   3.0 TiB  |   470 GiB  |   86 %  |      12272321  |       2.8 TiB  |   844295 / 3675000 |   0 / 0 |

Here is the same example using the -j option for displaying the result in JSON format:

# storpool -j disk list

  "1001" : {
     "agAllocated" : 5635,
     "agCount" : 6934,
     "agFree" : 1299,
     "agFreeNotTrimmed" : 0,
     "agFreeing" : 0,
     "agFull" : 21,
     "agMaxSizeFull" : 3693,
     "agMaxSizePartial" : 12,
     "agPartial" : 1905,
     "aggregateScore" : {
        "entries" : 0,
        "space" : 0,
        "total" : 0
     },
     "applyingTransaction" : false,
     "description" : "",
     "device" : "0000:bf:00.0-p1",
     "ejectedReason" : "Mgmt normal eject",
     "empty" : false,
     "entriesAllocated" : 3875797,
     "entriesCount" : 14700000,
     "entriesFree" : 10824203,
     "generationLeft" : -1,
     "hadMisalignedMaxsizeTrims" : false,
     "id" : 1001,
     "isWbc" : false,
     "journaled" : false,
     "lastScrubCompleted" : 1738330398,
     "lastUsedDiskObjects" : 645149,
     "model" : "SAMSUNG MZQL27T6HBLA-00A07",
     "mustTest" : false,
     "noFlush" : false,
     "noFua" : false,
     "noTrim" : false,
     "objectsAllocated" : 639496,
     "objectsCount" : 3675000,
     "objectsFree" : 3035504,
     "objectsOnDiskSize" : 2762300813312,
     "pendingErrorRecoveries" : 0,
     "performance" : {
        "avgDiskLatency" : 40,
        "avgJournalLatency" : 0,
        "diskAvgLatencyLimitActual" : 25000,
        "diskLatencyLimitOverride" : "off",
        "diskTotalLatencyLimitActual" : 3200000,
        "journalAvgLatencyLimitActual" : "unlimited",
        "journalLatencyLimitOverride" : "off",
        "journalTotalLatencyLimitActual" : "unlimited",
        "maxAvgDiskLatency" : 744,
        "maxAvgJournalLatency" : 0,
        "maxTotalDiskLatency" : 95286,
        "maxTotalJournalLatency" : 0,
        "timesDiskExceededLatency" : 0,
        "timesJournalExceededLatency" : 0,
        "totalDiskLatency" : 5223,
        "totalJournalLatency" : 0
     },
     "preservedObjectsCount" : 3675000,
     "recovery" : {
        "ecCodingRequests" : 0,
        "maxLocalRecoveryRequests" : 0,
        "maxRemoteRecoveryRequests" : 0
     },
     "scrubbedBytes" : 0,
     "scrubbing" : false,
     "scrubbingBW" : 0,
     "scrubbingFinishAfter" : 0,
     "scrubbingPaused" : false,
     "scrubbingPausedFor" : 0,
     "scrubbingStartedBefore" : 0,
     "sectorsCount" : 7501461504,
     "serial" : "S6CKNN0W815414",
     "serverId" : 10,
     "serverIdString" : "10.0",
     "serverInstanceId" : 0,
     "softEject" : "off",
     "ssd" : true,
     "testResults" : {
        "dataCorruption" : false,
        "failed" : false,
        "readBandwidthThreshold" : false,
        "readBps" : 2043000000,
        "readError" : false,
        "readLatencyThreshold" : false,
        "readMaxLat" : 0,
        "stall" : false,
        "timesTested" : 1,
        "writeBandwidthThreshold" : false,
        "writeBps" : 2031000000,
        "writeError" : false,
        "writeLatencyThreshold" : false,
        "writeMaxLat" : 0
     },
     "totalErrorsDetected" : 0,
     "wbc" : null
  },

The values related to aggregation and capacity are:

used and est. free in storpool disk list are in allocation groups.
The free allocation groups (agFree) have NO data in them, and the used ones (agCount - agFree) can either be completely full, or have some data in them (agFull, AgPartial, etc).
The on-disk size means the actual user data stored, if it was aggregated completely.
sectorsCount in the JSON output is the number of 512-byte sectors of the underlying device, i.e. raw disk space. That includes space for data and metadata.

StorPool template status

Here is example output from the storpool template status command (for details, see Getting template’s status):

# storpool template status

        ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
        | template             | place head | place all  | place tail | rdnd. |  volumes | snapshots/removing |     size | capacity |   avail. | avail. head |  avail. all | avail. tail | flags |
        ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
        | nvme                 | nvme       | nvme       | nvme       |     3 |       14 |         3/0        |  2.9 TiB |   43 TiB |   10 TiB |      31 TiB |      31 TiB |      31 TiB |       |
        | tier0                | nvme       | nvme       | nvme       |     3 |        1 |         0/0        |  1.0 GiB |   43 TiB |   10 TiB |      31 TiB |      31 TiB |      31 TiB |       |
        | tier1                | nvme       | nvme       | nvme       |     3 |        1 |         0/0        |   45 GiB |   43 TiB |   10 TiB |      31 TiB |      31 TiB |      31 TiB |       |
        | tier2                | nvme       | nvme       | nvme       |     3 |        1 |         0/0        |  1.0 TiB |   43 TiB |   10 TiB |      31 TiB |      31 TiB |      31 TiB |       |
        ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------

Here is the same example using the -j option for displaying the result in JSON format:

  # storpool -j template status

  {
   "availablePlaceAll" : 34419276922165,
   "availablePlaceHead" : 34419276922165,
   "availablePlaceTail" : 34419276922165,
   "bw" : "-",
   "capacityPlaceAll" : 140944875985966,
   "capacityPlaceHead" : 140944875985966,
   "capacityPlaceTail" : 140944875985966,
   "ec" : "",
   "id" : 42,
   "iops" : "-",
   "limitType" : "total",
   "name" : "tier0",
   "objectsCount" : 32,
   "onDiskSize" : 318478208077776,
   "parentName" : "",
   "placeAll" : "nvme",
   "placeHead" : "nvme",
   "placeTail" : "nvme",
   "removingSnapshotsCount" : 0,
   "replication" : 3,
   "reuseServer" : false,
   "size" : 1073741824,
   "snapshotsCount" : 0,
   "snapshotsWithChildrenSize" : 0,
   "snapshotsWithoutChildrenSize" : 0,
   "stored" : {
      "capacity" : 46981625328655,
      "free" : 11473092307388,
      "internal" : {
         "u1" : 33276265370001,
         "u2" : 2075390420371,
         "u3" : 156877230893
      },
      "placeAll" : {
         "capacity" : 140944875985966,
         "free" : 34419276922165,
         "internal" : {
            "u1" : 99828796110006,
            "u2" : 6226171261112,
            "u3" : 470631692682
         }
      },
      "placeHead" : {
         "capacity" : 140944875985966,
         "free" : 34419276922165,
         "internal" : {
            "u1" : 99828796110006,
            "u2" : 6226171261112,
            "u3" : 470631692682
         }
      },
      "placeTail" : {
         "capacity" : 140944875985966,
         "free" : 34419276922165,
         "internal" : {
            "u1" : 99828796110006,
            "u2" : 6226171261112,
            "u3" : 470631692682
         }
      }
   },
   "storedSize" : 33276265370001,
   "totalSize" : 3221225472,
   "volumesCount" : 1,
   "volumesSize" : 1073741824
}

This is where the data for templates comes in StorPool analytics and that’s used by orchestrations and people to judge the amount of free space in the system.

The capacity is calculated as follows:

For each disk, take the number of allocation groups(agCount), and subtract 8.
Sum all these allocation groups as bytes.
Divide by the overhead factor (about 1.09).
Multiply this by the number of drives, and divide by the replication factor.

The storedSize (in the JSON output, not visible in the StorPool CLI) is calculated as the sum of on-disk size on the drives in the tail placement group, divided by the replication factor. For EC templates (see Erasure Coding) it’s wrong, since the storedSize value takes into account coded pieces and should not be relied on in any way. Note that if there are multiple templates with different volumes and the same placement, the storedSize is the total for all of them, as the value is just taken from the drives and not grouped by template.

The calculation of free space is somewhat non-intuitive. It begins by identifying the disk with the smallest agFree (available allocation groups) value within the placement group. From this value, 8 is subtracted as a safety buffer. The result is then divided by the overhead factor (approximately 1.09), multiplied by the total number of drives, and finally divided by the replication factor.

The constant value of 8 acts as a safety margin to ensure that each drive maintains at least that many free allocation groups. Dropping below this threshold could lead to a deadlock, preventing any further writes to the drive.

This method provides a conservative estimate, assuming that new data will be evenly distributed across all drives in the placement group. As it depends on expected load patterns, the result may either overestimate or underestimate the actual available space.

\[\frac{minN(AgFree(DisksFromPlacementGroup)) - 8}{1.09} * \frac{NumberOfDisksInThePlacementGroup}{Replication Factor} %\]

There is a value named u2 under the internal objects, which represents the difference between the estimated free space and the total number of free allocation groups (AGs) across the drives. In other words, it reflects the amount of space that would be considered available if the placement group were perfectly balanced.

Tip

It is expected and almost always the case that the combined total of free and used space is less than the reported capacity. This is because the free space estimate is intentionally conservative. Additionally, there’s no guarantee that including the u2 value will fully account for the difference.

StorPool volume status

The VolumesGetStatus call in the REST API return the status of volumes and snapshots. The main value in the output is the storedSize, which represents the actual amount of user data written to the volume/snapshot, including user overwritten data, and so on, all packed. It is used for billing purposes to ascertain the amount of user data on the system. This is almost always less than the sum of object on-disk size available in “template status” and “disk list” commands described above.

Note

The value for a volume could be a lot less than the actual data, and most of its data to be in the snapshots. volume status does no summations and that’s up to the user do to.

The storedSize values from this call are used to calculate the stored space for billing purposes for the customers paying per stored data.

The size values for volumes from either this call or volume list are used to calculate the provisioned size in the cluster for billing purposes for the customers that pay based on provisioned size.

Note

The results of this call contain information specific to volumes only.

StorPool volume usedSpace

The VolumesSpace API call shows the amount of allocated space in a volume: all space in the volume which was written to and not trimmed. This is also the amount of space the volume would have stored if all snapshots are removed.

The values from this call are used for billing purposes to calculate the snapshot overhead for the snapshots as tier 3 billing.

Note

The results of this call contain information specific to volumes only.

StorPool snapshot space

The value of spaceUsed returned from the SnapshotsSpace API call shows how much space will be freed if this – and only this – snapshot is removed. Summations of this field in the same chain produce a meaningless result, and the only use of this call is to know if a snapshot with multiple children is saving space (and so has positive, not negative spaceUsed) and it needs to be completely removed by rebasing its children.

Note

The results of this call contain information specific to snapshots only, and its usability is restricted.

Key considerations

Consideration 1

StorPool’s relocator (see Relocator) does not put data to drives with (by default) less than 150GiB of free space. This is between 3.6% (4TB drives) and 7.3% (2TB drives), so drives must have at least that amount of free space for rebalancing to work.

A system with erasure coding must always be able to balance out a node. The above formula can be adapted, and there’s a monitoring alert for that situation.

Consideration 2

A system must always be able to balance out a drive. So there should be:

\[avaibleSpaceOnAllDrives = 150GiB + (driveSize / (numDrives - 1))\]

Frequently Asked Questions (FAQ)

How much free space do I have?

I need to know if I need to expand my storage: See Consideration 1. If you’re below that, then yes. If your usage stats show steady growth, a prediction can be made and extra space ordered in due course.
I see some values and they do not make sense.: See the source of the values, read all above and that should clear things up.
How much more can I expand my usage before it’s a problem: See Consideration 1. There should be that amount of space available, so the user data that can be put on the system is the capacity - storedSize * factor - safety. The safety is described in Consideration 1, and the factor for stored-to-unaggregated data is to be seen in the live system based on what disk list shows.

What uses space in the system?

Which of my users are using too much space?

This is calculated based on sums of storedSize for the snapshot chains for volumes grouped by the tag that the orchestration uses to mark VMs. It can also be matched with the ratio of usedSpace to storedSize to see what the largest users are.

When done for a period of time, it can be used for the next question.

Where does the growth in used space come from?

First, it’s possible that the cluster is getting imbalanced, so the growth of u2 should be checked (Template usage internal in analytics).

Then, the growth of storedSize should be investigated - if it’s a large spike somewhere, or а smooth growth. The comparison of the calculations from the question above over time can show where the change comes from.

Is there something taking space on the system that should not be there?

There are multiple possibilities here:

Snapshot chains that do not end in a volume - for example, volumecare snapshots left for a volume that’s deleted. These might be left-overs that are not needed.
Snapshots or volumes created manually for different reasons outside of the orchestration.
Snapshots or volumes unknown to the orchestration and left over.
In the case of large volumes for multiple VMs (like VMWare VMFSes or the XenServer ones), non-working reclaim/TRIM.

What is the snapshot overhead?

Sum of storedSize from volume status - spaceUsed from volume usedSpace.

Here is an example:

#!/bin/bash
# Calculate the total stored data for a volume, including its snapshots, the volume size alone, and the snapshot overhead:

storedSize=$(storpool -Bj volume status | jq -r --arg volumeName "$1" '[ .data[] | select(.name == $volumeName or .onVolume == $volumeName) | .storedSize ] | add')
volumeSize=$(storpool -Bj volume usedSpace | jq -r --arg volumeName "$1" '[ .data[] | select(.name == $volumeName) | .spaceUsed ] | add')
snapshotOverhead=$(( storedSize - volumeSize ))

echo "Total used space for volume $1: $storedSize B"
echo "Volume size $1: $volumeSize B"
echo "Snapshot Overhead for $1: $snapshotOverhead B"

Note

When calculating snapshot overhead for a single volume, it’s important to consider whether the snapshot chain has been rebased.

If the chain has not been rebased, you can simply follow the snapshot tree to identify all relevant snapshots and compute their overhead. However, if the chain has been rebased, this approach may be incomplete. Some snapshots created by the same volume might not be included in the rebased chain.

To identify all related snapshots, you can use the onVolume tag, which shows all snapshots associated with a volume regardless of rebase operations. However, this method also has limitations—particularly when a snapshot has multiple child volumes, in which case the onVolume tag may reference only one of them.