Cluster capacity
In StorPool, a drive is split into allocation groups
.
Writing data to a drive is performed by creating many entries
into allocation groups.
As a result of the write operations sometimes allocation groups can reach a state where there is free space in them that currently cannot be used.
That’s why StorPool features an aggregation
background operation, which merges entries if needed and compacts the used space.
The following sections explain how you can use this information when determining the total, free, and used space of your cluster.
Available information and its meaning
StorPool disk list
Here is example output from the storpool disk list
command (for details, see Disk):
# storpool disk list
disk | server | size | used | est.free | % | free entries | on-disk size | allocated objects | errors | flags
1001 | 10.0 | 3.5 TiB | 3.0 TiB | 468 GiB | 86 % | 12418867 | 2.8 TiB | 808354 / 3675000 | 0 / 0 |
1002 | 10.0 | 3.5 TiB | 3.0 TiB | 486 GiB | 86 % | 12331687 | 2.8 TiB | 807014 / 3675000 | 0 / 0 |
1003 | 10.0 | 3.5 TiB | 3.0 TiB | 478 GiB | 86 % | 12187140 | 2.8 TiB | 794240 / 3675000 | 0 / 0 |
1004 | 10.1 | 3.5 TiB | 3.0 TiB | 462 GiB | 87 % | 11992145 | 2.8 TiB | 799545 / 3675000 | 0 / 0 |
1005 | 10.1 | 3.5 TiB | 3.0 TiB | 464 GiB | 86 % | 12166754 | 2.8 TiB | 805114 / 3675000 | 0 / 0 |
1006 | 10.1 | 3.5 TiB | 3.0 TiB | 483 GiB | 86 % | 12091472 | 2.8 TiB | 807347 / 3675000 | 0 / 0 |
1007 | 10.2 | 3.5 TiB | 3.0 TiB | 480 GiB | 86 % | 12050469 | 2.8 TiB | 803044 / 3675000 | 0 / 0 |
1008 | 10.2 | 3.5 TiB | 3.0 TiB | 466 GiB | 86 % | 12065849 | 2.8 TiB | 804601 / 3675000 | 0 / 0 |
1009 | 10.2 | 3.5 TiB | 3.0 TiB | 470 GiB | 86 % | 12272321 | 2.8 TiB | 844295 / 3675000 | 0 / 0 |
Here is the same example using the -j
option for displaying the result in JSON format:
# storpool -j disk list
"1001" : {
"agAllocated" : 5635,
"agCount" : 6934,
"agFree" : 1299,
"agFreeNotTrimmed" : 0,
"agFreeing" : 0,
"agFull" : 21,
"agMaxSizeFull" : 3693,
"agMaxSizePartial" : 12,
"agPartial" : 1905,
"aggregateScore" : {
"entries" : 0,
"space" : 0,
"total" : 0
},
"applyingTransaction" : false,
"description" : "",
"device" : "0000:bf:00.0-p1",
"ejectedReason" : "Mgmt normal eject",
"empty" : false,
"entriesAllocated" : 3875797,
"entriesCount" : 14700000,
"entriesFree" : 10824203,
"generationLeft" : -1,
"hadMisalignedMaxsizeTrims" : false,
"id" : 1001,
"isWbc" : false,
"journaled" : false,
"lastScrubCompleted" : 1738330398,
"lastUsedDiskObjects" : 645149,
"model" : "SAMSUNG MZQL27T6HBLA-00A07",
"mustTest" : false,
"noFlush" : false,
"noFua" : false,
"noTrim" : false,
"objectsAllocated" : 639496,
"objectsCount" : 3675000,
"objectsFree" : 3035504,
"objectsOnDiskSize" : 2762300813312,
"pendingErrorRecoveries" : 0,
"performance" : {
"avgDiskLatency" : 40,
"avgJournalLatency" : 0,
"diskAvgLatencyLimitActual" : 25000,
"diskLatencyLimitOverride" : "off",
"diskTotalLatencyLimitActual" : 3200000,
"journalAvgLatencyLimitActual" : "unlimited",
"journalLatencyLimitOverride" : "off",
"journalTotalLatencyLimitActual" : "unlimited",
"maxAvgDiskLatency" : 744,
"maxAvgJournalLatency" : 0,
"maxTotalDiskLatency" : 95286,
"maxTotalJournalLatency" : 0,
"timesDiskExceededLatency" : 0,
"timesJournalExceededLatency" : 0,
"totalDiskLatency" : 5223,
"totalJournalLatency" : 0
},
"preservedObjectsCount" : 3675000,
"recovery" : {
"ecCodingRequests" : 0,
"maxLocalRecoveryRequests" : 0,
"maxRemoteRecoveryRequests" : 0
},
"scrubbedBytes" : 0,
"scrubbing" : false,
"scrubbingBW" : 0,
"scrubbingFinishAfter" : 0,
"scrubbingPaused" : false,
"scrubbingPausedFor" : 0,
"scrubbingStartedBefore" : 0,
"sectorsCount" : 7501461504,
"serial" : "S6CKNN0W815414",
"serverId" : 10,
"serverIdString" : "10.0",
"serverInstanceId" : 0,
"softEject" : "off",
"ssd" : true,
"testResults" : {
"dataCorruption" : false,
"failed" : false,
"readBandwidthThreshold" : false,
"readBps" : 2043000000,
"readError" : false,
"readLatencyThreshold" : false,
"readMaxLat" : 0,
"stall" : false,
"timesTested" : 1,
"writeBandwidthThreshold" : false,
"writeBps" : 2031000000,
"writeError" : false,
"writeLatencyThreshold" : false,
"writeMaxLat" : 0
},
"totalErrorsDetected" : 0,
"wbc" : null
},
The values related to aggregation and capacity are:
used
andest. free
instorpool disk list
are in allocation groups.The free allocation groups (
agFree
) have NO data in them, and the used ones (agCount - agFree
) can either be completely full, or have some data in them (agFull
,AgPartial
, etc).The
on-disk size
means the actual user data stored, if it was aggregated completely.sectorsCount
in the JSON output is the number of 512-byte sectors of the underlying device, i.e. raw disk space. That includes space for data and metadata.
StorPool template status
Here is example output from the storpool template status
command (for details, see Getting status):
# storpool template status
------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
| template | place head | place all | place tail | rdnd. | volumes | snapshots/removing | size | capacity | avail. | avail. head | avail. all | avail. tail | flags |
------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
| nvme | nvme | nvme | nvme | 3 | 14 | 3/0 | 2.9 TiB | 43 TiB | 10 TiB | 31 TiB | 31 TiB | 31 TiB | |
| tier0 | nvme | nvme | nvme | 3 | 1 | 0/0 | 1.0 GiB | 43 TiB | 10 TiB | 31 TiB | 31 TiB | 31 TiB | |
| tier1 | nvme | nvme | nvme | 3 | 1 | 0/0 | 45 GiB | 43 TiB | 10 TiB | 31 TiB | 31 TiB | 31 TiB | |
| tier2 | nvme | nvme | nvme | 3 | 1 | 0/0 | 1.0 TiB | 43 TiB | 10 TiB | 31 TiB | 31 TiB | 31 TiB | |
------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
Here is the same example using the -j
option for displaying the result in JSON format:
# storpool -j template status
{
"availablePlaceAll" : 34419276922165,
"availablePlaceHead" : 34419276922165,
"availablePlaceTail" : 34419276922165,
"bw" : "-",
"capacityPlaceAll" : 140944875985966,
"capacityPlaceHead" : 140944875985966,
"capacityPlaceTail" : 140944875985966,
"ec" : "",
"id" : 42,
"iops" : "-",
"limitType" : "total",
"name" : "tier0",
"objectsCount" : 32,
"onDiskSize" : 318478208077776,
"parentName" : "",
"placeAll" : "nvme",
"placeHead" : "nvme",
"placeTail" : "nvme",
"removingSnapshotsCount" : 0,
"replication" : 3,
"reuseServer" : false,
"size" : 1073741824,
"snapshotsCount" : 0,
"snapshotsWithChildrenSize" : 0,
"snapshotsWithoutChildrenSize" : 0,
"stored" : {
"capacity" : 46981625328655,
"free" : 11473092307388,
"internal" : {
"u1" : 33276265370001,
"u2" : 2075390420371,
"u3" : 156877230893
},
"placeAll" : {
"capacity" : 140944875985966,
"free" : 34419276922165,
"internal" : {
"u1" : 99828796110006,
"u2" : 6226171261112,
"u3" : 470631692682
}
},
"placeHead" : {
"capacity" : 140944875985966,
"free" : 34419276922165,
"internal" : {
"u1" : 99828796110006,
"u2" : 6226171261112,
"u3" : 470631692682
}
},
"placeTail" : {
"capacity" : 140944875985966,
"free" : 34419276922165,
"internal" : {
"u1" : 99828796110006,
"u2" : 6226171261112,
"u3" : 470631692682
}
}
},
"storedSize" : 33276265370001,
"totalSize" : 3221225472,
"volumesCount" : 1,
"volumesSize" : 1073741824
}
This is where the data for templates comes in StorPool analytics and that’s used by orchestrations and people to judge the amount of free space in the system.
The capacity is calculated as follows:
For each disk, take the number of allocation groups(
agCount
), and subtract 8.Sum all these allocation groups as bytes.
Divide by the overhead factor (about 1.09).
Multiply this by the number of drives, and divide by the replication factor.
The storedSize
(in the JSON output, not visible in the StorPool CLI) is
calculated as the sum of on-disk size
on the drives in the tail placement
group, divided by the replication factor. For EC templates (see Erasure Coding) it’s wrong, since the
storedSize
value takes into account coded pieces and should not be relied
on in any way. Note that if there are multiple templates with different volumes
and the same placement, the storedSize
is the total for all of them, as the
value is just taken from the drives and not grouped by template.
The calculation of free space is somewhat non-intuitive. It begins by identifying
the disk with the smallest agFree
(available allocation groups) value within the
placement group. From this value, 8 is subtracted as a safety buffer. The result
is then divided by the overhead factor (approximately 1.09), multiplied by the
total number of drives, and finally divided by the replication factor.
The constant value of 8 acts as a safety margin to ensure that each drive maintains at least that many free allocation groups. Dropping below this threshold could lead to a deadlock, preventing any further writes to the drive.
This method provides a conservative estimate, assuming that new data will be evenly distributed across all drives in the placement group. As it depends on expected load patterns, the result may either overestimate or underestimate the actual available space.
There is a value named u2
under the internal
objects, which represents the
difference between the estimated free space and the total number of free allocation
groups (AGs) across the drives. In other words, it reflects the amount of space
that would be considered available if the placement group were perfectly balanced.
Tip
It is expected and almost always the case that the combined total of free and
used space is less than the reported capacity. This is because the free space
estimate is intentionally conservative. Additionally, there’s no guarantee
that including the u2
value will fully account for the difference.
StorPool volume status
The VolumesGetStatus
call in the REST API return the status of volumes and snapshots.
The main value in the output is the storedSize
, which represents the actual
amount of user data written to the volume/snapshot, including user overwritten data,
and so on, all packed. It is used for billing purposes to ascertain the amount of user
data on the system. This is almost always less than the sum of object on-disk size
available in “template status” and “disk list” commands described above.
Note
The value for a volume could be a lot less than the actual data, and
most of its data to be in the snapshots. volume status
does no summations and
that’s up to the user do to.
The storedSize
values from this call are used to calculate the stored space for
billing purposes for the customers paying per stored data.
The size values for volumes from either this call or volume list
are used to
calculate the provisioned size in the cluster for billing purposes for the
customers that pay based on provisioned size.
Note
The results of this call contain information specific to volumes only.
StorPool volume usedSpace
The VolumesSpace
API call shows the amount of allocated space in a volume: all space in the
volume which was written to and not trimmed. This is also the amount of space the
volume would have stored if all snapshots are removed.
The values from this call are used for billing purposes to calculate the snapshot overhead for the snapshots as tier 3 billing.
Note
The results of this call contain information specific to volumes only.
StorPool snapshot space
The value of spaceUsed
returned from the SnapshotsSpace
API call shows how much space will be freed if
this – and only this – snapshot is removed. Summations of this field in the same
chain produce a meaningless result, and the only use of this call is to know if
a snapshot with multiple children is saving space (and so has positive, not
negative spaceUsed
) and it needs to be completely removed by rebasing its children.
Note
The results of this call contain information specific to snapshots only, and its usability is restricted.
Key considerations
Consideration 1
StorPool’s relocator (see Relocator) does not put data to drives with (by default) less than 150GiB of free space. This is between 3.6% (4TB drives) and 7.3% (2TB drives), so drives must have at least that amount of free space for rebalancing to work.
A system with erasure coding must always be able to balance out a node. The above formula can be adapted, and there’s a monitoring alert for that situation.
Consideration 2
A system must always be able to balance out a drive. So there should be:
Frequently Asked Questions (FAQ)
How much free space do I have?
- I need to know if I need to expand my storage
See Consideration 1. If you’re below that, then yes. If your usage stats show steady growth, a prediction can be made and extra space ordered in due course.
- I see some values and they do not make sense.
See the source of the values, read all above and that should clear things up.
- How much more can I expand my usage before it’s a problem
See Consideration 1. There should be that amount of space available, so the user data that can be put on the system is the capacity -
storedSize * factor - safety
. The safety is described in Consideration 1, and the factor forstored-to-unaggregated
data is to be seen in the live system based on whatdisk list
shows.
What uses space in the system?
- Which of my users are using too much space?
This is calculated based on sums of
storedSize
for the snapshot chains for volumes grouped by the tag that the orchestration uses to mark VMs. It can also be matched with the ratio of usedSpace tostoredSize
to see what the largest users are.When done for a period of time, it can be used for the next question.
- Where does the growth in used space come from?
First, it’s possible that the cluster is getting imbalanced, so the growth of
u2
should be checked (Template usage internal
in analytics).Then, the growth of
storedSize
should be investigated - if it’s a large spike somewhere, or а smooth growth. The comparison of the calculations from the question above over time can show where the change comes from.- Is there something taking space on the system that should not be there?
There are multiple possibilities here:
Snapshot chains that do not end in a volume - for example, volumecare snapshots left for a volume that’s deleted. These might be left-overs that are not needed.
Snapshots or volumes created manually for different reasons outside of the orchestration.
Snapshots or volumes unknown to the orchestration and left over.
In the case of large volumes for multiple VMs (like VMWare VMFSes or the XenServer ones), non-working reclaim/TRIM.
What is the snapshot overhead?
Sum of storedSize
from volume status
- spaceUsed
from volume usedSpace
.
Here is an example:
#!/bin/bash
# Calculate the total stored data for a volume, including its snapshots, the volume size alone, and the snapshot overhead:
storedSize=$(storpool -Bj volume status | jq -r --arg volumeName "$1" '[ .data[] | select(.name == $volumeName or .onVolume == $volumeName) | .storedSize ] | add')
volumeSize=$(storpool -Bj volume usedSpace | jq -r --arg volumeName "$1" '[ .data[] | select(.name == $volumeName) | .spaceUsed ] | add')
snapshotOverhead=$(( storedSize - volumeSize ))
echo "Total used space for volume $1: $storedSize B"
echo "Volume size $1: $volumeSize B"
echo "Snapshot Overhead for $1: $snapshotOverhead B"
Note
When calculating snapshot overhead for a single volume, it’s important to consider whether the snapshot chain has been rebased.
If the chain has not been rebased, you can simply follow the snapshot tree to identify all relevant snapshots and compute their overhead. However, if the chain has been rebased, this approach may be incomplete. Some snapshots created by the same volume might not be included in the rebased chain.
To identify all related snapshots, you can use the onVolume
tag, which shows
all snapshots associated with a volume regardless of rebase operations.
However, this method also has limitations—particularly when a snapshot has
multiple child volumes, in which case the onVolume
tag may reference only
one of them.