Storage devices

All storage devices that will be used by StorPool (HDD, SSD, or NVMe) must have one or more properly aligned partitions, and must have an assigned ID. Larger NVMe drives should be split into two or more partitions, which allows assigning them to different instances of the storpool_server service.

You can initialize the devices quickly using the disk_init_helper tool provided by StorPool. Alternatively, you do this manually using the common parted tool.

Journals

All hard disk drives should have a journal provided in one of the following ways:

  • On a persistent memory device (/dev/pmemN)

  • On a small, high-endurance NVMe device (An Intel Optane or similar)

  • On a regular NVMe on a small separate partition from its main data one

  • On battery/cachevault power-loss protected virtual device (RAID controllers).

No journals on the HDD is acceptable in case of snapshot-only data (for example, a backup-only cluster).

For persistent memory devices, see Persistent memory support.

Using disk_init_helper

The disk_init_helper tool is used in two steps:

  1. Discovery and setup

    The tool discovers all drives that do not have partitions and are not used anywhere (no LVM PV, device mapper RAID, StorPool data disks, and so on). It uses this information to generate a suggested configuration, which is stored as a configuration file. You can try different options until you get a configuration that suits you needs.

  2. Initialization

    You provide the configuration file from the first step to the tool, and it initializes the drives.

disk_init_helper is also used in the storpool-ansible playbook (see github.com/storpool/ansible), where it helps providing consistent defaults for known configurations and idempotency.

Example node

This is an example node with 7 x 960GB SSDs, 8 x 2TB HDDs, 1 x 100GB Optane NVMe, and 3 x 1TB NVMe disks:

[root@s25 ~]# lsblk
NAME     MAJ:MIN RM   SIZE RO TYPE MOUNTPOINT
sda        8:0    0 894.3G  0 disk
sdb        8:16   0 894.3G  0 disk
sdc        8:32   0 894.3G  0 disk
sdd        8:48   0 894.3G  0 disk
sde        8:64   0 894.3G  0 disk
sdf        8:80   0 894.3G  0 disk
sdg        8:96   0 894.3G  0 disk
sdh        8:112  0   1.8T  0 disk
sdi        8:128  0   1.8T  0 disk
sdj        8:144  0   1.8T  0 disk
sdk        8:160  0   1.8T  0 disk
sdl        8:176  0 111.8G  0 disk
|-sdl1     8:177  0  11.8G  0 part
|-sdl2     8:178  0   100G  0 part /
`-sdl128 259:15   0   1.5M  0 part
sdm        8:192  0   1.8T  0 disk
sdn        8:208  0   1.8T  0 disk
sdo        8:224  0   1.8T  0 disk
sdp        8:240  0   1.8T  0 disk
nvme0n1  259:6    0  93.2G  0 disk
nvme1n1  259:0    0 931.5G  0 disk
nvme2n1  259:1    0 931.5G  0 disk
nvme3n1  259:4    0 931.5G  0 disk

This node is used in the examples below.

Discovering drives

Basic usage

To assign IDs for all disks on this node run the tool with the --start argument:

[root@s25 ~]# disk_init_helper discover --start 2501 -d disks.json
sdl partitions: sdl1, sdl2, sdl128
Success generating disks.json, proceed with 'init'

Note

Note that the automatically generated IDs must be unique within the StorPool cluster. Allowed IDs are between 1 and 4000.

StorPool disk IDs will be assigned in an offset by 10 for SSD, NVMe, and HDD drives accordingly, which could be further tweaked by parameters.

By default, the tool discovers all disks without partitions; the one where the OS is installed (/dev/sdl) is skipped. The tools does the following:

  • Prepares all SSD, NVMe, and HDD devices with a single large partition on each one.

  • Uses the Optane device as a journal-only device for the hard drive journals.

Viewing configuration

You can use the --show option to see what will be done:

[root@s25 ~]# disk_init_helper discover --start 2501 --show
sdl partitions: sdl1, sdl2, sdl128
/dev/sdb (type: SSD):
    data partitions
        /dev/disk/by-id/ata-SAMSUNG_MZ7LM960HMJP-00005_S2TZNX0J302126-part1 (2501): 894.25 GiB (mv: None)
/dev/sda (type: SSD):
    data partitions
        /dev/disk/by-id/ata-SAMSUNG_MZ7LM960HMJP-00005_S2TZNX0J302127-part1 (2502): 894.25 GiB (mv: None)
/dev/sdc (type: SSD):
    data partitions
        /dev/disk/by-id/ata-SAMSUNG_MZ7LM960HMJP-00005_S2TZNX0J302128-part1 (2503): 894.25 GiB (mv: None)
/dev/sdd (type: SSD):
    data partitions
        /dev/disk/by-id/ata-SAMSUNG_MZ7LM960HMJP-00005_S2TZNX0J302129-part1 (2504): 894.25 GiB (mv: None)
/dev/sde (type: SSD):
    data partitions
        /dev/disk/by-id/ata-SAMSUNG_MZ7LM960HMJP-00005_S2TZNX0J302137-part1 (2505): 894.25 GiB (mv: None)
/dev/sdf (type: SSD):
    data partitions
        /dev/disk/by-id/ata-SAMSUNG_MZ7LM960HMJP-00005_S2TZNX0J302138-part1 (2506): 894.25 GiB (mv: None)
/dev/sdg (type: SSD):
    data partitions
        /dev/disk/by-id/ata-SAMSUNG_MZ7LM960HMJP-00005_S2TZNX0J302139-part1 (2507): 894.25 GiB (mv: None)
/dev/sdh (type: HDD):
    data partitions
        /dev/disk/by-id/ata-ST2000NM0008-2F3100_ZDS00Y25-part1 (2521): 1863.01 GiB (mv: /dev/disk/by-id/nvme-INTEL_SSDPE21K100GA_PHKE831600DM100EGN-part1)
/dev/sdj (type: HDD):
    data partitions
        /dev/disk/by-id/ata-ST2000NM0008-2F3100_ZDS03YRJ-part1 (2522): 1863.01 GiB (mv: /dev/disk/by-id/nvme-INTEL_SSDPE21K100GA_PHKE831600DM100EGN-part2)
/dev/sdi (type: HDD):
    data partitions
        /dev/disk/by-id/ata-ST2000NM0008-2F3100_ZDS041FK-part1 (2523): 1863.01 GiB (mv: /dev/disk/by-id/nvme-INTEL_SSDPE21K100GA_PHKE831600DM100EGN-part3)
/dev/sdk (type: HDD):
    data partitions
        /dev/disk/by-id/ata-ST2000NM0008-2F3100_ZDS04280-part1 (2524): 1863.01 GiB (mv: /dev/disk/by-id/nvme-INTEL_SSDPE21K100GA_PHKE831600DM100EGN-part4)
/dev/sdp (type: HDD):
    data partitions
        /dev/disk/by-id/ata-ST2000NM0008-2F3100_ZDS05PTA-part1 (2525): 1863.01 GiB (mv: /dev/disk/by-id/nvme-INTEL_SSDPE21K100GA_PHKE831600DM100EGN-part5)
/dev/sdo (type: HDD):
    data partitions
        /dev/disk/by-id/ata-ST2000NM0008-2F3100_ZDS05PTB-part1 (2526): 1863.01 GiB (mv: /dev/disk/by-id/nvme-INTEL_SSDPE21K100GA_PHKE831600DM100EGN-part6)
/dev/sdm (type: HDD):
    data partitions
        /dev/disk/by-id/ata-ST2000NM0008-2F3100_ZDS05PTD-part1 (2527): 1863.01 GiB (mv: /dev/disk/by-id/nvme-INTEL_SSDPE21K100GA_PHKE831600DM100EGN-part7)
/dev/sdn (type: HDD):
    data partitions
        /dev/disk/by-id/ata-ST2000NM0008-2F3100_ZDS05PTJ-part1 (2528): 1863.01 GiB (mv: /dev/disk/by-id/nvme-INTEL_SSDPE21K100GA_PHKE831600DM100EGN-part8)
/dev/nvme0n1 (type: journal-only NVMe):
    journal partitions
        /dev/disk/by-id/nvme-INTEL_SSDPE21K100GA_PHKE831600DM100EGN-part1 (None): 0.10 GiB (mv: None)
        /dev/disk/by-id/nvme-INTEL_SSDPE21K100GA_PHKE831600DM100EGN-part2 (None): 0.10 GiB (mv: None)
        /dev/disk/by-id/nvme-INTEL_SSDPE21K100GA_PHKE831600DM100EGN-part3 (None): 0.10 GiB (mv: None)
        /dev/disk/by-id/nvme-INTEL_SSDPE21K100GA_PHKE831600DM100EGN-part4 (None): 0.10 GiB (mv: None)
        /dev/disk/by-id/nvme-INTEL_SSDPE21K100GA_PHKE831600DM100EGN-part5 (None): 0.10 GiB (mv: None)
        /dev/disk/by-id/nvme-INTEL_SSDPE21K100GA_PHKE831600DM100EGN-part6 (None): 0.10 GiB (mv: None)
        /dev/disk/by-id/nvme-INTEL_SSDPE21K100GA_PHKE831600DM100EGN-part7 (None): 0.10 GiB (mv: None)
        /dev/disk/by-id/nvme-INTEL_SSDPE21K100GA_PHKE831600DM100EGN-part8 (None): 0.10 GiB (mv: None)
/dev/nvme3n1 (type: NVMe w/ journals):
    data partitions
        /dev/disk/by-id/nvme-INTEL_SSDPE2KX010T8_BTLJ849207E61P0FGN-part1 (2511): 931.51 GiB (mv: None)
/dev/nvme1n1 (type: NVMe w/ journals):
    data partitions
        /dev/disk/by-id/nvme-INTEL_SSDPE2KX010T8_BTLJ849207F91P0FGN-part1 (2512): 931.51 GiB (mv: None)
/dev/nvme2n1 (type: NVMe w/ journals):
    data partitions
        /dev/disk/by-id/nvme-INTEL_SSDPE2KX010T8_BTLJ84920JAJ1P0FGN-part1 (2513): 931.51 GiB (mv: None)

Recognizing SSDs

The SSDs and HDDs are auto-discovered by their rotational flag in the /sys/block hierarchy. There are however occasions when this flag might be misleading, and an SSD is visible as a rotational device.

For such cases there are overrides that can further help with proper configuration, as shown in the example below:

# disk_init_helper discover --start 101 '*Micron*M510*:s'

All devices matching the /dev/disk/by-id*Micron*M510*:s pattern will be forced as SSD drives, regardless of how they were discovered by the tool.

Specifying a journal

Similarly, a journal may be specified for a device, for example:

# disk_init_helper discover --start 101 '*Hitachi*HUA7200*:h:njo'

Instructs the tool to use an NVMe journal-only device for keeping the journals for all Hitachi HUA7200 drives.

The overrides look like this: <disk-serial-pattern>:<disk-type>[:<journal-type>]

The disk type may be one of:

  • s - SSD drive

  • sj - SSD drive with HDD journals (used for testing only)

  • n - NVMe drive

  • nj - NVMe drive with HDD journals

  • njo - NVMe drive with journals only (no StorPool data disk)

  • h - HDD drive

  • x - Exclude this drive match, even if it is with the right size.

The journal-type override is optional, and makes sense mostly when the device is an HDD:

  • nj - journal on an NVMe drive - requires at least one nj device

  • njo - journal on an NVMe drive - requires at least one njo device

  • sj - journal on SSD drive (unusual, but useful for testing); requires at least one sj device.

Initializing drives

To initialize the drives using an existing configuration file:

# disk_init_helper init disks.json

The above will apply the settings pre-selected during the discovery phase.

More options may be specified to either provide some visibility on what will be done (like --verbose and --noop), or provide additional options to storpool_initdisk for the different disk types (like --ssd-args and --hdd-args).

Manual partitioning

A disk drive can be initialized manually as a StorPool data disk.

Creating partitions

First, an aligned partition should be created spanning the full volume of the disk drive. Here is an example command for creating a partition on the whole drive with the proper alignment:

# parted -s --align optimal /dev/sdX mklabel gpt -- mkpart primary 2MiB 100%    # Here, X is the drive letter

For dual partitions on a NVMe drive that is larger than 4TB use:

# parted -s --align optimal /dev/nvmeXnY mklabel gpt -- mkpart primary 2MiB 50%   # Here, X is the nvme device controller, and Y is the device's first namespace
# parted -s --align optimal /dev/nvmeXnY mkpart primary 50% 100%

Similarly, to split an even larger (for example, 8TB or larger) NVMe drive to four partitions use:

# parted -s --align optimal /dev/nvmeXnY mklabel gpt -- mkpart primary 2MiB 25%   # Here, X is the nvme device controller, and Y is the device's first namespace
# parted -s --align optimal /dev/nvmeXnY mkpart primary 25% 50%
# parted -s --align optimal /dev/nvmeXnY mkpart primary 50% 75%
# parted -s --align optimal /dev/nvmeXnY mkpart primary 75% 100%

Hint

NVMe devices larger than 4TB should always be split as up to 4TiB chunks.

Initializing a drive

On a brand new cluster installation it is necessary to have one drive formatted with the “init” (-I) flag of the storpool_initdisk tool. This device is necessary only for the first start, and therefore it is best to pick the first drive in the cluster.

Initializing the first drive on the first server node with the init flag:

# storpool_initdisk -I {diskId} /dev/sdX   # Here, X is the drive letter

Initializing an SSD or NVME SSD device with the SSD flag set:

# storpool_initdisk -s {diskId} /dev/sdX   # Here, X is the drive letter

Initializing an HDD drive with a journal device:

# storpool_initdisk {diskId} /dev/sdX --journal /dev/sdY   # Here, X and Y are the drive letters

To list all initialized devices:

# storpool_initdisk --list

Example output:

0000:01:00.0-p1, diskId 2305, version 10007, server instance 0, cluster e.b, SSD, opened 7745
0000:02:00.0-p1, diskId 2306, version 10007, server instance 0, cluster e.b, SSD, opened 7745
/dev/sdr1, diskId 2301, version 10007, server instance 1, cluster e.b, SSD, opened 8080
/dev/sdq1, diskId 2302, version 10007, server instance 1, cluster e.b, SSD, opened 8080
/dev/sds1, diskId 2303, version 10007, server instance 1, cluster e.b, SSD, opened 8080
/dev/sdt1, diskId 2304, version 10007, server instance 1, cluster e.b, SSD, opened 8080
/dev/sda1, diskId 2311, version 10007, server instance 2, cluster e.b, WBC, jmv 160036C1B49, opened 8185
/dev/sdb1, diskId 2311, version 10007, server instance 2, cluster -, journal mv 160036C1B49, opened 8185
/dev/sdc1, diskId 2312, version 10007, server instance 2, cluster e.b, WBC, jmv 160036CF95B, opened 8185
/dev/sdd1, diskId 2312, version 10007, server instance 2, cluster -, journal mv 160036CF95B, opened 8185
/dev/sde1, diskId 2313, version 10007, server instance 3, cluster e.b, WBC, jmv 160036DF8DA, opened 8971
/dev/sdf1, diskId 2313, version 10007, server instance 3, cluster -, journal mv 160036DF8DA, opened 8971
/dev/sdg1, diskId 2314, version 10007, server instance 3, cluster e.b, WBC, jmv 160036ECC80, opened 8971
/dev/sdh1, diskId 2314, version 10007, server instance 3, cluster -, journal mv 160036ECC80, opened 8971

Drive initialization options

Other available options of the storpool_initdisk tool:

--list

List all StorPool disks on this node.

-i

Specify server instance, used when more than one storpool_server instances are running on the same node.

-r

Used to return an ejected disk back to the cluster or change some of the flags.

-F

Forget this disk and mark it as ejected; succeeds only without a running storpool_server instance that has the drive opened.

-s|–ssd y/n

Set SSD flag - on new initialize only, not reversible with -r. Providing the y or n value forces a disk to be considered as flash-based or not.

-j|–journal (<device>|none)

Used for HDDs when a RAID controller with a working cachevault or battery is present or an NVMe device is used as a power loss protected write-back journal cache.

--bad

Marks disk as bad. Will be treated as ejected by the servers.

--good

Resets disk to ejected if it was bad. Use with caution.

--list-empty

List empty NVMe devices.

--json

Output the list of devices as a JSON object.

--nvme-smart nvme-pci-addr

Dump the NVMe S.M.A.R.T. counters; only for devices controlled by the storpool_nvmed service.

Advanced options (use with care):

-e (entries_count)

Initialize the disk by overriding the default number of entries count (default is based on the disk size).

-o (objects_count)

Initialize the disk by overriding the default number of objects count (default is based on the disk size).

--wipe-all-data

Used when re-initializing an already initialized StorPool drive. Use with caution.

--no-test

Disable forced one-time test flag.

--no-notify

Does not notify servers of the changes. They won’t immediately open the disk. Useful for changing a flag with -r without returning the disk back to the server.

–no-fua (y|n)

Used to forcefully disable FUA support for an SSD device. Use with caution because it might lead to data loss if the device is powered off before issuing a FLUSH CACHE command.

–no-flush (y|n)

Used to forcefully disable FLUSH support for an SSD device.

–no-trim (y|n)

Used to forcefully disable TRIM support for an SSD device. Useful when the drive is misbehaving when TRIM is enabled.

–test-override no/test/pass

Modify the “test override” flag (default during disk init is “test”).

–wbc (y|n)

Used for HDDs when the internal write-back caching is enabled, implies SP_WRITE_BACK_CACHE_ENABLED to have an effect. Turned off by default.

--nvmed-rescan

Instruct the storpool_nvmed service to rescan after device changes.