Purpose

This is the functional specification for the supporting Dell EMC PowerFlex/ScaleIO storage plugin.

Bug Reference

Github issue#

Branch

master

Introduction

CloudStack currently supports Ceph/RBD storage which is a distributed block storage. PowerFlex (formerly ScaleIO/VxFlexOS) also provides distributed shared block storage. This proposed feature will add support for PowerFlex (v3.5 & above) storage as a primary storage (new Storage Plugin) in CloudStack.

Document History

Author	Description	Date
Suresh Kumar Anaparti	Added feature specification and design	27 Aug 2020

Glossary

SDS - ScaleIO Data Client
SDC - ScaleIO Data Server
MDM - Meta Data Manager
ScaleIO - PowerFlex (formerly VxFlexOS)
V-Tree - Volume Tree
VM - Virtual Machine
PS - Primary Storage

Usecases

This feature should able to:

Allow admin to add the PowerFlex/ScaleIO storage pool as a Primary Storage and perform PS operations
Allow user to deploy VM and perform VM operations
Allow user to create Volume and perform Volume operations

Feature specification

Add support for ScaleIO/PowerFlex pool as a primary storage
Template (QCOW2/RAW) spooling to the ScaleIO storage pool in RAW format
Creation of the ROOT and DATA volumes on the ScaleIO storage pool
Support volume QoS via offering details parameters

Functionality support

User / System VM lifecycle and operations
1. Deploy system VMs from the systemvm template and supports their lifecycle operations
2. Deploy user VM using the selected template in QCOW2 & RAW formats, and ISO
3. Start, Stop, Restart, Reinstall, Destory VM(s)
4. VM snapshot (offline-only, VM snapshots with memory is not supported)
5. Migrate VM from one KVM host to another KVM host (zone-wide: same/across clusters)
Volume lifecycle and operations (ScaleIO volumes are in RAW format)
1. Create ROOT disks using a provided template (in QCOW2 & RAW formats, from NFS secondary storage and direct download templates)
2. List, Detach, Resize ROOT volumes
3. Create, List, Attach, Detach, Resize, Delete DATA volumes
4. Create, List, Revert, Delete snapshots of volumes (with backup in Primary, no backup to secondary storage)
5. Create template (on secondary storage in QCOW2 format) from ScaleIO volume or snapshot
6. Support ScaleIO volume QoS using details parameters: iopsLimit , bandwidthLimitInMbps in compute/disk offering. These are the SDC limits for the volume.
7. Migrate volume (usually Volume Tree or V-Tree) from one ScaleIO storage pool to another ScaleIO (limited to storage pools within the same ScaleIO cluster)
8. Migrate volume across ScaleIO storage pools on different ScaleIO clusters (using block copy)
Config drive on scratch / cache space on KVM host

Test Guidelines

Add the ScaleIO/PowerFlex storage pool as a Primary Storage, using a tag say "scaleio".
Create compute offering and disk offering using the storage the "scaleio".
Deploy VMs and create data disks with the offerings created in step 2.

Error Handling

All errors at various levels for the storage operations will be logged in management-server.log.

Audit Events

Events will be generated in the management server logs for any resource being created during the course of the deployment.

Target Users

CloudStack Admins and Users.

ScaleIO Overview

ScaleIO basic architecture consists of SDC, SDS and MDM, details below.

SDC

The ScaleIO Data Client is a lightweight block device driver that exposes ScaleIO shared block volumes to applications. The SDC runs on the same server as the application. This enables the application to issue an IO request and the SDC fulfills it regardless of where the particular blocks physically reside. The SDC communicates with other nodes (beyond its own local server) over TCP/IP-based protocol, so it is fully routable. For this feature to work, all KVM hosts will need SDCs to be installed and connected to MDM.

SDS

The ScaleIO Data Server owns local storage that contributes to the ScaleIO Storage Pools. An instance of the SDS runs on every server that contributes some or all of its local storage space (HDDs, SSDs, PCIe, NVMe and flash cards) to the aggregated pool of storage within the ScaleIO virtual SAN. Local storage may be disks, disk partitions, even files. The role of the SDS is to actually perform the Back-End IO operations as requested by an SDC

MDM

The Meta Data Manager manages the ScaleIO system. The MDM contains all the metadata required for system operation; such as configuration changes. The MDM also allows monitoring capabilities to assist users with most system management tasks. The MDM manages the meta data, SDC, SDS, devices mapping, volumes, snapshots, system capacity including device allocations and/or release of capacity, RAID protection, errors and failures, and system rebuild tasks including rebalancing. In addition, all user interaction with the system is handled by the MDM. This is similar to the Ceph monitor or manager.

Gateway

The ScaleIO Gateway services RESTful API requests and connects to a single MDM and services requests by querying the MDM, and reformatting the answers it receives from the MDM in a RESTful manner, back to a REST client. Every ScaleIO scli command is also available in the ScaleIO REST API. Responses returned by the Gateway are formatted in JSON format. The API is available as part of the ScaleIO Gateway package. For the integration to work with CloudStack, the gateway must be installed and be accessible for the CloudStack control plane. There is also a GUI client for Windows, Mac and Linux for administrator to monitor and manage a cluster.

Protection Domain

A Protection Domain is a set of SDSs. Each SDS belongs to one (and only one) Protection Domain. Thus, by definition, each Protection Domain is a unique set of SDSs. The ScaleIO Data Client (SDC) is not part of the Protection Domain.

Storage Pool

Storage Pools allow the generation of different performance tiers in the ScaleIO system. A Storage Pool is a set of physical storage devices in a Protection Domain. Each storage device belongs to one (and only one) Storage Pool. When a Protection Domain is generated, it has one Storage Pool by default.

Fault Set

Datacenters are designed such that a unit of failure may consist of more than a single node. The fault set will limit mirrored chunks from being in the same fault set. A minimum of 3 fault sets is required per protection domain. And therefore the basic ScaleIO setup requires a 3-node (SDS) cluster setup

Volume

Single accessible/logical storage drive that can be accessed by hosts as block-based storage. A volume can be mapped and un-mapped on one or more SDCs, or for this feature can be mounted/unmounted on one or more KVM hosts appearing as a block-storage disk device.

Snapshot

The ScaleIO storage system enables users to take snapshots of existing volumes, up to 127 per volume. The snapshots are thinly provisioned and are extremely quick. Once a snapshot is generated, it becomes a new un-mapped “volume” in the system. Users manipulate snapshots in the same manner as any other volume exposed to the ScaleIO storage system.

V-Tree

All the snapshots resulting from one volume is referred to as a V-Tree (or Volume Tree). It’s a tree spanning from the source volume as the root, whose siblings are either snapshots of the volume itself or descendants of it. Each volume has a construct called a vTree which holds the volume and all snapshots associated with it. The limit on a VTree is 128 ‘volumes and snapshots’ – so one is taken by the original volume and the rest (127) are available for snapshots [1].

Consistency Group

A consistency group is created when snapshot is taken on two or more volumes.

CloudStack and ScaleIO

The ScaleIO counterparts above can be mapped and used with CloudStack and KVM as follows:

Primary StoragePool

A ScaleIO storage pool can be mapped 1:1 with a CloudStack storage pool and store the host/IP, port, username, password and id/name of the ScaleIO storage pool in CloudStack DB.

Templates

Templates can be of QCOW2 or RAW type, no changes in secondary storage or template/iso lifecycle are necessary.

Root disk

At the time of root-disk/VM provisioning, the KVM host agent can convert a template from secondary storage or direct-download into a RAW disk and write it to a mounted block-storage device (i.e. the mapped ScaleIO volume), which is the spooled template on the primary pool.

Root disk resize will cause resize of the related ScaleIO volume, similarly deletion of the root disk will cause deletion of the ScaleIO volume after unmapping it across all KVM hosts.

Data disk

On provisioning data-disks can be simply volumes that are created in ScaleIO that can be mapped on a KVM host and attached as “raw” disk(s) to a VM. Detach operation would be to detach the raw block-storage device from the VM and un-mapping the volume from a KVM host. Data disk resize will cause resize of the ScaleIO volume, similarly deletion of the data disk will cause deletion of the disk on ScaleIO after unmapping it on all KVM hosts.

Snapshot

CloudStack volume snapshots can be mapped 1:1 with volume snapshot on ScaleIO side. This will not require any backup operation to secondary storage by default similar to Ceph. A backup to secondary storage operation is possible by mounting the secondary storage and mapping the ScaleIO snapshot/volume on the KVM host and perform a block based transfer (using dd or qemu-img) or using qemu-img.

VM Snapshots

Creating snapshots of more than one ScaleIO volume creates a consistency group on ScaleIO side. For a running VM, VM snapshots with memory is not possible for root-disks on ScaleIO storage. Only VM snapshots without memory is possible (a consistent snapshot of root and data-disks of a VM).

Disk/compute offerings

Any storage IOPS settings can be taken and applied to a ScaleIO volume based on the compute offering for root-disk and the disk offering for a data-disk.

Note: due to ScaleIO limitation, the disk sizes must be multiples of 8GB, otherwise ScaleIO will roundoff and create disk with sizes on the boundary of 8GB.

Config Drive

This feature can be refactored in CloudStack where a local scratch/cache space can be defined on the KVM hosts for hosting the config drive ISOs, with a global setting that can change to behaviour of where to host the config drive isos (secondary storage, primary storage, local/scratch path on the host).

Direct download

This feature would require a caching/scratch space to download a template and then use that to perform block-based copy to a mapped/mounted ScaleIO volume before it could be used a root disk.

Storage Migration

ScaleIO allows migration of an entire VTree from one storage pool to another storage pool of the same system. Therefore, storage migration will be limited to storage pools managed by the same ScaleIO cluster gateway/manager.

Design description

Implement a new CloudStack storage plugin for ScaleIO storage. This will follow the design principles abstracted by CloudStack API for implementing a pluggable storage plugin.

The storage sub-system would have the following design aspects:

Introduce a new storage pool type “PowerFlex” that associates with PowerFlex/ScaleIO storage pool and allows for shared storage and over-provisioning. This type is used across various operations for handling a storage pool specific handling of operations especially on the hypervisor (KVM agent) side. Implement a new storage volume/datastore plugin with the following [2]:

i. ScaleIO Datastore Driver: a primary datastore driver class that is responsible for lifecycle operations of a volume and snapshot resource such as to grant/revoke access, create/copy/delete data object, create/revert snapshot and return usage data.

ii. ScaleIO Datastore Lifecycle: a class that is responsible for managing lifecycle of a storage pool for example to create/initialise/update/delete a datastore, attach to a zone/cluster and handle maintenance of the storage pool.

iii. ScaleIO Datastore Provider: a class that is responsible for exporting the implementation as a datastore provider plugin for CloudStack storage sub-system to pick it up and use for the storage pools of type “PowerFlex”.

iv. ScaleIO gateway client and utilities: a ScaleIO Java-SDK that provides helper classes for the driver and lifecycle classes to communicate with the ScaleIO gateway server using RESTful APIs. The new thin ScaleIO API client (Java client) will have the following functionality:

→ Secure authentication with provided URL and credentials

→ List all storage pools, find storage pool by ID/name

→ List all SDCs, find SDC by IP address

→ Map/unmap volume to SDC (a KVM host)

→ ScaleIO Volume life cycle operations

→ Map/unmap volume to SDC (a KVM host)

→ Other volume lifecycle operations supported in ScaleIO

2. Hypervisor layer (KVM): The hypervisor layer would have the following design aspects:

ScaleIO StorageAdaptor and StoragePool: For handling of ScaleIO volumes and snapshots, a ScaleIO storage specific adaptor and pool management classes may need to be added. These classes will be responsible for managing storage operations and pool related tasks and metadata.

All storage related operations need to be handled by various Command handlers and hypervisor/storage processors (KVMStorageProcessor) as orchestrated by the KVM server resource class (LibvirtComputingResource) such as CopyCommand, AttachCommand, DetachCommand, CreateObjectCommand, DeleteCommand, SnapshotAndCopyCommand, DirectDownloadCommand, etc.

Scratch/cache storage directory path on KVM host:

→ Define a new scratch/cache path in agent.properties with a default path, for example /var/cache/cloudstack/agent/

→ The cache directory will be used to host config drive ISOs for VMs and temporary cache for direct download templates

Configuration settings

The configuration settings changes below, are incorporated.

PowerFlex/ScaleIO Storage Pool:

Configuration	Description / Changes	Default Value
storage.pool.disk.wait	New primary storage level configuration to set the custom wait time for ScaleIO disk availability in the host (currently supports ScaleIO only).	60 secs
storage.pool.client.timeout	New primary storage level configuration to set the ScaleIO REST API client connection timeout (currently supports ScaleIO only).	60 secs
custom.cs.identifier	New global configuration, which holds 4 chars randomly generated initially. This parameter can be updated to suit the requirement of unique cloudstack installation identifier that helps in tracking the volumes of a specific cloudstack installation in the ScaleIO storage pool, used in Sharing basis.	random 4 chars string

Other settings added/updated:

Configuration	Description / Changes	Default Value
vm.configdrive.primarypool.enabled	Scope changed from Global to Zone level	false
vm.configdrive.use.host.cache.on.unsupported.pool	New zone level configuration to use host cache for config drives when storage pool doesn't support config drive.	true
vm.configdrive.force.host.cache.use	New zone level configuration to force host cache for config drives.	false
router.health.checks.failures.to.recreate.vr	New test "filesystem.writable.test" added, which checks the router filesystem is writable or not. If set to "filesystem.writable.test", the router is recreated when the disk is read-only.	<empty>

Agent parameters

The below parameters are introduced in the agent.properties file of the KVM host.

Parameter	Description	Default Value
host.cache.location	new parameter to specify the host cache path. Config drives will be created on the "/config" directory on the host cache.	/var/cache/cloud
powerflex.sdc.home.dir	new parameter to specify sdc home path if installed in custom dir, required to rescan and query_vols in the sdc.	/opt/emc/scaleio/sdc

Naming conventions for ScaleIO volumes

The following naming conventions are used for CloudStack resources in ScaleIO storage pool, which avoids the naming conflicts when the same ScaleIO pool is shared across multiple CloudStack zones / installations.

Volume: vol-[vol-id]-[pool-key]-[custom.cs.identifier]
Template: tmpl-[tmpl-id]-[pool-key]-[custom.cs.identifier]
Snapshot: snap-[snap-id]-[pool-key]-[custom.cs.identifier]
VMSnapshot: vmsnap-[vmsnap-id]-[pool-key]-[custom.cs.identifier]

where,
[pool-key] = 4 characters picked from the pool uuid. Example UUID: fd5227cb-5538-4fef-8427-4aa97786ccbc => fd52(27cb)-5538-4fef-8427-4aa97786ccbc. The highlighted 4 characters (in yellow) are picked. The pool can tracked with the UUID containing [pool-key].

[custom.cs.identifier] = value of the global configuration “custom.cs.identifier”, which holds 4 characters randomly generated initially. This parameter can be updated to suit the requirement of unique CloudStack installation identifier, which helps in tracking the volumes of a specific CloudStack installation.

Assumptions and Limitations

Dell EMC renamed ScaleIO to VxFlexOS and now to PowerFlex with v3.5, for the purpose of this documentation and implementation of this feature “ScaleIO” would be used to imply VxFlexOS/PowerFlex interchangeably. Names of components, API, global settings etc. in CloudStack specific feature branch may change over the course of implementation.
CloudStack will not manage the creation of storagepool/domains etc in ScaleIO, but those must be done by the Admin prior to creating a storage pool in CloudStack. Similary, deletion of ScaleIO storagepool in CloudStack will not cause actual deletion or removal of storage pool on ScaleIO side.
VM snapshots (using qemu) with memory may corrupt the qcow2 and this strategy would make this an unsupported feature. Volume snapshots (using qemu) that are stored within the qcow2 file may also corrupt the ScaleIO volume. Backend/ScaleIO-driven snapshots of volumes are still possible but limited to 127 snapshots per root/data disk.
Any kind of caching may make the (qcow2) volume unsafe for volume migration, as a risk-mitigation strategy no disk caching (cache=“none”) may be allowed for ScaleIO volumes. This potentially adds risks for VM migration as well as any side-effects of using a qcow2 based root disk ScaleIO volume using another ScaleIO volume as a RAW (template) disk backing file.
API client may hit SSL certificate related issues which cannot be bypassed by simply ignoring certificate warnings. Admin may need to accept cluster certificates using ScaleIO gateway for LIA/MDM etc. It is assumed that monitoring and maintenance of ScaleIO cluster certificates and configuration is outside of the scope of the CloudStack.
Delete volume/snapshot API to ScaleIO API gateway could cause a client instance to stop working for any subsequent calls unless another authenticated session is initiated. This would limit use of client connection pooling.
Volume/VTree migrations are limited to different storage pools in the same ScaleIO cluster.
PowerFlex/ScaleIO Limitations [1] are applicable.
1. Names of snapshots and volumes cannot exceed 31 characters, while typical CloudStack resource UUIDs are 36 characters long. So, the CloudStack volume/snapshot UUIDs cannot be used to map to related resources in ScaleIO.
2. V-Tree is limited to 127 snapshots for a ScaleIO volume.
3. The minimum volume size is 8GB, and volume sizes should be at the boundary of 8GBs (for example creation of volume of size 1GB will create volume of size 8GB) and minimum Storage Pool capacity is 300GB.
4. The maximum number of volumes/snapshots in system is limited to 131,072, and Maximum number of volumes/snapshots in Protection Domain is limited to 32,768.
5. The maximum SDCs per system is 1024, and the maximum volumes that can be mapped to a single SDC is 8192. From the docs we couldn’t determine the limit of maximum number of SDC mappings for a single ScaleIO volume.
6. The maximum concurrent GUI/REST/CLI clients logged-in is 128.

API changes

createServiceOffering API call needs new key/values in details parameter for specifying the ScaleIO Volume SDC Limits to Root disk.
1. New details key: "bandwidthLimitInMbps" support is added.
2. New details key: "iopsLimit" support is added.
3. Example (using cmk): create serviceoffering name=pflex_instance displaytext=pflex_instance storagetype=shared provisioningtype=thin cpunumber=1 cpuspeed=1000 memory=1024 tags=pflex serviceofferingdetails[0].bandwidthLimitInMbps=90 serviceofferingdetails[0].iopsLimit=9000
4. These key/values are optional and defaulted to 0 (unlimited)
createDiskOffering API call needs new key/values in details parameter for specifying the ScaleIO Volume SDC Limits to Data disk.
1. New details key: "bandwidthLimitInMbps" support is added.
2. New details key: "iopsLimit" support is added.
3. Example (using cmk): create diskoffering name=pflex_disk displaytext=pflex_disk storagetype=shared provisioningtype=thick disksize=3 tags=pflex details[0].bandwidthLimitInMbps=70 details[0].iopsLimit=7000
4. These key/values are optional and defaulted to 0 (unlimited)

DB changes

N/A

Hypervisors supported

KVM

UI Flow

The addition of the new provider would automatically list the "PowerFlex" as a provider, which supports to add a PowerFlex/ScaleIO storage pool.
Changes in "Add Primary Storage" UI when "PowerFlex" provider is selected, to specify: Gateway, Username, Password and Storage Pool for PowerFlex/ScaleIO storage pool. The PowerFlex pool URL is generated from these inputs specified.
Changes to allow VM snapshots for a stopped VM in KVM hypervisor and on ScaleIO storage pool.

Upgrade

N/A

Open Items/Questions

N/A

References

[1] VxFlex/PowerFlex Limits

[2] CloudStack Storage-subsystem design

[3] Getting to Know PowerFlex/ScaleIO

Space shortcuts

Child pages