Dynamic scaling of CPU and RAM

Bug Reference

https://issues.apache.org/jira/browse/CLOUDSTACK-658

Branch

master, 4.1.0

Introduction

It is not always possible to exactly identify the CPU and RAM requirements at the time of deploying a VM. But for various reasons it may be required to scale up these resources later on. At that time there is no other way but to restart the VM with increased resources. Dynamic scaling for CPU and RAM feature would allow to change these resources for a running VM avoiding any downtime.

Currently CS allows updating CPU/RAM by changing to a different compute offering for stopped VMs. This feature will enable the same for running VMs.

Purpose

This document describes the specifications and design of the feature.

References

relevant links

Document History

Glossary

Feature Specifications

Ability to scale up CPU and/or RAM for running user VMs based on predefined compute offerings
Ability to scale up CPU and/or RAM for running system VMs based on predefined system compute offerings
If scaling up requires a migration it will be limited within the cluster only - if the current host where VM is running has capacity then simply the resources will be scaled up, if not live migration will be done within the cluster and then resources will be scaled up. If no suitable host is found operation will fail. In future if live migration across clusters is possible then this constraint can be relaxed.
Ability to mark the VM for scale up at creation time (see open issue#1)
Ability to scale down CPU and/or RAM (see open issue#2)
No check for guest OS compatibility (can be considered only if there is a concise document/API listing for supported guest OS's for each hypervisor)
Only supported for newly created VMs, existing VMs (from prior CS releases) won't have the capability to scale up

Hypervisor support

Currently planning to do it for Vmware. Support for other HVs can also be added based on HV capabilities.

For KVM - Marcus Sorrensun has sponsored to do this.

Use cases

End users can scale up CPU and/or RAM for running VMs

Architecture and Design description

A new command class needs to be introduced for actually changing CPU/RAM at agent layer. This needs to be handled for the supported HVs.

ReconfigureVMCommand: This will have the updated CPU and RAM values as members

Allocation logic: Refer to flow chart below. There are 3 primary use cases:-

VM's current host has capacity - If VM's current host has capacity to scale up the vm then we put the vm in Reconfiguring state and lock the delta capacity. We then send the ReconfigureVMCommand to the HV to reconfigure the vm and scale it to the new values. Whether success / failure we put the vm back into running state, but release the delta capacity in case of failure.
VM's current host doesn't have capacity but the vm's cluster has - If VM's current host doesnt have capacity, then we call up the planners to find a suitable host that can take the scaled up vm in the cluster. Once the host is found out we lock the new required capacity on the new host and migrate the vm to that host. Once migrated we send ReconfigureVMCommand to the HV. IF there is failure in reconfiguring here then we release the delta capacity on the new host.
Cluster in which vm is running doesn't have capacity - we simply return failure to the end user saying that we dont have enough capacity to scale up the vm.

Web Services APIs

Following APIs needs to be changed:

scaleVirtualMachine - This is an existing API upgradeVirtualMachine and takes vm_id and compute_offering_id as inputs. This is a sync call currently and will not be used anymore since we need it to be async. I plan deprecate this api in the next big release (5.0).
I will introduce another API named scaleVirtualMachine which will be similar to upgradeVirtualMachine in every aspect except that it would be async.
For system VMs the same API can be used but proper access checks would be done. In case of a migration this will internally use the migrateVirtualMachine API logic.

UI flow

UI needs to give an action for upgrading vm (when vm is in running / stopped state) just like we give the same option when vm is stopped.

UI needs to call the new api scaleVirtualMachine for this and also keep in mind that this api is async in nature unlike the previous one which was sync. Other than that all the parameters remain the same.

Created a subtask for this https://issues.apache.org/jira/browse/CLOUDSTACK-1041

Open Issues

Should scale down be allowed? It can be explicitly prevented since none of the HVs/guest OS supports it.

There is also an option of having a custom compute offering where user can specify values for CPU and RAM during deployment or scaling up. But am not sure if this option can be misused since this is a user level API. Another complexity is to capture usage. Currently it is done based on compute offering.

Test cases

TBD

Appendix

Appendix A:

Appendix B:

Space shortcuts

Child pages