Container technologies are gaining quite a momentum and changing the way how application are traditionally deployed in the public and private clouds. Gaining interest in micro services based architecture also fostering adaption of container technologies. Like how cloud orchestration platforms like CloudStack enabled provisioning of VM's and adjunct services, container orchestration platforms like Kubernetes, docker swarm, mesos are emerging to enable orchestration of containers. Container orchestration platforms typically can be run any where and can be used to provision containers. A popular choice of running containers has been running them on the IAAS provisioned VM's. AWS and GCE provides native functionality to launch containers abstracting underlying consumption of VM's. There are couple efforts to provision a container orchestration platforms on top of CloudStack, but they are not out of the box solution. Given the momentum of container technologies, miro-services etc it make sense to provide a native functionality in CloudStack which is available out-of-the-box for users.

Purpose

Purpose of this document is present the functional requirements for supporting noticing native functionality in CloudStack to provision containers and detail design aspects of how the functionality will be achieved.

Functional specification

Container Cluster

CloudStack container service shall introduce the notion of container cluster. A 'container cluster' shall be first class CloudStack entity that will be a composite of existing CloudStack entities like virtual machines, network, network rules etc. CCS stitches Container service shall stitch together container cluster resources, and deploys chosen cluster manager like Kubernetes, Mesos, docker swarm etc to provide a container service like AWS ECS, Google container service etc to the CloudStack users.

Cluster life-cycle management

It is envisioned that CCS will Container service shall provide following container cluster life-cycle operations. Though for v1 release only create/delete life cycle operations will be supported.

create container cluster: provision container cluster resource, and brings the container cluster in to operational readiness state to launch containers. Resources provisioned shall depend on the cluster manager used. all the cluster VM's shall be launched in to a dedicated network for the cluster. API end point of cluster manager shall be exposed through creating port forwarding rule on source nat ip of the network dedicated for the cluster.
delete stop container cluster: Stopping a cluster will shutdown destroy all the resources consumed by provisioned for the container cluster. user can start the cluster at a later point with Start operationPost delete, a container cluster can not be performed any operations on it.
start container cluster: Starting a cluster will start the VM's and possibly start the network. Start is not guaranteed to succeed even if performed after stop. Resources are provisioned afresh
stop container cluster: Stopping a cluster will shutdown all the resources consumed by the container cluster. user can start the cluster at a later point with Start operation.
recovering a cluster: Due to possible faults (like VM's that got stopped due to failures, or malfunctioning cluster manager etc) container cluster can end up in Alert state. Recover is used to revive container cluster to a sane running state.
cluster resizing (scale-in/out): increase or decrease the size of the cluster
delete list container cluster: destroy list all the resources provisioned for the container cluster. Post delete, a container cluster can not be performed any operations on it.container clusters

provisioning container orchestrator

As part of container cluster creation, container service shall be responsible for setting up control place of container orchestrator that was choosen.

Design

Api changes

createContainerCluster
deleteContainerCluster
startContainerCluster
stopContainerCluster
listContainerCluster
listContainerClusterCACert

Each of the life cycle operation is a workflow resulting in either provisioning or deleting multiple CloudStack resources. Its not possible to achieve atomicity. There is no guarantee a workflow of a life cycle operation will succeed due to lack of 2PC like model of resource reservation followed by provisioning semantics. Also there is guarantee rollback getting succeeded. For e.g. while provisioning a cluster of size 10 VM's, deployment may run out of capacity to provision any more VM's after provisioning 5 Vm's . In which case as rollback provisioned VM's can be destroyed. But there can be cases where deleting a provisioned VM is not possible temporarily like disconnected hosts etc. Below approach is followed.

...

Below state machine captures the state of container cluster as it goes through various life-cycle operations. Not all states are necessarily end user visible.

createContainerCluster
deleteContainerCluster
startContainerCluster
stopContainerCluster
listContainerCluster
listContainerClusterCACert

...

Garbage collection

garbage collection will be implemented as a way to clean up the resources of container cluster, as a background task. Following are cases where cluster resources are freed up.

Starting container cluster fails, resulting in clean up of the provisioned resources (Starting → Expunging → Destroyed)
deleting container cluster (Stopped→ Expunging → Destroyed and Alert→ Expunging → Destroyed )

If there is failures in cleaning up resources, and clean up can not proceed, state of container cluster is marked in 'Expunge' state from 'Expunging' state. Garbage collector will loop through the list of container clusters in 'Expunge' state periodically and try to free the resources held by container cluster.

OPEN QUESTION

should we care to implement rollback of failure in container cluster creation, or do a lazy cleanup. Which is to mark the container cluster to be in 'Expunging' state and let garbage collector do the cleanup. Its just matter of when to do it. Both the flows may be using same cleanup module.

Cluster state synchronization

State of the container cluster is 'desired state' of the cluster as intended by the user or what the system's logical view of the container cluster. However there are various scenarios where desired state of the container cluster is not sync with state that can be inferred from actual physical/infrastructure. For e.g a container cluster in 'Running' state with cluster size of 10 VM's all in running state. Its possible due to host failures, some of the VM's may get stopped at later point. Now the desired state of the container cluster is a cluster with 10 VM's running and in operationally ready state (w.r.t to container provisioning), but the resource layer is state is different. So we need a mechanism to ensure:

cluster is in desired state at resource/infrastructure layer. Which could mean provision new VM's or delete VM's, in the cluster etc to ensure desired state of the container cluster
Conversely when reconciliation can not happen reflect the state of the cluster accordingly, and to recover at later point.

Following mechanism will be implemented.

A state 'Alert' will be maintained that indicates container cluster is not in its desired state.
A state synchronization background task will run periodically to infer if the cluster is in desired state. If not cluster will marked as alert state.
A recovery action try to recover the cluster

State transitions in FSM, where a container cluster ends up in 'Alert' state:

failure in middle of scale in/out, resulting in cluster size (# of VM's) not equal to the expected.
failure in stopping a cluster, leaving some VM's to be running state.
Difference of states as detected by the state synchronization thread.

Out-of-band changes

From layering perspective, CCS is like layered on top of CloudStack functionality. There is no way to control the life-cycle of individual resources that are part of container cluster. For e.g user can go and delete the VM's that are part of container cluster.

OPEN QUESTION There are no hooks to restrict this actions?

Only design option is to cluster state synchronization to figure missing entities (in case of destroyed VM's) or conflicting states (User can stop a VM, that is expected to be running by CCS) and put the cluster state in alert.

Policies can be defined on how to recover the cluster.

re-use cloud DB vs keep separate DB

	Pros	cons
separate DB	clean separation. There is no specific advantages w.r.t data integrity for keeping the CCS DB as part of the cloud DB. perceived no side affects on 'cloud' DB. although CCS plug-in can modify 'cloud'db avoid possible side affect on CCS DB, during CloudStack DB upgrades	add new ORM to access the CCS DB CloudStack ORM is tied to 'cloud' DB difficult to switch
reuse cloud DB and extend schema	easiest path, leverage existing ORM use FK, delete cascade etc for cross table references if possible	side affects on upgrades

Handling out-of-band changes:

CCS will keep below book keeping tables to store the cloudstack resources provisioned and used for a container cluster.

Note there are no foreign key and delete cascades. CCS should not loose book keeping data on the resources even if resource is deleted from the CloudStack DB.

CCS code need to do defensive coding to verify entity exist in CloudStack tables before using it.

CREATE TABLE IF NOT EXISTS `cloud`.`container_cluster` (
    `id` bigint unsigned NOT NULL auto_increment COMMENT 'id',
    `uuid` varchar(40),
    `name` varchar(255) NOT NULL,
    `description` varchar(4096) NULL COMMENT 'description',
    `zone_id` bigint unsigned NOT NULL COMMENT 'zone id',
    `service_offering_id` bigint unsigned COMMENT 'service offering id for the cluster VM',
    `template_id` bigint unsigned COMMENT 'vm_template.id',
    `network_id` bigint unsigned COMMENT 'network this public ip address is associated with',
    `node_count` bigint NOT NULL default '0',
    `account_id` bigint unsigned NOT NULL COMMENT 'owner of this cluster',
    `domain_id` bigint unsigned NOT NULL COMMENT 'owner of this cluster',
    `state` char(32) NOT NULL COMMENT 'current state of this cluster',
    `key_pair` varchar(40),
    `cores` bigint unsigned NOT NULL COMMENT 'number of cores',
    `memory` bigint unsigned NOT NULL COMMENT 'total memory',
    `endpoint` varchar(255) COMMENT 'url endpoint of the container cluster manager api access',
    `console_endpoint` varchar(255) COMMENT 'url for the container cluster manager dashbaord',
 
--    CONSTRAINT `fk_cluster__zone_id` FOREIGN KEY `fk_cluster__zone_id` (`zone_id`) REFERENCES `data_center` (`id`) ON DELETE CASCADE,
--    CONSTRAINT `fk_cluster__service_offering_id` FOREIGN KEY `fk_cluster__service_offering_id` (`service_offering_id`) REFERENCES `service_offering`(`id`) ON DELETE CASCADE,
--    CONSTRAINT `fk_cluster__template_id` FOREIGN KEY `fk_cluster__template_id`(`template_id`) REFERENCES `vm_template`(`id`) ON DELETE CASCADE,
--    CONSTRAINT `fk_cluster__network_id` FOREIGN KEY `fk_cluster__network_id`(`network_id`) REFERENCES `networks`(`id`) ON DELETE CASCADE,
 
    PRIMARY KEY(`id`)
) ENGINE=InnoDB DEFAULT CHARSET=utf8;
 
CREATE TABLE IF NOT EXISTS `cloud`.`container_cluster_vm_map` (
    `id` bigint unsigned NOT NULL auto_increment COMMENT 'id',
    `cluster_id` bigint unsigned NOT NULL COMMENT 'cluster id',
    `vm_id` bigint unsigned NOT NULL COMMENT 'vm id',
 
    PRIMARY KEY(`id`),
 
--    CONSTRAINT `container_cluster_vm_map_cluster__id` FOREIGN KEY `container_cluster_vm_map_cluster__id`(`cluster_id`) REFERENCES 
`container_cluster`(`id`) ON DELETE CASCADE
 
) ENGINE=InnoDB DEFAULT CHARSET=utf8;

Space shortcuts

Child pages

Versions Compared

Old Version 4

New Version 5

Key

Purpose

Functional specification

Container Cluster

Cluster life-cycle management

provisioning container orchestrator

Design

Api changes

Garbage collection

Cluster state synchronization

Out-of-band changes

References

Space shortcuts

Child pages

Page History

Versions Compared

Old Version 4

New Version 5

Key

Purpose

Functional specification

Container Cluster

Cluster life-cycle management

provisioning container orchestrator

Design

Api changes

Garbage collection

Cluster state synchronization

Out-of-band changes

References