This document describes the strategy behind MXNet CIs fleet auto scaling for slaves.

1. Slave configuration

MXNet CI will make use of multiple slaves, distinguishable by their labels.

Slaves configuration

1. mxnetlinux-cpu

Instance-Type: C5.18xlarge (72 vCPUs, 0 GPUs)

Executors: 4 (maybe variable, see “Further plans: Heavy job plugin”)

This instance is capable of building all Unix-based artifacts and execute all CPU-based tests on Ubuntu. The same instance will be used for both tasks because neither guarantees a permanent CPU-utilization but rather generate short spikes. Thus, available resources can be shared with as many tasks as possible and efficiency kept on a high level without sacrificing execution time.

2. mxnetlinux-gpu

Instance-Type: G3.8xlarge (32 vCPUs, 2 GPUs)

Executors: 1 (see “Further plans: Assign GPUs to jobs”)

This instance is only executing GPU-based tests on Ubuntu. At this stage, only one GPU will be used because no parallelization is possible with nvidia-docker (see “Further plans: Assign GPUs to jobs”). 8xlarge is used due to the amount of vCPUs because at this stage, a GPU-test spends most of the time validating results using NumPy on a CPU (see “Further plans: Cache validation results”).

3. mxnetwindows-cpu

Instance-Type: C5.18xlarge (72 vCPUs, 0 GPUs)

Executors: 4 (maybe variable, see “Further plans: Heavy job plugin”)

This instance is capable of building all Windows-based artifacts and execute all CPU-based tests on Windows.

4. mxnetwindows-gpu

Instance-Type: G3.8xlarge (32 vCPUs, 2 GPUs)

Executors: 1 (see “Further plans: Assign GPUs to jobs”)

This instance is only executing GPU-based tests on Windows. At this stage, only one GPU will be used because the current CI setup does not support assigning GPUs to jobs (see “Further plans: Assign GPUs to jobs”).

5. mxnetlinux-gpu-p3

Instance-Type: P3.2xlarge (8 vCPUs, 1 GPUs)

Executors: 1 (see “Further plans: Assign GPUs to jobs”)

This instance is used to validate features which are only available on the latest GPU generations on Ubuntu.

6. mxnetlinux-gpu-p3-8xlarge

Instance-Type: P3.8xlarge (32 vCPUs, 4 GPUs)

Executors: 1 (see “Further plans: Assign GPUs to jobs”)

This instance is used to validate features which are only available on the latest GPU generations on Ubuntu.

7. utility

Instance-Type: C5.large (2 vCPUs, 0 GPUs)

Executors: 30

The job of this slave is to execute Jenkins operations that need a node context. This could be loading a groovy file, publishing a GitHub status, checking out the repository and similar things.

This slave is NOT intended to run any workload. If you have to run a proper task, as small as it might seem, it should be run on mxnetlinux-cpu. The intention of this slave is to act as a high-priority queue for lifecycle operations that are required to execute a job properly. That's why this slave has so few CPU cores but so many executors - the intention is not that this slave does any actual work.

Planned:

1. mxnetwindows-gpu-p3

Instance-Type: P3.2xlarge (8 vCPUs, 1 GPUs)

Executors: 1 (see “Further plans: Assign GPUs to jobs”)

This instance will be used to validate features which are only available on the latest GPU generations on Windows.

2. mxnetwindows-gpu-p3-8xlarge

Instance-Type: P3.8xlarge (32 vCPUs, 4 GPUs)

Executors: 1 (see “Further plans: Assign GPUs to jobs”)

This instance will be used to validate features which are only available on the latest GPU generations on Windows.

2. Scaling metrics

System metrics (CPU, GPU, RAM, HDD/IO, network) will not be used as base for scaling decisions. This is due to the fact that the utilization, especially while running multiple jobs, varies extensively and thus does not represent a reliable data source. Instead, more context-sensitive information should be used. These could be the following Jenkins-informations:

Queue: Details about jobs enqueued for a specific slave-type.
Free&used executors/machines: Number of executors per machine and slave-type being free and occupied.
Remaining job duration: Estimated time (% and absolute) left until a job finishes.
Successful&failing jobs: Number of successful and failing jobs per machine and in total.

In order to generate these metrics, the following two approaches can be used:

AWS Lambda: Access the Jenkins REST API using Python, NodeJS or Java and aggregate these values in Lambda before pushing them to CloudWatch.
Scripts on Master: Run Python, NodeJS or Java to access the Jenkins REST API locally on master to aggregate these values and push data points via aws-cli to CloudWatch.

The only difference in this case is the location where the script is getting executed. In order to ensure separation of concerns, using AWS Lambda is preferred as it ensures proper execution and debugging possibilities without requiring access to the master. In the end, the Jenkins REST API is used in both cases.

3. Alarms

While some metrics may vary heavily during the lifetime of a slave, ongoing high values could indicate bad health of a slave or the GitHub repository. In order to catch these cases, the following Alarms could be used:

Queue waiting time too high: Based on “Queue”, this alarm could be triggered if the time a job waits for an executor exceeds a certain threshold. This means that auto scaling is not working properly or other issues are preventing jobs from being scheduled.
Slave RAM/HDD exceeded: RAM and HDD should always stay under a certain threshold to prevent out-of-memory/space-errors. A monitor running on all slaves will show memory leaks or missing clean-up-tasks before they impact the running system.
Master CPU/RAM/HDD/network/response-time exceeded: The master should always have a good amount of spare resources available. In the current setup, the master is not redundant (see “Further plans: Redundant master”) and thus a single-point-of-failure.
Too many free executors: Too many free executors could indicate that the system is not scaling down properly.
Job health: If a job on the same slave takes too long or fails multiple time, this could indicate a bad health of the slave. As countermeasure, the slave could be shut down or disconnected – a subsequent alarm is necessary to ensure that false errors make the system unstable.
Budget: An alarm should be triggered if the maximum number of instances has been reached or too many instances have been started within a short amount of time. This could indicate a misbehavior of the auto-scaling system.
CloudWatch data missing: If a service fails to provide data, an alarm should be triggered. This could indicate failure on the Master, Lambda or the data aggregation script.

4. Upscaling

Before a slave is being added, the current queue should be evaluated. Due to the fact that Jenkins may require some time to schedule tasks, only jobs which have been in the queue for at least 2 minutes should be considered.

Due to all instances being in the same VPC, anonymous authentication is used for JNLP. This allows to start a slave and let it connect automatically without the necessity to manage credentials. In order to improve debugging, an instance should be able to rename itself in the EC2 console by using the AWS-CLI command modify-instance-attribute http://docs.aws.amazon.com/cli/latest/reference/ec2/modify-instance-attribute.html. This step has to be security evaluated. An alternative could be to retrieve the slave’s IP on the master and thus create a mapping. This mapping could be evaluated by Lambda in order to rename the instances.

In order to improve startup times and prevent a cold-start during execution of Jenkins-jobs, an AMI should be created and the usage of a shared Docker cache (See “Further plans: Shared Docker cache) considered.

By default, Jenkins Master declines connection attempts to already occupied slots. An instance could try to connect to every single existing slot and wait until the connection is accepted. It has to be evaluated whether this is a good attempt and how race conditions are handled. An alternative solution could be to access the Jenkins API and retrieve a list of free slots. In order to reduce the number of slaves trying to connect to the same slot, the chosen slot could be determined by randomizing the selection – the instance should just try a different slave slot if the randomly chosen one has been occupied in the meantime.

5. Downscaling

Before a downscale operation can be executed, a few actions have to be taken before an instance is shut down. Otherwise, a running build could be interrupted and thus marked as failed.

By default, Jenkins distributes jobs evenly across all connected slaves. This results in all slaves only being used partially – shutting an instance down would thus result in an interrupted build. Instead, a way to make Jenkins occupy all executors of a single slave has to be found. In case this is not possible, the Naginator-Plugin https://wiki.jenkins-ci.org/display/JENKINS/Naginator+Plugin can be used to re-schedule a failed build. By evaluating how long a job has been running yet, the slave with the smallest sum of durations can be turned off.

Before a slave is being turned off, it should be marked as offline. This avoids race conditions caused by the Jenkins scheduler assigning a job before the slave is disconnected. After waiting a few seconds, the auto scaling service should validate that no jobs are being executed. Afterwards, the instance can be terminated.

To avoid further race conditions, the job queue should be checked while evaluating how many slaves should be stopped. If a job for that slave-type is queued, this should be considered.

6. Further plans

This section describes thoughts which should be evaluated at a later point in time.

a) Heavy job plugin

This plugin allows to assign weights to jobs. Jobs using many resources may be marked this way. This allows to execute only few heavy or many small jobs on a single machine. This allows a better resource utilization.

More information at https://wiki.jenkins.io/display/JENKINS/Heavy+Job+Plugin

b) Spot instances

In order to reduce costs, Spot instances could be used. Depending on the region, up to 90% compared to On Demand-Instances can be saved. To avoid builds getting marked as failed because the market price rose above the threshold and the instance getting shut down, the Naginator-Plugin can be used.

c) Assign GPUs to jobs

To improve parallelisation of tests, all GPUs should be used to execute tests. At the moment, only one GPU is being used per instance due to hardcoded GPU-indizes in the MXNet code and tests. Also, some tests like KVStore require more than one GPU and would fail otherwise.

1. Linux

On Ubuntu, nvidia-docker is being used as wrapper for docker in order to provide access to the GPUs inside the container. By default, nvidia-docker is not able to start multiple containers because GPUs can not be shared in between instances.

2. Windows

TODO: Evaluate

...

d) Cache validation results

Instead of computing validation data in GPU-jobs using NumPy on CPUs, the results should be cached. This can be achieved by mocking the NumPy interface and checking if the passed command chain has been executed previously.

e) Latest-gen tests

Some operators require the latest GPU-hardware and are being skipped if they are executed on older hardware. In order to test these cases, a new test job can be introduced. Due to the high price of P3-instances, these tests should be kept to a bare minimum.

f) Redundant master

If the master fails, it has to be restarted and may not get back up. We should evaluate whether it's possible and worth the effort to develop a redundant Jenkins Master setup.

g) Shared Docker cache

To reduce the cold-start time of jobs on new slaves, the used Docker images should be stored in a shared cache.

Page tree

Auto scaling the fleets

1. Slave configuration

Slaves configuration

1. mxnetlinux-cpu

2. mxnetlinux-gpu

3. mxnetwindows-cpu

4. mxnetwindows-gpu

5. mxnetlinux-gpu-p3

7. utility

Planned:

2. Scaling metrics

3. Alarms

4. Upscaling

5. Downscaling

6. Further plans

a) Heavy job plugin

b) Spot instances

c) Assign GPUs to jobs

1. Linux

2. Windows

d) Cache validation results

e) Latest-gen tests

f) Redundant master

g) Shared Docker cache