This document describes the next version of Autonomous Container Scheduling.

Basics

Currently, discussion and the implementation for future architecture of OW are in progress.

I agree with many parts of the future architecture and that would be the right direction to step up, IMHO.

(I hope I can also participate in the implementation of them.)

It seems, however it will take some time to take those into OW and stabilize them.


What if we can improve the performance of OpenWhisk more than 150 times with the relatively small amount of efforts by adopting new scheduler based on current architecture?

I think it would be worth to take it in until the new implementation is ready.

With SPI abstraction, the new scheduler can even coexist with the existing one.

OW operators can choose the suitable scheduler which fits their needs the best.


In the Autonomous Container Scheduler(I will call it ACS in the rest of this document), there are following major changes:

  • Activation is handled in a best-effort manner.
    • One container will try its best to handle activations
    • When existing containers are not enough to properly handle incoming activations, more containers are created for the action.
  • Once action containers are initialized, they exist longer than the current implementation.
    • They wait for subsequent requests for about 5~10s (configurable value).
  • The criteria for throttling is no more activation per minute, now the throttler cares the number of containers for the given action(namespace) and processing speed of existing containers.
    • It denotes that once a container is initialized, it is occupied by the given action.


The following diagram depicts the basic flow of container creation and activation invocation.

Each path is separately handled. So there are only 3 steps in the activation path.



Following sections describe more details about each part in ACS.

1. Segregation of container creation and activation processing

Activation lives in the world of 1~2 figures of ms.

On the other hand, container operations take about 3~4 figures of ms.

Since container operations are much slower than action invocation, if we process them at the same time, activation is delayed by container operations.


In ACS(Autonomous Container Scheduler), there are two kinds of Kafka topic.


  1. invokerN: this topic is now being used for the container creation.
  2. actionTopic: this topic is only used to deliver activation requests.


So whenever a controller receives activation requests, it just sends them to action topic. No more logic is associated.

And at the same time, a controller asynchronously starts monitoring of processing speed for the given action.

If there is no container or existing containers are not enough to handle all incoming activations in time, it will send ContainerCreation requests to an invoker via invokerN topic.

2. Location-independent scheduling.


It is critical for OW to reuse the container for better performance.

Since controllers and invokers work asynchronously and the status of running containers is frequently changing, it is a huge overhead for controllers to trace the location of all containers.

It would be better for controllers to be apart from it. In current scheduler, controller statically chooses the same invoker based on hash function.

But if hash values of two different actions are same, they interfere with each other and it degrades the performance though there are many idle invokers.


Here, another good points of actionTopic come to the fore.


  1. Controllers completely don't need to care about the location of existing containers.
  2. Execution of one action does not interfere with others.
  3. It is possible to balance the loads(containers) with the assumption that the container creation is evenly distributed among invokers.


Since Kafka will send requests to invokers which subscribe actionTopic, the only thing controllers should do is just sending the requests to actionTopic.

No matter how many containers are running and no matter which invokers have the containers, controllers don't need to be aware of the container location.


Once controllers send activation requests to the action topic, Kafka will deliver the requests to target containers.

Since activations for different actions go to different Kafka topics, each action does not affect the others.

Also, a controller does not care about existing container location, it can schedule ContainerCreation to the invoker with the least loads.

In turn, traffic is more evenly distributed among invokers with location-independent scheduling.

3. MessageDistributor


In Kafka, the unit of parallelism is the number of partitions.

If you want to receive messages from the 3 different consumers, you need at least 3 partitions.


In the previous version of ACS(Autonomous Container Scheduler), each ContainerProxy subscribes the action topic.

Now there are MessageDistributor which is dynamically created and subscribes the given action topic.

This is because if ContainerProxy subscribes the Kafka topic, we need the same number of partitions with the number of concurrent containers.

It means, if we need 300 concurrent containers, the number partition for the action topic should be 300.

With MessageDistributor, we can limit the number of partitions to the number of invokers.


When a new container for the given action is created in one invoker for the first time, MessageDistributor is created and starts receiving the activation messages from the topic.

MessageDistributor is aware of the existence of all containers for the given topic in the invoker, it can monitor the status of each container and distribute messages to the idle containers.

So ContainerProxy now receives messages from MessageDistributor and once it finished the invocation it tells MessageDistributor for more messages.


If an invocation for another action comes, new MessageDistributor is created and subscribes the action topic.

So there will be one MessageDistributor per invoker and per action.

This guarantee there will be the same number of MessageDistributor running with the number of invoker slots even in the worst case.

4. ContainerProxy Lifecycle changes


Even pausing/unpausing the container could also be overheads because it is relatively slower than action invocation.

And under heavy loads, I observed about 5~10s client-side response time with many concurrent container creation/deletion.

So it would be better to keep containers in such as a case, instead of pausing/terminating them.


In ACS, there is no Paused state in ContainerProxy.

ContainerProxy will wait for up to 10s(configurable) and if there is no subsequent request, it is just terminated.


5. ETCD


In OpenWhisk, many cluster-wise resources and states are being shared asynchronously.

There is no way to synchronously share the invoker pool and scheduling status among controllers.

In the current scheduler, this is not required and cannot be achievable because the resource status in invoker is very frequently changing(1~10ms) upon action invocation.

That's the reason why controllers statically divide resources in advance and no controller is involved in the scheduling of the others.


But this results in throttling problem.

For example, if one namespace has 100 concurrent invocation limit and there are 10 controllers running, each controller has 10 throttling limit for the namespace.

It means if more than 10 invocations are sent to a controller at any time, the controller will reject the requests with Too many requests.

It seems reasonable to assume that the loads will be evenly distributed with the load balancer in front of controllers, but this issue quite frequently happens in the real scene.

So users cannot be guaranteed 100 concurrent invocations and an OW operator should increase the concurrent invocation limit to the bigger number than 100 to indirectly resolve the issue.

(Though throttling limit is a soft limit, this is not the desired behavior I think)


To prevent this issue, scheduling status should be managed and shared in the cluster-wise manner so that controllers can consider scheduling information of other controllers.

In ACS, the container status is relatively less frequently changing(5s~10s), and synchronous information sharing is a bit effective.

So I introduced ETCD transaction to share scheduling status among controllers and invokers.

Once ContainerProxy runs, it waits for the requests up to 10s(configurable), the status of ContainerPool is less frequently changing. Invokers periodically(300ms) write their freePool status in ETCD.

Based on this information, controllers try to schedule ContainerCreation to the invoker with the least loads(containers) and write the information about container creation in ETCD with the transaction.

So other controllers consider the current pool status in invokers and the creation-in-progress information at the same time to schedule the new ContainerCreation.


It is required because container creation can take up to a few hundred milliseconds to a few seconds.

During this period, the container still occupies the invoker pool.

So all controllers increase the value(creation-in-progress) when they schedule to an invoker and decreases the value when they get the acks for ContainerCreation.

Since invokers are also periodically sharing their pool status, the newly created container will be considered for subsequent scheduling.

Surely it is possible for invokers to get more ContainerCreation than they can handle in ACS, so they respond with ReschedulingMessage under the circumstance.

Since controllers will always to try to send the request to the invoker with the least loads, that indicates there are not enough resources in the cluster and it is the time to add more invokers.

6. ActionMonintor


In ACS, once a container is running, it handles the requests in a best-effort manner.

So if there are the moderate number of requests which can be handled by one container, one container processes all the requests.

Under this situation, it is critical to decide the time to add more containers for the performance.

Since new containers are not assigned for each invocation, we can consider the following rule.


  • Add more containers when processing speed of existing containers < incoming requests rate.


There are a few more cases to add containers for the action, but basically, we need to monitor the processing speed and incoming rate.


ActionMonitor does exactly what I described above.

It monitors the consumer lag of action topics and decides whether to add more containers or not.

ActionMonitor resides in a controller. There can be multiple ActionMonitor for one action in multiple controllers respectively.

To prevent multiple ContainerCreation, all ActionMonitors elect one Leader using ETCD leadership API.

Only a Leader monitors the status and decides whether to create more containers or not.

If a Leader is failed for some reasons, one of Followers becomes a Leader.


When new activation comes to a controller for the first time, ActionMonitor is created for the given action and start monitoring.

And if there are no subsequent requests for the action for given time(10s), it is terminated.

Since activation path and container creation path are completely separated and container creation happens asynchronously, so even though there are some delays in container creation, there is no side effect in activation processing among existing containers.

Since there will be limited number of action kinds that a controller will handle at the same time, we can safely assume there will be limited number of ActionMonitor in each controller at some point.

7. Changes in Throttler.


In current codes, a throttler checks the number of invocation per minute and the number of concurrent invocation.

But in ACS, the number of invocation per minute is meaningless because containers are reserved for one action once they are initialized.

So the unit of resources becomes the number of containers instead of the number of invocation.

Precisely, the throttler cares how many containers are assigned to a namespace.

(Actual resources in invokers are the container, this is more intuitive.)


Also, activations are handled in a best-effort manner, the throttler should not reject requests with the only factor that the number of existing containers reaches the limit for the given namespace.

It should consider the processing speed at the same time.

So now a throttler considers the following two rules:


  1. Are there already the max number of containers in the given namespace?
  2. Is processing speed of existing containers < incoming requests rate?


If a namespace is laid down in those situations, that means, there are more number of requests than existing containers can handle and there is no space to add more containers.

This is the time to reject the requests.


For the second rule, throttler should check the consumer lag of action topics.

But it is quite overhead to check consumer lag for every single request.

Instead, it can asynchronously check the status and reject the requests at some point.

Luckily ActionMonitor is already monitoring the consumer lag to decide whether to add more containers or not.

So ActionMonitor can let Throttler know when it should reject the requests.


Only a Leader ActionMonitor monitors the status, but this information can be shared among all controllers with Akka-cluster.

I have seen the performance issue of Akka-cluster many times, but it would be fine to share one flag(true/false) for each action.

And in the worst case, if it takes much time to share that flag, the only side effect is controllers accept/reject a few more requests.

In ACS, that's acceptable because containers handle requests in a best-effort manner.

And it only happens when there is already the max number of containers running.


8. ETC

8.1 Handling of Action updates

An action can be updated at any time even during it is being invoked.

In ACS, neither controller and invokers are involved in action invocation path.

Only MessageDistributor receives the activation and ContainerProxy handle the requests.

If an action is updated while there are still running containers with previous information, there is no way for controllers and invokers to handle this.

There are two big cases:


  1. Codes or runtime are changed.
  2. Memory limit is changed.

For the first case, it can be handled in ContainerProxy, if it recognizes the action updates, it can just terminate the physical container and create a new one with new codes.

But if memory limit is changed, it can induce huge overcommit in invoker side.

For example, if there are about 100 containers for a action with the 256MB memory limit, and if the memory limit for the action is changed to 512MB, the memory usage can exceed the available memory in the invoker.

So ContainerProxy should not and cannot handle the memory update.

Ideally all existing containers should be terminated and controller should schedule new containers for the action.

But controllers just send the activation requests to the Kafka topic and asynchronously check existing containers.

If activations for the updated action are already sent to the action topic, and it is retrieved by MessageFeed, there is no way to handle this request again as MessageFeed just commits the offset whenever it receives the new messages.

So once memory limit for the action is changed, it should not be sent to existing containers and at the same time, controllers should send activations as soon as possible.


To achieve the above two requirements, ACS specify memory requirements for an action in the topic like this:


  • Action topic for actionA: {whisk.system}-{actionA}-{256MB}


Controllers just send activation to action topic as soon as it receives the requests, if we specify memory requirement in the action topic like that, it is guaranteed that new activation will not be sent to the existing containers by Kafka.


This kind of issue makes me feel the needs for routing components. Since OW rely on Kafka for routing, it is limited to route activations with our own rules. And this is the one of the reason why I become a fan of future architecture of OW.


8.2 Drawbacks of ACS

ACS is not an all-round scheduler. It has a few drawbacks.

1. It is more effective for short-running actions.

For long-running actions such as 1 minute-long, it would be best to create a container for each invocation request.

But container creation happens asynchronously, controller(ActionMonitor) needs some time to recognize it needs more containers.

It could be up to 300ms. But for 1 minute-long running action, 300ms bootup overhead is relatively small comparing to the execution time.

But still it is obvious that there is a space to optimize it.


2. Since a container is asynchronously created, the first invocation can take a little bit more time.

When the first ActionMonitor is created, there will be no container. Then ActionMonitor just sends ContainerCreation to invokers.

So it would not take up to 300ms like above, but anyway container is not being created with the invocation request, it could take a little bit more time to initialize a container.

But subsequent invocation is faster than current scheduler because there is no other logic associated in the activation path.


3. In the big cluster(e.g: # of controllers/invokers > 500~1000), it might not be effective because it rely on ETCD transaction and Kafka partitions.

As more and more partitions are used, it will take more time to rebalance the consumers in a group.

If more than an action requires a huge concurrency and utilizes 400~500 invokers(MessageDistributor) at the same time, it might take some time to rebalance as a new invoker subscribes the topic.

In controller side, all controllers compete with each other to write data in ETCD.

Since ETCD transaction API shows about 300~400 TPS with 5 nodes, it will take more time for controllers to schedule as more and more controllers are in the cluster.


But I think with all these drawbacks, this is worth to have it until the new implementation is ready.

It will be useful and show better performance in a relatively small cluster.




Since implementation is almost finished, I am willing to contribute it.

And I will describe the details about each specific component more in the subpages to form a consensus.

Any comments and feedbacks would be grateful.

  • No labels