Architectural differences

We currently basically have 2 ways of deploying OpenWhisk:

On raw VMs or bare-metal machines, where we need to deploy the invokers to each of the worker nodes to be able to create containers locally and manage their lifecycles locally. In this case, the system has very low level access to container primitives to be able to make a lot of performance optimisations.
Atop a container orchestrator, where we can transparently generate containers across a cluster of worker nodes. The concept of the invoker feels foreign in this case, because its implementation is geared towards use-case 1.

Yes, there are multiple flavors to both of those (there's an invoker-centric version of the Kubernetes deployment for instance), but in essence it boils to the two flavors described above.

The issue here is, that the needs and abstractions for these two types are different and make it hard to exist next to each other in one shared codebase. This causes more friction than necessary.

Proposal

I believe we can (and should) converge on an architecture that abstract the VM/bare-metal case away and give the Controller direct access to the containers for each action. Lots of proposals out there already go into that direction (see: Autonomous Container Scheduling, Clustered Singleton Invoker for HA on Mesos) and it is a natural fit for the orchestrator based deployments, which the whole community seems to move to.

The following is a proposal on where we could move to in the future. Please note that not all of it is completely thought out until the end and there are open questions. I wanted to get this out into the public though because I think there is some potential here for us to converge our topologies and get an overall better and cleaner picture.

The main goal of this document is to help us discuss on the overall architecture and then decide on a direction that we all think is viable to go in. It is not meant to be an exhaustive specification of that future architecture. All of the very fine details can be discussed in seperate discussions. It is meant though to identify early show-stoppers or concepts that we think cannot fly.

1. Overview

The overall architecture revolves around the Controller talking to user-containers directly via HTTP (similar to Autonomous Container Scheduling). The Controller orchestrates the full workflow of an activation (calling `/run`, getting the result, composing an activation record, writing that activation record). This eliminates both Kafka and the Invoker as we know it today from the critical path. Picking a container boils down to choosing a free (warm) container from a locally known list of containers for a specific action.

Container creation happens at a component called ContainerManager (which can reuse a lot of today's invoker's code, like the container factories). The ContainerManager is a cluster-singleton (see Clustered Singleton Invoker for HA on Mesos). It talks to the API of the underlying container orchestrator to create new containers. If the Controller has no more capacity for a certain action, that is, it exhausted its containers known for that action, it requests more containers from the ContainerManager.

The ContainerManagerAgent is an extension to the ContainerManager. It is placed on each node in the system to either orchestrate the full container lifecycle (no orchestrator case) or provide the means necessary to fulfill some bits of the container lifecycle (like pause/unpause in the Kubernetes case).

Kafka is still in the picture to handle overload scenarios (see Invoker Activation Queueing Change).

General dependencies between components

Full lifecycle of invocation and container creation

2. Loadbalancing:

The Controllers no longer need to do any guesstimates to optimise container reuse. Controllers are merely routers to containers they already know are warm, in fact, the Controllers only know warm containers.
Loadbalancing is more a concern of container creation time, where it is important to choose the correct machine to place the container on. Therefore, the ContainerManager can exploit the container orchestrator's existing placement algorithms without reinventing the wheel or can do the placement itself when we don't have an orchestrator running underneath.

Since loadbalancing happens at container creation time, the operation is a lot less time critical than scheduling at invocation time (like today). That is not to say it doesn't matter how long it takes, but when the whole container creation needs to be done anyway (which I assume is in the ballpark of roughly 500ms at least), we can easily afford to take 10-20ms to decide where to place that container. That opens the door for more sophisticated placement algorithms while not adding any latency to the critical warm path itself.

3. Scalability:

The Controllers in this proposal need to be massively scalable, since they handle a lot more than they do today. The only state known to them is which containers exist for which action.
As each of those containers can only ever handle C concurrent requests (where C in the default case is 1), it is of utmost importance that the system can **guarantee** that there will never be more than C requests to any given container. Therefore, shared state between the Controllers is not feasible due to its eventually consistent nature.
Instead, the ContainerManager divides the list of containers for each action distinctively across all existing Controllers. If we have 2 Controllers and 2 containers for action A, each Controller will get one of those containers to handle himself.

Edge case: If an action only has a very small amount of containers (less than there are Controllers in the system), we have a problem with the method described above. Since the front-door schedules round-robin or least-connected, it's impossible to decide to which Controller the request needs to go to hit that has a container available.
In this case, the other Controllers (who didn't get a container) act as a proxy and send the request to a Controller that actually has a container (maybe even via HTTP redirect). The ContainerManager decides which Controllers will act as a proxy in this case, since its the instance that distributes the containers.

Edge case diagram

4. Throttling/Limits:

Limiting in OpenWhisk should always revolve around resource consumption. The per-minute throttles we have today can be part of the front-door (like rate-limiting in nginx) but should not be part of the OpenWhisk core system.
Concurrency throttles and limits can be applied at container creation time. When a new container is requested, the ContainerManager checks the consumption of containers for that specific user. If she has exhausted her concurrency throttle, one of her own containers (of a different action) can be evicted to make space for that new container. If all containers are already of the same action, the call will result in a `429` response.

This section isn't fully fleshed out just yet and only contains basic ideas on how we could approach the issue.

5. Logs:

Logs are collected via the container orchestrator's API (see KubernetesContainerFactory) or via the ContainerManagerAgent, which is placed on each node.
In general, OpenWhisk should move to a more asynchronous log handling, where all the context needed is already part of each log line which can then be forwarded asynchronously on the node where the container is running rather than threading it through a more centralized component like the controller.

In theory this is a seperate discussion, it is very relevant to the viability of this proposal though.

6. Container Lifecyle:

Container creation is the responsibility of the ContainerManager. It talks to the underlying container orchestrator to generate containers (see ContainerFactorys like today). In the Kubernetes/Mesos case this is very straightforward, as they expose APIs to transparently generate containers across a cluster of nodes. For a potential case with no underlying orchestration layer, we build this layer ourselves by clustering the ContainerManagerAgents on all the nodes and talk to them to generate containers on the specific hosts. Unlike the Mesos/Kubernetes case, this warrants writing a placement algorithm as well.

The rest of the container lifecycle is responsibility of the Controller. It instructs the ContainerManagerAgent to pause/unpause containers. Removal (after a certain idle time or after a failure) is requested by the Controller from the ContainerManager again. The ContainerManager can also decide to remove containers, when it needs to make space for other containers. If it comes to that conclusion, it will first ask the Controller owning this Container to give it back (remove it from its list effectively). Only then can the container be removed.

Not very fleshed out either, mostly a suggestion. We need to make sure the network traffic generated from the Controllers to all the containers to lifecycle the containers is viable. Under bursts though, that should not matter because of the pause grace.

Conclusion

The above gives us a unified architecture for both the cases where a container orchestrator is present and for the cases where an orchestrator is absent. That should result in less friction while developing for different topologies and it should be a more natural fit for container orchestrator's than what we have today.

On top of that, it moves a lot of work off the critical path (loadbalancing, placement decisions, Kafka) and makes the Controllers a lot more scalable than today, which is crucial for this architecture.

The path to optimize gets a lot thinner which could result in easier trackable performance bottlenecks and less code to optimize for performance. In this architecture, all deployment cases look at the very same path when it comes to optimizing burst performance for instance.

Space shortcuts

Page tree

Architectural differences

Proposal

1. Overview

General dependencies between components

Full lifecycle of invocation and container creation

2. Loadbalancing:

3. Scalability:

Edge case diagram

4. Throttling/Limits:

5. Logs:

6. Container Lifecyle:

Conclusion

Space shortcuts

Page tree

OpenWhisk future architecture

Architectural differences

Proposal

1. Overview

General dependencies between components

Full lifecycle of invocation and container creation

2. Loadbalancing:

3. Scalability:

Edge case diagram

4. Throttling/Limits:

5. Logs:

6. Container Lifecyle:

Conclusion