Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

...

And with this change, it is critical to decide the timing and the number of containers to create. Performance can considerably vary depending on this decision.


2.5 Rule out Kafka

This can be controversial. Kafka is a great solution. It supports horizontal scaling, high throughput with commodity hardware, and high reliability, and so on.  I have used it for many years for many purposes. But I reached a conclusion that Kafka does not fit well with OpenWhisk and it becomes a hurdle for better controllability and flexibility. This is because we cannot control the message routing. While Kafka supports some methods to partition messages, it's not enough for Openwhisk.

...

2.6 The meaning of At-most-once and circuit breaking

This can be also controversial. I have been thinking about the at-most-once semantic and the circuit breaking in OpenWhisk.

Let's think about the at-most-once semantic which is supported in OpenWhisk. Mostly, OpenWhisk relies on Kafka. As soon as invokers fetch activations from Kafka, it immediately commits offsets. Once offsets are committed, activations reside in the invoker memory only. If the invoker crashed for some reasons while it holds some activations, all "not-yet-finished" activations will be lost. But the invoker already committed offsets, no more activation is executed again. This is the at-most-once semantic which OpenWhisk supports. It is meaningful because it would be better to skip failed activations in some cases instead of running the same activations again.

For example, let's assume there is an action that increases an existing value and puts the results in DB. If it crashed right after the value is successfully stored in DB, multiple executions will store a totally different value. So precisely, activations can be failed at any time and OpenWhisk does not guarantee at-least-once semantic. Instead, OpenWhisk guarantees that a response is always sent to clients no matter whether it is successful or not. This is achieved by controllers. Controllers start a timer when it sends activation to an invoker. If there is no response from the invoker for the given time, the activation will be timed out and a failed response will be sent to the client(or stored in DB). This is a circuit breaking in OpenWhisk. No matter how many invokers are failed, clients can receive (failed) responses from controllers anyway. The failure of invokers should not make the entire system unresponsive. It would be the worst that clients never receive any responses from controllers and nothing is left in client-side about what happens on their requests. Now we can finally reach the following conclusion with these ideas.

  1. The system(OpenWhisk) should not be affected by the failure of one component.
  2. Clients should always be able to receive responses. At least they should be able to notice their requests are failed when the system failure happens.

With the above two requirements, now let's think of ruling out Kafka more. Kafka is being used for activation buffering and message delivery. Message delivery in OpenWhisk is a little bit different with message delivery in other traditional systems. Messages are delivered in realtime and no message is fetched again if it is already delivered(executed). It means activation messages are transient. We don't need to keep all data for all retention period. Messages just pass through the whole system and once they are delivered, they become stale and useless. This implies that our message delivery component does not necessarily need to support persistence against all messages. In case incoming messages overwhelm the system capacity and some messages should be buffered until resources become available, we definitely need to support persistence for them. But it is not critical to support persistence for all flowing messages I think. Even though we lose flowing messages, OpenWhisk already supports circuit breaking and clients will anyway receive some responses. I think this is the reason why Markus proposed to use Kafka only for overflowing workload in OpenWhisk future architecture. Even if we lose some messages, it would be the same level of guarantee with the current version as invokers can also lose already committed messages on failure.

Similarly, when I design a new component for message delivery I didn't consider the persistence.

In short, we do not necessarily support persistence for message delivery with at-most-once semantic and circuit breaking process in OpenWhisk.TBD