Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

...


Given the closeness of Samzas architecture to Kafka Streams, IMHO it makes sense to take a quick look into the algorithm used by the Coordinator concerning load distribution and stateful tasks.

Heron

Twitter developed Heron in 2011 to overcome the shortcomings of Apache Storm attempting to handle the production load of data.
Specifically, the items Twitter needed to address are

  1. Resource isolation
  2. Resource efficiency
  3. Throughput

Heron claims to be compatible with Storm, but there are a few steps developers need to take https://apache.github.io/incubator-heron/docs/migrate-storm-to-heron/, so it's not entirely backward compatible out of the box.

Heron API

Heron is similar to Kafka Streams in that it offers a high-level DSL (https://apache.github.io/incubator-heron/docs/concepts/streamlet-api/) and a lower level API (https://apache.github.io/incubator-heron/docs/concepts/topologies/#the-topology-api) based on the original Storm API.


Topology

Heron topologies are composed of two main components Spouts and Bolts.

Spouts

Spouts inject data into Heron from external sources Kafka, Pulsar. So a spout is analogous to an input topic in Kafka Streams.

Bolts

Bolts apply the processing logic defined by developers on the data supplied by Spouts.

Streams of data connect spouts and Bolts.

Architecture

Heron needs to run on a cluster of machines. Once developers build a topology, they submit the topology to cluster, much like Spark, Flink in that you develop the application then distributed to the worker nodes


Resource Management

When using the high-level Streamlet API Heron allows you to specify resources for the topology:

  1. The number of containers into which the topology's physical plan divided into.
  2. The total number of CPUs allocated to be used by the topology
  3. The total amount of RAM allocated to be used by the topology


Operations

Heron offers stateless and stateful operations. There are windowing operations similar to Streams Sliding, Tumbling, and Time based windows. Heron also offers the concept of "counting" windows.

State Management

Heron uses either Zookeeper (https://apache.github.io/incubator-heron/docs/operators/deployment/statemanagers/zookeeper/) or the local file system (https://apache.github.io/incubator-heron/docs/operators/deployment/statemanagers/localfs/) for managing state. Details about state management are relatively sparse, but IMHO, I think the approach to state management is far enough from Kafka Streams that there is nothing we can gain Heron's state management.

The overall recommendation from the Heron documentation is that the local file system state management is for local development only, and Zookeeper is the preferred approach in a production environment.

Old Version of KIP-441

----------------------------------------------------------------------------------------------------------------------------

...