Status

Current state: [ UNDER DISCUSSION | ACCEPTED | REJECTED ]

...

JIRA: SAMZA-TBD

Released:

Problem

With 0.13 release, the rich high level APIs allows users to chain complex processing logic as one coherent and fluent application. With so much power, there is a need for inherent support for ease of testing. Currently, the users will have get their hands dirty and understand some implementation details of Samza to write exhaustive integration tests. We want to tackle this problem in steps and this SEP, will take us one step closer towards the goal by introducing an in-memory system in Samza.

Motivation

With in-memory system, we will alleviate the following pain points.

Dependency on Kafka for intermediate streams for testing
Running time for tests (time spent on setting up and tearing down)
Ease of testing
Lack of collection based input for testing (this SEP addresses this problem partly)

Assumptions

In-memory system is applicable only for jobs in local execution environment. Remote execution environment isn’t supported.
The scope of in-memory system and the data it handles are limited to a container. I.e. there is no support for process to process interaction or sharing.
Checkpointing is not supported and consumers always start from the beginning in case of restart.
In-memory system doesn’t support persistence and is not the source of truth for the data. The data in the queue is lost when the job restarts or shutdowns unexpectedly.
The input data source to the in-memory system is bounded, immutable collection. Note the assumption is only applicable for the input streams and not for intermediate streams.

Design

Data Partitioning

Samza is a distributed stream processing framework that achieves parallelism with partitioned data. With a bounded data source, we need to think about how the data is going to be partitioned and how do we map data to SystemStreamPartition in Samza.

Single Partition

We can use a trivial and simpler approach of associating all of our data source to a single partition. It is not a bad strategy since the primary use case for in-memory systems is testing and the volume of data is negligible that we can barely notice the effects of parallelism. Although it does come w/ a downside that it constraints the users to only test their job with only one task. It might not be a desirable and exhaustive testing strategy from a user’s perspective.

Multiple Partition

In order to exploit the parallelism that the Samza framework offers and to enable users test their job with multiple tasks, we need to support multiple partitioning. There are couple of ways to support multiple partitions.

A. Partitioning at source

In this approach, we push the partitioning to the source. For e.g. we can read of a `Collection<Collection<T>>` and have each collection within the collection assigned to one partition. This is surprisingly simple yet powerful since it eliminates the need for repartitioning phase and allows the user to group the data at his/her whim. The downside w/ this approach is the input collections can be skewed and Samza don’t control the evenness in the distribution of the data. Since the primary use case is testing, the skew should have negligible impact.

B. Partitioning within Samza

Takes a collection from the user and applies a partitioning strategy. The strategy could be as simple as a round robin strategy or random assignment strategy. How do we determine the partition count? We can either have the user specify the number of partitions (introduces new configuration in Samza). Alternatively, we can also automatically come up with partition number based on the input data source. TBD

C. Partitioning within Samza w/ configurable strategy

We follow a similar strategy as “Partitioning within Samza” with the additional optional of supporting user specified groupers. With this approach, we sign up for introducing a public interface that user has to implement and pass it to Samza using config. Downside being it introduces additional configurations and also add on to our existing class loading approach using reflection.

Space shortcuts

Child pages

Versions Compared

Old Version 1

New Version 2

Key

Status

Problem

Motivation

Assumptions

Design

Data Partitioning

Single Partition

Multiple Partition

A. Partitioning at source

B. Partitioning within Samza

C. Partitioning within Samza w/ configurable strategy

Proposed Changes

Public Interfaces

Implementation and Test Plan

Compatibility, Deprecation, and Migration Plan

Rejected Alternatives

Space shortcuts

Child pages

Page History

Versions Compared

Old Version 1

New Version 2

Key

Status

Problem

Motivation

Assumptions

Design

Data Partitioning

Single Partition

Multiple Partition

A. Partitioning at source

B. Partitioning within Samza

C. Partitioning within Samza w/ configurable strategy

Proposed Changes

Public Interfaces

Implementation and Test Plan

Compatibility, Deprecation, and Migration Plan

Rejected Alternatives