Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

...

With 0.13 release, the rich high level APIs allows users to chain complex processing logic as one coherent and fluent application. With so much power, there is a need for inherent support for ease of testing. Currently, the users will have get their hands dirty and understand some implementation details of Samza to write exhaustive integration tests. We want to tackle this problem in steps and this SEP, will take us one step closer towards the goal by introducing an in-memory system in Samza.

Terminologies

  • IME - Incoming Message Envelope
  • EOS - End of Stream

Motivation

With in-memory system, we will alleviate the following pain points.

...

The input data source for the in-memory system can be broadly classified as bounded and unbounded data. We are limiting the scope of this SEP to only bounded data source that is immutable as the input source. It simplifies the view of the data and also the initialization step for the consumers. However, in-memory system for intermediate streams supports both bounded and unbounded data. The sink a.k.a output source is modeled to be mutable.

 

Data Type

Samza has a pluggable system design allowing users to implement their own system consumers. Typically, consumers consume raw message and wrap them using IME. However, it is possible for some systems to introduce subclass of IME and pass them to tasks. For this reason, we need to support for different data types within in-memory collection.

  1. Raw messages: In-memory system will behave like a typical consumer and wrap the raw message using IME. It takes care of populating offset and key fields for the message. Note, the offset is defined as the position of the data in the collection and the key is the hash code of the raw message. 
  2. Type of IME:  In-memory system acts as a pass through system consumer, passing the actual message envelope to the task without any wrapping.

Data Partitioning

Samza is a distributed stream processing framework that achieves parallelism with partitioned data. With a bounded data source, we need to think about how the data is going to be partitioned and how do we map data to SystemStreamPartition in Samza. . Partitioning is only interesting in the case when the input source is raw message. With IME, partitioning data is already part of it and in-memory system will respect the partition information within the IME.

Single Partition

We can use a trivial and simpler approach of associating all of our data source to a single partition. It is not a bad strategy since the primary use case for in-memory systems is testing and the volume of data is negligible that we can barely notice the effects of parallelism. Although it does come w/ a downside that it constraints the users to only test their job with only one task. It might not be a desirable and exhaustive testing strategy from a user’s perspective.

...