Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

...

Samza today provides various api's like high level (fluent/application) for chaining complex jobs, low level (task) with configuration like Samza-Yarn, Samza as a library (Samza standalone) and in coming future Samza will support In-memory system for in memory streams that a user can produce/conusme from. Still Samza users have to get their hands dirty with too much configuration details and dependency on external system(like kafka) to write integration tests. In order to alleviate this we intend to provide a fluent test api to users. This will help in setting-up quick tests against their system without depending on underlying incoming stream e.g Kafka Topic. The goal of the proposed SEP we will provide a standardized framework to set-up continuous integration test for High Level Api, Low Level Api with Single Container configuration & Multi Container Configuration (Samza standalone using Zookeeper) with ease. Users won’t have to worry about the underlying Samza config details or even understand implementation details. This abstraction will increase developer productivity and help client's make their Samza system robust.

...

  1. Lack of a standardized way to set up integration tests for users

  2. Lack of brevity in code and much detailed understanding low level config details of the Samza Api to set up samza systems

  3. Lack of a pluggable Test System with any Samza single & multi-container configuration

  4. Set of robust tests for current Samza api

Assumptions 

 

  1. System depends on In Memory system to spin up in-memory system consumers and producers for in memory data streams

  2. Single Container system mode runs on SingleContainerGrouperFactory, that means these test are supposed to be ran on a single container environment

  3. Multi Container system leverages Samza as a library (Samza standalone) using Zookeeper as coordination service for setting up test

  4. In order to provide a testing api with least config management, basic configs are set up for users, but flexibility to add/override some custom config is also provided

  5. Testing is always supposed to be done using bounded streams using EndOfStreamMessage

Proposed Changes

We propose a three step process for configuration of integration test for your samza job. Three facets are data injection, transformation and validation.

  • Data Injection phase means configuring the input source for Samza job.
  • Data Transformation phase means api logic of samza job(low-level/high-level). 
  • Data Validation phase asserts the expected results to the actual results computed after running a Samza job.

Data Injection:

The initial source of data stream is always supposed to be bounded since this is a In the context of this Test framework an input stream means a stream that Samza job may consume from and an output stream means a stream that Samza job can produce to. The initial source of data stream is always supposed to be bounded since this is a test, and that may originate as one of the following sources listed below.

  • In Memory Data Stream
    Users have ability to produce to & consume from in memory system partitions after SEP-8. In Memory Data system alleviates the need of serialization/deserialization of data since it does not persist data. We take advantage of this and provide very succinct Stream classes to serve as input data sources, which are the following:
    • Collection Stream
      Users can plug a collection (either List or Map) to create an in-memory input stream
      e.g:  CollectionStream.of(...,{1,2,3,4})
    • Event Builder Stream
      Event builder helps a user to mimic runtime samza processing environment in its tests, for example adding an exception in the steam, advancing time for window functions.
  • File Stream
    Users can create an input stream reading and writing from local file
    e.g:  FileStream.of("/path/to/file")
  • Local Kafka Stream
    Users can also consume bounded streams from a kafka topic which serves as initial input. Samza already provide api's to consume from and produce to kafka. For kafka streams we will need serde configurations

Data Transformation:

For in memory streams the api actually initializes the in memory stream and spins up a Samza producer using an InMemorySystemProducer to write the stream, this is how a collection of data or events is initialized as a steam. It also configures any output stream if the user configures

Data Transformation:

This is the Samza job you write, this can be either done using Low Level Api or the fluent High Level Api. Test frameworks provides api to set up test for both the samza api. Test framework supports both the api with Single container and Multi-container mode. Users implement StreamTask and Async Stream task in the same way, as they do for their Samza job, and they pass it along to the framework. For high level apiFramework supports Samza job application logic, either using Low level api (StreamTask and AsyncStream Task) & High level api. This is done in the same way user writes their Samza job, but instead of writing verbose configs, they just need to pass class instance implementing this logic to the api.  

Data Validation: 

For the low level api once user runs the job, users can assert data from any intermediate streams the job produced to or the final stream that contains the output. Whereas Samza fluent api does job chaining hence only the final expected output can be compared in this case.   

...

Code Block
languagejava
themeEclipse
titleFileStream
/**
* TestApplication provides static factory for quick setup of Samza environment for testing High Level Api, users can configure input  
* streams and then apply various operators on streams and run the application
*/
 
public class TestApplication<T> {  
  private TestApplication( HashMap<String, String> configs, Mode mode)
  
  // Static factory to config & create runner for High level api 
  public static TestApplication create(String systemName, HashMap<String, String> config) {...}
  
  // Configure any kind of input stream for samza system, and get the handle of message stream to apply operators on
  public <T> MessageStream<T> getInputStream(EventStream<T> stream) {return null;...}
  public <T> MessageStream<T> getInputStream(CollectionStream<T> stream) {return null;...}
  public <T> MessageStream<T> getInputStream(FileStream<T> stream) {return null;...}
  
  // Run the app
  public void run() { };
}

...

Implementation and Test Plan

Implementatio

Compatibility, Deprecation, and Migration Plan

...