Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

...

Samza today provides various api's like high level (fluent/application) for chaining complex jobs, low level (task) with configuration like Samza-Yarn, Samza as a library (Samza standalone) and in coming future Samza will support In-memory system for in memory streams that a user can produce/conusme from. Still Samza users have to get their hands dirty with too much configuration details and dependency on external system(like kafka) to write integration tests. In order to alleviate this we intend to provide a fluent test api to users. This will help in setting-up quick tests against their system without depending on underlying incoming stream from another system e.g Kafka Topic.

The goal of the proposed SEP we will provide a standardized framework to set-up continuous integration test for High Level Api, Low Level Api with Single Container configuration & Multi Container Configuration (Samza standalone using Zookeeper) with ease. Users won’t have to worry about the underlying Samza config details or even understand implementation details. This abstraction will increase developer productivity and help client's make their Samza system robust.

...

  • Samza Job Owners: To make their Samza jobs robust
  • Samza Dev Teams: To enable dev teams make their samza infra robust

Motivation

Addition of this Continuous Integration Test Framework will alleviate:

  1. Lack of a standardized way to set up integration tests for users

  2. Lack of brevity in code and much detailed understanding of low level config details of the Samza Api configs to set up samza systems

  3. Lack of a pluggable Test System with any Samza configuration (single & multi-container configuration)

  4. Set Comprehensive set of robust tests for current Samza api

Assumptions 

 

  1. System depends on In Memory system to spin up in-memory system consumers and producers for in memory data streams

  2. Single Container mode runs on SingleContainerGrouperFactory, that means these test are supposed to be ran on a single container environment

  3. Multi Container system leverages Samza as a library (Samza standalone) using Zookeeper as coordination service for setting up test

  4. In order to provide a testing api with least config management, basic configs are set up for users, but flexibility to add/override some custom config is also provided

  5. Testing is always supposed to be done using bounded streams using EndOfStreamMessage

  6. System should directly consume from File stream or Local Kafka stream without any additional configs, File Stream implementation is out of scope for this SEP, although this SEP suggests what the design for file stream should look like

  7. Bounded message streams from Kafka (either using Latch or EndOfStreamMessage) is beyond the scope of this SEP

  8. For Stateful testing system may use RocksDb as it ships with Samza or maintain the state in memory

...

For the low level api once user runs the job, users can assert data from any streams the job uses. Whereas Samza fluent api does job chaining hence only the final expected output can be compared in this case. StreamAssert spins up a consumer for the system under test, gets the messages in the stream and compares this result with the expected value. Various assertion functions provided are:

 

StreamAssertStateAssert
containscontains
containsInAnyOrdersatisfies
inWindowPane 
satifies 

 

Data Types & Partitions:

The framework will provide complete flexibility for usage of primitive and derived data types. Serdes are required for local kafka stream and file stream but in memory streams dont require any Serde configuration. Test framework will provide api's for initialization of input streams (both single and multi-partition), and also data validation on single partition and multi-partition of the bounded streams 

...

Code Block
languagejava
themeEclipse
titleEventStream
/**
* EventStream provides utilities to build an in memory input stream of events. It helps mimic run time environment of your job, 
* advancing * time for windowing functions  
*/
 
public class EventStream<T> {
  public static abstract class Builder<T> {
    public abstract Builder addElement();
    public abstract Builder addException();
    public abstract Builder advanceTimeTo(long time);
    public abstract EventStream<T> build(); 
 }

 public static <T> Builder<T> builder() {...}
}

...