Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

Table of Contents

There are some discussions about which serde models to use before the tech preview release, and after the release the C3 team provided some feedbacks about using the current serde models, and possible ways to improve it. This page serves as a summary of our past thoughts and discussions about pros / cons of different options for clear illustration.

...

  1. Although it has the benefits of "only specifying the serde classes at the top instead of throughout the topology", it may have the risk for complex topologies, users tend to forget and mess with the class → serde mapping; i.e. this is a question of "whether it is worthwhile to just enforce it throughout the code for safety".
  2. Asking users to learn a new "Type" class for parameterized types could be a programming burden.
  3. Not clear if extending the serialization interface with this new overloaded function would be clean from any misuse side effects.

 

About Open World v.s. Closed World

Open World APIs

Current Kafka Streams high-level DSL is considered "Open World", where arguments of the API functions are generic typed with the provided (de-)serializers to read / write their objects into byte arrays.

The serdes are provided when we 1) read data from Kafka, 2) write data to Kafka, 3) materialize data to some persistent storage (like RocksDB), and it is always provided dynamically in the DSL.

 

Note that we will only materialize KTables that are read directly from Kafka topic, i.e.:

 

// we will materialize the table read from "topic", and apply the filter / mapValues logic on 
// that materialized data before the aggregation processor below
  
builder.table("topic", keyDeserializer, valueDeserializer).filter(...).mapValues(...).aggregate()

 

We do not need to consider case 3) for KTables, but for KStream we still need to require users to provide the serde libraries for KStream when it is going to be materialized. 

 

Its pros are:

1) Better programmability: users just need to use whatever class objects / primitives they want.

2) Type-safty: it includes compilation-time type checking.

3) Flexibility to extend to other programming interfaces in the future.

 

Its cons are:

1) Users have to define a class for each intermediate results throughout the topology, and also a ser-de factory for any of them that are going to be written to disk files / sockets (i.e. Kafka or RocksDB).

2) There will be some overhead in creating the transient objects at each stage of the topology (i.e. objects that are created between the consecutive transformation stages).

 

Closed World APIs

Another popular API category is "Closed World", where the data types are pre-defined as a compound class like JSON (e.g. Storm uses "Tuple", Spark uses "DataFrame", Hyracks uses "Frame", etc), which are essentially a map representation of the underlying byte buffer, and getters / setters are then by field-name and translated into byte-buffer reads / writes.

 

Its pros are:

 

1) Users do not need to define these classes and their serdes specifically for each of these classes.

2) Possibly reduce GC on "transient" JVM objects CPU overheads of ser-de / hashing if the library can mutate the incoming record bytes.

 

Its cons are:

1) Lost of type-safty, since it is always translating from a "Tuple" to another "Tuple".

2) Users have to define the "schema" of the transformed tuple dynamically that could be possibly registered in a schema registry service.

 

Examples of Closed-World APIs include: Storm, Spark, Hyracks, Giraph, etc.

 

Typing Needs in Practice

I think that for most stream processing applications, users would likely fall into one of the following categories:

 

1. Their data types are pure strings with some JSON format throughout the pipeline. Cases fall into this category are that a) stream source data comes from some files / logs from other systems whose format they cannot control, b) they use string just for ease of programming that involves other external systems.

 

2. Their data are encoded with Avro / Thrift / ProtocolBuffer throughout the pipeline. Cases fall into this category are that a) they have a centralized schema management for the whole organization that everyone depends on, b) their stream source data comes from some online services that generating messages at their own format.

 

3. Their source data are encoded differently for each data source, or all their data types are primitive. Cases fall into this category are that a) streaming source data are quite heterogeneous with their own SerDe factories and they do not have a centralized data schema support, b) just for demos?

 

4. They want strong typing in their application for better programmability in terms of functional programming / oo programming, and hence would rather want to define their own class types for all possible intermediate values rather than using a generic class as in case 1) or 2).

 

For the above three cases, type-safty will be mostly helpful for case 3) and 4), and possibly also in case 1) and 2) for strong typing as well (think of SpecificRecord in Avro).

Summary

In summary, many of these pros / cons considerata are really programmability / user-facing issues that personal taste may play a big role here. And hence we thought that whichever option we chose, we will not be able to make everyone happy (not sure if it is a half-half split though). So before we collect more feedbacks that brings in factors that we have not thought about before, we would like to keep the situation as-is and work on improving the current options for better programmability, but keep this in mind so that we can always come back and revisit this decision before we remove the "API unstable" tag for Kafka Streams.