Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

Table of Contents

This page is meant as a template for writing a KIP. To create a KIP choose Tools->Copy on this page and modify with your content and replace the heading with the next KIP number and a description of your issue. Replace anything in italics with your own description.

Status

Current state: Draft Under Discussion

Discussion thread: here [Change the link from the KIP proposal email archive to your own email thread]

JIRA: here [Change the link from KAFKA-1 to your own ticket]15299 

Please keep the discussion on the mailing list rather than commenting on the wiki (wiki discussions get unwieldy fast).

...

Reduce the gap between the semantics of  relational databases and data streams.

It is a common data integration technique to capture RDMS database changes as they are made to data entities using Change Data Capture (CDC) platform and send these changes as messages to Kafka topics.

Often each data entity is stored in several tables with one-to-many relationships between them which allows for the easy extraction of these entities from the DB When business entity is required to be stored in relational database the data model for this entity is typically normalized. This data normalization in general results in using several tables to store each data entity type with several types of relations between tables. If foreign key relationship is used between database tables then it is easy to assemble the business entity data from corresponding DB tables when necessary using SQL SELECT statements with foreign key table join.

Also reading the data entities from the application database usually associated with some specific "business" event when data entity is in the "business" consistent statethat could be represented as a stream of business events resulting from the business activity. For example Order data entity shown on the diagram below could only be required to be meaningfully extracted for the integration when its state is "COMPLETED" since before that it might be still in the process of creation.

Gliffy Diagram
macroId93230620-dd07-435c-b0fb-e1502ffba955
displayNameDB Entities
nameDB Entities
pagePin4

In order to integrate RDBMS data entity with consumers using RDBMS->CDC->Kafka->KStreams->Consumer pipeline the following sequence of steps could be used:

  • Replicate each entity table to its own topic with CDC. Typically this is only possible using Table PK as Kafka Message Key. We will call these topics "Data Topics" here.
  • Create Kafka Stream KTables from all "Data Topics"
  • Publish business events that trigger data entity extraction from Kafka topics to its own topic. We can call this topic "Trigger Topic" here
  • Use existing stream/table join feature in Kafka Streams that will join "Trigger Topic" based stream with parent table of data entity (Orders) in our example
  • Use new Kafka Streams feature proposed in this KIP-955 to join the stream resulting from the previous step with the rests of KTables using left Foreign Key Join

order processing as the result of "ORDER COMPLETED EVENT".

So if we create KStream from the stream of business events and create a KTable(s) from all database tables that store business entity data (for example using Change Data Capture) we can assemble complete Business Entity using stream-table joins and then aggregating on event key.

For the Order Completion example above the sequence of order extraction from Kafka topics is as follows:

  • As the first step Business event stream could be joined with business entity parent table using stream-table join on PK. Using the order management example on the diagram below this will be Order Processing Events stream joined with Orders KTable on OrderId key resulting in a stream of completed orders. 
  • As the second step the stream of completed orders could be left joined with Order Details table using the foreign key (OrderId) resulting in a stream of competed orders with corresponding order details. If order to order details relation 1:n then additional aggregation on OrderId will be required. 

This KIP proposes the new Kafka Streams feature with capability to join the KStream with KTable using left Foreign Key Join

Gliffy Diagram
macroId93230620-dd07-435c-b0fb-e1502ffba955
displayNameDB Entities
nameDB Entities
pagePin4
This KIP makes data aggregation semantic consistent between SQL and Kafka Streams.

Public Interfaces

/**
* Join records of this {@code KStream} with {@code KTable} using non-windowed left join.
* <p>
* This is a foreign key join, where the joining key is determined by the {@code foreignKeyExtractor}.
*
* @param rightTable the {@code KTable} on the right side of join to be joined with this
* {@code KStream}. Keyed by KO.
* @param foreignKeyExtractor a {@link Function} that extracts the key (KO) from this table's value (V). If the
* result is null, the update is ignored as invalid.
* @param joiner a {@link ValueJoiner} that computes the join result for a pair of matching records
* @param <VR> the value type of the result {@code KTable}
* @param <KO> the key type of the right {@code KTable}
* @param <VO> the value type of the right {@code KTable}
* @return a {@code KStream} that contains the result of joining this stream with {@code rightTable}
*/
<VR, KO, VO> KStream<K, VR> leftJoin(final KTable<KO, VO> rightTable,
final Function<VO, KO> foreignKeyExtractor,
final ValueJoiner<V, VO, VR> joiner);

...

Gliffy Diagram
macroIdc6bef4d8-c933-4957-ad49-235b9e7b0122
displayNameWorkflow
nameWorkflow
pagePin34

Compatibility, Deprecation, and Migration Plan

  • What impact (if any) will there be on existing users?
  • If we are changing behavior how will we phase out the older behavior?
  • If we need special migration tools, describe them here.
  • When will we remove the existing behavior?

Test Plan

Describe in few sentences how the KIP will be tested. We are mostly interested in system tests (since unit-tests are specific to implementation details). How will we know that the implementation works as expected? How will we know nothing broke?

Rejected Alternatives

  • There is no impact to existing users because this is a new Interface

Test Plan

Proposed new functionality should be tested with the functional test cases to ensure that results of stream-table foreign key join are consistent with results of the standard RDBMS SQL for foreign key left join between two database tables with foreign key relations. The database table that has foreign key field corresponds to the KTable that are on the right of the left join and database table that has a primary key matching foreign key corresponds to the KStream to the left of join.

Rejected Alternatives

NoneIf there are alternative ways of accomplishing the same thing, what were they? The purpose of this section is to motivate why the design is the way it is and not some other way.