Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

...

Discussion thread


Vote thread


ISSUE


Release

0.6

Motivation

Background

In data pipelines built with Flink and CDC technologies, there is often a requirement to enrich data in real time by merging multiple tables into a single wide table. Consider the scenario with three interconnected tables A, B, and C, forming a join chain:

...

Our goal is to update the aggregate view of tables A, B, and C in real time.

...

Flink Dual-Stream Join: Utilizing Flink for Join Operations Between Two Data Streams


Image Added

Core Challenge:

Growth of State Storage: When performing real-time stream joins, Flink needs to maintain a state that holds the data pending to be joined. If the data arrival rate of one stream far exceeds that of the other, this state can grow significantly large, leading to storage and performance issues.

Flink Lookup Join: Flink's Method for Searching and Joining Data

Image Added

Core Challenge:





The traditional lookup join approach is limited by its sole response to changes in the main data stream. This means that if a related dimension table (such as B or C) undergoes changes, the data that has already been joined cannot be dynamically updated. In other words, a lookup operation is only triggered when a new event enters the main stream, and updates in dimension tables do not result in changes to the join results.

New Strategy: Dynamic Dimension Table-Driven Lookup Join

...