Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

...

The main work in Timestamp Barrier  and the differences between Timestamp Barrier and existing Watermark in Flink are in the following table.


Timestamp Barrier

Watermark

Generation

JobManager must coordinate all source subtasks and generate a unified timestamp barrier from System Time or Event Time for them

Each source subtask generate timestamp barrier(watermark event) from System Time or Event Time

Checkpoint

Store <checkpoint, timestamp barrier> when the timestamp barrier is generated, so that the job can recover the same timestamp barrier for the uncompleted checkpoint.

None

Replay data

Store <timestamp barrier, offset> for source when it broadcast timestamp barrier, so that the source can replay the same data according to the same timestamp barrier.

None

Align data

Align data for stateful operator(aggregation, join and etc.) and temporal operator(window)

Align data for temporal operator(window)

Computation

Operator compute for a specific timestamp barrier based on the results of a previous timestamp barrier.

Window operator only computes results in the window range.

Output

Operator output or commit results when it collect all the timestamp barrier, including operators with data buffer or async operations.

Window operator support "emit" output


The main work in Flink and Table Store are as followed


ComponentMain Work






Table Store

MetaService
  1. Manage the relationship between ETL job and table in Table Store , including source table, sink table.
  2. Manage the finished timestamp barrier of each table in Table Store
  3. Interaction between Flink and MetaService, such as register ETL job, get consistency version of table and ect.

Catalog

  1. Register source and sink table with ETL job id.
  2. Create table based on a consistency version from MetaService

Source and SplitEnumerator
  1. SplitEnumerator managers the snapshot, split and timestamp barrier for specific table.
  2. Source read data and timestamp barrier from split, broadcast timestamp barrier
  3. Notify MetaService to update the completed timestamp barrier for tables.
  4. Notify MetaService to cleanup the information of the terminated ETL job.
Sink
  1. Write data to table store and commit data with timestamp barrier



Flink

Timestamp Barrier MechanismThe detailed and main work is in the above table
Planner
  1. Register job to MetaService to create relationship between source and sink tables.
  2. Create table based on a consistency version from MetaService 
JobManager
  1. Add a listener and call back it when the job ends

The Next Step

This is an overall FLIP for data consistency in streaming and batch ETL. Next, we would like to create FLIP for each functional module with detailed design. For example:

...