...
The main work in Timestamp Barrier
and the differences between Timestamp Barrier
and existing Watermark
in Flink are in the following table.
Timestamp Barrier | Watermark | |
Generation |
| Each source subtask generate timestamp barrier(watermark event) from System Time or Event Time |
Checkpoint | Store <checkpoint, timestamp barrier> when the timestamp barrier is generated, so that the job can recover the same timestamp barrier for the uncompleted checkpoint. | None |
Replay data | Store <timestamp barrier, offset> for source when it broadcast timestamp barrier, so that the source can replay the same data according to the same timestamp barrier. | None |
Align data | Align data for stateful operator(aggregation, join and etc.) and temporal operator(window) | Align data for temporal operator(window) |
Computation | Operator compute for a specific timestamp barrier based on the results of a previous timestamp barrier. | Window operator only computes results in the window range. |
Output | Operator output or commit results when it collect all the timestamp barrier, including operators with data buffer or async operations. | Window operator support "emit" output |
The main work in Flink
and Table Store
are as followed
Component | Main Work | |
---|---|---|
Table Store | MetaService |
|
Catalog |
| |
Source and SplitEnumerator |
| |
Sink |
| |
Flink | Timestamp Barrier Mechanism | The detailed and main work is in the above table |
Planner |
| |
JobManager |
|
The Next Step
This is an overall FLIP for data consistency in streaming and batch ETL. Next, we would like to create FLIP for each functional module with detailed design. For example:
...