Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

...

On the other hand, this proposal increases the cost of data operation and management. When the data of a table needs to be rolled back to the specified snapshot for some reason, each downstream table needs to be reset to a different snapshot. It's terrible.

For the above reasons, we choose the global checkpoint mechanism in the first stage. But there are some problems of Checkpoint for data consistency

  • Flink uses Checkpoint as a fault-tolerant mechanism, it supports aligned checkpoint, non-aligned checkpoint, and may even task local checkpoint in the future.
  • Even for Aligned Checkpoint, data consistency cannot be guaranteed for some operators, such as Temporal operators with timestamp or data buffer.

So after the implementation in the first stage, we need to upgrade the Watermark or implement a new Timestamp Barrier mechanism in Flink to support full semantics of Data Consistency. In a single ETL job

  1. Source generates Timestamp Barrier based on System Time or Event Time
  2. Timestamp Barriers have order relations. Aggregation and Temporal operator perform computation in each Timestamp Barrier, then between Timestamp Barrier.
  3. Aggregation and  Temporal operator outputs results for each Timestamp Barrier

ETL jobs write each Timestamp Barrier to snapshot in Table Store, and the downstream ETL can read the Timestamp Barrier, which makes the Timestamp Barrier be transferred between ETL jobs.

Finally, we can guarantee the consistency of data by Timestamp Barrier and exactly-once computation by Checkpoint independently.

Roadmap In Future


Data consistency of ETL Topology is our first phase of work. After completing this part, we plan to promote the capacity building and improvement of Flink + Table Store in future, mainly including the following aspects.

  1. Materialized View in SQL. Next, we hope to introduce materialized view syntax into Flink to improve user interaction experience. Queries can also be optimized based on materialized views to improve performance.

  2. Improve MetaService capabilities. ManagerService is a single point in the system, and it should supports failover. In the other way, MetaService supports managing Flink ETL jobs and tables in Table Store, accessed by other computing engines such as Spark and being an agent of Hive Metastore later.

  3. Improve data consistency semantics. As mentioned above, we need to implement "Timestamp Barrier" to support full semantics data consistency instead of "Aligned Checkpoint" in the first stage. 
  4. Improve OLAP performance. We have created issues in FLINK-25318] Improvement of scheduler and execution for Flink OLAP to manage improvement of OLAP in Flink. At the same time, we hope to continue to enhance the online query capability of Table Store and improve the OLAP performance of Flink + Table Store.

  5. Improvement of data real-time. At present, our consistency design is based on Flink checkpoint mechanism and supports minute level delay. In the future, we hope to support second level or even millisecond level data real-time on the premise of ensuring data consistency.

...