Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

...

  1. Each Flink ETL job is independent, it manages and performs checkpoints in its job manager.
  2. Each ETL job generates snapshot data in Table Store according to its checkpoint independently.
  3. Flink OLAP/Batch jobs read snapshots of tables from Table Store and performs complex computations such as join and agg.

Proposal

...

In the whole process of streaming data processing, there are mainly the following 4 How problems(general streaming and batch ETL):

  • HOW to manage the relationship between ETL jobs?

Multiple ETL jobs and tables in Table Store form a topology. For example, there are dependencies between Table1, Table2, and Table3. Currently, users can't get relationship information between these tables, and when data on a table is late, it is difficult to trace the root cause based on job dependencies.

  • HOW to define the data correctness in query?

draw.io Diagram
bordertrue
diagramNamefigure2
simpleViewerfalse
width600
linksauto
tbstyletop
lboxtrue
diagramWidth681
revision1

As shown above, Flink ETL jobs will generate V11, V21, V31 in Table1,2,3 respectively for V1 in CDC. Suppose the following case: there's a base table in database, a user creates cascaded views or materialized views Table1, Table2 and Table3 based on it. Then the user executes complex queries on Table1, Table2, and Table3 and the query results are consistent with the base table. When the user creates the three tables in Table Store and incrementally updates data of them by Flink ETL jobs in real-time, these tables can be regarded as materialized views that are incrementally updated in real time. In the process of streaming data processing, how

  • HOW to manage the relationship between ETL jobs?

Multiple ETL jobs and tables in Table Store form a topology. For example, there are dependencies between Table1, Table2, and Table3. Currently, users can't get relationship information between these tables, and when data on a table is late, it is difficult to trace the root cause based on job dependencies.

  • HOW to define the data correctness in query?

draw.io DiagrambordertruediagramNamefigure2simpleViewerfalsewidth600linksautotbstyletoplboxtruediagramWidth681revision1As shown above, Flink ETL jobs will generate V11, V21, V31 in Table1,2,3 respectively for V1 in CDC. How to define data correctness in query when users perform join query on these three tables? The case in point is how to ensure V11, V21 and V31 are read or not read in one query?

...

  1. Materialized View in SQL. Next, we hope to introduce materialized view syntax into Flink to improve user interaction experience. Queries can also be optimized based on materialized views to improve performance.

  2. Improve MetaService capabilities. ManagerService is a single point in the system, and it should supports failover. In the other way, MetaService supports managing Flink ETL jobs and tables in Table Store, accessed by other computing engines such as Spark and being an agent of Hive Metastore later.

  3. Improve OLAP performance. We have created issues in FLINK-25318] Improvement of scheduler and execution for Flink OLAP to manage improvement of OLAP in Flink. At the same time, we hope to continue to enhance the online query capability of Table Store and improve the OLAP performance of Flink + Table Store.

  4. Improvement of data real-time. At present, our consistency design is based on Flink checkpoint mechanism and supports minute level delay. In the future, we hope to support second level or even millisecond level data real-time on the premise of ensuring data consistency.

...