Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

...

Please keep the discussion on the mailing list rather than commenting on the wiki (wiki discussions get unwieldy fast).

Table of Contents

Motivation

The current syntax/features of Flink SQL is very perfect in both stream mode and batch mode.

...

  1. Create a StagedTable based on the schema of the query result, but it is not visible in the catalog.
  2. Execute the spark task and write the result into StagedTable.
  3. If all Spark tasks are executed successfully, call StagedTable#commitStagedChanges(), then it is visible in the catalog.
  4. If the execution fails, call StagedTable#abortStagedChanges().

Research summary

We want to unify the semantics and implementation of Streaming and Batch, we finally decided to use the implementation of Spark DataSource v1.

Reasons:

  • Streaming mode requires the table to be created first, downstream jobs can consume in real time.
  • In most cases, Streaming jobs do not need to be cleaned up even if the job fails.
  • Flink has a rich connector ecosystem, and the capabilities provided by external storage systems are different, Flink needs to behave consistently.
  • Batch jobs try to ensure final atomicity.

Implementation Plan

Through the research summary and analysis, the overall implementation process is as follows:

Execution Flow

Steps:

  1. Create the sink table  in the catalog based on the schema of the query result.
  2. Start the job and write the result to target.
  3. If the job executes successfully, then make data visible.
  4. If the job execution fails, then drop the sink table or delete data.(This capability requires runtime module support, such as hook, and SQL passes relevant parameters to the runtime module.)

...