...
- Create a StagedTable based on the schema of the query result, but it is not visible in the catalog.
- Execute the spark task and write the result into StagedTable.
- If all Spark tasks are executed successfully, call StagedTable#commitStagedChanges(), then it is visible in the catalog.
- If the execution fails, call StagedTable#abortStagedChanges().
Implementation Plan
Supported Job Mode
Support both streaming and batch mode.
In order to guarantee atomicity, there will be differences in implementation details.
Execution Flow
Steps:
- Create the sink table in the catalog based on the schema of the query result.
- Start the job and write the result to target.
- If the job executes successfully, then make data visible.
- If the job execution fails, then drop the sink table or delete data.(This capability requires runtime module support, such as hook, and SQL passes relevant parameters to the runtime module.)
Supported Job Mode
Support both streaming and batch mode.
The execution flow of streaming and batch is similar, the main differences are in atomicity and data visibility
Streaming
Since streaming job are long-running, the table needs to be created first.usually data is to be consumed downstream in real time. Determined by the specific Sink implementation.
- Data is visible after Checkpoint is complete or visible immediately after writing.
- In stream semantics
- Create the sink table in the catalog based on the schema of the query result.
- Start the job and write the result to the sink table.
Batch
The batch job will end. In order to guarantee atomicity, we usually write the results in a temporary directory.
...
@PublicEvolving /** |
Compatibility, Deprecation, and Migration Plan
...