...
Add CreatingTable to use CTAS for table API.
/** A table that need to be created for CTAS. */ |
Catalog
Providing method that are used to create stage tableserialize/deserialize Catalog and infer the option of CatalogBaseTable.
@PublicEvolving /** /** |
...
First of all, make it clear, CTAS command create table must go through catalog.
Implementation Plan
We will introduce new concepts: Stage Table.
Stage Table: Created through the Catalog#createStageTable API, stored in the memory of the Catalog, visible in the SQL compilation stage;
There will be a collection record stage table in the catalog, which is created in the catalog backend storage when the job status is CREATED;
Dropped from the catalog backend storage when the job status is FAILED or CANCELED.
Through the research summary and analysis, the overall implementation process is as follows:
Execution Flow
Through the research summary and analysis, the current status of CTAS in the field of big data is:
- Flink: Flink dialect does not support CTAS. ==> LEVEL-1
- Spark DataSource v1: is atomic (can roll back), but is not isolated. ==> LEVEL-2
- Spark DataSource v2: Guaranteed atomicity and isolation. ==> LEVEL-3
- Hive MR: Guaranteed atomicity and isolation. ==> LEVEL-3
Combining the current situation of Flink and the needs of our business, choosing a Level-2 implementation for Flink.
Execution Flow
The overall execution process is shown in the figure above.
...
- Create sink table(stage table) through Catalog's new API createStageTable.
- Construct CTASJobStatusHook with Catalog as a construction parameter, CTASJobStatusHook is an implementation of the JobStatusHook interface.
- Register CTASJobStatusHook with StreamGraph, then passed to JobGraph and serialized(Need to implement serialization/deserialization of Catalog and JobStatusHook).
- When the job starts and the status is CREATED, the runtime module will call the JobStatusHook#onCreated method, and we call the Catalog#createTable method in the CTASJobStatusHook#onCreated method.
- When the final status of the job is FAILED, the runtime module will call the JobStatusHook#onFailed method, we call the Catalog#dropTable method in the CTASJobStatusHook#onFailed method.
- When the final status of the job is CANCELED, the runtime module will call the JobStatusHook#onCanceled method, we call the Catalog#dropTable method in the CTASJobStatusHook#onCanceled method.
- When the final status of the job is FINISH, the runtime module will call the JobStatusHook#onFinished method, and we do not need to do any additional operations.
Hook design
Definition of JobStatusHookRuntime
/** |
Register JobStatusHookPlanner
public class StreamGraph implements Pipeline { /** Registers the JobStatusHook. */ |
...