...
- Flink: Flink dialect does not support CTAS. ==> LEVEL-1
- Flink: Hive dialect already supports CTAS but does not guarantee atomic(can not roll back). ==> LEVEL-2
- Spark DataSource v1: is atomic (can roll back), but is not isolated. ==> LEVEL-3
- Spark DataSource v2: Guaranteed atomicity and isolation. ==> LEVEL-4
- Hive MR: Guaranteed atomicity and isolation. ==> LEVEL-54
Hive SQL and Spark SQL are mainly used in offline(batch mode) scenarios; Flink SQL is suitable for both real-time(streaming mode) and offline(batch mode) scenarios. In a real-time scenario, we believe that the job is always running and does not stop, and the data is written in real time and visible in real time, so we do not think it is necessary no need to provide atomicity.
To ensure that Flink SQL is semantically consistent in Streaming mode and Batch mode, combining the current situation of Flink and the needs of our business, choosing LEVEL-2 atomicity as the default behavior for Flink streaming and batch mode. If the user requires LEVEL-3 atomicity, this ability can be achieved by enabling an atomicity , allowing users to enable option. In general, batch mode usually requires LEVEL-3 atomicity with an option. atomicity. In a nutshell, Flink provide two level atomicity guarantee, LEVEL-2 as the default behavior.
Syntax
We proposing the CREATE TABLE AS SELECT(CTAS) clause as following:
...
Regarding Catalog interface, to support Create Table As Select syntax(atomicity guarantee, if don't need atomicity, the second option is no need), two changes are needed here:
- Providing a new method inferTableOptions that is used to infer the options of CatalogBaseTable, these options will be used to compile the sql to JobGraph successfully. This method throw UnsupportedOperationException default.
- This interface should extands the java Serializable interface, then it can be serialized as a part of JobGraph and pass to JM side, so this require the essential options to construct a catalog object can be serialized.
- Providing a new method inferTableOptions that is used to infer the options of CatalogBaseTable, these options will be used to compile the sql to JobGraph successfully. This method throw UnsupportedOperationException default.
/** /**
|
...
The atomicity implementation of Flink CTAS requires two parts:
- Add Enabling the Enable atomicity support option.
- Catalog can be serialized, ensuring atomicity by performing created/dropped table on the JM side.
...