...
Hive MR is client mode, the client is responsible for parsing, compiling, optimizing, executing, and finally cleaning up.
Therefore, when Hive executes the CTAS command , it can execute Query as follows:
- Execute query first, and write the
...
- query result to the temporary directory.
...
- If all MR tasks are executed successfully, then create a
...
- table and load the data.
- If the execution fails, the table will not be created.
Spark(DataSource v1) : non-atomic
There is a role called driver in Spark, the driver is responsible for compiling tasks, applying for resources, scheduling task execution, tracking task operation, etc.
Spark executes CTAS steps as follows:
- Create a sink table based on the schema of the query result.
- Execute the spark task and write the result to a temporary directory.
- If all Spark tasks are executed successfully, use the Hive API to load data into the sink table created in the first step.
- If the execution fails, driver will drop the sink table created in the first step.
Spark(DataSource v2, Not yet completed, Hive Catalog is not supported yet) : optional atomic
Non-atomic
Non-atomic implementation is consistent with DataSource v1 logic. For details, see CreateTableAsSelectExec .
Atomic
Atomic implementation( for details, see AtomicCreateTableAsSelectExec), supported by StagingTableCatalog and StagedTable .
StagedTable supports commit and abort.
StagingTableCatalog is in memory, when executes CTAS steps as follows:
- Create a StagedTable based on the schema of the query result, but it is not visible in the catalog.
- Execute the spark task and write the result into StagedTable.
- If all Spark tasks are executed successfully, call StagedTable#commitStagedChanges(), then it is visible in the catalog.
- If the execution fails, call StagedTable#abortStagedChanges().
Flink Implementation plan
We will refer to spark DataSource v1 implementation.
After the job is successfully executed, move the data to the official directory and create a table, like hive/spark.
...