This page is meant as a template for writing a FLIP. To create a FLIP choose Tools->Copy on this page and modify with your content and replace the heading with the next FLIP number and a description of your issue. Replace anything in italics with your own description.

Status

Current state: ["Under Discussion"] Accepted

Discussion thread: here (<- link to https://mail-archiveslists.apache.org/mod_mbox/flink-dev/)
thread/n6nsvbwhs5kwlj5kjgv24by2tk5mh9xd

VOTE thread: JIRA: here (<- link to https://issueslists.apache.org/jira/browse/FLINK-XXXX)

Released: 1.18

Please keep the discussion on the mailing list rather than commenting on the wiki (wiki discussions get unwieldy fast).

Motivation

CREATE TABLE AS SELECT(CTAS) statement has been support by FLIP-218, but it's not atomic. It will create the table first before job running. If the job execution fails, or is cancelled, the table will not be dropped.

We want Flink to support atomic CTAS, where only the table is created when the Job succeeds.

we refer to FLIP-218: Support SELECT clause in CREATE TABLE(CTAS) , use the existing JobStatusHook mechanism and extend Catalog's new API to implement atomic CTAS capabilities.

Public Interfaces

Introduce createTwoPhaseCatalogTable API for Catalog.

thread/8c0wlp72kq0dhcbpy08nl1kb28q17kv3

JIRA:

Jira

server	ASF JIRA
columnIds	issuekey,summary,issuetype,created,updated,duedate,assignee,reporter,customfield_12311032,customfield_12311037,customfield_12311022,customfield_12311027,priority,status,resolution
columns	key,summary,type,created,updated,due,assignee,reporter,Priority,Priority,Priority,Priority,priority,status,resolution
serverId	5aa69414-a9e9-3523-82ec-879b028fb15b
key	FLINK-32580

Released: 1.18

Please keep the discussion on the mailing list rather than commenting on the wiki (wiki discussions get unwieldy fast).

Motivation

CREATE TABLE AS SELECT(CTAS) statement has been support by FLIP-218, but it's not atomic. It will create the table first before job running. If the job execution fails, or is cancelled, the table will not be dropped.

We want Flink to support atomic CTAS, where only the table is created when the Job succeeds.

we refer to FLIP-218: Support SELECT clause in CREATE TABLE(CTAS) , use the existing JobStatusHook mechanism and extend Catalog's new API to implement atomic CTAS capabilities.

Public Interfaces

Introduce interface SupportsStaging , which provided applyStaging API. If DynamicTableSink implements the interface SupportsStaging, it indicates that it supports atomic operations.

Code Block

language	java

/**
 * Enables different staged operations to ensure atomicity in a {@link DynamicTableSink}.
 *
 * <p>By default, if this interface is not implemented, indicating that atomic operations are not
 * supported, then a non-atomic implementation is used.
 */
@PublicEvolving
public interface SupportsStaging {

    /**
     * Provides a {@link StagedTable} that provided transaction abstraction. StagedTable will be

Code Block

language	java

@PublicEvolving
public interface Catalog {

    /**
     * Create a {@link TwoPhaseCatalogTable} that provided transaction abstraction.
     * TwoPhaseCatalogTable will be combined with {@link JobStatusHook} to achieve atomicity
     * support in the Flink framework. Default returns empty, indicating that atomic operations are
     * not supported, then using non-atomic implementations.
     *
     * <p>The framework will make sure to call this method with fully validated {@link
     * ResolvedCatalogTable}.
     *
     * @param tablePath path of the table to be created
     * @param table the table definition
     * @param ignoreIfExists flagcombined with {@link JobStatusHook} to specifyachieve behavioratomicity whensupport ain tablethe orFlink view already exists at theframework. Call
     * the relevant API of StagedTable givenwhen path:the ifJob setstate to false, it throws a TableAlreadyExistException, if set to true, do
     *     nothingis switched.
     *
     * <p>This method will be called at the compile stage.
     *
  @param  isStreamingMode A* flag@param thatStagingContext tellsTell ifDynamicTableSink, the currentoperation tabletype isof inthis stream modeStagedTable, Different
     *     modes can have different implementations of atomicity support.expandable
     * @return {@link TwoPhaseCatalogTableStagedTable} that can be serialized and provides atomic operations
     */
    StagedTable  operationsapplyStaging(StagingContext context);

     /** @throws
 TableAlreadyExistException if table already exists* andThe ignoreIfExistscontext is intended false
to tell DynamicTableSink the type *of @throwsthis DatabaseNotExistExceptionoperation. ifIn thethis databaseway,
 in tablePath doesn't exist
 * DynamicTableSink can return *the @throwscorresponding CatalogExceptionimplementation inof caseStagedTable ofaccording anyto runtime exceptionthe
     */
 specific operation. More defaulttypes Optional<TwoPhaseCatalogTable>of createTwoPhaseCatalogTable(
operations can be extended in the future.
     */
 ObjectPath tablePath,
  interface StagingContext {
        CatalogBaseTable table,StagingPurpose getStagingPurpose();
    }

    enum    boolean ignoreIfExists,StagingPurpose {
            boolean isStreamingMode)
            throws TableAlreadyExistException, DatabaseNotExistException, CatalogException {CREATE_TABLE_AS,
        return Optional.empty();CREATE_TABLE_AS_IF_NOT_EXISTS
    }

}

Introduce TwoPhaseCatalogTable StagedTable interface that support atomic operations.

Code Block

language	java

/**
 * A {@link CatalogTableStagedTable} for atomic semantics using a two-phase commit protocol, combined with
 * {@link JobStatusHook} for atomic CTAS. {@link TwoPhaseCatalogTableStagedTable} will be a member
 * variable of
 * CtasJobStatusHook and can be serialized;
 *
 * <p>
 * CtasJobStatusHook#onCreated <p>CtasJobStatusHook#onCreated will call the beginTransactionbegin method of TwoPhaseCatalogTableStagedTable;
 * CtasJobStatusHook#onFinished will call the commit method of TwoPhaseCatalogTableStagedTable;
 * CtasJobStatusHook#onFailed and CtasJobStatusHook#onCanceled will call the abort method of
 * TwoPhaseCatalogTableStagedTable;
 */
@PublicEvolving
public interface TwoPhaseCatalogTableStagedTable extends CatalogTable, Serializable {

    /**
     * This method will be called when the job is started. Similar to what it means to open a
     * transaction in a relational database; In Flink's atomic CTAS scenario, it is used to do some
     * initialization work; For example, initializing the client of the underlying service, the tmp
     * path of the underlying storage, or even call the start transaction API of the underlying
     * service, etc.
     */
    void beginTransactionbegin();

    /**
     * This method will be called when the job is succeeds. Similar to what it means to commit the
     * transaction in a relational database; In Flink's atomic CTAS scenario, it is used to do some
     * data visibility related work; For example, moving the underlying data to the target
     * directory, writing buffer data to the underlying storage service, or even call the commit
     * transaction API of the underlying service, etc.
     */
    void commit();

    /**
     * This method will be called when the job is failed or canceled. Similar to what it means to
     * rollback the transaction in a relational database; In Flink's atomic CTAS scenario, it is
     * used to do some data cleaning; For example, delete the data in tmp directory, delete the
     * temporary data in the underlying storage service, or even call the rollback transaction API
     * of the underlying service, etc.
     */
    void abort();
}

Proposed Changes

First we need to have a Table interface that can be combined with the abstract transaction capability, so we introduce TwoPhaseCatalogTable, which can perform start transaction, commit transaction, and abort transaction operations.

The three APIs corresponding to TwoPhaseCatalogTable:

beginTransaction : Similar to open transactions, we can do some prep work, such as initializing the client, initializing the data, initializing the directory, etc.

commit : Similar to commit transactions, we can do some data writing, data visibility, table creation, etc.

abort : Similar to abort transactions, we can do some data cleaning, data restoration, etc.

Note: TwoPhaseCatalogTable must be serializable, because it used on JM.

Then we need somewhere to create the TwoPhaseCatalogTable, because different Catalogs implement atomic CTAS and need to perform different operations,

for example, HiveCatalog needs to access the Hive Metastore; JDBCCatalog needs to access the back-end database, so we introduce the createTwoPhaseCatalogTable API on the Catalog interface.

The definition of the createTwoPhaseCatalogTable API is similar to that of the createTable API, with the extension of the isStreamingMode parameter, in order to provide a different atomicity implementation in different modes.

Integrate atomicity CTAS

Introduce CtasJobStatusHook (implements JobStatusHook interface), TwoPhaseCatalogTable is its member variable;

The implementation of the API related to the call to TwoPhaseCatalogTable is as follows:

TableConfigOptions

Add table.ctas.atomicity-enabled option to allow users to enable atomicity when using create table as select syntax.

Code Block

language	java

@PublicEvolving
public class TableConfigOptions {
    @Documentation.TableOption(execMode = Documentation.ExecMode.BATCH_STREAMING)
    public static final ConfigOption<Boolean> TABLE_CTAS_ATOMICITY_ENABLED =
            key("table.ctas.atomicity-enabled")
                    .booleanType()
                    .defaultValue(false)
                    .withDescription(
                            "Specifies if the create table as select operation is executed atomically. "

Code Block

language	java

/**
 * This Hook is used to implement the Flink CTAS atomicity semantics, calling the corresponding API
 * of {@link TwoPhaseCatalogTable} at different stages of the job.
 */
public class CtasJobStatusHook implements JobStatusHook {

    private final TwoPhaseCatalogTable twoPhaseCatalogTable;

    public CtasJobStatusHook(TwoPhaseCatalogTable twoPhaseCatalogTable) {
        this.twoPhaseCatalogTable = twoPhaseCatalogTable;
    }

    @Override
    public void onCreated(JobID jobId) {
        twoPhaseCatalogTable.beginTransaction();
    }

    @Override
  + "By publicdefault, voidthe onFinished(JobID jobId) {
        twoPhaseCatalogTable.commit();
    }

    @Override
    public void onFailed(JobID jobId, Throwable throwable) {
operation is non-atomic. The target table is created in Client side, and it will not be dropped even though the job fails or is cancelled. "
           twoPhaseCatalogTable.abort();
    }

    @Override
    public void onCanceled(JobID jobId) {
        twoPhaseCatalogTable.abort();
    }
}

Compatibility with existing non-atomic CTAS

The return value of Catalog#createTwoPhaseCatalogTable is Optional, and we can determine whether atomicity semantics are supported based on whether the return value is empty:

empty : it means that atomicity semantics are not supported and the existing code logic is used;

not empty : it means that atomicity semantics are supported, then create a CtasJobStatusHook and use the JobStatusHook mechanism to implement atomicity semantics, as described in the code in the previous section.

Code Block

language	java

Optional<TwoPhaseCatalogTable> twoPhaseCatalogTableOptional =
  + "If set this option to true and DynamicTableSink implements the SupportsStaging interface, the create table as select operation is expected to be executed atomically, "
      ctasCatalog.createTwoPhaseCatalogTable(
                objectPath,
            + "the behavior of catalogTable,
which depends on the             createTableOperation.isIgnoreIfExists(),
                isStreamingMode);

if (twoPhaseCatalogTableOptional.isPresent()) {
	// use TwoPhaseCatalogTable for atomic CTAS statements
    TwoPhaseCatalogTable twoPhaseCatalogTable =
            twoPhaseCatalogTableOptional.get();
    CtasJobStatusHook ctasJobStatusHook =
            new CtasJobStatusHook(twoPhaseCatalogTable);
    mapOperations.add(
            ctasOperation.toSinkModifyOperation(
                    createTableOperation.getTableIdentifier(),
                    createTableOperation.getCatalogTable(),
                    twoPhaseCatalogTable,
                    ctasCatalog,
                    catalogManager));
    jobStatusHookList.add(ctasJobStatusHook);
} else {
    // execute CREATE TABLE first for non-atomic CTAS statements
    executeInternal(ctasOperation.getCreateTableOperation());
    mapOperations.add(ctasOperation.toSinkModifyOperation(catalogManager));
}

Atomicity support on Stream and Batch mode

We usually think of Stream mode jobs as LONG RUNNING, i.e. they never stop, so there is no atomicity semantics, but now flink is the stream batch unified computing engine,

so we introduce isStreamingMode when we define Catalog#createTwoPhaseCatalogTable, and Catalog can decide whether to provide atomicity semantic support.

In the production environment, there are some user-defined streams source will also be finished, the job will also be finished (no more data input),

in this case use atomic semantic implementation, will improve the user experience, by the implementation of Catalog decision.

HiveCatalog implementation of atomic CTAS demo

Then implementation of the atomic CTAS operation requires only two steps :

Catalog implements the createTwoPhaseCatalogTable method;
Introduce the implementation class of the TwoPhaseCatalogTable interface.

HiveCatalog implements the createTwoPhaseCatalogTable API:

actual DynamicTableSink.");
}

Proposed Changes

First we need to have a Table interface that can be combined with the abstract transaction capability, so we introduce StagedTable, which can perform start transaction, commit transaction, and abort transaction operations.

The three APIs corresponding to StagedTable:

begin : Similar to open transactions, we can do some prep work, such as initializing the client, initializing the data, initializing the directory, etc.

commit : Similar to commit transactions, we can do some data writing, data visibility, table creation, etc.

abort : Similar to abort transactions, we can do some data cleaning, data restoration, etc.

Note: StagedTable must be serializable, because it used on JM.

Then we need somewhere to create the StagedTable, because different TableSink implement atomic CTAS and need to perform different operations,

for example, HiveTableSink needs to access the Hive Metastore and write to HDFS(OSS etc); JDBCTableSink needs to access the back-end database;

Therefore, we introduce the interface SupportsStaging, which, if implemented by DynamicTableSink, indicates that it supports atomic operations, otherwise it does not support atomic operations.

Flink framework can determine whether DynamicTableSink supports atomicity CTAS by whether it implements the interface SupportsStaging, and if it does, get the StagedTable object through the applyStaging API, otherwise use the non-atomic CTAS implementation.

Identification of atomic CTAS

Normally, in stream mode, we consider the job to be LONG RUNNING, and even if it fails, it needs to resume afterwards, so atomic CTAS semantics are usually not needed.

In addition, there are probably many flink jobs that already use non-atomic CTAS functions, especially Stream jobs, in order to ensure the consistency of flink behavior, and to give the user maximum flexibility, in time DynamicTableSink implements the SupportsStaging interface, users can still choose non-atomic implementation according to business needs.

So, we can infer in the TableEnvironmentImpl whether atomic CTAS is used based on whether the user has enabled it and whether DynamicTableSink implements the SupportsStaging interface, like the following:

Code Block

language	java

boolean isAtomicCtas = tableConfig.get(TableConfigOptions.TABLE_CTAS_ATOMICITY_ENABLED) && dynamicTableSink instanceof SupportsStaging;

Integrate atomicity CTAS

Introduce CtasJobStatusHook (implements JobStatusHook interface), StagedTable is its member variable;

The implementation of the API related to the call to StagedTable is as follows:

Code Block

language	java

/**
 * This Hook is used to implement the Flink CTAS atomicity semantics, calling the corresponding API
 * of {@link StagedTable} at different stages of the job.
 */
public class CtasJobStatusHook implements JobStatusHook {

    private final StagedTable stagedTable;

    public CtasJobStatusHook(StagedTable stagedTable) {
        this.stagedTable = stagedTable;
    }

    @Override
    public void onCreated(JobID jobId) {
        stagedTable.begin();
    }

    @Override
    public void onFinished(JobID jobId) {
        stagedTable.commit();
    }

    @Override
    public void onFailed(JobID jobId, Throwable throwable) {
        stagedTable.abort();
    }

    @Override
    public void onCanceled(JobID jobId) {
        stagedTable.abort();
    }
}

Compatibility with existing non-atomic CTAS

We can infer atomicity CTAS support by whether DynamicTableSink implements the interface SupportsStaging or not:

not : it means that atomicity semantics are not supported and the existing code logic is used;

yes : it means that atomicity semantics are supported, then create a CtasJobStatusHook and use the JobStatusHook mechanism to implement atomicity semantics, as described in the code in the previous section.

Code Block

language	java

Optional<DynamicTableSink> dynamicTableSinkOptional =
        getDynamicTableSink(
                catalogTable,
                tableIdentifier,
                createTableOperation.isTemporary(),
                catalogManager.getCatalog(catalogName)); 
if (tableConfig.get(TableConfigOptions.TABLE_CTAS_ATOMICITY_ENABLED)
        && dynamicTableSinkOptional.isPresent()
        && dynamicTableSinkOptional.get() instanceof SupportsStaging) {
    DynamicTableSink dynamicTableSink = dynamicTableSinkOptional.get();
    StagedTable stagedTable =
            ((SupportsStaging) dynamicTableSink)
                    .applyStaging(
                            new SupportsStaging.StagingContext() {
                                @Override
                                public SupportsStaging.StagingPurpose
                                        getStagingPurpose() {
                                    if (createTableOperation
                                            .isIgnoreIfExists()) {
                                        return SupportsStaging.StagingPurpose
                                                .CREATE_TABLE_AS_IF_NOT_EXISTS;
                                    }
                                    return SupportsStaging.StagingPurpose
                                            .CREATE_TABLE_AS;
                                }
                            });
    CtasJobStatusHook ctasJobStatusHook = new CtasJobStatusHook(stagedTable);
    mapOperations.add(
            ctasOperation.toStagedSinkModifyOperation(
                    createTableOperation.getTableIdentifier(),
                    catalogTable,
                    ctasCatalog,
                    dynamicTableSink));
    jobStatusHookList.add(ctasJobStatusHook);
} else {
    // execute CREATE TABLE first for non-atomic CTAS statements
    executeInternal(ctasOperation.getCreateTableOperation());
    mapOperations.add(ctasOperation.toSinkModifyOperation(catalogManager));
}

To avoid secondary generation of DynamicTableSink, we need to construct a StagedSinkModifyOperation that inherits from SinkModifyOperation and then add the DynamicTableSink member variable.

Current non-atomic CTAS implementations

Current Flink supports non-atomic CTAS operations, when it is CreateTableASOperation, we will create the target table first, and then compile and execute the insert operation.

The current program has the following shortcomings:

First: If the insert operation fails, whether it is a compile failure or a job execution failure, flink will not drop the created target table;

Second: Before the job is executed, because the target table already exists, but no data can be read.

Atomic CTAS demo

Then implementation of the atomic CTAS operation requires only two steps :

1: DynamicTableSink implements the interface SupportsStaging;

2: Introduce the implementation class of the StagedTable interface.

Hive demo

HiveTableSink implements the applyStaging API:

Code Block

language	java

@Override
public StagedTable applyStaging(StagingContext context) {
    Table hiveTable =
            HiveTableUtil.instantiateHiveTable(
                    identifier.toObjectPath(),
                    catalogTable,
                    HiveConfUtils.create(jobConf),
                    false);

    hiveStagedTable =
            new HiveStagedTable(
                    hiveVersion,
                    new JobConfWrapper(jobConf),
                    hiveTable,
                    context.getStagingPurpose()
                            == SupportsStaging.StagingPurpose.CREATE_TABLE_AS_IF_NOT_EXISTS);

    return hiveStagedTable;
}

HiveStagedTable implements the core logic

Code Block

language	java

/** An implementation of {@link StagedTable} for Hive to support atomic ctas. */
public class HiveStagedTable implements StagedTable {

    private static final long serialVersionUID = 1L;

    @Nullable private final String hiveVersion;
    private final JobConfWrapper jobConfWrapper;

    private final Table table;

    private final boolean ignoreIfExists;

    private transient HiveMetastoreClientWrapper client;

    private FileSystemFactory fsFactory;
    private TableMetaStoreFactory msFactory;
    private boolean overwrite;
    private Path tmpPath;
    private String[] partitionColumns;
    private boolean dynamicGrouped;
    private LinkedHashMap<String, String> staticPartitions;
    private ObjectIdentifier identifier;
    private PartitionCommitPolicyFactory partitionCommitPolicyFactory;

    public HiveStagedTable(
            String hiveVersion,
            JobConfWrapper jobConfWrapper,
            Table table,
            boolean ignoreIfExists) {
        this.hiveVersion = hiveVersion;
        this.jobConfWrapper = jobConfWrapper;
        this.table = table;
        this.ignoreIfExists = ignoreIfExists;
    }

    @Override
    public void begin() {
        // init hive metastore client
        client =
                HiveMetastoreClientFactory.create(
                        HiveConfUtils.create(jobConfWrapper.conf()), hiveVersion);
    }

    @Override
    public void commit() {
        try {
            // create table first
            client.createTable(table);

            try {
                List<PartitionCommitPolicy> policies = Collections.emptyList();
                if (partitionCommitPolicyFactory != null) {
                    policies =
                            partitionCommitPolicyFactory.createPolicyChain(
                                    Thread.currentThread().getContextClassLoader(),
                                    () -> {
                                        try {
                                            return fsFactory.create(tmpPath.toUri());
                                        } catch (IOException e) {
                                            throw new RuntimeException(e);
                                        }
                                    });
                }

                FileSystemCommitter committer =
                        new FileSystemCommitter(
                                fsFactory,
                                msFactory,
                                overwrite,
                                tmpPath,
                                partitionColumns.length,
                                false,
                                identifier,
                                staticPartitions,
                                policies);
                committer.commitPartitions();
            } catch (Exception e) {
                throw new TableException("Exception in two phase commit", e);
            } finally {
                try {
                    fsFactory.create(tmpPath.toUri()).delete(tmpPath, true);
                } catch (IOException ignore) {
                }
            }
        } catch (AlreadyExistsException alreadyExistsException) {
            if (!ignoreIfExists) {
                throw new FlinkHiveException(alreadyExistsException);
            }
        } catch (Exception e) {
            throw new FlinkHiveException(e);
        } finally {
            client.close();
        }
    }

    @Override
    public void abort() {
        // do nothing
    }

    public void setFsFactory(FileSystemFactory fsFactory) {
        this.fsFactory = fsFactory;
    }

    public void setMsFactory(TableMetaStoreFactory msFactory) {
        this.msFactory = msFactory;
    }

    public void setOverwrite(boolean overwrite) {
        this.overwrite = overwrite;
    }

    public void setTmpPath(Path tmpPath) {
        this.tmpPath = tmpPath;
    }

    public void setPartitionColumns(String[] partitionColumns) {
        this.partitionColumns = partitionColumns;
    }

    public void setDynamicGrouped(boolean dynamicGrouped) {
        this.dynamicGrouped = dynamicGrouped;
    }

    public void setStaticPartitions(LinkedHashMap<String, String> staticPartitions) {
        this.staticPartitions = staticPartitions;
    }

    public void setIdentifier(ObjectIdentifier identifier) {
        this.identifier = identifier;
    }

    public void setPartitionCommitPolicyFactory(
            PartitionCommitPolicyFactory partitionCommitPolicyFactory

Code Block

language	java

	@Override
    public Optional<TwoPhaseCatalogTable> createTwoPhaseCatalogTable(
            ObjectPath tablePath, CatalogBaseTable table, boolean ignoreIfExists, boolean isStreamingMode)
            throws TableAlreadyExistException, DatabaseNotExistException, CatalogException {

        if (isStreamingMode) {
        this.partitionCommitPolicyFactory = partitionCommitPolicyFactory;
   //HiveCatalog does not support atomicity semantics in stream mode
     }

    public Table getTable() {
        return Optional.empty()table;
        }
}

Jdbc Demo

JdbcTableSink implements the applyStaging API:

Code Block

language	java

@Override
public StagedTable applyStaging(StagingContext context) {
    checkNotNull(tablePath, "tablePath cannot be null");
        checkArgument(table instanceof ResolvedCatalogBaseTable, "table should be resolved");

... ...
	StagedTable stagedTable = new JdbcStagedTable(
            new  ResolvedCatalogBaseTable<?> resolvedTable = (ResolvedCatalogBaseTable<?>) table;ObjectPath(tablePath.getDatabaseName(), tablePath.getObjectName() + "_" + System.currentTimeMillis()),
			tablePath,
			tableSchem,
			jdbcUrl,
        if (!databaseExists(tablePath.getDatabaseName())) {
  jdbcUserName,
          throw new DatabaseNotExistException(getName(), tablePath.getDatabaseName() jdbcPassword);

    return stagedTable;
}

JdbcStagedTable implements the core logic

Code Block

language	java

/** An implementation of {@link StagedTable} for Jdbc to support atomic ctas. */
public class JdbcStagedTable implements StagedTable {

    private final ObjectPath tmpTablePath    }
        if (!ignoreIfExists && tableExists(tablePath)) {
            throw new TableAlreadyExistException(getName(), tablePath);
    private final ObjectPath  }finalTablePath;

    private final Map<String,  boolean managedTable = ManagedTableListener.isManagedTable(this, resolvedTable)String> schema;

    private final String jdbcUrl;
 Table hiveTable =
 private final String userName;
    private final String password;

    public HiveTableUtil.instantiateHiveTableJdbcStagedTable(
            ObjectPath tmpTablePath,
           tablePath, resolvedTable,ObjectPath hiveConffinalTablePath, managedTable);

        TwoPhaseCatalogTable twoPhaseCatalogTable = new HiveTwoPhaseCatalogTable(
Map<String, String> schema,
            String  getHiveVersion()jdbcUrl,
            String userName,
   new JobConfWrapper(JobConfUtils.createJobConfWithCredentials(hiveConf)),
          String password) {
    hiveTable,
    this.tmpTablePath = tmpTablePath;
        this.finalTablePath = ignoreIfExists)finalTablePath;

        return Optional.of(twoPhaseCatalogTable)this.schema = schema;
    }

HiveTwoPhaseCatalogTable implements the core logic

Code Block

language	java

/**
 * An implementation of {@link TwoPhaseCatalogTable} for Hive to
 * support atomic ctas.
 */
public class HiveTwoPhaseCatalogTable implements TwoPhaseCatalogTable {

    private static final long serialVersionUID = 1L;    this.jdbcUrl = jdbcUrl;
        this.userName = userName;
        this.password = password;
    }

    @Nullable@Override
 private final String hiveVersion;
public void begin() {
 private final JobConfWrapper jobConfWrapper;

    private// finalcreate Tabletmp table;
, writing data to privatethe final boolean ignoreIfExists;

tmp table
    private transient HiveMetastoreClientWrapper client;

 Connection connection = public HiveTwoPhaseCatalogTablegetConnection();
            String hiveVersion,
 connection
           JobConfWrapper jobConfWrapper,
    .prepareStatement("create table " + tmpTablePath.getFullName() + "(  Table table,... ... )")
            boolean ignoreIfExists) {
  .execute();
    }

  this.hiveVersion = hiveVersion;@Override
    public void commit() {
 this.jobConfWrapper = jobConfWrapper;
     // Rename the this.table = table;
tmp table to the final table name
        Connection this.ignoreIfExistsconnection = ignoreIfExistsgetConnection();
    }

    @Overrideconnection
    public void beginTransaction() {
        // init hive metastore client
 .prepareStatement(
         client =
              "rename table HiveMetastoreClientFactory.create("
                        HiveConfUtils.create(jobConfWrapper.conf()), hiveVersion);
    }

    @Override
    public void commit() {
 + tmpTablePath.getFullName()
          try {
            client.createTable(table);
        } catch+ (AlreadyExistsException" alreadyExistsException)to {"
            if  (!ignoreIfExists) {
                throw new+ FlinkHiveExceptionfinalTablePath.getFullName(alreadyExistsException));
            }
        } catch (Exception e) {.execute();
    }

    @Override
    throwpublic newvoid FlinkHiveExceptionabort(e); {
        } finally {// drop tmp table
        Connection connection =  client.closegetConnection();
        }
    }

connection.prepareStatement("drop table " + tmpTablePath.getFullName()).execute();
    @Override}

    publicprivate voidConnection abortgetConnection() {
        // do nothingget jdbc connection
        return JDBCDriver.getConnection();
    }
}

Compatibility, Deprecation, and Migration Plan

...

Page tree

Versions Compared

Old Version 6

New Version Current

Key

Status

Motivation

Public Interfaces

Motivation

Public Interfaces

Proposed Changes

Integrate atomicity CTAS

Compatibility with existing non-atomic CTAS

Atomicity support on Stream and Batch mode

HiveCatalog implementation of atomic CTAS demo

Proposed Changes

Identification of atomic CTAS

Integrate atomicity CTAS

Compatibility with existing non-atomic CTAS

Current non-atomic CTAS implementations

Atomic CTAS demo

Hive demo

HiveStagedTable implements the core logic

Jdbc Demo

JdbcTableSink implements the applyStaging API:

JdbcStagedTable implements the core logic

HiveTwoPhaseCatalogTable implements the core logic

Compatibility, Deprecation, and Migration Plan

Page tree

Page History

Versions Compared

Old Version 6

New Version Current

Key

Status

Motivation

Public Interfaces

Motivation

Public Interfaces

Proposed Changes

Integrate atomicity CTAS

Compatibility with existing non-atomic CTAS

Atomicity support on Stream and Batch mode

HiveCatalog implementation of atomic CTAS demo

Proposed Changes

Identification of atomic CTAS

Integrate atomicity CTAS

Compatibility with existing non-atomic CTAS

Current non-atomic CTAS implementations

Atomic CTAS demo

Hive demo

HiveStagedTable implements the core logic

Jdbc Demo

JdbcTableSink implements the applyStaging API:

JdbcStagedTable implements the core logic

HiveTwoPhaseCatalogTable implements the core logic

Compatibility, Deprecation, and Migration Plan