Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

...

2) When serializing the catalog, only need to serialize and save the catalog name(my_catalog) and properties, like this:

my_catalog

{'type'='jdbc', 'default-database'='...', 'username'='...', 'password'='...', 'base-url'='...'}

The advantages of this solution are simple design, ease of compatibility and reduced complexity of implementation for the user, and does not require complex serialization and deserialization tools. 

...

void registerCatalog(String catalogName, CatalogDescriptor catalogDescriptor);

Note:  This solution only works if we create the Catalog using DDL, because we can only get the Catalog properties with the with keyword. If we use a Catalog registered by TableEnvironment#registerCatalog method, we cannot get these properties. Therefore, jobs that use TableEnvironment#registerCatalog do not support CTAS for the time being.

Runtime

Provide JM side, job status change hook mechanism.

...

Rejected Alternatives

Catalog serialize

 This solution is only applicable to the way of create catalog using DDL because we only can get the Catalog options through the with keyword. If we use the Catalog registered by TableEnvironment#registerCatalog method, the options can not be got. There are a large number of users who use TableEnvironment#registerCatalog to register the Catalog in the production environment.  Consider the above, we reject this plan.

For Catalog, we have For Catalog, if we added serialize and deserialize APIs, and the Catalog implements its own properties that need to be serialized. We save the classname of the Catalog together with the serialized content, like this:

...

Since the Catalog class may not have a parameterless constructor, so we can't use Class#newInstance to initialize an object, we can use the framework objenesis to solve. After using objenesis to get the Catalog object (an empty Catalog instance), get the real Catalog instance through the Catalog#deserialize API. This solves the serialization/deserialization problem of CatalogBaseTable and Catalog.

For example, JdbcCatalog#serialize can save catalogName, defaultDatabase, username, pwd, baseUrl, and JdbcCatalog#deserialize can re-initialize a JdbcCatalog object through these parameters; HiveCatalog#serialize can save catalogName, defaultDatabase, hiveConf, hiveVersion, and HiveCatalog#deserialize can re-initialize a HiveCatalog object through these parameters; InMemoryCatalog#serialize only needs to save the catalogName and defaultDatabase, and InMemoryCatalog#deserialize can re-initialize an InMemoryCatalog object through these two parameters.

The tables in the InMemoryCatalog already exist in the external system. The metadata information held in the InMemoryCatalog is only used by the job itself, and is held only in memory. Therefore, all metadata information in the InMemoryCatalog does not need to be serialized and passed to JM. In JM, only need to initialize a new InMemoryCatalog.

The solution serialization tool is more complex to implement, and the user-defined Catalog is more expensive to implement, so it is abandoned.

References

  1. Support SELECT clause in CREATE TABLE(CTAS)
  2. MySQL CTAS syntax
  3. Microsoft Azure Synapse CTAS
  4. LanguageManual DDL#Create/Drop/ReloadFunction
  5. Spark Create Table Syntax

...

  • Streaming mode requires the table to be created first(metadata sharing), downstream jobs can consume in real time.
  • In most cases, Streaming jobs do not need to be cleaned up even if the job fails(Such as Redis, cannot be cleaned unless all keys written are recorded).
  • Batch jobs try to ensure final atomicity(The job is successful and the data is visible; otherwise, drop the metadata and delete the temporary data).
properties  [ˈprɒpətiz]  详细X
基本翻译
n. <正式>房屋及周围的土地;性质,性能;(舞台或电影的)道具(prop 的旧时用语)(property 的复数)
网络释义

...