Status

Current state: "Under Discussion",

Discussion thread: http://apache-flink-mailing-list-archive.1008284.n3.nabble.com/DISCUSS-FLIP-64-Support-for-Temporary-Objects-in-Table-module-td32684.html

JIRA:

Released: <Flink Version>

Please keep the discussion on the mailing list rather than commenting on the wiki (wiki discussions get unwieldy fast).

Motivation

As part of FLIP-30 a Catalog API was introduced that enables storing table meta objects permanently. At the same time the majority of current APIs create temporary objects that cannot be serialized. This FLIP aims to clarify the creation of meta objects (tables, views, functions) in a unified way.

Another current problem in the API is that all the temporary objects are stored in a special built-in catalog, which is not very intuitive for many users, as they must be aware of that catalog to reference temporary objects.

Lastly, different APIs have different ways of providing object paths. Either via

String path…,
String path, String pathContinued…
String name

We should choose one approach and unify it across all APIs.

Public Interfaces

registerTable & registerDataStream

The naming of Table objects is actually quite misleading. The Table object represents a relational query, which is actually a view rather than a Table. The difference between a view and a table is primarily that a Table is a physical storage of data. Whereas view is a virtual table on top of Tables that does not materialize data. Thus the flink org.apache.flink.table.api.Table object is actually a SQL View. The same applies to a DataStream, which is also a way to extract data from persistent storage and apply transformations on top of it.

Moreover for DataStream we should only support temporary Views, as there is no way (as of now) to persist them.

We use the “create” prefix rather than “register” to be closer to SQL DDL.

Suggested methods:

void createTemporaryView(String path, Table view);
void createTemporaryView(String path, DataStream view);

registerTableSink & registerTableSource

We suggest to drop those methods entirely as they are misleading what they actually do. TableSource & Sinks long-term are meant to support the physical representation of the data without the logical part as e.g. computed columns (watermarks etc.). Those will be part of the CatalogTable abstraction.

In other words the TableSource & TableSink are to physical to be exposed for the inline declaration. They are meant for predefined connectors. If a user wants to read from some inline source, it can be done with from/toDataStream.

This should be replaced with the properties approach (DDL, descriptor).

registerScalarFunction, registerAggregateFunction & registerTableFunction

As a result of the rework of the type system in UDFs, we will be able to merge the three methods into a single one.

We use the “create” prefix rather than “register” to be closer to SQL DDL.

Suggested methods:

void createTemporaryFunction(String path, UserDefinedFunction function);

ConnectTableDescriptor#registerTableSink,registerTableSource,registerTableSourceAndSink

The table descriptor describes properties of an external system, the physical data format and logical type of the data. Therefore it represents a Table concept. For queries that do not want to modify metastore permanently it makes sense to introduce a temporary table concept.

IMPORTANT: Flink does not own data, just the meta information. Temporary table means that the meta information is available for the session only. If some data was written to a temporary table, the data will not be dropped at the end of the session.

We use the “create” prefix rather than “register” to be closer to SQL DDL.

Suggested methods:

class ConnectTableDescriptor {
   void createTemporaryTable(String path);
}

The method would store the properties only. The source & sink lookup would happen later when reading or writing from/to table. Therefore we do not need to separate source & sink registration.

NOTE: We should not support CREATE TEMPORARY TABLE … AS SELECT syntax. As mentioned above flink does not own the data. Therefore this statement should not be supported in Flink. In Flink’s statement, such a query can be expressed with CREATE TEMPORARY VIEW.

Dropping temporary objects

The temporary objects can shadow permanent objects. Therefore it is vital to enable dropping them to switch from temporary (usually used for experimentations) to permanent objects. We suggest to introduce a separate methods for temporary objects to make the distinction really clear which objects are dropped. The dropTemporary* methods would remove only the temporary objects. They would not take permanent objects into consideration. The same should apply for the regular drop methods. They should only apply to permanent tables, but should throw an exception if a temporary object with same identifier exists. The methods would return true if an object existed under given path and was removed.

Suggested methods:

boolean dropTemporaryView(String path);
boolean dropTemporaryView(String path);
boolean dropTemporaryFunction(String path);
boolean dropTemporaryFunction(String path);

Summary:

Methods of TableEnvironment

Current call	Replacement	Comment
registerTable	createTemporaryView	For the non temporary part we need to make `QueryOperation` string serializable.
registerTableSink	(Deprecate)
registerTableSource	(Deprecate)
registerDataStream	createTemporaryView
registerScalarFunction/ registerAggergateFunction/ registerTableFunction	createTemporaryFunction	We can unify the 3 methods into one once we rework type inference for UDFs

New suggested methods:

Current call	Comment
dropTemporaryTable
dropTemporaryView
dropTemporaryFunction

Methods of ConnectTableDescriptor

Current call	Replacement	SQL equivalent	Comment
registerTableSource	(Deprecate)	-
registerTableSink	(Deprecate)	-
registerTableSourceAndSink	createTemporaryTable	CREATE TEMPORARY TABLE	We should not support CREATE TEMPORARY TABLE AS SELECT = CREATE TEMPORARY VIEW

Persistent API - not part of the FLIP

Implementation, nor design of those API calls is not part of the FLIP. It is just to show that the permanent API is a separate concept that requires further work.

Call to add	SQL equivalent	Comment
createView(Table)	CREATE VIEW	We need to make `QueryOperation`s string serializable.
createTable(TableDescriptor)	CREATE TABLE	We need to rework TableDescriptor. We can temporary use ConnectTableDescriptor#connect
createFunction	CREATE FUNCTION	We need serializable function representation

Referencing objects

I suggest to change the way we address objects in the API to unify it across SQL/Table API & different objects. We should always specify path as a single string and parse it into a catalog/database/object-name subparts.

Affected APIs

We should deprecate:

TableEnvironment

Table scan(String... tablePath);
void insertInto(Table table, String sinkPath, String... sinkPathContinued);

Table

void insertInto(QueryConfig conf, String tablePath, String... tablePathContinued);
void insertInto(String tablePath, String... tablePathContinued);

We suggest to replace those calls with

TableEnvironment

Table from(String tablePath)
void insertInto(String sinkPath, Table table);

Table

void insertInto(String sinkPath) -

we need to immediately drop the “void insertInto(String tablePath, String... tablePathContinued);” for this to work. Otherwise this call would be ambiguous:

Table t = …
t.insertInto(“db.sink”)

Parsing logic

Parsing logic should follow the SQL standard logic for identifiers

Identifier should be 1-3 part identifier
Parts should be delimited with a . dot
Users can escape parts of identifier with ` backtick
Users can escape backtick by duplicating it

Proposed Changes

Assumption:

All objects are identified with 3 part identifiers (catalog name, database name, object name).

We suggest to enable overriding catalog objects with temporary objects. This means it would be possible to register temporary table with identifier `cat1`.`db1`.`tmpTab`. Even if `tmpTab` exists in the `cat1` catalog in `db1` database. Moreover we suggest to allow registering temporary objects in a path that does not exist. This means a user can register e.g. a table in a catalog or database that does not exist.

The benefit of this approach is that it makes experiments way easier. For exploratory purposes a user can use a temporary table. Once the results are verified. The user can drop the temporary table and rerun the exact same query on top of the permanent table.

The tables & views are always identified with a 3 part path. Because it is not always the case for functions the function resolution is discussed separately.

If a user provides a not fully qualified identifier, it is first resolved to 3 part one and then the identifier is:

first evaluated in the temporary map
catalog & database

In case of functions as discussed in FLIP-57 they can have 1 or 3 part identifiers.

1 part for built-in functions
3 part for catalog functions

We suggest to always treat the temporary function in the 3 part category.

Therefore the resolution logic would be following:

built-in functions (1-part name)
temporary functions (3-part path, expanded if provided with less than 3 parts)
catalog functions (3-part path, expanded if provided with less than 3 parts)

For the write path, all temporary objects would always be expanded to 3 part identifiers, with the current catalog & current database if needed.

We suggest to introduce an in-memory maps in CatalogManager:

private final Map<ObjectIdentifier, CatalogView> temporaryViews;
private final Map<ObjectIdentifier, CatalogTable> temporaryTables;
private final Map<ObjectIdentifier, CatalogFunction> temporaryFunctions;

References:

How other systems handle temporary objects:

MySQL:

MySQL allows creating temporary tables in any schema (even if the target database does not exist). Those temporary tables take precedence over permanent tables with the same name (and schema). Therefore user cannot access a permanent table unless the temporary table is dropped.

MySQL adds also a special syntax for dropping temporary tables (DROP TEMPORARY TABLE) in order to prohibit dropping permanent tables when the intention was to drop a temporary one.

https://dev.mysql.com/doc/refman/8.0/en/create-temporary-table.html

http://www.mysqltutorial.org/mysql-temporary-table/

Hive:

Hive implements similar behavior to MySQL. The difference is that the database must exist. Hive also does not add the DROP TEMPORARY TABLE syntax.

In hive temporary functions must not have a database.

https://cwiki.apache.org/confluence/display/Hive/LanguageManual+DDL#LanguageManualDDL-TemporaryTables

SQL Server:

SQL server reserves a special schema for temporary tables (dbo). It also forces user to prefix table names with ‘#’ character. This needed to differentiate if the table should be temporary or permanent. Therefore, it is not possible to override a permanent table.

If you user includes a schema_name when he/she creates or accesses a temporary table, it is ignored. All temporary tables are created in the dbo schema.

https://docs.microsoft.com/en-us/sql/t-sql/statements/create-table-transact-sql?view=sql-server-2017

https://docs.microsoft.com/de-de/sql/relational-databases/databases/tempdb-database?view=sql-server-2017

Oracle:

Oracle implements similar behavior to SQL Server.

https://oracle-base.com/articles/18c/private-temporary-tables-18c

Postgres:

Postgres allows overriding permanent tables with temporary objects, but does not allow arbitrary schemas. Objects identifiers consist of only a single object name.

https://www.postgresql.org/docs/9.3/sql-createtable.html

Compatibility, Deprecation, and Migration Plan

We deprecate all methods mentioned above and remove them in the next release after the one when we deprecated them

Test Plan

Describe in a few sentences how the FLIP will be tested. We are mostly interested in system tests (since unit-tests are specific to implementation details). How will we know that the implementation works as expected? How will we know nothing broke?

Rejected Alternatives

If there are alternative ways of accomplishing the same thing, what were they? The purpose of this section is to motivate why the design is the way it is and not some other way.

Page tree

FLIP-64: Support for Temporary Objects in Table module

Status

Motivation

Public Interfaces

registerTable & registerDataStream

registerTableSink & registerTableSource

registerScalarFunction, registerAggregateFunction & registerTableFunction

ConnectTableDescriptor#registerTableSink,registerTableSource,registerTableSourceAndSink

Dropping temporary objects

Summary:

Methods of TableEnvironment

New suggested methods:

Methods of ConnectTableDescriptor

Persistent API - not part of the FLIP

Referencing objects

Affected APIs

Parsing logic

Proposed Changes

References:

MySQL:

Hive:

SQL Server:

Oracle:

Postgres:

Compatibility, Deprecation, and Migration Plan

Test Plan

Rejected Alternatives