Page properties

Discussion thread

Discussion thread: http://apache-flink-mailing-list-archive.1008284.n3.nabble.com/DISCUSS-FLIP-63-Rework-table-partition-support-td32770.html

JIRA:

...

Vote thread

JIRA

Jira

server	ASF JIRA
serverId	5aa69414-a9e9-3523-82ec-879b028fb15b
key	FLINK-14249

Release

1.10

Please keep the discussion on the mailing list rather than commenting on the wiki (wiki discussions get unwieldy fast).

...

Although Queue may distinguish partitions by the partition concept of the underlying queue. (Like kafka partition), streaming connector like queue(Kafka) support table partition is not our goal in this ticket.
Bucket support to cover hash partition in traditional database and etc..

Background

Partition in traditional databases

...

At present, the partition we want to support is similar to that of hive and only supports single value list partition.

Create Table

CREATE TABLE country_page_view(

user STRING,

cnt INT)

PARTITIONED BY (date STRING, country STRING);

The table will be partitioned by two fields.

Why like hive:

We don’t need support rich partitioning criteria. Because hive bucket cover these concepts.
Partition column is the pseudocolumn to unify batch and streaming.
Such a grammar can isolate the partition columns from the regular columns and avoid confusion of user concepts. (Either way, the two may intersect to make the specific data strange.

static partitioning insert

Users can specify the value of partition while inserting the data:

INSERT OVERWRITE TABLE tablename1 [PARTITION (partcol1=val1, partcol2=val2 ...) select_statement1 FROM from_statement;

INSERT INTO TABLE tablename1 [PARTITION (partcol1=val1, partcol2=val2 ...)] select_statement1 FROM from_statement;

INSERT OVERWRITE will overwrite any existing data in the table or partition
INSERT INTO will append to the table or partition, keep the existing data intact
Both INSERT INTO and INSERT OVERWRITE will create a new partition if the target static partition doesn't exist.

For example:

static partitioning insert

Users can specify the value of partition while inserting the data:

INSERT OVERWRITE TABLE tablename1 [PARTITION (partcol1=val1, partcol2=val2 ...) select_statement1 FROM from_statement;

INSERT INTO TABLE tablename1 [PARTITION (partcol1=val1, partcol2=val2 ...)] select_statement1 FROM from_statement;

PARTITION clause should contain all partition columns of this table.
The fields returned in this select statement should not contain any of the partition columns.
INSERT OVERWRITE will overwrite any existing data in the table or partition
INSERT INTO will append to the table or partition, keep the existing data intact
Both INSERT INTO and INSERT OVERWRITE will create a new partition if the target static partition doesn't exist.

For example:

INSERT INTO TABLE country_page_view PARTITION (date=’2019-8-30’, country=’china’) SELECT user, cnt FROM country_page_view_source;

This will create a new partition in country_page_view and insert all date from country_page_view_source to this partition. User can verify it by command:

➜ ~ SHOW PARTITIONS country_page_view;

INSERT INTO TABLE country_page_view PARTITION (date=’2019-8-30’,country=’china’) SELECT user, cnt FROM country_page_view_source;

This will create a new partition in country_page_view and insert all date from country_page_view_source to this partition. User can verify it by command:

➜ ~ SHOW PARTITIONS country_page_view;

date=’2019-8-30’,country=’china’

dynamic partitioning insert

In the dynamic partition inserts, users can give partial partition specifications, which means just specifying partial column values in the PARTITION clause or not provide PARTITION clause. Let the In the dynamic partition inserts, users can give partial partition specifications, which means just specifying partial column values in the PARTITION clause or not provide PARTITION clause. Let the engine dynamically determine the partitions based on the values of the partition column from source table. This means that the dynamic partition creation is determined by the value of the input column.

...

In this method, the engine will determine the different unique values from source table that the partition columns holds(i.e date and country), and creates partitions for each value.

Different from hive : Flink will automatically generate partition specification 2.X or smaller: Dynamic partitioned columns do not need to be on partition clause. (Hive 3.0.0 also support this in HIVE-19083) and don’t specify .

Hive 2.X, user need define dynamic partition columns in PARTITION clause like this:

...

INSERT INTO TABLE country_page_view PARTITION (date=’2019-8-30’) SELECT user, cnt, date country FROM country_page_view_source;

...

This is related to implementation, but now we don’t have any implementation nowrecommend hive’s behavior.

external partitioned tables

...

FLINK-5859 FLINK-12805 FLINK-13115 already introduce PartitionableTableSource to flink and implement it in blink planner. The source interface is:

public interface PartitionableTableSource {

// get all partitions, list of partition column name to column value map.

List<Map<String, String>> getPartitions();

// get partition column names

List<String> getPartitionFieldNames();

// Applies the remaining partitions to the table source.

TableSource applyPartitionPruning(List<Map<String, String>> remainingPartitions);

}

Advantages and disadvantages:

The engine will automatically prune the partitions based on the filters and partition columns. Source don’t need do something.
The table source need get all partition values.
The problem is that
The engine will automatically prune the partitions based on the filters and partition columns. Source don’t need do something.
The table source need get all partition values.
The problem is that every partition Pruning needs to get all partition values. When there are thousands of partitions, there will be a lot of pressure on catalog (for example, MySQL storage).

New PartitionableTableSource

public interface PartitionableTableSource {

// get partition column names.

List<String> getPartitionFieldNames();

// Applies the remaining partitions to the table source.

TableSource applyPartitionPruning(List<Expression> partitionPredicates);

}

How to do partition pruning How to do partition pruning depends entirely on TableSource's own implementation:

The table source can use catalog to do partition pruning. For example, hive table source can touch its catalog from creation of HiveTableFactory.
Without catalog, the table source will list sub directories to do the filter by name.

Without Partition pruning

How to do partition pruning depends on table:

The table is catalog table: planner will use catalog to do partition pruning.
The table is temporary table: planner will use the all_partitions returned by the temporary table and do the filter by name.

Add Catalog Api:

List<CatalogPartitionSpec> listPartitionsByFilter(ObjectPath tablePath, List<Expression> filters)

Without Partition pruning

If it is a partition catalog table, will read all partition which is registered to catalog. The data of all partitions will be read out. Users can judge which partition by the partition column in data.

...

Writing buffers is small: Like Csv/Text, In this case, when dynamic partitioning, we can write multiple files simultaneously in a task of sink.
Writing buffers is big: Like Orc/Parquet, In this case, when dynamic partitioning, we can not write multiple files simultaneously in a task of sink. Otherwise, too much memory will lead to OOM.

Introduce new PartitionableTableSink:

public interface PartitionableTableSink {

// set the static partition into the TableSink.

void setStaticPartition(Map<String, String> partitions);

// get dynamic partition column names.

List<String> getDynamicPartitionFieldNames();

// If returns true, sink can trust all records will definitely be grouped by partition fields before consumed by the sink, sink can use “grouped multi-partition writer”. If returns false, there are no need to do partition grouping.

// If never invoke this method, that mean the execution mode(streaming mode) don’t support grouping, the sink should use its “ungrouped multi-partition writer” when there are dynamic partitions.

boolean enableDynamicPartitionGrouping();

}

Sink implementation (HiveTableSink) should provide three writers:

single-partition writer: writes data to a single partition (non-dynamic-partition writes).
grouped multi-partition writer: inputs are grouped by dynamic partitions, So there's only one partition at the same time.
ungrouped multi-partition writer: writing multiple partitions at the same time consumes more memory.

Nice to have: Considering that dynamic partitioning implementation of all file format(hive/flink csv, parquet, orc) is almost the same. We need a file format frame to abstract these things.

Streaming partition write

Scenes

There are many scenarios where data can be written to FileSink through streaming job. At the same time, these data can be analyzed and calculated by batch job.

static partition writing to sink.
dynamic partition writing

partitioned by window time, maybe event time or processing time. Without trigger, the partition column is monotonically incremental.
partitioned by regular columns.

Exactly-once semantics

Like StreamingFileSink, table sink should integrated with the checkpointing mechanism to provide exactly once semantics.

The files can be in one of three states: in-progress, pending or finished. The file that is currently being written to is in-progress. Once a file is closed for writing it becomes pending. When a checkpoint is successful the currently pending files will be moved to finished.

StreamingFileSink does many great works:

Decouple checkpoint from file size. It provides an abstraction of RollingPolicy to determine file size. On snapshot, it will not only store the pending files, but also store in-progress files. In case of a failure, it will restore the pending files, and restore in-progress files too. (In-progress files will be truncated to discard the content that does not belong to that checkpoint. This is achieved by using RecoverableWriter.)

temp files and renaming versus recoverable writer:

Either way, file visibility still depends on the checkpoint finish time.
Complex Formats, such as hive, can hardly meet the requirements for recoverable writer. (Hive just provides abstract RecordWriter, which hardly supports above features: Flush to the file system and record its file offset on snapshot, and truncate redundant file contents on recovery)

To simplify the current implementation, we only consider that file size depends on checkpoint.

snapshotState(cpId): The file currently being written changes from in-progress state to pending state. Store the pending files (Contains all unfinished checkpoints corresponding files) by operator state.
notifyCheckpointComplete(cpId): Move all the pending files less than or equal to cpId to the target directory, and the corresponding files will be finished.

HiveFormat's problem: At this stage, HiveFormat needs to access Metastore if the file needs to be visible. Only the Task side can have logic in notifyCheckpointComplete, which will lead to distributed access to Metastore, causing pressure.

initializeState(retore): Copy the pending files from state to memory.

Partition support

Stream write support both static partition table and dynamic partition table. To static partition table is simple: just like regular table. The only thing is decide path by static partition first.

To dynamic partition table:

partitioned by monotonically column (like partitioned by window time): In this case, the implementation should be the same as batch grouped multi-partition writer. At the same time, can open only one writer.
partitioned by regular columns, Because in the case of streaming, upstream can not sort all data, so:

Open multiple writers at the same time, If the file format is CSV or text or partition number is small, this is no problem. If it's a Parquet or Orc data format, it will consume too much memory.
(Nice to have) Accumulate data in a single checkpoint, wait until snapshot, sort all data, and write partition data one by one.

FileSystemSink

Considering the stream writing and the mechanism of dynamic partitioning, we need to implement a FileSink to handle the relevant logic. Subsequent Flink file-related connectors and HiveSink can be unified into this sink. Formats only need to implement the relevant interface, without dealing with streaming exactly-once and partition-related logic.

Support single-partition writing
Support grouped multi-partition writing
Support non-grouped multi-partition writing
Extended StreamingFileSystemSink support streaming exactly-once

Catalog changes

HiveConnector should only call HiveClient-related Api by catalog, and other places should call HiveCatalog. So we should add more method to cover the requirements.

HiveCatalog

CatalogTable and CatalogPartition should cover HiveTableSource/HiveTableSink requirements (like hive StorageDescriptor). Should add more properties to the map in CatalogPartition from HiveCatalog:

String location;
String inputFormat;
String outputFormat;
String serializationLib;
boolean compressed;

Partition statistics

First, planner should support statistics of catalog table.
Planner should read partition statistics and update to query optimizer.
Related: FilterableTableSource need update statistics too.

Now we don't have a mechanism for pruning or filterPushDown's source to update its statistics. Maybe we need to modify the related tableSource interfaces.

Catalog interface

public class Catalog {

void renamePartition(ObjectPath tablePath, CatalogPartitionSpec spec, CatalogPartitionSpec newSpec);

List<CatalogPartitionSpec> listPartitionsByFilter(ObjectPath tablePath, List<Expression> filters) throws TableNotExistException, TableNotPartitionedException, CatalogException;

}

Finally, hive table source and table sink should get rid of Hive client.

Public Interfaces

DDL

CREATE TABLE country_page_view(

user STRING,

cnt INT,

date STRING,

country STRING)

PARTITIONED BY (date, country);

The table will be partitioned by two fields.

DML

static partition writing:

Sink implementation should provide three writers:

single-partition writer: writes data to a single partition (non-dynamic-partition writes).
grouped multi-partition writer: inputs are grouped by dynamic partitions, So there's only one partition at the same time.
ungrouped multi-partition writer: writing multiple partitions at the same time consumes more memory.

Streaming partition write

Scenes

There are many scenarios where data can be written to FileSink through streaming job. At the same time, these data can be analyzed and calculated by batch job.

static partition writing to sink.
dynamic partition writing

partitioned by window time, maybe event time or processing time. Without trigger, the partition column is monotonically incremental.
partitioned by regular columns.

Exactly-once semantics

Like StreamingFileSink, table sink should integrated with the checkpointing mechanism to provide exactly once semantics.

The files can be in one of three states: in-progress, pending or finished. The file that is currently being written to is in-progress. Once a file is closed for writing it becomes pending. When a checkpoint is successful the currently pending files will be moved to finished.

StreamingFileSink does many great works:

Decouple checkpoint from file size. It provides an abstraction of RollingPolicy to determine file size. On snapshot, it will not only store the pending files, but also store in-progress files. In case of a failure, it will restore the pending files, and restore in-progress files too. (In-progress files will be truncated to discard the content that does not belong to that checkpoint. This is achieved by using RecoverableWriter.)

temp files and renaming versus recoverable writer:

Either way, file visibility still depends on the checkpoint finish time.
Complex Formats, such as hive, can hardly meet the requirements for recoverable writer. (Hive just provides abstract RecordWriter, which hardly supports above features: Flush to the file system and record its file offset on snapshot, and truncate redundant file contents on recovery)

To simplify the current implementation, we only consider that file size depends on checkpoint.

snapshotState(cpId): The file currently being written changes from in-progress state to pending state. Store the pending files (Contains all unfinished checkpoints corresponding files) by operator state.
notifyCheckpointComplete(cpId): Move all the pending files less than or equal to cpId to the target directory, and the corresponding files will be finished.

HiveFormat's problem: At this stage, HiveFormat needs to access Metastore if the file needs to be visible. Only the Task side can have logic in notifyCheckpointComplete, which will lead to distributed access to Metastore, causing pressure.

initializeState(retore): Copy the pending files from state to memory.

Partition support

Stream write support both static partition table and dynamic partition table. To static partition table is simple: just like regular table. The only thing is decide path by static partition first.

To dynamic partition table:

partitioned by monotonically column (like partitioned by window time): In this case, the implementation should be the same as batch grouped multi-partition writer. At the same time, can open only one writer.
partitioned by regular columns, Because in the case of streaming, upstream can not sort all data, so:

Open multiple writers at the same time, If the file format is CSV or text or partition number is small, this is no problem. If it's a Parquet or Orc data format, it will consume too much memory.
(Nice to have) Accumulate data in a single checkpoint, wait until snapshot, sort all data, and write partition data one by one.

FileSystemSink

Considering the stream writing and the mechanism of dynamic partitioning, we need to implement a FileSink to handle the relevant logic. Subsequent Flink file-related connectors and HiveSink can be unified into this sink. Formats only need to implement the relevant interface, without dealing with streaming exactly-once and partition-related logic.

Support single-partition writing
Support grouped multi-partition writing
Support non-grouped multi-partition writing
StreamingFileSystemSink support streaming exactly-once

Not recommend using StreamingFileSink to support partitioning in Table.

The bucket concept and SQL's bucket concept are in serious conflict.
In table, we need support single-partition writing, grouped multi-partition writing, non-grouped multi-partition writing.
We need a global role to commit files to metastore.
We need an abstraction to support both streaming and batch mode
Table partition is simpler than StreamingFileSink, the concept of partitioning is that we only support partition references on fields, rather than being as flexible as runtime.

Flink FileSystem connector

The DDL can like this:

CREATE TABLE USER_T

......

WITH (

'connector.type' = ‘filesystem’,

'connector.path' = 'hdfs:///tmp/xxx',

'format.type' = 'csv',

'update-mode' = 'append',

'partition-support' = 'true'

)

The only difference from the previous FileSystem is that the partition-support attribute is required. We can use this identifier to represent the new connector support partition without changing the previous connector.Other attributes can be completely consistent.

'partition-support' = 'true' can be removed after we full support csv format.

And provide table factories:

Provide FileSystemTableFactory: Csv format and Hive format will use it.
Provide FileSystemTableSink and FileSystemTableSource
Provide BatchFileSystemSink and StreamingFileSystemSink

Formats just needs to implement:

InputFormat for read
RecordWriter and FileCommitter to write.

Specific implementation format does not involve too much partition concept, it only manages its own reading and writing.

Code prototype: https://github.com/JingsongLi/flink/tree/filesink/flink-table/flink-table-api-java-bridge/src/main/java/org/apache/flink/table/sink/filesystem

Catalog changes

HiveCatalog

CatalogTable and CatalogPartition should cover HiveTableSource/HiveTableSink requirements (like hive StorageDescriptor). Should add more properties to the map in CatalogPartition from HiveCatalog:

String location;
String inputFormat;
String outputFormat;
String serializationLib;
boolean compressed;

Partition statistics

First, planner should support statistics of catalog table.
Planner should read partition statistics and update to query optimizer.
Related: FilterableTableSource need update statistics too.

Public Interfaces

DML

static partition writing:

INSERT { INTO | OVERWRITE } TABLE INSERT INTO | OVERWRITE TABLE tablename1 [PARTITION (partcol1=val1, partcol2=val2 ...) select_statement1 FROM from_statement;

dynamic partition writing:

INSERT { INTO | OVERWRITE } TABLE tablename1 select_statement1 FROM from_statement;

...

ALTER TABLE table_name DROP PARTITION partition_spec[, PARTITION partition_spec, ...]

Show

partition_spec ::= (partition_column = partition_col_value, partition_column = partition_col_value, ...)

Show

SHOW PARTITIONS lists all the existing partitions SHOW PARTITIONS lists all the existing partitions for a given base table. Partitions are listed in alphabetical order.

...

SHOW TABLE EXTENDED [IN|FROM database_name] LIKE 'identifier_with_wildcards' [PARTITION(partition_spec)];

Describe

DESCRIBE [EXTENDED | FORMATTED] [db_name.]table_name [PARTITION partition_spec] [col_name];

TableSource

public interface PartitionableTableSource {

// get partition column names.

List<String> getPartitionFieldNames();

// Applies the remaining partitions to the table source.

TableSource applyPartitionPruning(List<Expression> partitionPredicates);

}

TableSink

public interface PartitionableTableSink {

' [PARTITION(partition_spec)];

Describe

DESCRIBE [EXTENDED | FORMATTED] [db_name.]table_name [PARTITION partition_spec] [col_name];

Catalog interface

public interface Catalog {

…..

void renamePartition(ObjectPath tablePath, CatalogPartitionSpec spec, CatalogPartitionSpec newSpec) throws PartitionNotExistException, PartitionAlreadyExistsException, CatalogException;

void syncPartitions(ObjectPath tablePath) throws TableNotExistException, CatalogException;

List<CatalogPartitionSpec> listPartitionsByFilter(ObjectPath tablePath, List<Expression> filters) throws TableNotExistException, TableNotPartitionedException, CatalogException;

}

Further discussion

Create DDL

Should we support partition grammar like Spark SQL? (Subsequent votes will be taken to determine.)

CREATE TABLE country_page_view(

user STRING,

cnt INT)

PARTITIONED BY (date STRING, country STRING);

The table will be partitioned by two fields.

Recover Partitions (MSCK REPAIR TABLE)

Flink stores a list of partitions for each table in its catalog. If, however, new partitions are directly added to HDFS (say by using hadoop fs -put command) or removed from HDFS, the catalog will not be aware of these changes to partition information unless the user runs ALTER TABLE table_name ADD/DROP PARTITION commands on each of the newly added or removed partitions, respectively.[3]

However, users can run a command with the repair table option:

MSCK REPAIR TABLE table_name;

which will update catalog about partitions for partitions for which such catalog doesn't already exist. The default option for MSC command is ADD PARTITIONS. With this option, it will add any partitions that exist on HDFS.

TableSink Interface

public interface PartitionableTableSink {

List<String> getPartitionFieldNames();

// set the static partition into the TableSink.

...

boolean enableDynamicPartitionGrouping();

}

Road map

...

Sync flink partition and hive partition

;

}

Road map

Modify DDL support.
Rework partition pruning
Rework dynamic partitioning
Introduce FileSystemTableFactory

Introduce BatchFileSystemSink
Introduce StreamingFileSystemSink
Introduce FileSystemTableFactory and FileSystemTableSource and FileSystemTableSink
Introduce new CSV for FileSystemTableFactory
Integrate Hive to FileSystemTableFactory

Nice to have:

Integrate Create table DDL(with partition) to Hive
push down partition pruning to hive metastore
Introduce alter partitions commands
Introduce recover partitions commands
Introduce show/describe partitions commands
Integrate partition statistics to planner

...

Reference

[1] https://en.wikipedia.org/wiki/Partition_(database)

...

Page tree

Page History

Versions Compared

Old Version 4

New Version Current

Key

Discussion thread: http://apache-flink-mailing-list-archive.1008284.n3.nabble.com/DISCUSS-FLIP-63-Rework-table-partition-support-td32770.html

Background

Partition in traditional databases

Create Table

static partitioning insert

static partitioning insert

dynamic partitioning insert

dynamic partitioning insert

external partitioned tables

New PartitionableTableSource

Without Partition pruning

Without Partition pruning

Streaming partition write

Scenes

Exactly-once semantics

Partition support

FileSystemSink

Catalog changes

HiveCatalog

Partition statistics

Catalog interface

Public Interfaces

DDL

DML

Streaming partition write

Scenes

Exactly-once semantics

Partition support

FileSystemSink

Flink FileSystem connector

Catalog changes

HiveCatalog

Partition statistics

Public Interfaces

DML

Show

Show

Describe

TableSource

TableSink

Describe

Catalog interface

Further discussion

Create DDL

Recover Partitions (MSCK REPAIR TABLE)

TableSink Interface

Road map

Road map

Reference