Status

Current state: Under Discussion

Discussion thread: http://apache-flink-mailing-list-archive.1008284.n3.nabble.com/DISCUSS-FLIP-63-Rework-table-partition-support-td32770.html

JIRA:

Released:

Please keep the discussion on the mailing list rather than commenting on the wiki (wiki discussions get unwieldy fast).

Motivation

A partition is a division of a logical database or its constituent elements into distinct independent parts. Database partitioning is normally done for manageability, performance or availability reasons, or for load balancing[1].

Partition is widely used in hive. Especially in the ETL domain, most tables have partition attributes, which allow users to continue processing. Partition is more convenient for data management, time partitioning and business partitioning are common.

Goal and non-goal

Goal

Table partitioning means dividing table data into some parts based on the values of particular columns.

Partition in flink only support single value list partition, this means that partitions can only be partitioned according to the specific values of a column. Not support: hash, range, etc.
Both regular tables and temporary tables support partition, with PartitionableTableSource and PartitionableTableSink, user can do above read and write to the temporary table.

Read:

Partition prune: partitioned table support partition pruning, this means that users can specify which partition to read to avoid scanning the entire table.
Regular read: Without partition prune, will read all partition data, and select * will contain partition columns.

Write: Flink does not require users to create partitions in advance, and partitions are created automatically during writing.

static partition write: the users can specify which partition to write.
dynamic partition write: partition are specified by specific data. Many partitions may be generated based on the data.
Streaming write to partition should support exactly-once.

Connectors:

Introduce file system connector support partition.
Improve Hive connector partition support.

Non-Goal

Although Queue may distinguish partitions by the partition concept of the underlying queue. (Like kafka partition), streaming connector like queue(Kafka) support table partition is not our goal in this ticket.

Background

Partition in traditional databases

Partition in traditional databases is very complex, and they support rich partitioning criteria, includes:

list partition
range partition
hash partition
subpartition

The DDL in traditional databases like:

CREATE TABLE pageview(

user VARCHAR(100),

cnt INT,

date VARCHAR(100))

PARTITION BY LIST (date) (

PARTITION day1 values(‘2019-8-28’),

PARTITION day2 values(‘2019-8-29’),

PARTITION day3 values(‘2019-8-30’)

);

Note:

date is the reference of DDL defined fields
partition values need to be saved in real data because they support rich partitioning criteria.

Partition in Hive

In today's big data systems, partition mainly comes from hive. The partition in Hive is only similar to the concept of single value list partition in traditional databases. There is no need for support rich partitioning criteria at present.

The Create DDL like:

CREATE TABLE page_view(

user STRING,

cnt INT)

PARTITIONED BY (date STRING);

The users can query on “where date = ‘2019-8-28’” to high performance partition pruning.

Note:

date is not the reference of DDL defined fields.
Partitioned field can not be included in the table declarative fields. Otherwise will get the error.
Partitioned field data is not stored in real data. It just be used in directory.

Partition in Spark

Spark support hive partitioned by when use Hive catalog, and it also introduced its partitioned by DDL too when use inMemory catalog. (The two methods of use are mutually exclusive)

In SPARK-7654, Spark introduce partition interface to Dataset api.

In SPARK-14954, Spark introduce partitioned to CREATE TABLE DDL.

The DDL like:

CREATE TABLE page_view(

user STRING,

cnt INT,

date STRING)

PARTITIONED BY (date);

date is the reference of DDL defined fields
But partitioned field data is not stored in real data(FileFormat can not see the partition columns). It just be used in directory. So the real data indices are different from the definition of CREATE DDL.

disadvantages: This disrupts the format of real data, and partitioned columns may be in the middle of non-partitioned columns, which makes real data look strange.

Partition Pruning

Hive/Spark partition pruning

Hive/Spark use catalog to partition pruning. If use mysql as catalog storage, the partition filter will push down to mysql query.

This is the most efficient pruning method, which has less pressure on catalog and client.

Databricks delta partition pruning

Databricks delta is a transaction storage layer specially designed to use Apache Spark and Databricks File System. It don’t have catalog and focuses on transaction. It does partition pruning by launching a Spark SQL job. First, it reads checkpoint and changeLog, gets the current readable file list, and then filter it according to condition, and get the final partitions.

One of the main reasons is that partition pruning is too heavy in Delta. It needs to merge checkpoint and changeLog, and there may be many smaller files, so it needs to start a Spark SQL job to complete.

Proposed Change

Partition SQL

At present, the partition we want to support is similar to that of hive and only supports single value list partition.

Create Table

CREATE TABLE country_page_view(

user STRING,

cnt INT)

PARTITIONED BY (date STRING, country STRING);

The table will be partitioned by two fields.

Why like hive:

We don’t need support rich partitioning criteria. Because hive bucket cover these concepts.
Partition column is the pseudocolumn to unify batch and streaming.
Such a grammar can isolate the partition columns from the regular columns and avoid confusion of user concepts. (Either way, the two may intersect to make the specific data strange.

static partitioning insert

Users can specify the value of partition while inserting the data:

INSERT OVERWRITE TABLE tablename1 [PARTITION (partcol1=val1, partcol2=val2 ...) select_statement1 FROM from_statement;

INSERT INTO TABLE tablename1 [PARTITION (partcol1=val1, partcol2=val2 ...)] select_statement1 FROM from_statement;

INSERT OVERWRITE will overwrite any existing data in the table or partition
INSERT INTO will append to the table or partition, keep the existing data intact
Both INSERT INTO and INSERT OVERWRITE will create a new partition if the target static partition doesn't exist.

For example:

INSERT INTO TABLE country_page_view PARTITION (date=’2019-8-30’, country=’china’) SELECT user, cnt FROM country_page_view_source;

This will create a new partition in country_page_view and insert all date from country_page_view_source to this partition. User can verify it by command:

➜ ~ SHOW PARTITIONS country_page_view;

date=’2019-8-30’,country=’china’

dynamic partitioning insert

In the dynamic partition inserts, users can give partial partition specifications, which means just specifying partial column values in the PARTITION clause or not provide PARTITION clause. Let the engine dynamically determine the partitions based on the values of the partition column from source table. This means that the dynamic partition creation is determined by the value of the input column.

INSERT INTO TABLE country_page_view SELECT user, cnt, date, country FROM country_page_view_source;

In this method, the engine will determine the different unique values from source table that the partition columns holds(i.e date and country), and creates partitions for each value.

Different from hive: Flink will automatically generate partition specification (Hive 3.0.0 also support this HIVE-19083) and don’t specify partition columns in PARTITION clause like this:

INSERT INTO TABLE country_page_view PARTITION (date, country) SELECT user, cnt, date, country FROM country_page_view_source;

Partially specified partition columns values are also supported:

INSERT INTO TABLE country_page_view PARTITION (date=’2019-8-30’) SELECT user, cnt, date FROM country_page_view_source;

NOTE:

The dynamic partition columns must be specified last among the columns in the SELECT statement
The dynamic partition columns must be in the same order in which they appear in the DDL of CREATE TABLE.
Because of the existence of dynamic partitioning, we will stuff both static and dynamic columns into Row, so the data received by the sink contains all partition columns.

Behavior of dynamic partition INSERT OVERWRITE:

delete all partition directories that match the static partition values provided in the insert statement. (spark behavior)
only delete partition directories which have data written into it (hive behavior)

This is related to implementation, but now we don’t have any implementation now.

external partitioned tables

If we already have partition data on File system, if we want to load it into Flink catalog. At this point, we need to add partition grammar.

Consider we have a table country_page_view, it is a file table and its location is ‘/user/flink/country_page_view’. And now we have some data of partition (2019-8-30, china), we want to load it into Flink catalog, we can do:

File system operation: move data to ‘/user/flink/country_page_view/2019-8-30/china/’
ALTER TABLE country_page_view ADD PARTITION (date=’2019-8-30’, country=’china’);

NOTE: Using external partitioning tables is an option. Files in File system can also be loaded into managed non-partitioned tables, from which the date can be inserted into partitioned tables. But by external partitioning tables, user can avoid reading and writing real data, which can greatly improve performance.

Partition Read

Partition pruning

One of the great significance of Partition is to support partition Pruning. Users can specify the partition to read through standard filtering conditions, which can greatly improve the efficiency of reading.

Current blink partition pruning:

FLINK-5859 FLINK-12805 FLINK-13115 already introduce PartitionableTableSource to flink and implement it in blink planner. The source interface is:

public interface PartitionableTableSource {

// get all partitions, list of partition column name to column value map.

List<Map<String, String>> getPartitions();

// get partition column names

List<String> getPartitionFieldNames();

// Applies the remaining partitions to the table source.

TableSource applyPartitionPruning(List<Map<String, String>> remainingPartitions);

}

Advantages and disadvantages:

The engine will automatically prune the partitions based on the filters and partition columns. Source don’t need do something.
The table source need get all partition values.
The problem is that every partition Pruning needs to get all partition values. When there are thousands of partitions, there will be a lot of pressure on catalog (for example, MySQL storage).

New PartitionableTableSource

public interface PartitionableTableSource {

// get partition column names.

List<String> getPartitionFieldNames();

// Applies the remaining partitions to the table source.

TableSource applyPartitionPruning(List<Expression> partitionPredicates);

}

How to do partition pruning depends entirely on TableSource's own implementation:

The table source can use catalog to do partition pruning. For example, hive table source can touch its catalog from creation of HiveTableFactory.
Without catalog, the table source will list sub directories to do the filter by name.

Without Partition pruning

The data of all partitions will be read out. Users can judge which partition by the partition column in data.

Partition write

Static Partition

Static partition writing is basically the same as non-partitioned writing. The only difference is that the directory of the final file needs to contain a subdirectory of the partition.

Dynamic Partition

We have already talked about the grammar of dynamic partitioning, and this time we will focus on its implementation and its impact on the sink interface.

Now there are two writing formats:

Writing buffers is small: Like Csv/Text, In this case, when dynamic partitioning, we can write multiple files simultaneously in a task of sink.
Writing buffers is big: Like Orc/Parquet, In this case, when dynamic partitioning, we can not write multiple files simultaneously in a task of sink. Otherwise, too much memory will lead to OOM.

Introduce new PartitionableTableSink:

public interface PartitionableTableSink {

// set the static partition into the TableSink.

void setStaticPartition(Map<String, String> partitions);

// get dynamic partition column names.

List<String> getDynamicPartitionFieldNames();

// If returns true, sink can trust all records will definitely be grouped by partition fields before consumed by the sink, sink can use “grouped multi-partition writer”. If returns false, there are no need to do partition grouping.

// If never invoke this method, that mean the execution mode(streaming mode) don’t support grouping, the sink should use its “ungrouped multi-partition writer” when there are dynamic partitions.

boolean enableDynamicPartitionGrouping();

}

Sink implementation (HiveTableSink) should provide three writers:

single-partition writer: writes data to a single partition (non-dynamic-partition writes).
grouped multi-partition writer: inputs are grouped by dynamic partitions, So there's only one partition at the same time.
ungrouped multi-partition writer: writing multiple partitions at the same time consumes more memory.

Nice to have: Considering that dynamic partitioning implementation of all file format(hive/flink csv, parquet, orc) is almost the same. We need a file format frame to abstract these things.

Streaming partition write

Scenes

There are many scenarios where data can be written to FileSink through streaming job. At the same time, these data can be analyzed and calculated by batch job.

static partition writing to sink.
dynamic partition writing

partitioned by window time, maybe event time or processing time. Without trigger, the partition column is monotonically incremental.
partitioned by regular columns.

Exactly-once semantics

Like StreamingFileSink, table sink should integrated with the checkpointing mechanism to provide exactly once semantics.

The files can be in one of three states: in-progress, pending or finished. The file that is currently being written to is in-progress. Once a file is closed for writing it becomes pending. When a checkpoint is successful the currently pending files will be moved to finished.

StreamingFileSink does many great works:

Decouple checkpoint from file size. It provides an abstraction of RollingPolicy to determine file size. On snapshot, it will not only store the pending files, but also store in-progress files. In case of a failure, it will restore the pending files, and restore in-progress files too. (In-progress files will be truncated to discard the content that does not belong to that checkpoint. This is achieved by using RecoverableWriter.)

temp files and renaming versus recoverable writer:

Either way, file visibility still depends on the checkpoint finish time.
Complex Formats, such as hive, can hardly meet the requirements for recoverable writer. (Hive just provides abstract RecordWriter, which hardly supports above features: Flush to the file system and record its file offset on snapshot, and truncate redundant file contents on recovery)

To simplify the current implementation, we only consider that file size depends on checkpoint.

snapshotState(cpId): The file currently being written changes from in-progress state to pending state. Store the pending files (Contains all unfinished checkpoints corresponding files) by operator state.
notifyCheckpointComplete(cpId): Move all the pending files less than or equal to cpId to the target directory, and the corresponding files will be finished.

HiveFormat's problem: At this stage, HiveFormat needs to access Metastore if the file needs to be visible. Only the Task side can have logic in notifyCheckpointComplete, which will lead to distributed access to Metastore, causing pressure.

initializeState(retore): Copy the pending files from state to memory.

Partition support

Stream write support both static partition table and dynamic partition table. To static partition table is simple: just like regular table. The only thing is decide path by static partition first.

To dynamic partition table:

partitioned by monotonically column (like partitioned by window time): In this case, the implementation should be the same as batch grouped multi-partition writer. At the same time, can open only one writer.
partitioned by regular columns, Because in the case of streaming, upstream can not sort all data, so:

Open multiple writers at the same time, If the file format is CSV or text or partition number is small, this is no problem. If it's a Parquet or Orc data format, it will consume too much memory.
(Nice to have) Accumulate data in a single checkpoint, wait until snapshot, sort all data, and write partition data one by one.

FileSystemSink

Considering the stream writing and the mechanism of dynamic partitioning, we need to implement a FileSink to handle the relevant logic. Subsequent Flink file-related connectors and HiveSink can be unified into this sink. Formats only need to implement the relevant interface, without dealing with streaming exactly-once and partition-related logic.

Support single-partition writing
Support grouped multi-partition writing
Support non-grouped multi-partition writing
Extended StreamingFileSystemSink support streaming exactly-once

Catalog changes

HiveConnector should only call HiveClient-related Api by catalog, and other places should call HiveCatalog. So we should add more method to cover the requirements.

HiveCatalog

CatalogTable and CatalogPartition should cover HiveTableSource/HiveTableSink requirements (like hive StorageDescriptor). Should add more properties to the map in CatalogPartition from HiveCatalog:

String location;
String inputFormat;
String outputFormat;
String serializationLib;
boolean compressed;

Partition statistics

First, planner should support statistics of catalog table.
Planner should read partition statistics and update to query optimizer.
Related: FilterableTableSource need update statistics too.

Now we don't have a mechanism for pruning or filterPushDown's source to update its statistics. Maybe we need to modify the related tableSource interfaces.

Catalog interface

public class Catalog {

void renamePartition(ObjectPath tablePath, CatalogPartitionSpec spec, CatalogPartitionSpec newSpec);

List<CatalogPartitionSpec> listPartitionsByFilter(ObjectPath tablePath, List<Expression> filters) throws TableNotExistException, TableNotPartitionedException, CatalogException;

}

Finally, hive table source and table sink should get rid of Hive client.

Public Interfaces

DDL

CREATE TABLE country_page_view(

user STRING,

cnt INT,

date STRING,

country STRING)

PARTITIONED BY (date, country);

The table will be partitioned by two fields.

DML

static partition writing:

INSERT INTO | OVERWRITE TABLE tablename1 [PARTITION (partcol1=val1, partcol2=val2 ...) select_statement1 FROM from_statement;

dynamic partition writing:

INSERT INTO | OVERWRITE TABLE tablename1 select_statement1 FROM from_statement;

If no specific partition value is specified, or less specified, it is dynamic partition writing.

alter partitions

ALTER TABLE table_name ADD PARTITION partition_spec [, PARTITION partition_spec];

ALTER TABLE table_name PARTITION partition_spec RENAME TO PARTITION partition_spec;

-- Move partition from table_name_1 to table_name_2

ALTER TABLE table_name_2 EXCHANGE PARTITION (partition_spec) WITH TABLE table_name_1;

-- multiple partitions

ALTER TABLE table_name_2 EXCHANGE PARTITION (partition_spec, partition_spec2, ...) WITH TABLE table_name_1;

ALTER TABLE table_name DROP PARTITION partition_spec[, PARTITION partition_spec, ...]

Show

SHOW PARTITIONS lists all the existing partitions for a given base table. Partitions are listed in alphabetical order.

SHOW PARTITIONS table_name;

It is also possible to specify parts of a partition specification to filter the resulting list.

SHOW PARTITIONS table_name PARTITION(ds='2010-03-03', hr='12');

Nice to have:

SHOW TABLE EXTENDED [IN|FROM database_name] LIKE 'identifier_with_wildcards' [PARTITION(partition_spec)];

Describe

DESCRIBE [EXTENDED | FORMATTED] [db_name.]table_name [PARTITION partition_spec] [col_name];

TableSource

public interface PartitionableTableSource {

// get partition column names.

List<String> getPartitionFieldNames();

// Applies the remaining partitions to the table source.

TableSource applyPartitionPruning(List<Expression> partitionPredicates);

}

TableSink

public interface PartitionableTableSink {

// set the static partition into the TableSink.

void setStaticPartition(Map<String, String> partitions);

// get dynamic partition column names.

List<String> getDynamicPartitionFieldNames();

// If returns true, sink can trust all records will definitely be grouped by partition fields before consumed by the sink, sink can use “grouped multi-partition writer”. If returns false, there are no need to do partition grouping.

// If never invoke this method, that mean the execution mode(streaming mode) don’t support grouping, the sink should use its “ungrouped multi-partition writer” when there are dynamic partitions.

boolean enableDynamicPartitionGrouping();

}

Road map

Add/Modify DDL support.
Change Catalog
Rework partition pruning
Introduce FileSystemSink
Rework dynamic partitioning
Fix existing bugs

Sync flink partition and hive partition

Add tests to partition

Reference

[1] https://en.wikipedia.org/wiki/Partition_(database)

[2] https://cwiki.apache.org/confluence/display/Hive/LanguageManual+DML#LanguageManualDML-DynamicPartitionInserts

[3] https://cwiki.apache.org/confluence/display/Hive/LanguageManual+DDL

[4] https://resources.zaloni.com/blog/partitioning-in-hive

[5] https://issues.apache.org/jira/browse/FLINK-5859

Page tree

FLIP-63: Rework table partition support

Status

Discussion thread: http://apache-flink-mailing-list-archive.1008284.n3.nabble.com/DISCUSS-FLIP-63-Rework-table-partition-support-td32770.html

Motivation

Goal and non-goal

Goal

Non-Goal

Background

Partition in traditional databases

Partition in Hive

Partition in Spark

Partition Pruning

Hive/Spark partition pruning

Databricks delta partition pruning

Proposed Change

Partition SQL

Create Table

static partitioning insert

dynamic partitioning insert

external partitioned tables

Partition Read

Partition pruning

Current blink partition pruning:

New PartitionableTableSource

Without Partition pruning

Partition write

Static Partition

Dynamic Partition

Streaming partition write

Scenes

Exactly-once semantics

Partition support

FileSystemSink

Catalog changes

HiveCatalog

Partition statistics

Catalog interface

Public Interfaces

DDL

DML

alter partitions

Show

Describe

TableSource

TableSink

Road map

Reference

Document

https://docs.google.com/document/d/15R3vZ1R_pAHcvJkRx_CWleXgl08WL3k_ZpnWSdzP7GY/edit?usp=sharing