Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.


Motivation

In data streaming process there may be data errors and other issues, and we need to correct the data in the flow. This situation is very common and important. However, in this process, we do not want to affect existing data processing to avoid impact on users. We need to create a new data streaming process and wait for it to catch up with the data and replace the original data streaming process. The main operations can be divided into the following steps:

  1. Create a replica table based on the specified tag/snapshot of upstream and downstream Paimon Tables

  2. Resubmit all streaming jobs, incremental or full recovery starting from the specified offset

I think we We need to support branching in Paimon . Then for the above data correction progress, then we could create replica tables to avoid copying all data from specified tables and increase storage space.
Besides the above, branching in Paimon can also be used to enhance tag. for For Tag simulation of traditional Hive partition tables, provide data correction capabilities on the basis of Tag, which can be used to supplement data and achieve precise segmentation capabilities.
Above all, the branch we would like to introduce in Paimon has the following abilities:

  1. Each table only has one main branch, and other branches can only be created from the specified tag of the main branch

  2. Create Users can create or delete a branch for tables in Paimon, and create a tag tags for a specified branch.

  3. Update schema for the branch, such as altering tables to add/drop columns.

  4. Jobs can streaming/batch read from and write data to in the branch

  5. There are merge merging and replace replacing operations from branch to main, and after replace main with given branch, the previous main branch will be deleted.

...

draw.io Diagram
bordertrue
diagramName2
simpleViewerfalse
width
linksauto
tbstyletop
lboxtrue
diagramWidth591
revision12

There is a main branch file  in the branch base directory of table and it has the main branch name in the file. Besides that, there will be multiple branch directories and each branch has snapshot, tag and schema in its directory.

NOTICE: By default, the Snapshot、Schema and Tag of main branch will be in the base directory of table as previously. The main branch will be used to read and write when there's no specified branch or main branch file in the table.

Create Branch

There will be a series of snapshots, tags and schemas in the main branch of a Paimon table. We can create a new branch with branch - name from the tag for the table. To do that, Paimon will should create a new directory with the given branch name, copy the specified tag, snapshot and schema from the main branch to the new branch.

...

We need to support replacing the main branch with a branch without affecting streaming and batch data read and write on the branch. To achieve this, we need to do the following steps:

  1. Calculate and copy the snapshots, tags and schemas which should be copied from the main branch to target branch

  2. Update the Main Branch File to the target branch

  3. Drop the previous main branch, including snapshots, tags and schemas.

...

We propose to provide two Flink actions for users to control the creating, deleting, merging and replacing of branches.

action

argument

note

create-branch

--name <branch-name>: specify the name of the branch.
-- tag <tag-name>: specify the name of a tag.

create a branch based on the given tag.

delete-branch

--name <branch-name>: specify which branch will be deleted.

delete a branch.

merge-branch

--name <branch-name>: merge specified branch to main.

merge specified branch to main.

replace-main-branch

--name <branch-name>: replace main branch with specified branch.

replace the main branch with a specified branch.

Branch System Table

We propose introducing a system table $branches. The schema is:

Field Name

Field Type

Comment

name

string

The branch name

tag_name

string

The created tag for the branch

tagged_snapshot_id

bigint

The snapshot id for the tag.

Expiring Snapshot

We already had a mechanism to find deletion candidates when expiring snapshots. After adding branches for a Paimon table, it needs to traverse all branches to check whether a snapshot can be deleted.

...