Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.


Motivation

In data streaming process there may be data errors and other issues, and we need to correct the data in the flow. This situation is very common and important. However, in this process, we do not want to affect existing data processing to avoid impact on users. We need to create a new data streaming process and wait for it to catch up with the data and replace the original data streaming process. The main operations can be divided into the following steps:

...

  1. Each table only has one main branch, and other branches can only be created from the specified tag of the main branch

  2. Create Users can create or delete a branch for tables in Paimon, and create a tag tags for a specified branch.

  3. Update schema for the branch, such as altering tables to add/drop columns.

  4. Jobs can streaming/batch read from and write data to in the branch

  5. There are merge merging and replace replacing operations from branch to main, and after replace main with given branch, the previous main branch will be deleted.

...

draw.io Diagram
bordertrue
diagramName2
simpleViewerfalse
width
linksauto
tbstyletop
lboxtrue
diagramWidth591
revision12

There is a main branch file  in the branch base directory of table and it has the main branch name in the file. Besides that, there will be multiple branch directories and each branch has snapshot, tag and schema in its directory.

NOTICE: By default, the Snapshot、Schema and Tag of main branch will be in the base directory of table as previously. The main branch will be used to read and write when there's no specified branch or main branch file in the table.

Create Branch

There will be a series of snapshots, tags and schemas in the main branch of a Paimon table. We can create a new branch with branch - name from the tag for the table. To do that, Paimon will should create a new directory with the given branch name, copy the specified tag, snapshot and schema from the main branch to the new branch.

...

We need to support replacing the main branch with a branch without affecting streaming and batch data read and write on the branch. To achieve this, we need to do the following steps:

  1. Calculate and copy the snapshots, tags and schemas which should be copied from the main branch to target branch

  2. Update the Main Branch File to the target branch

  3. Drop the previous main branch, including snapshots, tags and schemas.

...

action

argument

note

create-branch

--name <branch-name>: specify the name of the branch.
-- tag <tag-name>: specify the name of a tag.

create a branch based on the given tag.

delete-branch

--name <branch-name>: specify which branch will be deleted.

delete a branch.

merge-branch

--name <branch-name>: merge specified branch to main.

merge specified branch to main.

replace-main-branch

--name <branch-name>: replace main branch with specified branch.

replace the main branch with a specified branch.

...