Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

...

  1. Each table only has one main branch, and other branches can only be created from the specified tag of the main branch

  2. Create or delete a branch for tables in Paimon, and create a tag for a specified branch.

  3. Update schema for the branch, such as altering tables to add/drop columns.

  4. Jobs can streaming/batch read from and write data to the branch

  5. There are merge and replace operations from branch to main, and after replace main with given branch, the previous main branch will be deleted.

Architecture

Data storage structure in Paimon is divided into five components: schema, data file, manifest file and list, snapshot, tag.

draw.io Diagram
bordertrue
diagramName1
simpleViewerfalse
width
linksauto
tbstyletop
lboxtrue
diagramWidth621
revision1

To support the above branch capabilities, we would like to introduce branches for snapshot, tag and schema as follows.

draw.io Diagram
bordertrue
diagramName2
simpleViewerfalse
width
linksauto
tbstyletop
lboxtrue
diagramWidth591
revision1

There is a main branch file in the branch directory and it has the main branch name in the file. Besides that, there will be multiple branch directories and each branch has snapshot, tag and schema in its directory.

Create Branch

There will be a series of snapshots, tags and schemas in the main branch of a Paimon table. We can create a new branch with branch-name from the tag for the table. To do that, Paimon will create a new directory with the given branch name, copy the specified tag, snapshot and schema from the main branch to the new branch.

draw.io Diagram
bordertrue
diagramName3
simpleViewerfalse
width
linksauto
tbstyletop
lboxtrue
diagramWidth841
revision1

For example, when Branch-1 is created from tag-1, it should copy the relevant snapshot-4 and schema-1 for Branch-1. Branch-2 and Branch-3 will do the same thing for tag-7 and tag-11.

Operations In Branch

After a branch is created, streaming and batch jobs can read and write data in it. Like a regular table, we can also streaming and batch data from branch through time travel. After writing data to the branch, new snapshots and tags will be created. Users can also perform DDL for table branches, such as add/drop/alter columns. For example, we do these operations in Branch-1 to create new schemas, snapshots and tags.

draw.io Diagram
bordertrue
diagramName4
simpleViewerfalse
width
linksauto
tbstyletop
lboxtrue
diagramWidth811
revision1

Delete Branch

Delete branch is very simple, just delete the directory of the specified branch directly.

Merge Branch To Main

Merge a branch into the main branch can be divided into two steps:

  1. Delete all the snapshots, tags and schemas in the main branch that are created after the created tag for the branch

  2. Copy snapshots, tags and schemas from the branch to the main branch.

The merged branch can still be read and written by jobs and the data in the branch is still independent from the main branch.

draw.io Diagram
bordertrue
diagramName未命名绘图
simpleViewerfalse
width
linksauto
tbstyletop
lboxtrue
diagramWidth811
revision1

Replace Main With Branch


We need to support replacing the main branch with a branch without affecting streaming and batch data read and write on the branch. To achieve this, we need to do the following steps:

  1. Calculate the snapshots, tags and schemas which should be copied from the main branch to target branch

  2. Update the Main Branch File to the target branch

  3. Drop the previous main branch, including snapshots, tags and schemas.

After the above steps, the main branch will be replaced with the target branch and the existing jobs can still read and write data in the branch.

draw.io Diagram
bordertrue
diagramName6
simpleViewerfalse
width
linksauto
tbstyletop
lboxtrue
diagramWidth871
revision1