You are viewing an old version of this page. View the current version.

Compare with Current View Page History

« Previous Version 2 Next »

Discussion thread


Vote thread
ISSUEhttps://github.com/apache/incubator-paimon/issues/1795
ReleasePaimon-0.6

Motivation


In data streaming process there may be data errors and other issues, and we need to correct the data in the flow. This situation is very common and important. However, in this process, we do not want to affect existing data processing to avoid impact on users. We need to create a new data streaming process and wait for it to catch up with the data and replace the original data streaming process. The main operations can be divided into the following steps:

  1. Create a replica table based on the specified tag/snapshot of upstream and downstream Paimon Tables

  2. Resubmit all streaming jobs, incremental or full recovery starting from the specified offset


We think we need to support branching in Paimon. Then we could create replica tables to avoid copying all data from specified tables and increase storage space.
Besides the above, branching in Paimon can also be used to enhance tag. for Tag simulation of traditional Hive partition tables, provide data correction capabilities on the basis of Tag, which can be used to supplement data and achieve precise segmentation capabilities.
Above all, the branch we would like to introduce in Paimon has the following abilities:

  1. Each table only has one main branch, and other branches can only be created from the specified tag of the main branch

  2. Create or delete a branch for tables in Paimon, and create a tag for a specified branch.

  3. Update schema for the branch, such as altering tables to add/drop columns.

  4. Jobs can streaming/batch read from and write data to the branch

  5. There are merge and replace operations from branch to main, and after replace main with given branch, the previous main branch will be deleted.


  • No labels