THIS IS A TEST INSTANCE. ALL YOUR CHANGES WILL BE LOST!!!!
...
- Vinoth Chandar : APPROVED
- Balaji Varadarajan : [APPROVED/REQUESTED_INFO/REJECTED]
- Nishith Agarwal : [APPROVED/REQUESTED_INFO/REJECTED]
Status
Current state: Under Discussion
...
To snapshot is to get the most up-to-date records from a Hudi dataset at the query a particular point in time. Note that this could take longer for MOR tables as it involves merging the latest log files.
...
Description | Remark | |
---|---|---|
--source-base-path | Base path for the source Hudi dataset to be snapshotted | required |
--target-base-path | Base path for the target output files (snapshots) | required |
--snapshot-prefix | Snapshot prefix or directory under the target base path in order to segregate different snapshots | optional; may default to provide a daily prefix at run time like 2019/11/12/ |
--output-format | "HUDI_COPY", "PARQUET" | required; When "HUDI_COPY", behaves the same as HoodieSnapshotCopier ; may support more data formats in the future |
--output-partition-field | A field to be used by Spark repartitioning | optional; Ignored when "HUDI_COPY" |
--output-partitioner | A class to facilitate custom repartitioning | optional; Ignored when "HUDI_COPY" |
Steps
Gliffy Diagram | ||||
---|---|---|---|---|
|
- Read
- Output format "PARQUET": Leverage on
org.apache.hudi.common.table.view.HoodieTableFileSystemView
logic to get the latest records (RT query) - Output format "HUDI_COPY": we don't need RT query. Instead, we just use RO query to copy the latest parquet files, like what the existing
HoodieSnapshotCopier
does
- Output format "PARQUET": Leverage on
- Transform
- Output format "PARQUET"
- Stripe Hudi metadata
- Allow user to provide a field to do simple Spark repartitioning
- Allow user to provide a class to do custom repartitioning
- No transformation is needed for output format "HUDI_COPY"; just copy the original files, like what the existing
HoodieSnapshotCopier
does
- Output format "PARQUET"
- Write
- Just need to provide the output directory and Spark shall handle the rest.
...