...
- Vinoth Chandar : APPROVED
- Balaji Varadarajan : APPROVED
- Nishith Agarwal : APPROVED
Status
Current state: Under Discussion Status colour Green title COMPLETED
Discussion thread: here
JIRA: Jira server ASF JIRA serverId 5aa69414-a9e9-3523-82ec-879b028fb15b key HUDI-344
Released: <Hudi Version>0.6.0
Abstract
A feature to snapshot a Hudi dataset and export the latest records to a set of external files (e.g., plain parquet files).
...
Description | Remark | |||||||
---|---|---|---|---|---|---|---|---|
--source-base-path | Base path for the source Hudi dataset to be snapshotted | required | ||||||
--target-baseoutput-path | Base Output path for the target output files (snapshots)storing a particular snapshot | required | ||||||
-- | snapshotoutput- | prefixSnapshot prefix or directory under the target base path in order to segregate different snapshots | optional; may default to provide a daily prefix at run time like 2019/11/12/ | format | Output format for the exported dataset; accept these values: json|parquet|hudi | --output-format | "HUDI", "PARQUET" | required; When "HUDIhudi", behaves the same as HoodieSnapshotCopier ; may support more data formats in the future |
--output-partition-field | A field to be used by Spark repartitioning | optional; Ignored when "HUDI" or when The output dataset's default partition field will inherent from the source Hudi dataset. When this argument is specified, the provided value will be used for both in-memory Spark repartitioning and output file partition.
In case of more flexibility needed for repartitioning, use | ||||||
--output-partitioner | A class to facilitate custom repartitioning | optional; Ignored when "HUDIhudi" |
Steps
Gliffy Diagram | ||||
---|---|---|---|---|
|
- Read
- Regardless of output format, always leverage on
org.apache.hudi.common.table.view.HoodieTableFileSystemView
to perform RO query for read - Specifically, data to be read is from the latest version of columnar files in the source dataset, up to the latest commit time, like what the existing
HoodieSnapshotCopier
does
- Regardless of output format, always leverage on
- Transform
- Output format "PARQUETparquet"
- Stripe Hudi metadata
- Allow user to provide a field to do simple Spark repartitioning
- Allow user to provide a class to do custom repartitioning
- No transformation is needed for output format "HUDIhudi"; just copy the original files, like what the existing
HoodieSnapshotCopier
does
- Output format "PARQUETparquet"
- Write
- Just need to provide the output directory and Spark shall handle the rest.
...