...
To snapshot is to get the most up-to-date records from a Hudi dataset at a particular point in time. Note that this could take longer for MOR tables as it involves merging the latest log files.
Arguments
Description | Remark | |
---|---|---|
--source-base-path | Base path for the source Hudi dataset to be snapshotted | required |
--target-base-path | Base path for the target output files (snapshots) | required |
--snapshot-prefix | Snapshot prefix or directory under the target base path in order to segregate different snapshots | optional; may default to provide a daily prefix at run time like 2019/11/12/ |
--output-format | "HUDI", "PARQUET" | required; When "HUDI", behaves the same as HoodieSnapshotCopier ; may support more data formats in the future |
--output-partition-field | A field to be used by Spark repartitioning | optional; Ignored when "HUDI" or when The output dataset's default partition field will inherent from the source Hudi dataset. When this argument is specified, the provided value will be used for both in-memory Spark repartitioning and output file partition. |
|
In case of more flexibility needed for repartitioning, use | ||
--output-partitioner | A class to facilitate custom repartitioning | optional; Ignored when "HUDI" |
Steps
Gliffy Diagram | ||||
---|---|---|---|---|
|
...