Page History

...

Description Remark

--source-base-path Base path for the source Hudi dataset to be snapshotted required

--target-base-path Base path for the target output files (snapshots) required

--snapshot-prefix Snapshot prefix or directory under the target base path in order to segregate different snapshots optional; may default to provide a daily prefix at run time like 2019/11/12/

--output-format "HUDI", "PARQUET" required; When "HUDI", behaves the same as HoodieSnapshotCopier ; may support more data formats in the future

--output-partition-field

A field to be used by Spark repartitioning

optional; Ignored when "HUDI" or when --output-partitioner is specified

The output dataset's default partition field will inherent from the source Hudi dataset.

When this argument is specified, the provided value will be used for both in-memory Spark repartitioning and output file partition. See the code snippet below

Code Block

language	java

String partitionField = cfg.getPartitionField(); // from the argument
df.repartition(df.col(partitionField))
  .write()
  .partitionBy(partitionField)
  .parquet(outputPath);

In case of more flexibility needed for repartitioning, use --output-partitioner

--output-partitioner A class to facilitate custom repartitioning optional; Ignored when "HUDI"

...

Space shortcuts

Page tree

Versions Compared

Old Version 9

New Version 10

Key