Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

...

To snapshot is to get the most up-to-date records from a Hudi dataset at a particular point in time. Note that this could take longer for MOR tables as it involves merging the latest log files.

Arguments


DescriptionRemark
--source-base-pathBase path for the source Hudi dataset to be snapshottedrequired
--target-base-pathBase path for the target output files (snapshots)required
--snapshot-prefixSnapshot prefix or directory under the target base path in order to segregate different snapshotsoptional; may default to provide a daily prefix at run time like 2019/11/12/ 
--output-format"HUDI", "PARQUET"required; When "HUDI", behaves the same as HoodieSnapshotCopier ; may support more data formats in the future
--output-partition-fieldA field to be used by Spark repartitioning

optional; Ignored when "HUDI" or when --output-partitioner is specified

The output dataset's default partition field will inherent from the source Hudi dataset.

When this argument is specified, the provided value will be used for both in-memory Spark repartitioning and output file partition.

See the code snippet below

Code Block
languagejava
String partitionField =
cfg.getPartitionField();
 // from the argument
df.repartition(df.col(partitionField))
  .write()
  .partitionBy(partitionField)
  .parquet(outputPath);

In case of more flexibility needed for repartitioning, use --output-partitioner 

--output-partitionerA class to facilitate custom repartitioning optional; Ignored when "HUDI"

Steps

Gliffy Diagram
nameRFC-9 snapshotter overview
pagePin2

...