Page History

...

To snapshot is to get the most up-to-date records from a Hudi dataset at a particular point in time. Note that this could take longer for MOR tables as it involves merging the latest log files.

Arguments

	Description	Remark
--source-base-path	Base path for the source Hudi dataset to be snapshotted	required
--target-base-path	Base path for the target output files (snapshots)	required
--snapshot-prefix	Snapshot prefix or directory under the target base path in order to segregate different snapshots	optional; may default to provide a daily prefix at run time like `2019/11/12/`
--output-format	"HUDI", "PARQUET"	required; When "HUDI", behaves the same as `HoodieSnapshotCopier` ; may support more data formats in the future
--output-partition-field	A field to be used by Spark repartitioning	optional; Ignored when "HUDI" or when `--output-partitioner` is specified The output dataset's default partition field will inherent from the source Hudi dataset. When this argument is specified, the provided value will be used for both in-memory Spark repartitioning and output file partition.

See the code snippet below

Code Block

language	java

String partitionField =

cfg.getPartitionField();

 // from the argument
df.repartition(df.col(partitionField))
  .write()
  .partitionBy(partitionField)
  .parquet(outputPath);

In case of more flexibility needed for repartitioning, use --output-partitioner

--output-partitioner A class to facilitate custom repartitioning optional; Ignored when "HUDI"

Steps

Gliffy Diagram

name	RFC-9 snapshotter overview
pagePin	2

...

Space shortcuts

Page tree

Versions Compared

Old Version 11

New Version 12

Key

Arguments

Steps