Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

...

Proposers

Approvers

  • @<approver1 JIRA username> Vinoth Chandar  [APPROVED/REQUESTED_INFO/REJECTED]
  • @<approver2 JIRA username> Balaji Varadarajan  [APPROVED/REQUESTED_INFO/REJECTED]
  • ...

...

To snapshot is to get the most up-to-date records from a Hudi dataset at the query time. Note that this could take longer for MOR tables as it involves merging the latest log files.

Arguments


DescriptionRemark
--source-base-pathBase path for the source Hudi dataset to be snapshottedrequired
--target-base-pathBase path for the target output files (snapshots)required
--snapshot-prefixSnapshot prefix or directory under the target base path in order to segregate different snapshotsoptional; may default to provide a daily prefix at run time like 2019/11/12/ 
--output-format"HUDI_COPY", "PARQUET"required; When "HUDI_COPY", behaves the same as HoodieSnapshotCopier ; may support more data formats in the future
--output-partition-fieldA field to be used by Spark repartitioningoptional; Ignored when "HUDI_COPY"
--output-partitionerA class to facilitate custom repartitioning optional; Ignored when "HUDI_COPY"

Steps

Gliffy Diagram
nameRFC-9 snapshotter overview
pagePin1

...