Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.
Comment: Update status and minor change to align with the impl.

...

Current state

Status
colourBlueGreen
titleINACTIVECOMPLETED

Discussion thread: here

JIRA:

Jira
serverASF JIRA
serverId5aa69414-a9e9-3523-82ec-879b028fb15b
keyHUDI-344

Released: <Hudi Version>0.6.0


Abstract

A feature to snapshot a Hudi dataset and export the latest records to a set of external files (e.g., plain parquet files).

...

snapshotprefix

DescriptionRemark
--source-base-pathBase path for the source Hudi dataset to be snapshottedrequired
--target-baseoutput-pathBase Output path for the target output files (snapshots)storing a particular snapshotrequired
--output-Snapshot prefix or directory under the target base path in order to segregate different snapshotsoptional; may default to provide a daily prefix at run time like 2019/11/12/ formatOutput format for the exported dataset; accept these values: json|parquet|hudi--output-format"HUDI", "PARQUET"required; When "HUDIhudi", behaves the same as HoodieSnapshotCopier ; may support more data formats in the future
--output-partition-fieldA field to be used by Spark repartitioning

optional; Ignored when "HUDI" or when --output-partitioner is specified

The output dataset's default partition field will inherent from the source Hudi dataset.

When this argument is specified, the provided value will be used for both in-memory Spark repartitioning and output file partition.

Code Block
languagejava
String partitionField = // from the argument
df.repartition(df.col(partitionField))
  .write()
  .partitionBy(partitionField)
  .parquet(outputPath);

In case of more flexibility needed for repartitioning, use --output-partitioner 

--output-partitionerA class to facilitate custom repartitioning optional; Ignored when "HUDIhudi"

Steps

Gliffy Diagram
nameRFC-9 snapshotter overview
pagePin2

  1. Read
    • Regardless of output format, always leverage on org.apache.hudi.common.table.view.HoodieTableFileSystemView to perform RO query for read
    • Specifically, data to be read is from the latest version of columnar files in the source dataset, up to the latest commit time, like what the existing HoodieSnapshotCopier does
  2. Transform
    • Output format "PARQUETparquet"
      • Stripe Hudi metadata
      • Allow user to provide a field to do simple Spark repartitioning
      • Allow user to provide a class to do custom repartitioning
    • No transformation is needed for output format "HUDIhudi"; just copy the original files, like what the existing HoodieSnapshotCopier does
  3. Write
    • Just need to provide the output directory and Spark shall handle the rest.

...