Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

...

To snapshot is to get the most up-to-date records from a Hudi dataset at a particular point in time. Note that this could take longer for MOR tables as it involves merging the latest the data exported from MOR tables may not be the most up-to-date as RO query is used for retrieval, which omits the latest data in the log files.

Arguments


DescriptionRemark
--source-base-pathBase path for the source Hudi dataset to be snapshottedrequired
--target-base-pathBase path for the target output files (snapshots)required
--snapshot-prefixSnapshot prefix or directory under the target base path in order to segregate different snapshotsoptional; may default to provide a daily prefix at run time like 2019/11/12/ 
--output-format"HUDI", "PARQUET"required; When "HUDI", behaves the same as HoodieSnapshotCopier ; may support more data formats in the future
--output-partition-fieldA field to be used by Spark repartitioning

optional; Ignored when "HUDI" or when --output-partitioner is specified

The output dataset's default partition field will inherent from the source Hudi dataset.

When this argument is specified, the provided value will be used for both in-memory Spark repartitioning and output file partition.

Code Block
languagejava
String partitionField = // from the argument
df.repartition(df.col(partitionField))
  .write()
  .partitionBy(partitionField)
  .parquet(outputPath);

In case of more flexibility needed for repartitioning, use --output-partitioner 

--output-partitionerA class to facilitate custom repartitioning optional; Ignored when "HUDI"

...

Gliffy Diagram
nameRFC-9 snapshotter overview
pagePin2

  1. Read
    • Output format "PARQUET": Leverage Regardless of output format, always leverage on org.apache.hudi.common.table.view.HoodieTableFileSystemView logic to get the latest records (RT query) to perform RO query for read
    • Specifically, data to be read is from the latest version of columnar files in the source dataset, up to the latest commit timeOutput format "HUDI": we don't need RT query. Instead, we just use RO query to copy the latest parquet files, like what the existing HoodieSnapshotCopier does
  2. Transform
    • Output format "PARQUET"
      • Stripe Hudi metadata
      • Allow user to provide a field to do simple Spark repartitioning
      • Allow user to provide a class to do custom repartitioning
    • No transformation is needed for output format "HUDI"; just copy the original files, like what the existing HoodieSnapshotCopier does
  3. Write
    • Just need to provide the output directory and Spark shall handle the rest.

...