Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

...

Status

Current state: Under Discussion

...

To snapshot is to get the most up-to-date records from a Hudi dataset at the query a particular point in time. Note that this could take longer for MOR tables as it involves merging the latest log files.

...


DescriptionRemark
--source-base-pathBase path for the source Hudi dataset to be snapshottedrequired
--target-base-pathBase path for the target output files (snapshots)required
--snapshot-prefixSnapshot prefix or directory under the target base path in order to segregate different snapshotsoptional; may default to provide a daily prefix at run time like 2019/11/12/ 
--output-format"HUDI_COPY", "PARQUET"required; When "HUDI_COPY", behaves the same as HoodieSnapshotCopier ; may support more data formats in the future
--output-partition-fieldA field to be used by Spark repartitioningoptional; Ignored when "HUDI_COPY"
--output-partitionerA class to facilitate custom repartitioning optional; Ignored when "HUDI_COPY"

Steps

Gliffy Diagram
nameRFC-9 snapshotter overview
pagePin2

  1. Read
    • Output format "PARQUET": Leverage on org.apache.hudi.common.table.view.HoodieTableFileSystemView logic to get the latest records (RT query)
    • Output format "HUDI_COPY": we don't need RT query. Instead, we just use RO query to copy the latest parquet files, like what the existing HoodieSnapshotCopier does
  2. Transform
    • Output format "PARQUET"
      • Stripe Hudi metadata
      • Allow user to provide a field to do simple Spark repartitioning
      • Allow user to provide a class to do custom repartitioning
    • No transformation is needed for output format "HUDI_COPY"; just copy the original files, like what the existing HoodieSnapshotCopier does
  3. Write
    • Just need to provide the output directory and Spark shall handle the rest.

...