Vinoth Chandar : APPROVED
Balaji Varadarajan : [APPROVED/REQUESTED_INFO/REJECTED]
Nishith Agarwal : [APPROVED/REQUESTED_INFO/REJECTED]

Status

Current state: Under Discussion

...

To snapshot is to get the most up-to-date records from a Hudi dataset at the query a particular point in time. Note that this could take longer for MOR tables as it involves merging the latest log files.

...

	Description	Remark
--source-base-path	Base path for the source Hudi dataset to be snapshotted	required
--target-base-path	Base path for the target output files (snapshots)	required
--snapshot-prefix	Snapshot prefix or directory under the target base path in order to segregate different snapshots	optional; may default to provide a daily prefix at run time like `2019/11/12/`
--output-format	"HUDI_COPY", "PARQUET"	required; When "HUDI_COPY", behaves the same as `HoodieSnapshotCopier` ; may support more data formats in the future
--output-partition-field	A field to be used by Spark repartitioning	optional; Ignored when "HUDI_COPY"
--output-partitioner	A class to facilitate custom repartitioning	optional; Ignored when "HUDI_COPY"

Steps

Gliffy Diagram

name	RFC-9 snapshotter overview
pagePin	2

Read
- Output format "PARQUET": Leverage on org.apache.hudi.common.table.view.HoodieTableFileSystemView logic to get the latest records (RT query)
- Output format "HUDI_COPY": we don't need RT query. Instead, we just use RO query to copy the latest parquet files, like what the existing HoodieSnapshotCopier does
Transform
- Output format "PARQUET"
  - Stripe Hudi metadata
  - Allow user to provide a field to do simple Spark repartitioning
  - Allow user to provide a class to do custom repartitioning
- No transformation is needed for output format "HUDI_COPY"; just copy the original files, like what the existing HoodieSnapshotCopier does
Write
- Just need to provide the output directory and Spark shall handle the rest.

...

Space shortcuts

Page tree

Versions Compared

Old Version 8

New Version 9

Key

Status

Steps

Space shortcuts

Page tree

Page History

Versions Compared

Old Version 8

New Version 9

Key

Status

Steps