You are viewing an old version of this page. View the current version.

Compare with Current View Page History

« Previous Version 5 Next »

RFC-9: Hudi Dataset Snapshotter

Proposers

Approvers

Status

Current state: Under Discussion

Discussion thread: -

JIRA: -

Released: <Hudi Version>

Abstract

A feature to snapshot a Hudi dataset and export the latest records to a set of external files (e.g., plain parquet files).

Background

The existing org.apache.hudi.utilities.HoodieSnapshotCopier performs a Hudi-to-Hudi copy that serves for backup purpose. To broaden the usability, the Copier could be potentially extended to perform exporting features to data formats, like plain parquet files, other than Hudi dataset.

Implementation

The proposed class is org.apache.hudi.utilities.HoodieSnapshotter , which serves as the main entry for snapshotting related work.

Definition of "Snapshot"

To snapshot is to get the most up-to-date records from a Hudi dataset at the query time. Note that this could take longer for MOR tables as it involves merging the latest log files.

Arguments


DescriptionRemark
--source-base-pathBase path for the source Hudi dataset to be snapshottedrequired
--target-base-pathBase path for the target output files (snapshots)required
--snapshot-prefixSnapshot prefix or directory under the target base path in order to segregate different snapshotsoptional; may default to provide a daily prefix at run time like 2019/11/12/ 
--output-format"HUDI_COPY", "PARQUET"required; When "HUDI_COPY", behaves the same as HoodieSnapshotCopier ; may support more data formats in the future
--output-partition-fieldA field to be used by Spark repartitioningoptional; Ignored when "HUDI_COPY"
--output-partitionerA class to facilitate custom repartitioning optional; Ignored when "HUDI_COPY"

Steps

RFC-9 snapshotter overview

  1. Read
    • Output format "PARQUET": Leverage on org.apache.hudi.common.table.view.HoodieTableFileSystemView logic to get the latest records (RT query)
    • Output format "HUDI_COPY": we don't need RT query. Instead, we just use RO query to copy the latest parquet files, like what the existing HoodieSnapshotCopier does
  2. Transform
    • Output format "PARQUET"
      • Stripe Hudi metadata
      • Allow user to provide a field to do simple Spark repartitioning
      • Allow user to provide a class to do custom repartitioning
    • No transformation is needed for output format "HUDI_COPY"; just copy the original files, like what the existing HoodieSnapshotCopier does
  3. Write
    • Just need to provide the output directory and Spark shall handle the rest.

Rollout/Adoption Plan

  • No impact to existing users as this is a new independent utility tool.
  • Once this feature is GA'ed, we can mark HoodieSnapshotCopier as deprecated and suggest user to switch to this tool, which provides equivalent copying features.

Test Plan

  • Write similar tests like HoodieSnapshotCopier 
  • When testing end-to-end, we are to verify
    • number of records are matched
    • later snapshot reflect the latest info from the original dataset
  • No labels