Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

Div
classhome-banner

 RFC-

<RFC NUMBER> : <Title>

9 : Hudi Dataset Snapshotter

Table of Contents
maxLevel4
minLevel3

Proposers

Approvers

  • @<approver1 JIRA username> : [APPROVED/REQUESTED_INFO/REJECTED]
  • @<approver2 JIRA username> : [APPROVED/REQUESTED_INFO/REJECTED]
  • ...

...

Current state: Under Discussion

Discussion thread: here -

JIRA: here -

Released: <Hudi Version>

Abstract

A feature to snapshot a Hudi dataset and export the latest records to a set of external files (e.g., plain parquet files).

Background

<Introduce any much background context which is relevant or necessary to understand the feature and design choices.>

Implementation

<Describe the new thing you want to do in appropriate detail, how it fits into the project architecture. Provide a detailed description of how you intend to implement this feature.This may be fairly extensive and have large subsections of its own. Or it may be a few sentences. Use judgement based on the scope of the change.>

Rollout/Adoption Plan

  • <What impact (if any) will there be on existing users?>
  • <If we are changing behavior how will we phase out the older behavior?>
  • <If we need special migration tools, describe them here.>
  • <When will we remove the existing behavior?>

The existing org.apache.hudi.utilities.HoodieSnapshotCopier performs a Hudi-to-Hudi copy that serves for backup purpose. To broaden the usability, the Copier could be potentially extended to perform exporting features to data formats, like plain parquet files, other than Hudi dataset.

Implementation

The proposed class is org.apache.hudi.utilities.HoodieSnapshotter , which serves as the main entry for snapshotting related work.

Definition of "Snapshot"

To snapshot is to get the most up-to-date records from a Hudi dataset at the query time. Note that this could take longer for MOR tables as it involves merging the latest log files.

Arguments


DescriptionRemark
--source-base-pathBase path for the source Hudi dataset to be snapshottedrequired
--target-base-pathBase path for the target output files (snapshots)required
--snapshot-prefixSnapshot prefix or directory under the target base path in order to segregate different snapshotsoptional; may default to provide a daily prefix at run time like 2019/11/12/ 
--output-format"HUDI_COPY", "PARQUET"required; When "HUDI_COPY", behaves the same as HoodieSnapshotCopier ; may support more data formats in the future
--output-partition-fieldA field to be used by Spark repartitioningoptional; Ignored when "HUDI_COPY"
--output-partitionerA class to facilitate custom repartitioning optional; Ignored when "HUDI_COPY"

Steps

Gliffy Diagram
nameRFC-9 snapshotter overview
pagePin1

  1. Read
    • Output format "PARQUET": Leverage on org.apache.hudi.common.table.view.HoodieTableFileSystemView logic to get the latest records (RT query)
    • Output format "HUDI_COPY": we don't need RT query. Instead, we just use RO query to copy the latest parquet files, like what the existing HoodieSnapshotCopier does
  2. Transform
    • Output format "PARQUET"
      • Stripe Hudi metadata
      • Allow user to provide a field to do simple Spark repartitioning
      • Allow user to provide a class to do custom repartitioning
    • No transformation is needed for output format "HUDI_COPY"; just copy the original files, like what the existing HoodieSnapshotCopier does
  3. Write
    • Just need to provide the output directory and Spark shall handle the rest.

Rollout/Adoption Plan

  • No impact to existing users as this is a new independent utility tool.
  • Once this feature is GA'ed, we can mark HoodieSnapshotCopier as deprecated and suggest user to switch to this tool, which provides equivalent copying features.

Test Plan (TODO)

...

<Describe in few sentences how the RFC will be tested. How will we know that the implementation works as expected? How will we know nothing broke?>

...