Div

class	home-banner

RFC-

9 : Hudi Dataset Snapshotter

Table of Contents

maxLevel	4
minLevel	3

Proposers

@rxuRaymond Xu

Approvers

@<approver1 JIRA username> : [APPROVED/REQUESTED_INFO/REJECTED]
@<approver2 JIRA username> : [APPROVED/REQUESTED_INFO/REJECTED]
...

...

Current state: Under Discussion

Discussion thread: here -

JIRA: here -

Released: <Hudi Version>

Abstract

A feature to snapshot a Hudi dataset and export the latest records to a set of external files (e.g., plain parquet files).

Background

<Introduce any much background context which is relevant or necessary to understand the feature and design choices.>

Implementation

<Describe the new thing you want to do in appropriate detail, how it fits into the project architecture. Provide a detailed description of how you intend to implement this feature.This may be fairly extensive and have large subsections of its own. Or it may be a few sentences. Use judgement based on the scope of the change.>

Rollout/Adoption Plan

<What impact (if any) will there be on existing users?>
<If we are changing behavior how will we phase out the older behavior?>
<If we need special migration tools, describe them here.>
<When will we remove the existing behavior?>

The existing org.apache.hudi.utilities.HoodieSnapshotCopier performs a Hudi-to-Hudi copy that serves for backup purpose. To broaden the usability, the Copier could be potentially extended to perform exporting features to data formats, like plain parquet files, other than Hudi dataset.

Implementation

The proposed class is org.apache.hudi.utilities.HoodieSnapshotter , which serves as the main entry for snapshotting related work.

Definition of "Snapshot"

To snapshot is to get the most up-to-date records from a Hudi dataset at the query time. Note that this could take longer for MOR tables as it involves merging the latest log files.

Arguments

	Description	Remark
--source-base-path	Base path for the source Hudi dataset to be snapshotted	required
--target-base-path	Base path for the target output files (snapshots)	required
--snapshot-prefix	Snapshot prefix or directory under the target base path in order to segregate different snapshots	optional; may default to provide a daily prefix at run time like `2019/11/12/`
--output-format	"HUDI_COPY", "PARQUET"	required; When "HUDI_COPY", behaves the same as `HoodieSnapshotCopier` ; may support more data formats in the future
--output-partition-field	A field to be used by Spark repartitioning	optional; Ignored when "HUDI_COPY"
--output-partitioner	A class to facilitate custom repartitioning	optional; Ignored when "HUDI_COPY"

Steps

Gliffy Diagram


name	RFC-9 snapshotter overview
pagePin	1

Read
- Output format "PARQUET": Leverage on org.apache.hudi.common.table.view.HoodieTableFileSystemView logic to get the latest records (RT query)
- Output format "HUDI_COPY": we don't need RT query. Instead, we just use RO query to copy the latest parquet files, like what the existing HoodieSnapshotCopier does
Transform
- Output format "PARQUET"
  - Stripe Hudi metadata
  - Allow user to provide a field to do simple Spark repartitioning
  - Allow user to provide a class to do custom repartitioning
- No transformation is needed for output format "HUDI_COPY"; just copy the original files, like what the existing HoodieSnapshotCopier does
Write
- Just need to provide the output directory and Spark shall handle the rest.

Rollout/Adoption Plan

No impact to existing users as this is a new independent utility tool.
Once this feature is GA'ed, we can mark HoodieSnapshotCopier as deprecated and suggest user to switch to this tool, which provides equivalent copying features.

Test Plan (TODO)

...

<Describe in few sentences how the RFC will be tested. How will we know that the implementation works as expected? How will we know nothing broke?>
...

Space shortcuts

Page tree

Versions Compared

Old Version 1

New Version 2

Key

RFC-

9 : Hudi Dataset Snapshotter

Proposers

Approvers

Abstract

Background

Implementation

Rollout/Adoption Plan

Implementation

Definition of "Snapshot"

Arguments

Steps

Rollout/Adoption Plan

Test Plan (TODO)

<Describe in few sentences how the RFC will be tested. How will we know that the implementation works as expected? How will we know nothing broke?>
...

Space shortcuts

Page tree

Page History

Versions Compared

Old Version 1

New Version 2

Key

RFC-

9 : Hudi Dataset Snapshotter

Proposers

Approvers

Abstract

Background

Implementation

Rollout/Adoption Plan

Implementation

Definition of "Snapshot"

Arguments

Steps

Rollout/Adoption Plan

Test Plan (TODO)

<Describe in few sentences how the RFC will be tested. How will we know that the implementation works as expected? How will we know nothing broke?>...

<Describe in few sentences how the RFC will be tested. How will we know that the implementation works as expected? How will we know nothing broke?>
...