Div

class	home-banner

RFC-9: Hudi Dataset Snapshotter

3

Table of Contents
maxLevel	4
minLevel

Proposers

Raymond Xu

Approvers

Vinoth Chandar : [APPROVED/REQUESTED_INFO/REJECTED]
Balaji Varadarajan : APPROVED
Nishith Agarwal : [APPROVED/REQUESTED_INFO/REJECTED]

Status

Current state: Under Discussion

Status


colour	Green
title	COMPLETED

Discussion thread: - here

JIRA:

Jira

server	ASF JIRA
serverId	5aa69414-a9e9-3523-82ec-879b028fb15b
key	HUDI-344

Released: <Hudi Version>0.6.0

Abstract

A feature to snapshot a Hudi dataset and export the latest records to a set of external files (e.g., plain parquet files).

Background

The existing org.apache.hudi.utilities.HoodieSnapshotCopier performs a Hudi-to-Hudi copy that serves for backup purpose. To broaden the usability, the Copier could be potentially extended to perform exporting features to data formats, like plain parquet files, other than Hudi dataset.

Implementation

The proposed class is org.apache.hudi.utilities.HoodieSnapshotterHoodieSnapshotExporter , which serves as the main entry for snapshotting related work.

Definition of "Snapshot"

To snapshot is to get the most up-to-date records from a Hudi dataset at the query a particular point in time. Note that this could take longer for MOR tables as it involves merging the latest the data exported from MOR tables may not be the most up-to-date as RO query is used for retrieval, which omits the latest data in the log files.

Arguments

	Description	Remark
--source-base-path	Base path for the source Hudi dataset to be snapshotted	required
--target-

base

output-path

Base

Output path for

the target output files (snapshots)

storing a particular snapshot	required
--

snapshot

output-

prefixSnapshot prefix or directory under the target base path in order to segregate different snapshotsoptional; may default to provide a daily prefix at run time like 2019/11/12/

format

Output format for the exported dataset; accept these values: json|parquet|hudi

--output-format"HUDI_COPY", "PARQUET"

required; When "

HUDI_COPY

hudi", behaves the same as `HoodieSnapshotCopier` ; may support more data formats in the future
--output-partition-field	A field to be used by Spark repartitioning	optional; Ignored when "HUDI

_COPY"

" or when --output-partitioner is specified

The output dataset's default partition field will inherent from the source Hudi dataset.

When this argument is specified, the provided value will be used for both in-memory Spark repartitioning and output file partition.

Code Block

language	java

String partitionField = // from the argument
df.repartition(df.col(partitionField))
  .write()
  .partitionBy(partitionField)
  .parquet(outputPath);

In case of more flexibility needed for repartitioning, use --output-partitioner

--

output-partitioner

A class to facilitate custom repartitioning

optional; Ignored when "

HUDI_COPY

hudi"

Steps

Gliffy Diagram

name	RFC-9 snapshotter overview
pagePin	12

Read
- Output format "PARQUET": Leverage Regardless of output format, always leverage on org.apache.hudi.common.table.view.HoodieTableFileSystemView logic to get the latest records (RT query) to perform RO query for read
- Specifically, data to be read is from the latest version of columnar files in the source dataset, up to the latest commit timeOutput format "HUDI_COPY": we don't need RT query. Instead, we just use RO query to copy the latest parquet files, like what the existing HoodieSnapshotCopier does
Transform
- Output format "PARQUETparquet"
  - Stripe Hudi metadata
  - Allow user to provide a field to do simple Spark repartitioning
  - Allow user to provide a class to do custom repartitioning
- No transformation is needed for output format "HUDI_COPYhudi"; just copy the original files, like what the existing HoodieSnapshotCopier does
Write
- Just need to provide the output directory and Spark shall handle the rest.

Rollout/Adoption Plan

No impact to existing users as this is a new independent utility tool.
Once this feature is GA'ed, we can mark HoodieSnapshotCopier as deprecated and suggest user to switch to this tool, which provides equivalent copying features.

Test Plan

Write similar tests like HoodieSnapshotCopier
When testing end-to-end, we are to verify
- number of records are matched
- later snapshot reflect the latest info from the original dataset

...

Space shortcuts

Page tree

Versions Compared

Old Version 5

New Version Current

Key

RFC-9: Hudi Dataset Snapshotter

Proposers

Approvers

Status

Abstract

Background

Implementation

Definition of "Snapshot"

Arguments

Steps

Rollout/Adoption Plan

Test Plan

Space shortcuts

Page tree

Page History

Versions Compared

Old Version 5

New Version Current

Key

RFC-9: Hudi Dataset Snapshotter

Proposers

Approvers

Status

Abstract

Background

Implementation

Definition of "Snapshot"

Arguments

Steps

Rollout/Adoption Plan

Test Plan