You are viewing an old version of this page. View the current version.

Compare with Current View Page History

Version 1 Next »

Definition

An item in the `Hudi` ingestion processing timeline

Design details

At its core, Hudi maintains a timeline of all actions performed on the dataset at different instants of time that helps provide instantaneous views of the dataset, while also efficiently supporting retrieval of data in the order of arrival. A Hudi `timeline instant` consists of the following components

  • Action type : Type of action performed on the dataset
  • Instant time : Instant time is typically a timestamp (e.g: 20190117010349), which monotonically increases in the order of action’s begin time.
  • Instant state : current state of the instant

Hudi guarantees that the actions performed on the timeline are atomic & timeline consistent based on the instant time.

Key action types performed include

  • COMMITS - A commit denotes an atomic write of a batch of records into a dataset.
  • CLEANS - Background activity that gets rid of older versions of files in the dataset, that are no longer needed.
  • DELTA_COMMIT - A delta commit refers to an atomic write of a batch of records into a Merge On Read (MOR) storage type of dataset, where some/all of the data could be just written to delta logs.
  • COMPACTION - Background activity to reconcile differential data structures within Hudi e.g: moving updates from row based delta log files to columnar file formats. Internally, compaction manifests as a special commit on the timeline
  • ROLLBACK - Indicates that a commit/delta commit was unsuccessful & rolled back, removing any partial files produced during such a write
  • SAVEPOINT - Marks certain file groups as “saved”, such that cleaner will not delete them. It helps restore the dataset to a point on the timeline, in case of disaster/data recovery scenarios.

Any given instant can be in one of the following instant states

  • REQUESTED - Denotes an action has been scheduled, but has not initiated
  • INFLIGHT - Denotes that the action is currently being performed
  • COMPLETED - Denotes completion of an action on the timeline

Design decisions

  1. #todo

Related concepts

  1. file format
  2. commit

Status (draft)


  • No labels