Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

Proposers

Approvers

Status

Current state"Under Discussion"

JIRA:   HUDI-86

Released: TBD

Abstract

Hudi allows us to store multiple versions of data in the dataset overtime while providing `snapshot isolation`. The number of versions are configurable and depends on how much tradeoff one prefers for space used by those older versions. Hence, as one can imagine, Hudi has the capabilities to allow to accessing the state of a table (dataset) at a certain point/instant in time. This could be done by allowing read mechanisms from a Hudi table (dataset) that go back and read the version of the data that was written a particular point/instant in time (see instant time). 

Background

Hudi provides 3 different query/view type to access data written in 2 different storage type - Copy On Write (COW) and Merge On Read (MOR).

...

We would like to build on these existing capabilities of Hudi to be able to provide point in time queries - access the state and contents of the table at a particular instant in time in the past.

Implementation

Introduce another type of snapshot query scan mode, POINT_IN_TIME. This will enable  scanning the Hudi dataset for files created on or before the supplied point_in_time. The challenge as described above is to have a solution around inconsistent results when a user runs the same POINT_IN_TIME query multiple times.

...

We might need a way to standardize the way users think about querying hudi tables using Spark/Hive/Presto. At the moment, I'm proposing another VIEW_TYPE (query/view type (question)) to be introduced in Spark POINT_IN_TIME which will work in conjunction with an already present config END_INSTANTTIME_OPT_KEY to provide the snapshot view of a table at a particular instant in time (similar to select * from table where _hoodie_commit_time <= timeAsOf (commit_time))

Caveats/Open Items

  • The number of versions to keep should match the number of commits the client wants to travel. Need a way to enforce this.
  • Proposed approach requires the client to perform some work and enforces some limitations
    • Can only time-travel based on the commit times of a hudi dataset. The clients have to figure out a way to map the timestamp they want to travel against the commit time that matches closes to it.
    • Clients have to get a list of the valid timestamps (hudi commit times) to time travel against 

Rollout/Adoption Plan

  • This change introduced new configs to enable time travel for hudi tables but at the same time does not change the existing features.

Test Plan

  • Unit tests including ones that cover different cases of time-travel to ensure consistent results.
  • Units tests including ones that cover what happens if the versions are not present
  • Testing with hudi test suite, by running a test workflow for few days, with cleaner etc running and queries.

...