Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

...

select * from table where _hoodie_commit_time <= timeAsOf (commit_time)

This RFC does not propose implementation/changes for Presto.


Spark queries

Incremental pull using Spark works with a custom implementation of import org.apache.spark.sql.sources.BaseRelation (called IncrementalRelation) which based on the VIEW_TYPE that the user can pass from a Spark DataSource. This is slightly different from the way users query Hive/Presto tables. The Hive/Presto tables don't really expose the RO/RT/IC views to users but instead expose InputFormats that build on those views. As part of those input formats, we support different scan modes such as SNAPSHOT or INCREMENTAL (and now our proposed POINT_IN_TIME_SNAPSHOT). 

We might need a way to standardize the way users think about querying hudi tables using Spark/Hive/Presto. At the moment, I'm proposing another VIEW_TYPE to be introduced in Spark POINT_IN_TIME which will work in conjunction with an already present config END_INSTANTTIME_OPT_KEY to provide the snapshot view of a table at a particular instant in time (similar to select to select * from table where _hoodie_commit_time <= timeAsOf (commit_time))

Caveats

  • The number of versions to keep should match the number of commits the client wants to travel. Need a way to enforce this.
  • Proposed approach pushed the client to perform some work and enforces some limitations
    • Can only time-travel based on the commit times of a hudi dataset. The clients have to figure out a way to map the timestamp they want to travel against the commit time that matches closes to it.
    • Clients have to get a list of the valid timestamps (hudi commit times) to time travel against 

Rollout/Adoption Plan

  • This change introduced new configs to enable time travel for hudi tables but at the same time does not change the existing features.

Test Plan

  • Unit tests including ones that cover different cases of time-travel to ensure consistent results.
  • Units tests including ones that cover what happens if the versions are not present
  • Testing with hudi test suite, by running a test workflow for few days, with cleaner etc running and queries.