Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

...

Hudi allows us to store multiple versions of data in the table the dataset overtime while providing snapshot isolation. The number of versions are configurable and depends on how much tradeoff one prefers for space used by those older versions. Hence, as one can imagine, Hudi has the capabilities to allow to accessing the state of a table at a certain instant in time. This could be done by allowing read mechanisms from a Hudi table that go back and read the version of the data that was written a particular instant in time. 

...

  • Snapshot queries as the latest state of a dataset/table. The default scan mode is set to LATEST.

...

All of these view are exposed by intercepting the planning stage through the HoodieInputFormat. Before we dive further into incremental changes, I want to bring to everyone's attention what incremental changes from Hudi really mean. For eg, if you have a log table (say log events from Kafka), incremental_pull (t1, t2] would provide you all the records that were ingested between those instants.

If you have a hudi table/dataset that  that you are ingesting database changes logs (updates), then incremental_pull (t1, t2] would provide you all the records that were either inserted or updated between those time instants. These records are a full representation of the updated record, meaning, updated_record = (unchanged_columns + changed_columns). 

...

Introduce another type of snapshot scan mode, POINT_IN_TIME_SCAN. This will enable  scanning the Hudi dataset for  for files created on or before the supplied point_in_time. The challenge as described above is to have a solution around inconsistent results when a user runs the same POINT_IN_TIME query multiple times. There are 2 options here : 

1) Hudi provides a list of timestamps that can be supplied by the user as the point_in_time the user wants to query against. Hudi writes the commit/instant times to a timeline metadata folder and provides API's to read the timeline. At the moment there are 2 ways to read the timeline, a) HoodieActiveTimeline class can be instantiated on the client to get a list of the completed commits.  Hudi exposes a HoodieReadClient to users to enable reading certain contents and perform certain read-level operations on the Hudi dataset. I'm proposing we add another method getCompletedInstants() that can allow users to programmatically get a list of all the timestamps against which users can query the table to get consistent results.  b) For users of hive queries who are more SQL friendly, can perform a query such as : select _hoodie_commit_time from table sort by  _hoodie_commit_time provides an ordered list of all the commit times.

...

  • The number of versions to keep should match the number of commits the client wants to travel. Need a way to enforce this.
  • Proposed approach requires the client to perform some work and enforces some limitations
    • Can only time-travel based on the commit times of a hudi dataset. The clients have to figure out a way to map the timestamp they want to travel against the commit time that matches closes to it.
    • Clients have to get a list of the valid timestamps (hudi commit times) to time travel against 

...