You are viewing an old version of this page. View the current version.

Compare with Current View Page History

« Previous Version 3 Next »

Proposers

Approvers

Status

Current state"Under Discussion"

JIRA:   HUDI-86

Released: TBD

Abstract

Hudi allows us to store multiple versions of data in the table overtime while providing snapshot isolation. The number of versions are configurable and depends on how much tradeoff one prefers for space used by those older versions. Hence, as one can imagine, Hudi has the capabilities to allow to accessing the state of a table at a certain instant in time. This could be done by allowing read mechanisms from a Hudi table that go back and read the version of the data that was written a particular instant in time. 

Background

Hudi provides 3 different views to access data written in 2 different table types - COW and MOR. 1) RO view allows to reading data written in columnar format with columnar read performance 2) RT view is a combination of avro and parquet and optimizes for latency vs read performance (right now applicable only for MOR) 3) Incremental view allows one to extract incremental change logs from a Hudi table. The incremental view really builds on top of the other 2 views. Hudi supports 2 types of queries - Snapshot & Incremental.

  • Snapshot queries as the latest state of a dataset/table. 
  • Incremental queries provides incremental changes between time (t1,t2]

All of these view are exposed by intercepting the planning stage through the HoodieInputFormat. Before we dive further into incremental changes, I want to bring to everyone's attention what incremental changes from Hudi really mean. For eg, if you have a log table (say log events from Kafka), incremental_pull (t1, t2] would provide you all the records that were ingested between those instants.

If you have a hudi table/dataset that you are ingesting database changes logs (updates), then incremental_pull (t1, t2] would provide you all the records that were either inserted or updated between those time instants. These records are a full representation of the updated record, meaning, updated_record = (unchanged_columns + changed_columns). 

Hive queries

We set certain configurations for hoodie to understand whether this is a Snapshot based query or an Incremental query. If it is an incremental query, the user can provide the (min,max) lastInstant to pull incremental changes. For RO view, we are able to figure out exactly which files changed and can limit the number of files read (and hence the amount of data) during the planning phase. For RT view, since we are not able to figure out easily which log files have data from which commits/instants, we rely on Hive to read out the entire data but the user to apply the filters in their query such as : select * from table where _hoodie_commit_time > min and _hoodie_commit_time < max. 

Presto queries

For both RO and RT views, we rely on Presto to read out the entire data but the user to apply the filters in their query such as : select * from table where _hoodie_commit_time > min and _hoodie_commit_time < max. 

Spark queries

For spark based incremental pull, for RO views, Hudi has a special technique to avoiding the planning phase through the HoodieInputFormat and directly being able to access the data that changed. A similar implementation on RT view is required but for RT views, we fall back on similar paths of applying query filtering as described above.

-----------------

As you can imagine, there is already a "hacky" way to achieve time travel using the following methodology : select * from table where _hoodie_commit_time <= point_in_time. Obviously, this is not very efficient and can also lead to incorrect results since the point_in_time might be different that the hoodie instant time. If another commit is inflight between point_in_time and the instant_time closest to it, this query will provide different results for different runs.

Note that there are ways to improve the queries by predicate pushdowns of _hoodie_commit_time fields which is not in the scope of discussions for this RFC.

We would like to build on these existing capabilities of Hudi to be able to provide point in time queries - access the state and contents of the table at a particular instant in time in the past.

Implementation (IN_PROGRESS)


Hive queries

Presto queries

Spark queries









  • No labels