Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

...

Status

Current state"Under Discussion"

Status
colourBlue
titleINACTIVE

JIRA:   HUDI-86

Released: TBD

...

Hudi provides 3 different def~query-type to access data written in 2 different def~table-type - def~copy-on-write (COW) and def~merge-on-read (MOR).

  1. def~read-optimized-query allows to reading data written in columnar format with columnar read performance
  2. def~snapshot-query is a combination of avro and parquet and optimizes for latency vs read performance (right now applicable only for MOR)
  3. def~incremental-query allows one to extract incremental change logs from a Hudi table. The incremental view really builds on top of the other 2 views.

Hudi supports 2 types of queries - Snapshot & Incremental.

  • Excerpt Include
    query def~query scan mode snapshotquery
    def~query scan mode snapshot
  • Excerpt Include
    query def~query scan model mode incrementalquery
    def~query scan model mode incremental

All of these view are exposed by intercepting the planning stage through the HoodieInputFormat. Before we dive further into incremental changes, I want to bring to everyone's attention what incremental changes from Hudi really mean. For eg, if you have a log table (say log events from Kafka), incremental_pull (t1, t2] would provide you all the records that were ingested between those instants.

...

We set certain configurations for `Hudi` to understand whether this is a Snapshot based query (query def~query scan mode snapshot) or an Incremental query (query def~query scan model mode incremental. If it is an incremental query, the user can provide the (min,max) lastInstant to pull incremental changes. For def~read-optimized-query, we are able to figure out exactly which files changed and can limit the number of files read (and hence the amount of data) during the planning phase. For def~snapshot-query, since we are not able to figure out easily which log files have data from which commits/instants (def~commit / def~timeline), we rely on Hive to read out the entire data but the user to apply the filters in their query such as : select * from table where _hoodie_commit_time > min and _hoodie_commit_time < max. (def~commit-time)

...

Hudi provides 3 different def~query-type to access data written in 2 different def~table-type - def~copy-on-write (COW) and def~merge-on-read (MOR).

  1. def~read-optimized-query allows to reading data written in columnar format with columnar read performance
  2. def~snapshot-query is a combination of avro and parquet and optimizes for latency vs read performance (right now applicable only for MOR)
  3. def~incremental-query allows one to extract incremental change logs from a Hudi table. The incremental view really builds on top of the other 2 views.

Hudi supports 2 types of queries - Snapshot & Incremental.

  • Excerpt Include
    query def~query scan mode snapshotquery
    def~query scan mode snapshot
  • Excerpt Include
    query def~query scan model mode incrementalquery
    def~query scan model mode incremental

All of these view are exposed by intercepting the planning stage through the HoodieInputFormat. Before we dive further into incremental changes, I want to bring to everyone's attention what incremental changes from Hudi really mean. For eg, if you have a log table (say log events from Kafka), incremental_pull (t1, t2] would provide you all the records that were ingested between those instants.

...

We set certain configurations for `Hudi` to understand whether this is a Snapshot based query (query def~query scan mode snapshot) or an Incremental query (query def~query scan model mode incremental. If it is an incremental query, the user can provide the (min,max) lastInstant to pull incremental changes. For def~read-optimized-query, we are able to figure out exactly which files changed and can limit the number of files read (and hence the amount of data) during the planning phase. For def~snapshot-query, since we are not able to figure out easily which log files have data from which commits/instants (def~commit / def~timeline), we rely on Hive to read out the entire data but the user to apply the filters in their query such as : select * from table where _hoodie_commit_time > min and _hoodie_commit_time < max. (def~commit-time)

...