Vinoth Chandar : APPROVED
Balaji Varadarajan : APPROVED
SemanticBeeng : [APPROVED/REQUESTED_INFO/REJECTED]

Status

Current state: "Under Discussion"

Status


colour	Blue
title	INACTIVE

JIRA: HUDI-86

Released: TBD

...

Hudi provides 3 different def~query-type to access data written in 2 different def~table-type - def~copy-on-write (COW) and def~merge-on-read (MOR).

def~read-optimized-query allows to reading data written in columnar format with columnar read performance
def~snapshot-query is a combination of avro and parquet and optimizes for latency vs read performance (right now applicable only for MOR)
def~incremental-query allows one to extract incremental change logs from a Hudi table. The incremental view really builds on top of the other 2 views.

Hudi supports 2 types of queries - Snapshot & Incremental.

Excerpt Include
query def~query scan mode snapshotquery
def~query scan mode snapshot

Excerpt Include
query def~query scan model mode incrementalquery
def~query scan model mode incremental

All of these view are exposed by intercepting the planning stage through the HoodieInputFormat. Before we dive further into incremental changes, I want to bring to everyone's attention what incremental changes from Hudi really mean. For eg, if you have a log table (say log events from Kafka), incremental_pull (t1, t2] would provide you all the records that were ingested between those instants.

...

We set certain configurations for `Hudi` to understand whether this is a Snapshot based query (query def~query scan mode snapshot) or an Incremental query (query def~query scan model mode incremental. If it is an incremental query, the user can provide the (min,max) lastInstant to pull incremental changes. For def~read-optimized-query, we are able to figure out exactly which files changed and can limit the number of files read (and hence the amount of data) during the planning phase. For def~snapshot-query, since we are not able to figure out easily which log files have data from which commits/instants (def~commit / def~timeline), we rely on Hive to read out the entire data but the user to apply the filters in their query such as : select * from table where _hoodie_commit_time > min and _hoodie_commit_time < max. (def~commit-time)

...

Hudi provides 3 different def~query-type to access data written in 2 different def~table-type - def~copy-on-write (COW) and def~merge-on-read (MOR).

def~read-optimized-query allows to reading data written in columnar format with columnar read performance
def~snapshot-query is a combination of avro and parquet and optimizes for latency vs read performance (right now applicable only for MOR)
def~incremental-query allows one to extract incremental change logs from a Hudi table. The incremental view really builds on top of the other 2 views.

Hudi supports 2 types of queries - Snapshot & Incremental.

Excerpt Include
query def~query scan mode snapshotquery
def~query scan mode snapshot

Excerpt Include
query def~query scan model mode incrementalquery
def~query scan model mode incremental

All of these view are exposed by intercepting the planning stage through the HoodieInputFormat. Before we dive further into incremental changes, I want to bring to everyone's attention what incremental changes from Hudi really mean. For eg, if you have a log table (say log events from Kafka), incremental_pull (t1, t2] would provide you all the records that were ingested between those instants.

...

We set certain configurations for `Hudi` to understand whether this is a Snapshot based query (query def~query scan mode snapshot) or an Incremental query (query def~query scan model mode incremental. If it is an incremental query, the user can provide the (min,max) lastInstant to pull incremental changes. For def~read-optimized-query, we are able to figure out exactly which files changed and can limit the number of files read (and hence the amount of data) during the planning phase. For def~snapshot-query, since we are not able to figure out easily which log files have data from which commits/instants (def~commit / def~timeline), we rely on Hive to read out the entire data but the user to apply the filters in their query such as : select * from table where _hoodie_commit_time > min and _hoodie_commit_time < max. (def~commit-time)

...

Space shortcuts

Page tree

Versions Compared

Old Version 25

New Version Current

Key

Status

Space shortcuts

Page tree

Page History

Versions Compared

Old Version 25

New Version Current

Key

Status