...
Current state: " Status colour YellowBlue title In ProgressINACTIVE
Released: TBD
...
Hudi provides 3 different def~query-type to access data written in 2 different def~table-type - def~copy-on-write (COW) and def~merge-on-read (MOR).
- def~read-optimized-query allows to reading data written in columnar format with columnar read performance
- def~snapshot-query is a combination of avro and parquet and optimizes for latency vs read performance (right now applicable only for MOR)
- def~incremental-query allows one to extract incremental change logs from a Hudi table. The incremental view really builds on top of the other 2 views.
Hudi supports 2 types of queries - Snapshot & Incremental.
Excerpt Include query def~query scan mode snapshotquery def~query scan mode snapshot
Excerpt Include query def~query scan model mode incrementalquery def~query scan model mode incremental
All of these view are exposed by intercepting the planning stage through the HoodieInputFormat. Before we dive further into incremental changes, I want to bring to everyone's attention what incremental changes from Hudi really mean. For eg, if you have a log table (say log events from Kafka), incremental_pull (t1, t2] would provide you all the records that were ingested between those instants.
...
We set certain configurations for `Hudi` to understand whether this is a Snapshot based query (query def~query scan mode snapshot) or an Incremental query (query def~query scan model mode incremental. If it is an incremental query, the user can provide the (min,max) lastInstant to pull incremental changes. For def~read-optimized-query, we are able to figure out exactly which files changed and can limit the number of files read (and hence the amount of data) during the planning phase. For def~snapshot-query, since we are not able to figure out easily which log files have data from which commits/instants (def~commit / def~timeline), we rely on Hive to read out the entire data but the user to apply the filters in their query such as : select * from table where _hoodie_commit_time > min and _hoodie_commit_time < max. (def~commit-time)
...
Hudi provides 3 different def~query-type to access data written in 2 different def~table-type - def~copy-on-write (COW) and def~merge-on-read (MOR).
- def~read-optimized-query allows to reading data written in columnar format with columnar read performance
- def~snapshot-query is a combination of avro and parquet and optimizes for latency vs read performance (right now applicable only for MOR)
- def~incremental-query allows one to extract incremental change logs from a Hudi table. The incremental view really builds on top of the other 2 views.
Hudi supports 2 types of queries - Snapshot & Incremental.
Excerpt Include query def~query scan mode snapshotquery def~query scan mode snapshot
Excerpt Include query def~query scan model mode incrementalquery def~query scan model mode incremental
All of these view are exposed by intercepting the planning stage through the HoodieInputFormat. Before we dive further into incremental changes, I want to bring to everyone's attention what incremental changes from Hudi really mean. For eg, if you have a log table (say log events from Kafka), incremental_pull (t1, t2] would provide you all the records that were ingested between those instants.
...
We set certain configurations for `Hudi` to understand whether this is a Snapshot based query (query def~query scan mode snapshot) or an Incremental query (query def~query scan model mode incremental. If it is an incremental query, the user can provide the (min,max) lastInstant to pull incremental changes. For def~read-optimized-query, we are able to figure out exactly which files changed and can limit the number of files read (and hence the amount of data) during the planning phase. For def~snapshot-query, since we are not able to figure out easily which log files have data from which commits/instants (def~commit / def~timeline), we rely on Hive to read out the entire data but the user to apply the filters in their query such as : select * from table where _hoodie_commit_time > min and _hoodie_commit_time < max. (def~commit-time)
...