Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

...

Hudi provides 3 different views to access data written in 2 different table types - COW and MOR. 1)

  1. RO view allows to reading data written in columnar format with columnar read performance

...

  1. RT view is a combination of avro and parquet and optimizes for latency vs read performance (right now applicable only for MOR)

...

  1. Incremental view allows one to extract incremental change logs from a Hudi table. The incremental view really builds on top of the other 2 views.

Hudi supports 2 types of queries - Snapshot & Incremental.

  • Snapshot queries as the latest state of a dataset/table. The default scan mode is set to LATEST.
  • Incremental queries provides incremental changes between time (t1,t2], can control this by using a scan mode called INCREMENTAL

All of these view are exposed by intercepting the planning stage through the HoodieInputFormat. Before we dive further into incremental changes, I want to bring to everyone's attention what incremental changes from Hudi really mean. For eg, if you have a log table (say log events from Kafka), incremental_pull (t1, t2] would provide you all the records that were ingested between those instants.

...

Introduce another type of snapshot scan mode, POINT_IN_TIME_SCAN. This will enable  scanning the Hudi dataset for files created on or before the supplied point_in_time. The challenge as described above is to have a solution around inconsistent results when a user runs the same POINT_IN_TIME query multiple times. There are 2 options here : 

...

Hive queries

Users can pass a new scan mode via the following config in JobConf → https://github.com/apache/incubator-hudi/blob/master/hudi-hadoop-mr/src/main/java/org/apache/hudi/hadoop/HoodieHiveUtil.java#L30

...

Incremental pull using Spark works with a custom implementation of import org.apache.spark.sql.sources.BaseRelation (called IncrementalRelation) which based on the VIEW_TYPE that the user can pass from a Spark DataSource. This is slightly different from the way users query Hive/Presto tables. The Hive/Presto tables don't really expose the RO/RT/IC views to users but instead expose InputFormats that build on those views. As part of those input formats, we support different scan modes modes such as SNAPSHOT or INCREMENTAL (and now our proposed POINT_IN_TIME_SNAPSHOT). 

...