Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

...

  • Indexing : A big part of Hudi's efficiency comes from indexing the mapping from record keys to the file ids, to which they belong to. This index also helps the `HoodieWriteClient` separate upserted records into inserts and updates, so they can be treated differently. `HoodieReadClient` supports operations such as `filterExists` (used for de-duplication of table) and an efficient batch `read(keys)` api, that can read out the records corresponding to the keys using the index much quickly, than a typical scan via a query. The index is also atomically updated each commit, and is also rolled back when commits are rolled back.
  • Storage : The storage part of the DAG is responsible for taking an `RDD[HoodieRecord]`, that has been tagged as an insert or update via index lookup, and writing it out efficiently onto storage.

File Layout

<WIP>

Indexing

<WIP>

Hudi currently provides two choices for indexes : `BloomIndex` and `HBaseIndex` to map a record key into the file id to which it belongs to. This enables us to speed up upserts significantly, without scanning over every record in the dataset. Hudi Indices can be classified based on their ability to lookup records across partition. A `global` index does not need partition information for finding the file-id for a record key but a `non-global` does.

...

This index is built by adding bloom filters with a very high false positive tolerance (e.g: 1/10^9), to the parquet file footers. The advantage of this index over HBase is the obvious removal of a big external dependency, and also nicer handling of rollbacks & partial updates since the index is part of the data file itself.

At runtime, checking the Bloom Index for a given set of record keys effectively amounts to checking all the bloom filters within a given partition, against the incoming records, using a Spark join. Much of the engineering effort towards the Bloom index has gone into scaling this join by caching the incoming RDD[HoodieRecord] and dynamically tuning join parallelism, to avoid hitting Spark limitations like 2GB maximum for partition size. As a result, Bloom Index implementation has been able to handle single upserts upto 5TB, in a reliable manner.

DAG with Range Pruning:

draw.io Diagram
bordertrue
viewerToolbartrue
fitWindowfalse
diagramNamehoodie-bloom-index-dag
simpleViewerfalse
width
diagramWidth1003
revision2

Compaction

<WIP>

File Sizing

<WIP>

Querying

<WIP>

Snapshot Queries

<WIP>

Incremental Queries

<WIP>

Read Optimized

<WIP>