You are viewing an old version of this page. View the current version.

Compare with Current View Page History

Version 1 Next »

Hudi currently provides two choices for indexes : `BloomIndex` and `HBaseIndex` to map a record key into the file id to which it belongs to. This enables us to speed up upserts significantly, without scanning over every record in the table. Hudi Indices can be classified based on their ability to lookup records across partition. A `global` index does not need partition information for finding the file-id for a record key but a `non-global` does.

HBase Index (global)

Here, we just use HBase in a straightforward way to store the mapping above. The challenge with using HBase (or any external key-value store for that matter) is performing rollback of a commit and handling partial index updates.
Since the HBase table is indexed by record key and not commit Time, we would have to scan all the entries which will be prohibitively expensive. Instead, we store the commit time with the value and discard its value if it does not belong to a valid commit.

Bloom Index (non-global)

This index is built by adding bloom filters with a very high false positive tolerance (e.g: 1/10^9), to the parquet file footers. The advantage of this index over HBase is the obvious removal of a big external dependency, and also nicer handling of rollbacks & partial updates since the index is part of the data file itself.

At runtime, checking the Bloom Index for a given set of record keys effectively amounts to checking all the bloom filters within a given partition, against the incoming records, using a Spark join. Much of the engineering effort towards the Bloom index has gone into scaling this join by caching the incoming RDD[HoodieRecord] and dynamically tuning join parallelism, to avoid hitting Spark limitations like 2GB maximum for partition size. As a result, Bloom Index implementation has been able to handle single upserts upto 5TB, in a reliable manner.

DAG with Range Pruning:

draw.io

Diagram attachment access error: cannot display diagram

  • No labels