Page History

...

This index is built by adding bloom filters with a very high false positive tolerance (e.g: 1/10^9), to the parquet file footers. The advantage of this index over HBase is the obvious removal of a big external dependency, and also nicer handling of rollbacks & partial updates since the index is part of the data file itself.

At runtime, checking the Bloom Index for a given set of record keys effectively amounts to checking all the bloom filters within a given partition, against the incoming records, using a Spark join. Much of the engineering effort towards the Bloom index has gone into scaling this join by caching the incoming RDD[HoodieRecord] and dynamically tuning join parallelism, to avoid hitting Spark limitations like 2GB maximum for partition size. As a result, Bloom Index implementation has been able to handle single upserts upto 5TB, in a reliable manner.

DAG with Range Pruning:

draw.io Diagram

border	true

diagramName	hoodie-bloom-index-dag
simpleViewer	false
width	600
links	auto
tbstyle	top
pageId	103093742
lbox	true
diagramWidth	1002.0542713885977

Space shortcuts

Page tree

Versions Compared

Old Version 1

New Version 2

Key