Bloom Index
This index is built by adding bloom filters with a very high false positive tolerance (e.g: 1/10^9), to the parquet file footers. The advantage of this index over HBase is the obvious removal of a big external dependency, and also nicer handling of rollbacks & partial updates since the index is part of the data file itself.
At runtime, checking the Bloom Index for a given set of record keys effectively amounts to checking all the bloom filters within a given partition, against the incoming records, using a Spark join. Much of the engineering effort towards the Bloom index has gone into scaling this join by caching the incoming RDD[HoodieRecord] and dynamically tuning join parallelism, to avoid hitting Spark limitations like 2GB maximum for partition size. As a result, Bloom Index implementation has been able to handle single upserts upto 5TB, in a reliable manner.
DAG with Range Pruning:
eyJleHRTcnZJbnRlZ1R5cGUiOiIiLCJnQ2xpZW50SWQiOiIiLCJjcmVhdG9yTmFtZSI6IlZpbm90aCBDaGFuZGFyIiwib3V0cHV0VHlwZSI6ImJsb2NrIiwibGFzdE1vZGlmaWVyTmFtZSI6IlZpbm90aCBDaGFuZGFyIiwibGFuZ3VhZ2UiOiJlbiIsImRpYWdyYW1EaXNwbGF5TmFtZSI6IiIsInNGaWxlSWQiOiIiLCJhdHRJZCI6IjExMzcwOTAxNCIsImRpYWdyYW1OYW1lIjoiaG9vZGllLWJsb29tLWluZGV4LWRhZyIsImFzcGVjdCI6IiIsImxpbmtzIjoiYXV0byIsImNlb05hbWUiOiJkZWZ+Ymxvb20taW5kZXgiLCJ0YnN0eWxlIjoidG9wIiwiY2FuQ29tbWVudCI6ZmFsc2UsImRpYWdyYW1VcmwiOiIiLCJjc3ZGaWxlVXJsIjoiIiwiYm9yZGVyIjp0cnVlLCJtYXhTY2FsZSI6IjEiLCJvd25pbmdQYWdlSWQiOjEwMzA5Mzc0MiwiZWRpdGFibGUiOmZhbHNlLCJjZW9JZCI6MTQ0NTEwMDI2LCJwYWdlSWQiOiIiLCJsYm94Ijp0cnVlLCJzZXJ2ZXJDb25maWciOnsiZW1haWxwcmV2aWV3IjoiMSJ9LCJvZHJpdmVJZCI6IiIsInJldmlzaW9uIjowLCJtYWNyb0lkIjoiMjZlN2M4ODItMTZlNC00Mzc0LWJiN2UtZDEyNzI5OWViZWQ5IiwicHJldmlld05hbWUiOiJob29kaWUtYmxvb20taW5kZXgtZGFnLnBuZyIsImxpY2Vuc2VTdGF0dXMiOiJPSyIsInNlcnZpY2UiOiIiLCJpc1RlbXBsYXRlIjoiIiwid2lkdGgiOiI2MDAiLCJzaW1wbGVWaWV3ZXIiOmZhbHNlLCJsYXN0TW9kaWZpZWQiOjE1NzcwODk4NTMwMDAsImV4Y2VlZFBhZ2VXaWR0aCI6ZmFsc2UsIm9DbGllbnRJZCI6IiJ9