Hudi provides efficient upserts, by mapping a def~record-key + def~partition-path combination consistently to a def~file-id, via an indexing mechanism. This mapping between record key and file group/file id, never changes once the first version of a record has been written to a file group. In short, the mapped file group contains all versions of a group of records.
Excerpt | |||||||||||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
Hudi provides efficient upserts, by mapping a def~record-key + def~partition-path combination consistently to a def~file-id, via an indexing mechanism. This mapping between record key and file group/file id, never changes once the first version of a record has been written to a file group. In short, the mapped file group contains all versions of a group of records. Hudi currently provides two choices for indexes : `BloomIndex` and `HBaseIndex` : def~bloom-index and def~hbase-index, (with a few in the works :
A `global` index does not need partition information for finding the file-id for a record key but a `non-global` does. HBase Index (global)Here, we just use HBase in a straightforward way to store the mapping above. The challenge with using HBase (or any external key-value store for that matter) is performing rollback of a commit and handling partial index updates. Bloom Index (non-global)This index is built by adding bloom filters with a very high false positive tolerance (e.g: 1/10^9), to the parquet file footers. The advantage of this index over HBase is the obvious removal of a big external dependency, and also nicer handling of rollbacks & partial updates since the index is part of the data file itself. | |||||||||||||||||||||
border | true | ||||||||||||||||||||
viewerToolbar | true | ||||||||||||||||||||
fitWindow | false | ||||||||||||||||||||
diagramName | hoodie-bloom-index-dag | ||||||||||||||||||||
simpleViewer | false | ||||||||||||||||||||
width | 400 | ||||||||||||||||||||
diagramWidth | 1003 | revision | 1