Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

...

Excerpt

Hudi provides efficient upserts, by mapping a def~record-key + def~partition-path combination consistently to a def~file-id, via an indexing mechanism. This mapping between record key and file group/file id, never changes once the first version of a record has been written to a file group. In short, the mapped file group contains all versions of a group of records. Hudi currently provides two choices for indexes : def~bloom-index and def~hbase-index, (with a few in the works :

Jira
serverASF JIRA
columnskey,summary,type,created,updated,due,assignee,reporter,priority,status,resolution
serverId5aa69414-a9e9-3523-82ec-879b028fb15b
keyHUDI-466
,
Jira
serverASF JIRA
columnskey,summary,type,created,updated,due,assignee,reporter,priority,status,resolution
serverId5aa69414-a9e9-3523-82ec-879b028fb15b
keyHUDI-407
) to map a record key into the file id to which it belongs to. This enables us to speed up upserts significantly, without scanning over every record in the table.


Hudi Indices can be classified based on their ability to lookup records across partition.

  • A
`global` index
  • global   index does not need partition information for finding the file-id for a
record key but a `non-global` does.
  • record key. i.e the writer can pass in null  or any string as def~partition-path and the index lookup will find the location of the def~record-key nonetheless. Global index can be very useful, in cases where the uniqueness of the record key needs to be guaranteed across the entire def~table. Cost of the index lookup however grows as a function of the size of the entire table.
  • non-global  index on the other hand, relies on partition path and only looks for a given def~record-key, against files belonging to that corresponding def~table-partition. This can be suitable in cases where it's always possible to generate the partition path associated with a record key, and enjoy greater scalability, since cost of indexing only grows a function the actual set of def~table-partitions actually written to.