Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

...

IndexFile -> A file that stores key-values in an efficient file format that allows for quick, random look up through using indexes, bloom filters etc, doing the least amount of seek possible.

Option 3 (

...

discarded)


Some suggestions came up as part of the discussion thread. A UUID is generally composed of multiple components, one of them being a timestamp. One can use timestamp ordering to generate a (min,max) range for each file which would eventually help in answering which file contains the uuid rather than maintaining a separate indexing system (some details here : https://www.percona.com/blog/2014/12/19/store-uuid-optimized-way/)

Update : On looking further into this, found out that there are multiple types and variants of UUID generation, the popular ones being a) random b) time-based. The popularly used java.util.UUID is actual a random based generator and hence does not guarantee any ordering of parts of the uuid. Although there is another variant of UUID that is time-based, it is possible to have the uuid be generated upstream (outside of hudi) in which case uuid ordering will anyways be broken. Hence, we are discarding this solution.

Rollout/Adoption Plan

  • Existing Users will have to perform a 1 time migration of existing index (bloom, user driven partition path etc) to this index. We will look into write a tool for this.
  • This new index and the old index’s contracts will not change and that should be seamless to the client. 
  • We will need a migration tool to bootstrap the global index from the existing dataset.

...