Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

...

Lets say with initial estimates, 1000 buckets were pre-allocated. And at some point, if we are reaching the tipping point i.e 1M entires per bucket, we can cap that file group and start a new file group. Remember with compaction in place, all HFiles per file group will be compacted to one HFile. So, if we don't expand to another file group, at some point we could have 2M entries or greater in one HFile which might give us bad performance. So, we ought to cap one file group to say 1M entries and start another file group. So, what this essentially means is that, this bucket of interest is equivalent to two virtual buckets. So, the hashing and number of buckets still remains the same, but our index is capable to scaling with more number of entries. 

Implementation specifics

As mentioned in the above sections, we need to have compaction from time to time to compact all Hfiles for a given bucket. So, in order to re-use compaction w/o needing to change a whole lot from what it is today, we have come up with an Inline FileSystem which is capable of inlining any file format(parquet, hFile, etc) within a given file. In this context, it is going to be HFile. On a high level, this InlineFileSystem enables to embed/inline HFile within any generic file. A url/path will be generated for each such embedded HFile and the same can be used with HFile Reader as though it is a standalone HFile to read the content. Here is a pictorial representation of how a file with embedded hFiles might look like. 


                                                                                                          Image Added

In cloud data stores like S3(which does not support appends), we might have only one embedded hFile per data file as no appends are supported. 

With this in context, let's take a look at the how data layout might look like. 

Think of each bucket in this indexing scheme synonymous to a file group(which contains the actual data) in a hoodie partition. A typical partition in a MOR dataset might have one base file and N no of small delta files. Envision a similar structure for each bucket in this index. Each bucket is expected to have one base file and N number of smaller delta files with each having an embedded hFile. Each new batch of ingestion will either append a new hFile to an existing delta file as a new data block or create a new delta file and write the new hFile as the first data block. At regular intervals, compaction will pick up the base hFile and all the delta hFiles to create a new base file(with embedded hFile) as a compacted version. 

Here is an illustration of how the index might look like within a single bucket pre and post compaction 

Image AddedImage AddedImage Added      

Here is the same illustration for cloud stores like S3

Image Added Image Added Image Added


This structure gives us many benefits. Since async compaction has been battle tested, with some minimal changes, we can reuse the compaction. Instead of data files, it is embedded hFiles in this case. With this layout, it is easy to reason about rollbacks and commits. And we get the same file system views similar to a hoodie partition. For instance, fetching new index entries after a given commit time, fetching delta between two commit timestamps, point in time query and so on. 

Rollout/Adoption Plan

  • Existing Users will have to perform a 1 time migration of existing index (bloom, user driven partition path etc) to this index. We will look into write a tool for this.
  • This new index and the old index’s contracts will not change and that should be seamless to the client. 
  • We will need a migration tool to bootstrap the global index from the existing dataset.

...