Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

...

This section is still getting flushed out. Listing the possible options for now. 

In general, its a good practice to provision 30% more than we anticipate to avoid scaling beyond initial number of buckets. Because, there is some trade off or overhead involved in trying to scale further beyond initial number of buckets initialized. 

Option1: Adding file groups to existing buckets

In this approach, we will create multiple file groups within one bucket to scale further. A file group represents groups of HFiles forming a single group which could be compacted to one base HFile. So, each file group will have a base HFile. 

Lets say with initial estimates, 1000 Lets say with initial estimates, 1000 buckets were pre-allocated. And at some point, if we are reaching the tipping point i.e 1M entires per bucket, we can cap that file group and start a new file group. Remember with compaction in place, all HFiles per file group will be compacted to one HFile. So, if we don't expand to another file group, at some point we could have 2M entries or greater in one HFile which might give us bad performance. So, we ought to cap one file group to say 1M entries and start another file group. So, what this essentially means is that, this bucket of interest is equivalent to two virtual buckets. So, the hashing and number of buckets still remains the same, but our index is capable to scaling with more number of entries.  Whenever new file group is created, the older one is sealed, i.e it may not take up any new writes. In other words, write will append to the latest file group. Read also does not change and can be looked up in parallel across all HFiles. 

Option2: Multiple hash look ups

...

and bucket groups

First hash can result in bucket indices 1 to 1000(lets call this a bucket group). Once we hit 80% load on these indexesthis bucket group, we could come up with a new hash that would result in buckets bucket indices 1001 to 2000. So, during index look up, all records will go through two lookups. This will return one bucket index per bucket group and at max only one among them should return a value if exists. New writes will go into bucket indices 1001 to 2000. 

Implementation specifics

Con: One drawback with this approach is that, there is a chance of skewness where in very few buckets in first bucket group grows to > 1M entries while some of them 100k or less. So, we can't really wait until 80% load to create new bucket group. 

To be filled: Compare and contrast both approaches. 

Implementation specifics

As mentioned in the above sections, we need to have compaction from time to time to compact all Hfiles for a given bucket. So, in order to re-use compaction w/o needing to As mentioned in the above sections, we need to have compaction from time to time to compact all Hfiles for a given bucket. So, in order to re-use compaction w/o needing to change a whole lot from what it is today, we have come up with an Inline FileSystem which is capable of inlining any file format(parquet, hFile, etc) within a given file. In this context, it is going to be HFile. On a high level, this InlineFileSystem enables to embed/inline HFile within any generic file. A url/path will be generated for each such embedded HFile and the same can be used with HFile Reader as though it is a standalone HFile to read the content. Here is a pictorial representation of how a file with embedded hFiles might look like. 

...

This structure gives us many benefits. Since async compaction has been battle tested, with some minimal changes, we can reuse the compaction. Instead of data files, it is embedded hFiles in this case. With this layout, it is easy to reason about rollbacks and commits. And we get the same file system views similar to a hoodie partition. For instance, fetching new index entries after a given commit time, fetching delta between two commit timestamps, point in time query and so on are possible with this layout.

Rollout/Adoption Plan

instance, fetching new index entries after a given commit time, fetching delta between two commit timestamps, point in time query and so on are possible with this layout.

Rollout/Adoption Plan

  • Existing Users will have to perform a 1 time migration of existing index (bloom, user driven partition path etc) to this index. We will look into write a tool for this.
  • We can tee the writes and reads and do a controlled roll out until we gain enough confidence. Until then, we could compare read results from both old and new index and ensure there are no surprises.
  • Or we could write a tool for async index comparison. I am yet to look into internals of existing index, but if it supports incremental fetch, we could write a background script which will fetch records from both old and new indices at regular intervals only for the delta commits and compare them. Existing Users will have to perform a 1 time migration of existing index (bloom, user driven partition path etc) to this index. We will look into write a tool for this.
  • This new index and the old index’s contracts will not change and that should be seamless to the client. 
  • We will need a migration tool to bootstrap the global index from the existing dataset.

...