Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

...

Please note that in cloud stores that do not support log append operations, Hudi is forced to create new archive files to archive old metadata operations.  You can increase hoodie.commits.archival.batch moving forward to increase the number of commits archived per archive file. In addition, you can increase the difference between the 2 watermark configurations : hoodie.keep.max.commits (default : 30) and hoodie.keep.min.commits (default : 20). This way, you can reduce the number of archive files created and also at the same time increase the number of metadata archived per archive file. Note that post 0.7.0 release where we are adding consolidated Hudi metadata (RFC-15), the follow up work would involve re-organizing archival metadata so that we can do periodic compactions to control file-sizing of these archive files.

How do I configure Bloom filter (when Bloom/Global_Bloom index is used)? 

Bloom filters are used in bloom indexes to look up the location of record keys in write path. Bloom filters are used only when the index type is chosen as “BLOOM” or “GLOBAL_BLOOM”. Hudi has few config knobs that users can use to tune their bloom filters. 

On a high level, hudi has two types of blooms: Simple and Dynamic.

Simple, as the name suggests, is simple. Size is statically allocated based on few configs. 

“hoodie.bloom.index.filter.type”: SIMPLE

“hoodie.index.bloom.num_entries” refers to the total number of entries per bloom filter, which refers to one file slice. Default value is 60000.

“hoodie.index.bloom.fpp” refers to the false positive probability with the bloom filter. Default value: 1*10^-9.

Size of the bloom filter depends on these two values. This is statically allocated and here is the formula that determines the size of bloom. Until the total number of entries added to the bloom is within the configured “hoodie.index.bloom.num_entries” value, the fpp will be honored. i.e. with default values of 60k and 1*10^-9, bloom filter serialized size = 430kb. But if more entries are added, then the false positive probability will not be honored. Chances that more false positives could be returned if you add more number of entries than the configured value. So, users are expected to set the right values for both num_entries and fpp. 

Hudi suggests to have roughly 100 to 120 mb sized files for better query performance. So, based on the record size, one could determine how many records could fit into one data file. 

Lets say your data file max size is 128Mb and default avg record size is 1024 bytes. Hence, roughly this translates to 130k entries per data file. For this config, you should set num_entries to ~130k. 

Dynamic bloom filter:

“hoodie.bloom.index.filter.type” : DYNAMIC

This is an advanced version of the bloom filter which grows dynamically as the number of entries grows. So, users are expected to set two values wrt num_entires. “hoodie.index.bloom.num_entries” will determine the starting size of the bloom. “hoodie.bloom.index.filter.dynamic.max.entries” will determine the max size to which the bloom can grow upto. And fpp needs to be set similar to “Simple” bloom filter. Bloom size will be allotted based on the first config “hoodie.index.bloom.num_entries”. Once the number of entries reaches this value, bloom will dynamically grow its size to 2X. This will go on until the size reaches a max of “hoodie.bloom.index.filter.dynamic.max.entries” value. Until the size reaches this max value, fpp will be honored. If the entries added exceeds the max value, then the fpp may not be honored. 

Contributing to FAQ 

A good and usable FAQ should be community-driven and crowd source questions/thoughts across everyone. 

...