Page History

...

2.Multiple file groups per bucket: this is useful if data is skewed writing or grows a lot.

Comparsion

	Pattern 1	Pattern 2
Number of file groups per bucket	1	>1
Can avoid random access	yes	no
Implementation complexity	simple	complex
Can avoid data skew when writing	no	yes
Good support for data growth	bad	great

This proposal will implement pattern 1.

...

Because the number of buckets is calculated according to the estimated amount of data, with the rapid growth of data, the size of a single bucket becomes too large, which will reduce the read and write performance.
Similar to the hashmap expansion process, the The number of buckets can be expanded by multiples , so is recommended. For multiple expansion, cluster the data by rehashing so that the existing data can be redistributed in a lightweight manner without shuffling. Otherwise, Non-multiple expansion has to rewrite the table with re-shuffling.

For example, 2 buckets expanding to 4 will split the 1st bucket and rehashing data in it to two smaller buckets: 1st and 3st bucket, and the 2st bucket before is changed to 2st and 4st smaller one.

Image Added

Data skew

Data skew means that the data size of some buckets will be too small or too large, resulting in a long tail of reads and writes, and increasing end-to-end time.
It is difficult to have a better solution on the engine side to solve this problem.

Configuration

hoodie.index.type=BUCKET_HASH_INDEX
hoodie.hash.index.bucket.num=1024
hoodie.datasource.write.indexkey.field=colA (index key should be the super set of the record key)

...

Space shortcuts

Page tree

Versions Compared

Old Version 4

New Version 5

Key

Configuration