...
- 2.Multiple file groups per bucket: this is useful if data is skewed writing or grows a lot.
Comparsion
Pattern 1 | Pattern 2 | |
Number of file groups per bucket | 1 | >1 |
Can avoid random access | yes | no |
Implementation complexity | simple | complex |
Can avoid data skew when writing | no | yes |
Good support for data growth | bad | great |
This proposal will implement pattern 1.
...
Because the number of buckets is calculated according to the estimated amount of data, with the rapid growth of data, the size of a single bucket becomes too large, which will reduce the read and write performance.
Similar to the hashmap expansion process, the The number of buckets can be expanded by multiples , so is recommended. For multiple expansion, cluster the data by rehashing so that the existing data can be redistributed in a lightweight manner without shuffling. Otherwise, Non-multiple expansion has to rewrite the table with re-shuffling.
For example, 2 buckets expanding to 4 will split the 1st bucket and rehashing data in it to two smaller buckets: 1st and 3st bucket, and the 2st bucket before is changed to 2st and 4st smaller one.
- Data skew
Data skew means that the data size of some buckets will be too small or too large, resulting in a long tail of reads and writes, and increasing end-to-end time.
It is difficult to have a better solution on the engine side to solve this problem.
Configuration
hoodie.index.type=BUCKET_HASH_INDEX
hoodie.hash.index.bucket.num=1024
hoodie.datasource.write.indexkey.field=colA (index key should be the super set of the record key)
...