...
This is very similar to the COW table. For MOR table, inserts can go into either parquet files or into log files. This approach will continue to support both modes. The output of clustering is always parquet format. Also, compaction and clustering cannot run at the same time on the same file groups. Compaction also needs changes to ignore file groups that are already clustered.
Performance numbers
Time for reading metadata
Test is done to measure time take to read 'replace' metadata using code here. Here is the result:
Partitions | Total FileGroups replaced (divide by column1 to get number of file groups per partition) | Serialization cost (millis) | Deserialization cost (millis) | Memory utilization (HoodieReplaceMetadata object size + serialized byte[] size in memory ) |
1 | 300 | 55 | 41 | 60KB |
1 | 3,000 | 55 | 42 | 570KB |
1 | 30,000 | 93 | 120 | 5.7MB |
1 | 300,000 | 103 | 130 | 57MB |
10 | 300 | 53 | 32 | 60KB |
10 | 3,000 | 68 | 52 | 574KB |
10 | 30,000 | 87 | 104 | 5.7MB |
10 | 300,000 | 97 | 114 | 57MB |
We plan to store this metadata similar to clean metadata in avro files. After consolidated metadata is launched, we can come up with a plan to migrate this to leverage consolidated metadata(This will likely reduce memory required for cases where a partition has large number of files replaced)
Rollout/Adoption Plan
- No impact on the existing users because add new function
...