This is very similar to the COW table. For MOR table, inserts can go into either parquet files or into log files. This approach will continue to support both modes. The output of clustering is always parquet format. Also, compaction and clustering cannot run at the same time on the same file groups. Compaction also needs changes to ignore file groups that are already clustered.

Performance numbers

Time for reading metadata

Test is done to measure time take to read 'replace' metadata using code here. Here is the result:

Partitions	Total FileGroups replaced (divide by column1 to get number of file groups per partition)	Serialization cost (millis)	Deserialization cost (millis)	Memory utilization (HoodieReplaceMetadata object size + serialized byte[] size in memory )
1	300	55	41	60KB
1	3,000	55	42	570KB
1	30,000	93	120	5.7MB
1	300,000	103	130	57MB
10	300	53	32	60KB
10	3,000	68	52	574KB
10	30,000	87	104	5.7MB
10	300,000	97	114	57MB

We plan to store this metadata similar to clean metadata in avro files. After consolidated metadata is launched, we can come up with a plan to migrate this to leverage consolidated metadata(This will likely reduce memory required for cases where a partition has large number of files replaced)

Rollout/Adoption Plan

No impact on the existing users because add new function

...

Space shortcuts

Page tree

Versions Compared

Old Version 7

New Version 8

Key

Performance numbers

Time for reading metadata

Rollout/Adoption Plan

Space shortcuts

Page tree

Page History

Versions Compared

Old Version 7

New Version 8

Key

Performance numbers

Time for reading metadata

Rollout/Adoption Plan