Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

...

This is very similar to the COW table. For MOR table, inserts can go into either parquet files or into log files. This approach will continue to support both modes. The output of clustering is always parquet format.  Also, compaction and clustering cannot run at the same time on the same file groups. Compaction also needs changes to ignore file groups that are already clustered.


Performance numbers

Time for reading metadata

Test is done to measure time take to read 'replace' metadata using code here. Here is the result:


Partitions

Total FileGroups replaced

(divide by column1 to get number of file groups per partition)

Serialization cost (millis)

Deserialization cost (millis)

Memory utilization

(HoodieReplaceMetadata object size + serialized byte[] size in memory )

1

300

55

41

60KB

1

3,000

55

42

570KB

1

30,000

93

120

5.7MB

1

300,000

103

130

57MB

10

300

53

32

60KB

10

3,000 

68

52

574KB

10

30,000 

87

104

5.7MB

10

300,000

97

114

57MB

We plan to store this metadata similar to clean metadata in avro files. After consolidated metadata is launched, we can come up with a plan to migrate this to leverage consolidated metadata(This will likely reduce memory required for cases where a partition has large number of files replaced)


Rollout/Adoption Plan

  • No impact on the existing users because add new function

...