...

Filter specific partitions (based on config to prioritize latest vs older partitions)
Any files that have size > targetFileSize are not eligible for clustering
Any files that have pending compaction/clustering scheduled are not eligible for clustering
Any filegroups that have log files are not eligible for clustering (We could remove this restriction at a later stage.)

Group files that are eligible for clustering based on specific criteria. Each group is expected to have data size in multiples of ‘targetFileSize’. Grouping is done as part of ‘strategy’.
1. If sort columns are specified,

We could put a cap on group size to improve parallelism and avoid shuffling large amounts of data

1. If sort columns are not specified, we could consider grouping files based on other criteria: (All of these can be exposed as different strategies).
  1. Group files based on record key ranges. This is useful because key range is stored in a parquet footer and can be used for certain queries/updates.
  2. Groups files based on commit time.
  3. Random grouping of files.
2. We could put a cap on group size to improve parallelism and avoid shuffling large amounts of data
Filter groups based on specific criteria (akin to orderAndFilter in CompactionStrategy)
Finally, the clustering plan is saved to the timeline. Structure of metadata

...

In the ‘metrics’ element, we could store ‘min’ and ‘max’ for each column in the file for helping with debugging and operations.

...

...

Rollout/Adoption Plan

...

Space shortcuts