...

Identify files that are eligible for clustering

Filter specific partitions (based on config to prioritize latest vs older partitions)
Any files that have size > targetFileSize are not eligible for clustering
Any files that have pending compaction/clustering scheduled are not eligible for clustering
Any filegroups that have log files are not eligible for clustering (We could remove this restriction at a later stage.)

Group files that are eligible for clustering based on specific criteria. Each group is expected to have data size in multiples of ‘targetFileSize’. Grouping is done as part of ‘strategy’ defined in the plan. We can provide 2 strategies
1. Group files based on record key ranges. This is useful because key range is stored in a parquet footer and can be used for certain queries/updates.
2. Groups files based on commit time.
3. Group files that have overlapping values for custom columns
4. If sort columns are specified,
5. If sort columns are not specified, we could consider grouping files based on other criteria: (All of these can be exposed as different strategies).
  1. Group files based on record key ranges. This is useful because key range is stored in a parquet footer and can be used for certain queries/updates.
  2. Groups files based on commit time.
  3. Random grouping of files.
6. Group random files
7. We could We could put a cap on group size to improve parallelism and avoid shuffling large amounts of data
Filter groups based on specific criteria (akin to orderAndFilter in CompactionStrategy)
Finally, the clustering plan is saved to the timeline. Structure of metadata

   "namespace":"org.apache.hudi.avro.model",

   "type":"record",

   "name":"HoodieClusteringPlan",

   "fields":[

         "name":"clusteringGroups",

         "type":["null", {

            "type":"array",

            "items":{

               "name":"HoodieClusteringGroup",

               "type":"record",

               "fields":[

{

                  {
                     "name":"fileIds",

                     "type":["null",

{

"type":"array",

"items":"string"

}],

"default": null

},

{

"name":"partitionPath",

"type":["null","string"],

"default": null

},

{

"name":"metrics",

"type":["null", {

"type":"map",

"values":"double"

}],

"default": null

}

]

}

}],

"default": null

},

{

"name":"targetFileSize",

"type":["long", "null"],

"default": 1073741824

},

{

"name":"sortColumns",

"type":["null", {

"type":"array",

"items":"string"

}],

"default": null

},

{

 {
                        "type":"array",
                        "items":"string"
                     }],
                     "default": null
                  },
                  {
                     "name":"partitionPath",
                     "type":["null","string"],
                     "default": null
                  },
                  {
                     "name":"metrics",
                     "type":["null", {
                        "type":"map",
                        "values":"double"
                     }],
                     "default": null
                  }
               ]
            }
        }],
       "default": null
    },
    {
       "name":"targetFileSize",
       "type":["long", "null"],
       "default": 1073741824
    },
    {
       "name":"strategy",
       "type":"record",
       "fields":[
          {
            "name":"strategyClassName", /* have to be subclass of ClusteringStrategy interface defined in hudi. ClusteringStrategy class include methods like getPartitioner */
            "type":["null","string"],
            "default": null
          },
          {
             "name":"strategyParams", /* Parameters could be different for different strategies. example, if sorting is needed for the strategy, parameters can contain sortColumns */
             "type":["null", {
                "type":"map",
                "values":"string"
             }],
             "default": null
          }
       ]
     },
    {
       "name":"extraMetadata",

       "type":["null", {

          "type":"map

", "

",
          "values":"string"

}],

       "default": null

},

       "name":"version",

       "type":["int", "null"],

       "default": 1

]
}

In the ‘metrics’ element, we could store ‘min’ and ‘max’ for each column in the file for helping with debugging and operations.

...

Read the clustering plan, look at the number of ‘clusteringGroups’. This gives parallelism.
Create inflight clustering file
For each groupgroup

Instantiate appropriate strategy class with strategyParams (example: sortColumns)
Strategy class defines partitioner and we can use it to create buckets and write the data.
Create new 'CombineHandle' based on parameters (sortColumns for initial case)
If sort order is not specified, we could just combine the records and write to new buckets using existing logic similar to bulk_insert/insert.
If sort order is specified, we need to add new logic (essentially do merge sort across files within group and write records to target file groups honoring targetFileSize ) and write the new file groups

Create replacecommit. Contents are in HoodieReplaceCommitMetadata

operationType is set to ‘clustering’.
We can extend the metadata and store additional fields to help track important information (strategy class can return this 'extra' metadata information) We can extend the metadata and
1. strategy used to combine files
1. add additional metrics including range of values for each column in each file etc.
2. TODO: see if any additional metadata is needed?

Other concerns:

Can we do group files while Running clustering (as opposed to grouping during scheduling)?
- To limit IO, scheduling filters certain file groups from clustering. If these file groups filtered have overlapping data with files selected, effectiveness of clustering will be limited. So I think grouping and filtering during scheduling has some benefits.

Is the ClusteringPlan extensible enough for future use cases?
- With the above approach, executing a clustering plan is basically dependent on two parameters: ‘targetFileSize’ and ‘sortColumns’. Based on these parameters, we create different partitioners/write data differently in new locations. Because this avro schema is extensible, we could add new fields ‘strategyClass’. Users can define custom strategy class and support any other usecases that might come up.
Can we store sortColumns store strategyParams in hoodie.properties instead of storing in clustering plan?
- This is reasonable if we don't expect sortColumns expect strategyParams to change forever. If the data pattern changes for any reason or if there are usecases to support sorting use different partitions strategies by different columnspartitions, this may not work.

Rollout/Adoption Plan

...

Space shortcuts

Page tree

Versions Compared

Old Version 14

New Version 15

Key

Other concerns:

Rollout/Adoption Plan

Space shortcuts

Page tree

Page History

Versions Compared

Old Version 14

New Version 15

Key

Other concerns:

Rollout/Adoption Plan