Cleaning is an essential def~instant-action, performed for purposes of deleting old def~file-slices and bound the growth of storage space consumed by a def~table. Cleaning is performed automatically and right after each def~write-operation and leverages the timeline metadata cached on the timeline server  to avoid scanning the entire def~table to evaluate opportunities for cleaning.

There are two styles of cleaning supported.

  • Clean by commits/deltacommits : This is the most common and must-to-use mode with incremental queries. In this style, cleaner retains all the file slices that were written to in the last N  commits/delta commits, thus effectively providing the ability to be able to incrementally query any def~instant-time range across those actions. While this can be useful for incremental queries, it might need larger storage on some high write workloads, since it preserved all versions of file slices for the configured range.
  • Clean by file-slices retained : This is  a much more simpler style of cleaning, where we only retain the last N  file slices in each def~file-group. Some query engines like Apache Hive process very large queries that could take several hours to finish and in such cases, it is useful to set N to be large enough such that no file slice that might be still accessed by the query is deleted (doing so will fail the query after it has already spent hours running and consuming cluster resources).

Additionally, cleaning ensures that there is always 1 file slice (the latest slice) retained in a def~file-group.

  • No labels