Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

...

With this document, we mainly focus on the small-file-compaction problem which is one of the not yet covered scenarios. The problem manifests when using the the FileSink  and you want to guarantee a certain file size and not end up with a lot of small files (i.e. one file per parallel subtask). It becomes very important for modern data lake use cases where usually the data is written in some kind of columnar format (Parquet or ORC) and larger files are recommended to amplify the read performance. A few examples are the Iceberg sink [1], Delta lake [2], or Hive. For Hive in particular we already have a sink that supports compaction but it is not generally applicable and only available in Flink’s Table API [3].

The goal of this document is to extend the unified Sink API to broaden the spectrum of supported scenarios and fix the the small-file-compaction  problem.

Alternative 1 Global Sink Coordinator:

...

The global sink coordinator solves the the small-file-compaction problem by combining committables (files with sizing information) together until a certain threshold is reached. Afterward the combined committable is forwarded to the committers that do the actual file merging and write it to the final location.

...