Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

...

The committable aggregators are optional operators between the sink writers and committers. The user can define the key for the exchange to the comittable aggregator, the parallelism of the committable aggregator, and the key for exchange from the committable aggregators to the committers.


Image Added

Proposed Changes

We propose to extend the unified Sink interface with the following changes. The changes to the unified sink should be backward compatible because only a new method with a default implementation is added to the sink interface.

...

In the case of the small-file-compaction the problem, we would choose a constant key for the exchange between sink writers and the committable aggregator. This transfers all committables to exactly one committable aggregator that is responsible for only forwarding a new committable, which contains the locations of the files to merge if configured file size is reached, or for batch execution when endOfInput is triggered. The committer operators will merge the files and write the final files.

The committable aggregator can be checkpointed similarly to the current SinkWriter operator. The combined committables are emitted when receiving the snapshot barrier and the outstanding commitables that haven't reached the expected size are written to state.


the parallelism will likely be set to 1 for the committable aggregator so that compaction across multiple subtasks is possible. In addition, the approach also supports compaction across multiple checkpoints if the committables in on checkpoint do not meet the required size.

...