Status

Discussion thread https://lists.apache.org/thread/pt2b1f17x2l5rlvggwxs6m265lo4ly7p
Vote thread
JIRA

Unable to render Jira issues macro, execution error.

Release1.15

Please keep the discussion on the mailing list rather than commenting on the wiki (wiki discussions get unwieldy fast).

Motivation

Currently, some default config values of blocking shuffle are not good enough and to achieve good stability and performance, users have to tune several config options. For example,

  1. Shuffle data compression is not enabled by default, users usually need to enable data compression first;
  2. Sort-shuffle is not enabled by default, for large parallelism jobs, users must enable it manually;
  3. As reported in the user mailing list, in some scenarios (for example, data skew), the read buffer request timeout exception may occur, users need to increase the batch read memory size to solve this issue;
  4. The current default value of memory size per partition for sort-shuffle is pretty small, which may influence blocking shuffle performance, usually, users need to increase this value.

Public Interfaces

This FLIP only changes the default value of 4 config options related to blocking shuffle, including taskmanager.network.blocking-shuffle.compression.enabled, taskmanager.network.sort-shuffle.min-parallelism, taskmanager.memory.framework.off-heap.batch-shuffle.size, taskmanager.network.sort-shuffle.min-buffers. No new public interface will be added and pipeline streaming mode is not influenced.

Proposed Changes

KeyCurrent ValueTarget ValueReason
taskmanager.network.blocking-shuffle.compression.enabledfalsetrueUsually, data compression can reduce both disk and network IO which is good for performance. At the same time, it can save storage space.
taskmanager.network.sort-shuffle.min-parallelismInteger.MAX_VALUE1For high parallelism batch jobs, sort-shuffle is better for both stability and performance. (We tested setting this value to 1, 128, 256, 512 and 1024 with TPC-DS and the result showed that 1 is the best one.)
taskmanager.memory.framework.off-heap.batch-shuffle.size32m64mAside from the improvement in FLINK-24954, increasing this value can also help to solve the read buffer request issue. (Previously, when choosing the default value, both ‘32M' and '64M' are OK for tests and we chose the smaller one in a cautious way.)
taskmanager.network.sort-shuffle.min-buffers64512The current default value is quite modest and the performance can be influenced especially after we enable sort shuffle for large parallelism batch jobs by default. 

Compatibility, Deprecation, and Migration Plan

There should be no compatibility issues. Though the number of network buffers required per partition is increased, the default TM size can still support 8 blocking result partitions concurrently, which is big enough. For large parallelism jobs, because sort-shuffle decouples the network buffer consumption from parallelism, less network buffers are required which is good for usability.

Test Plan

  1. Run Flink tests multiple times (20) to ensure that the change does not influence test stability; (Already done)
  2. Run TPC-DS tests on a cluster to to ensure both performance and stability (no failures) will be improved. (Already done)

Rejected Alternatives

Use 128 as the default value of taskmanager.network.sort-shuffle.min-parallelism.