Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

...

Page properties


Discussion threadhere (<- link to https://lists.apache.org/list.html?dev@flink.apache.org)/thread/yzfn5yf2tf8o165ns337bvfmb7t8h7mf
Vote threadhere (<- link to https://lists.apache.org/list.html?dev@flink.apache.org)
JIRAhere (<- link to https://issues.apache.org/jira/browse/FLINK-XXXX)
thread/flv4w4tn5r8xbhzdqngx8o8o8t0gv3bt
JIRA

Jira
serverASF JIRA
serverId5aa69414-a9e9-3523-82ec-879b028fb15b
keyFLINK-30469

Release1.17Release<Flink Version>

Please keep the discussion on the mailing list rather than commenting on the wiki (wiki discussions get unwieldy fast).

...

  • Add a new config option.
    • Key
      • taskmanager.network.memory.network.read-buffer.required-per-gate.max
    • Default
      • (none).
    • Type
      • Integer.
    • Description
      • The maximum number of network read buffers that are required by an input gate. (An input gate is responsible for reading data from all subtasks of an upstream task.) The number of buffers needed by an input gate is dynamically calculated in runtime, depending on various factors (e.g., the parallelism of the upstream task). Among the calculated number of needed buffers, the part below this configured value is required, while the excess part, if any, is optional. A task will fail if the required buffers cannot be obtained in runtime. A task will not fail due to not obtaining optional buffers, but may suffer a performance reduction. If not explicitly configured, the default value is Integer.MAX_VALUE for streaming workloads, and 1000 for batch workloads.
    • In the first version, this new config will be annotated as @Experimental. We will revisit and try to remove the annotation in the next version.
  • Deprecated 2 existing config options
    • taskmanager.network.memory.buffers-per-channel
    • taskmanager.network.memory.floating-buffers-per-gate
    • We will not annotate these 2 configs as @Deprecated until the @Experimental annotation for taskmanager.network.memory.network.read-buffer.required-per-gate.max is removed.
  • Modify the default value of taskmanager.memory.network.max from 1g to MemorySize.MAX_VALUE.

...

Config option

Config details

Reason

taskmanager.network.memory.network.read-buffer.required-per-gate.max

1. If this option is not configured, the default value for the Batch job is 1000, and the default value for the Stream job is Integer.MAX_VALUE.

2. If this option is configured, all jobs will take effect.

This config option is introduced to improve the usability of Flink, reduce the probability of insufficient network memory exceptions, and reduce the performance regression when all buffers use Floating Buffers(all-buffer-floating).

...

During TPC-DS tests, the network memory default configuration may cause job failure due to insufficient memory when running in high parallelism(for example, 1000). After setting the default value of taskmanager.network.memory.network.read-buffer.required-per-gate.max to MemorySize.MAX_VALUE and introducing the optimization of Flink Shuffle reading buffers above, the total TPC-DS can run successfully. Moreover, no performance regression is observed in CI and NexBenchmark. If users still find a performance regression during use, the user can set taskmanager.memory.network.max to 1g to go back to the state before introducing the feature.

...

  1. When total network memory buffers in an InputGate exceed the configured threshold, all read buffers in InputChannels will change from Exclusive Buffer to Floating Buffer, which may have a certain impact on performance. If users want to go back to the state before introducing the feature, just add this config option, taskmanager.network.memory.network.read-buffer.required-per-gate.max: 2147483647 (Integer.MAX_VALUE).

  2. After default value of taskmanager.memory.network.max is modified, the TM network may occupy more memory in some cases than before, resulting in a smaller heap or managed memory. Users do not need to configure this option in most cases. If users want to go back to the state before introducing the feature, just add the config option, taskmanager.memory.network.max: 1g.

...