Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.


...

Page properties

...


Discussion thread

Discussion threadhere (<- link to https://mail-archives.apache.org/mod_mbox/flink-dev/)

JIRA:

...


Vote thread
JIRA

Jira
serverASF JIRA
serverId5aa69414-a9e9-3523-82ec-879b028fb15b
keyFLINK-13980

Release1.10


Please keep the discussion on the mailing list rather than commenting on the wiki (wiki discussions get unwieldy fast).

...

Task executor memory configuration options. (See Memory Pools and Configuration Keys)A summary of backwards compatibility  

Proposed Changes

Unifying Managed Memory for Batch and Streaming

The basic idea is to consider memory used by RocksDB state backends as part of managed memory, and extend memory manager so that state backends memory consumers can simply reserve certain amount of memory from it but not necessarily allocate the memory from it. In this way, users should be able to switch between streaming and batch jobs, without having to modify the cluster configurations.

Memory Use Cases

...

  • FTM, no control on the overall memory consumption
  • Streaming jobs with RocksDBStateBackend
    • Off-heap memory
    • Implicitly allocated by the state backend
    • Cannot exceed total memory size, which is configured during initialization
  • Batch jobs
    • OffEither on-heap or off-heap memory
    • Explicitly allocated from the memory manager
    • Cannot exceed total memory allocated from memory manager

To make managed memory work with both cases, we should always allocate managed memory off-heap.

Unifying Explicit and Implicit Memory Allocation

  • Memory consumers can acquire memory in two ways
    • Explicitly acquire from MemoryManager, in the form of MemorySegment.
    • Reserve from MemoryManager, in which case should return “use up to X bytes”, and implicitly allocate the memory by the consumer itself.
  • MemoryManager never pre-allocate any memory pages, so that we keep the managed memory budget available for both allocation from MemoryManager and allocation directly from memory consumers.
  • For off-heap memory explicitly acquired from MemoryManager, Flink always allocate with Unsafe.allocateMemory(), which is not limited by the JVM -XX:MaxDirectMemorySize parameter.
    • This eliminates the uncertainty about how many off-heap managed memory should be accounted for JVM max direct memory. 
    • The fallback drawback is that Unsafe is no longer supported in Java 12.

??? MemorySegment ???

It’s an open question that how memory buffers should be returned from MemoryManager in the cases of explicit allocations. 

  • Currently (Flink 1.9), memory buffers are returned as a list of MemorySegments, each wraps a memory buffer with the same configured page size.
  • An alternative could be to return one continuous buffer of the requested size.

In the current way, MemoryManager can flexibly assign pre-allocated MemorySegments to satisfy requirements with different memory amount, without having to release and re-allocate memory buffers. Since MemoryManager no longer supports pre-allocation, this is not a strong advantage. The fallback is that, the segment division among allocated memory may not be a good fit to how the consumers want to use the memory.

Separate On-Heap and Off-Heap Memory Pools for Managed Memory

Currently (Flink 1.9), all managed memory are allocated with the same type, either on-heap or off-heap. This is good with the current use cases, where we do not necessary need both on-heap and off-heap managed memory in the same task executor.

With the design in this proposal, memory usage of state backends is also considered as managed memory, which means we may have scenarios where jobs in the same cluster need different types of managed memory. E.g., a streaming job with MemoryStateBackend and another streaming job with RocksDBStateBackend.

Therefore, we separate the managed memory pool into the on-heap pool and the off-heap pool. We use an off-heap fraction to decide what fraction of managed memory should go into the off-heap pool, and leave the rest to the on-heap pool. Users can still configure the cluster to use all on-heap / off-heap managed memory by setting the off-heap fraction to 0 / 1.

Memory Pools and Configuration Keys

Image Removed

Framework Heap Memory

Memory Pools and Configuration Keys

Image Added

Framework Heap Memory

On-heap memory for the Flink task manager framework. It is not accounted for slot resource profiles.

(taskmanager.memory.framework.heap.size)

(default 128mb)

Framework Off-Heap Memory

OffOn-heap memory for the Flink task manager framework. It is not accounted for slot resource profiles.

(taskmanager.memory.framework.off-heap.size)

(default 128mb)

Task Heap Memory

...

(taskmanager.memory.task.heap.size)

Task Off-Heap Memory

Off-heap memory for user code.

(taskmanager.memory.task.off-heap.offheapsize

(default 0b)

Network Memory

Off-heap memory for shuffle service, e.g., network buffers.

(taskmanager.memory.network.[min/max/fraction]) or (taskmanager.network.memory.[min/max/fraction])

(default min=64mb, max=1gb, fraction=0.1)

Managed Memory

On-heap and offOff-heap Flink managed memory.

(taskmanager.memory.managed.[size|fraction])(taskmanager.memory.managed.offheap-fraction)

(default fraction=0.5, offheap-fraction=0.0)

On-Heap Managed Memory = Managed Memory * (1 - offheap-fraction)

Off-Heap Managed Memory = Managed Memory * offheap-fraction

4)

JVM Metaspace

Off-heap memory for JVM metaspace.

(taskmanager.memory.jvm-metaspace)

(default 192mb96mb)

JVM Overhead

Off-heap memory for thread stack space, I/O direct memory, compile cache, etc.

(taskmanager.memory.jvm-overhead.[min/max/fraction])

(default min=128mb192mb, max=1gb, fraction=0.1)

Total Flink Memory

Coarser config option for total flink memory, to make it easily configurable for users.

This includes Framework Heap Memory, Framework Off-Heap Memory, Task Heap Memory, Task Off-Heap Memory, Network Memory Network Memory, and Managed Memory.

This excludes JVM Metaspace and JVM Overhead.

(taskmanager.memory.flink.size)

Total Process Memory

...

  • JVM heap memory
    • Includes Framework Heap Memory, Task Heap Memory, and On- Heap Managed Memory
    • Explicitly set both  -Xmx and -Xms to this value
  • JVM metaspace
    • Set -XX:MaxMetaspaceSize to configured JVM Metaspace

...

  • direct memory

...

    • Includes Framework

...

  • It’s an open question whether and how we set JVM -XX:MaxDirectMemorySize parameter
    • Off-Heap
    Managed
    • Memory
    are allocated through Unsafe.allocateMemory()
    • ,
    and we can do the same thing for Network Memory. Then the max direct memory size parameter should only affect
    • Task Off-
    Heap
    • heap Memory and
    JVM Overhead.
  • Netty uses direct memory. Although in most cases it’s only tens of megabytes per task executor, it is possible that in some corner cases this could grow up to hundreds of megabytes.

Alternative 1: 

Do not set max direct memory size. Leave it to JVM default, which is the same as max heap size. Normally this should be enough for JVM Overhead, and Task Off-Heap Memory if there is not too many. The fallback is that, in cases where the user codes use significant direct memory, users need to manually set large max direct memory through  env.java.opts.

Alternative 2: 

Set max direct memory size strictly to the sum of configured Task Off-Heap Memory and JVM Overhead, so the users never need to manually configure it. It also guarantees that direct memory usage can never exceed the limit, and we get descriptive exceptions when it tries to. The fallback is that both Task Off-Heap Memory and JVM Overhead are usually empirically configured and may not be accurate. Thus it is likely to result in either instability due to direct OOM or low memory utility due to over reservation of Task Off-Heap Memory and JVM Overhead.

Alternative 3: 

...

    • Network Memory
    • Explicitly set -XX:MaxDirectMemorySize to this value
    • For Managed Memory, we always allocate memory with Unsafe.allocateMemory(), which will not be limited by this parameter.
  • JVM metaspace
    • Set -XX:MaxMetaspaceSize to configured JVM Metaspace

Memory Calculations

  • All the memory / pool size calculations take place before the task executor JVM is started. Once JVM is started, there should be no further calculations and deriving inside Flink TaskExecutor. 
  • The calculations should be performed in two places only.
    • In the startup shell scripts, for standalone.
    • On the resource manager side, for Yarn/Mesos/K8s.
  • The startup scripts can actually call java with the Flink runtime code to execute the calculation logics. In this way, we can make sure that standalone cluster and other active mode clusters have consistent memory calculation logics.
  • The calculated memory / pool sizes are passed into the task executor as environment variablesdynamic configurations (via '-D').

Calculation Logics

We need either of these three options configured.

...

  • If both Task Heap Memory and Managed Memory are configured, we use these to derive Total Flink Memory
    • If Network Memory Memory is configured explicitly, we use that value
    • Otherwise, we compute it such that it makes up the configured fraction of the final Total Flink Memory (see getAbsoluteOrInverseFraction())
  • If Total Flink Memory is configured, but not Task Heap Memory and Managed Memory, then we deriveNetwork Memory Network Memory and Managed Memory, and leave the rest (excluding Framework Heap Memory and , Framework Off-Heap Memory and Task Off-Heap Memory) as Task Heap Memory.
    • IfNetwork  Network Memory is configured explicitly, we use that value
    • Otherwise we compute it such that it makes up the configured fraction of the Total Flink Memory (see getAbsoluteOrFraction())
    • If Managed Memory is configured explicitly, we use that value
    • Otherwise we compute it such that it makes up the configured fraction of the Total Flink Memory (see getAbsoluteOrFraction())
  • If only the Total Process Memory is configured, we derive the Total Flink Memory in the following way
    • We get (or compute relative) and subtract the JVM Overhead from Total Process Memory (see getAbsoluteOrFraction())
    • We subtract JVM Metaspace from the remaining
    • We leave the rest as Total Flink Memory

...

        Math.max(min, Math.min(relative, max))

    }

}

Implementation Steps

Step 1. Introduce a switch for enabling the new task executor memory configurations

Introduce a temporal config option as a switch between the current / new task executor memory configuration code paths. This allows us to implement and test the new code paths without affect the existing code paths and behaviors.

Step 2. Implement memory calculation logics

  • Introduce new configuration options
  • Introduce data structures and utilities.
    • Data structure to store memory / pool sizes of task executor
    • Utility for calculating memory / pool sizes from configuration
    • Utility for generating dynamic configurations
    • Utility for generating JVM parameters

This step should not introduce any behavior changes.

Step 3. Launch task executor with new memory calculation logics

  • Invoke data structures and utilities introduced in Step 2 to generate JVM parameters and dynamic configurations for launching new task executors.
    • In startup scripts
    • In resource managers
  • Task executor uses data structures and utilities introduced in Step 2 to set memory pool sizes and slot resource profiles.
    • MemoryManager
    • ShuffleEnvironment
    • TaskSlotTable

Implement this step as separate code paths only for the new mode.

Step 4. Separate on-heap and off-heap managed memory pools

  • Update MemoryManager to have two separated pools.
  • Extend MemoryManager interfaces to specify which pool to allocate memory from.

Implement this step in common code paths for the legacy / new mode. For the legacy mode, depending to the configured memory type, we can set one of the two pools to the managed memory size and always allocate from this pool, leaving the other pool empty.

Step 5. Use native memory for managed memory.

  • Allocate memory with Unsafe.allocateMemory
    • MemoryManager

Implement this issue in common code paths for the legacy / new mode. This should only affect the GC behavior.

Step 6. Clean-up of legacy mode.

  • Fix / update / remove test cases for legacy mode
  • Deprecate / remove legacy config options.
  • Remove legacy code paths
  • Remove the switch for legacy / new mode.

Compatibility, Deprecation, and Migration Plan

This FLIP changes how users configure cluster resources, which in some cases may require re-configuring of cluster if migrated from prior versions.

Deprecated configuration keys are as follows:

Deprecated KeyAs Fallback of New KeyNotes
taskmanager.heap.size

Standalone: taskmanager.memory.flink.size
Yarn/Mesos/K8s: taskmanager.memory.process.size


taskmanager.heap.mb

Standalone: taskmanager.memory.flink.size
Yarn/Mesos/K8s: taskmanager.memory.process.size


taskmanager.memory.sizetaskmamager.memory.managed.size
taskmanager.memory.fractionN/A`taskmanager.memory.managed.fraction` now has different sementices.
taskmanager.memory.off-heapN/A`taskmanager.memory.off-heap` will be ignored, because we no-longer support on-heap managed memory.
taskmanager.memory.preallocateN/A`taskmanager.memory.preallocate` will be ignored, because we no-longer support pre-allocation of managed memory.
taskmanager.network.memory.[min/max/fraction]taskmanager.memory.shuffle.[min/max/fraction]

Test Plan

  • We need to update existing and add new integration tests dedicated to validate the new memory configuration behaviors.
  • It is also expected that other regular integration and end-to-end tests should fail if this is broken.

Limitations

  • The proposed design uses Unsafe.allocateMemory() for allocating managed memory, which is no longer supported Java 12. We need to look for alternative solutions in the future.

Follow Ups

  • This FLIP requires very good documentation to help users understand how to properly configure Flink processes and which knobs to turn in which cases.
  • It would be good to expose configured memory pool sizes in the web UI, so that users see immediately what amount of memory TMs assume to use for what purpose.

Rejected Alternatives

Regarding JVM direct memory, we have the following alternative.

  1. Have MemorySegments de-allocated by the GC, and trigger GC by setting proper JVM max direct memory size parameter.
  2. Have MemorySegments de-allocated by the GC, and trigger GC by a dedicated bookkeeping independent from JVM max direct memory size parameter.

  3. Manually allocate and de-allocate MemorySegments. 

We decided to go with 3, but depends on how segment fault safe it turns out to be, we may easily switch to other alternatives after the implementationAlternatives regarding MemorySegment and max direct memory are still under open discussion.