Status
Current state: "Under discussion"
Discussion thread: here (<- link to https://mail-archives.apache.org/mod_mbox/flink-dev/)
JIRA:
Released: <Flink Version>
Please keep the discussion on the mailing list rather than commenting on the wiki (wiki discussions get unwieldy fast).
Motivation
Since Flink 1.5 we have the same heartbeat timeout and interval default values that are defined as heartbeat.timeout: 50s
and heartbeat.interval: 10s
. These values were mainly chosen to compensate for lengthy GC pauses and blocking operations that we executed in the main threads of Flink components. Since then, there were quite some advancements wrt the JVM's GCs and we could also get rid of a lot of blocking calls that were executed in the main thread. Moreover, a long heartbeat.timeout
causes long recovery times in case of a TaskManager loss because the system can only properly recover after the dead TaskManager has been removed from the scheduler. Hence, I wanted to revisit the default values of these two config options and check whether they can be shortened.
Since there is no perfect solution that fits all use cases, I would really like to hear from the Flink practitioners to which values they set the heartbeat config options. Based on this experience we might come up with better default values.
Proposed Changes
I propose the following change to the default value of the heartbeat options.
heartbeat.timeout: 15s
heartbeat.interval: 3s
Compatibility, Deprecation, and Migration Plan
Reducing the heartbeat timeout will cause more false positive heartbeat timeouts to be produced. Especially in flakey or overloaded network environments or if one deals with a lot of GC pauses, this can become noticeable in more job restarts. The way to solve this problem is to explicitly increase the heartbeat timeout.
The decreased heartbeat interval will cause a bit more overhead because Flink will send status information that is piggy-backed on the heartbeat signals more often between the different components. If this should become a problem, then users can increase the heartbeat interval.
Test Plan
Check whether our test suite passes with the new values. Moreover, we should reach out to our users to run tests with a snapshot version or the RC to test the change in the wild.
Rejected Alternatives
No rejected alternatives yet.