Status

Current state: "Under discussion"

Discussion thread: here (<- link to https://mail-archives.apache.org/mod_mbox/flink-dev/)

JIRA: Unable to render Jira issues macro, execution error.

Released: <Flink Version>

Please keep the discussion on the mailing list rather than commenting on the wiki (wiki discussions get unwieldy fast).

Motivation

Since Flink 1.5 we have the same heartbeat timeout and interval default values that are defined as heartbeat.timeout: 50s and heartbeat.interval: 10s. These values were mainly chosen to compensate for lengthy GC pauses and blocking operations that we executed in the main threads of Flink components. Since then, there were quite some advancements wrt the JVM's GCs and we could also get rid of a lot of blocking calls that were executed in the main thread. Moreover, a long heartbeat.timeout causes long recovery times in case of a TaskManager loss because the system can only properly recover after the dead TaskManager has been removed from the scheduler. Hence, I wanted to revisit the default values of these two config options and check whether they can be shortened.

Since there is no perfect solution that fits all use cases, I would really like to hear from the Flink practitioners to which values they set the heartbeat config options. Based on this experience we might come up with better default values.

Proposed Changes

I propose the following change to the default value of the heartbeat options.

heartbeat.timeout: 15s

heartbeat.interval: 3s

Compatibility, Deprecation, and Migration Plan

Reducing the heartbeat timeout will cause more false positive heartbeat timeouts to be produced. Especially in flakey or overloaded network environments or if one deals with a lot of GC pauses, this can become noticeable in more job restarts. The way to solve this problem is to explicitly increase the heartbeat timeout.

The decreased heartbeat interval will cause a bit more overhead because Flink will send status information that is piggy-backed on the heartbeat signals more often between the different components. If this should become a problem, then users can increase the heartbeat interval.

Test Plan

Check whether our test suite passes with the new values. Moreover, we should reach out to our users to run tests with a snapshot version or the RC to test the change in the wild.

Rejected Alternatives

No rejected alternatives yet.

Page tree

FLIP-185: Shorter heartbeat timeout and interval default values