Discussion threadhttps://lists.apache.org/thread/ho77fx13lw4ds52t0fs1xqz2vtn50n2o
Vote threadhttps://lists.apache.org/thread/sn5cv1gc5bpg1k22kow9h52jr65otvon
JIRA

Unable to render Jira issues macro, execution error.

Release1.19, 2.0

Please keep the discussion on the mailing list rather than commenting on the wiki (wiki discussions get unwieldy fast).

Motivation

The FLIP-193 introduced two modes of state file ownership during checkpoint restoration: RestoreMode#CLAIM and RestoreMode#NO_CLAIM. The LEGACY mode, which was how Flink worked until 1.15, has been superseded by NO_CLAIM as the default mode. The main drawback of LEGACY mode is that the new job relies on artifacts from the old job without cleaning them up, leaving users uncertain about when it is safe to delete the old checkpoint directories. This leads to the accumulation of unnecessary checkpoint files that are never cleaned up. Considering cluster availability and job maintenance, it is not recommended to use LEGACY mode. Users could choose the other two modes to get a clear semantic for the state file ownership.

This FLIP proposes to deprecate the LEGACY mode and remove it completely in the upcoming Flink 2.0. This will make the semantic clear as well as eliminate many bugs caused by mode transitions involving LEGACY mode (e.g. Unable to render Jira issues macro, execution error. ) and enhance code maintainability.

Public Interfaces & Proposed Changes

org.apache.flink.runtime.jobgraph.RestoreMode#LEGACY will be marked as @deprecated. And in Flink 2.0, it will be removed. The corresponding configuration 'execution.savepoint-restore-mode', REST API param ('restoreMode') and CLI option ('-restoreMode') for starting a job will also remove the alternative of 'LEGACY'.

Compatibility, Deprecation, and Migration Plan

This change has no impact on most users, as they do not explicitly specify the restore mode. For jobs that are using the LEGACY mode, a deprecation warning should be issued. If user do not want Flink to delete artifacts from the old job, they could migrate to NO_CLAIM mode. This will result in a longer checkpointing time for the first checkpoint, as it forces the first checkpoint to be a full one. But it is a good tradeoff in most situations as it provides clear ownership. If user want Flink to manage the old artifacts, they could migrate to CLAIM mode and everything works fine.

Note: The changelog state backend does not support the NO_CLAIM mode, so it is better to implement Unable to render Jira issues macro, execution error. before the LEGACY mode is completely removed. Otherwise user can only choose CLAIM mode while enabling the changelog, thus they cannot keep the old artifacts after restoring.

This change should be highlighted in the release notes to ensure users are well-informed.

Test Plan

None.

Rejected Alternatives

None.