Status
Discussion thread | https://lists.apache.org/thread/0oc10cr2q2ms855dbo29s7v08xs3bvqg |
---|---|
Vote thread | |
JIRA | |
Release | 1.19, 2.0 |
Please keep the discussion on the mailing list rather than commenting on the wiki (wiki discussions get unwieldy fast).
Motivation
Currently, the configuration options pertaining to checkpointing, recovery, and state management are primarily grouped under the following prefixes:
- state.backend.* : configurations related to state accessing and checkpointing, as well as specific options for individual state backends
- execution.checkpointing.* : configurations associated with checkpoint execution and recovery
- execution.savepoint.*: configurations for recovery from savepoint
In addition, there are several individual options such as state.checkpoint-storage
and state.checkpoints.dir
that fall outside of these prefixes. The current arrangement of these options, which span multiple modules, is somewhat haphazard and lacks a systematic structure. For example, the options under the CheckpointingOptions
and ExecutionCheckpointingOptions
are related and have no clear boundaries from the user's perspective, but there is no unified prefix for them. With the upcoming release of Flink 2.0, we have an excellent opportunity to overhaul and restructure the configurations related to checkpointing, recovery, and state management. This FLIP proposes to reorganize these settings, making it more coherent by module, which would significantly lower the barriers for understanding and reduce the development costs moving forward.
Proposed Changes
This FLIP proposes a transition from existing configuration options to new ones, which are categorized by prefixes as listed below:
- execution.checkpointing: all configurations associated with checkpointing and savepoint.
- execution.recovery: all configurations pertinent to state recovery.
- state.*: all configurations related to the state accessing.
- state.backend.*: specific options for individual state backends, such as RocksDB.
- state.changelog: configurations for the changelog, as outlined in FLIP-158, including the options for the "Durable Short-term Log" (DSTL).
- state.latency-track: configurations related to the latency tracking of state access.
Please note that this FLIP does not change any state accessing, checkpointing and recovery logic. To be more specific, the detailed migration plan for the individual options is as follows:
Checkpointing
Original (Current) option | Proposed New option | Note |
state.checkpoints.num-retained | execution.checkpointing.num-retained | |
state.checkpoint.cleaner.parallel-mode | execution.checkpointing.cleaner.parallel-mode | |
state.backend.incremental |
execution.checkpoint.type | an enumeration option which can be 'full' (default) or 'incremental' |
state.backend.local-recovery | execution.checkpointing.local-copy.enabled | Whether to do checkpoint on local disk, another option |
taskmanager.state.local.root-dirs | execution.checkpointing.local-copy.dir | |
state.savepoints.dir | execution.checkpointing.savepoint.dir | |
state.checkpoints.dir state.checkpoint-storage | execution.checkpointing.dir | One URI is enough for specifying the checkpoint directory and storage. 'jobmanager' and 'jm://' are special URI for job manager based storage. |
state.storage.fs.memory-threshold | execution.checkpointing.data-inline-threshold | |
state.storage.fs.write-buffer-size | execution.checkpointing.write-buffer |
All those options will be moved into `CheckpointingOptions`.
Recovery
Original (Current) option | Proposed New option | Note |
execution.savepoint.path | execution.recovery.path | |
execution.savepoint.ignore-unclaimed-state | execution.recovery.ignore-unclaimed-state | |
execution.savepoint-restore-mode | execution.recovery.claim-mode | |
state.backend.local-recovery | execution.recovery.from-local | Specifies whether to recover from local, the local files are created while the |
execution.checkpointing.approximate-local-recovery | execution.recovery.approximate-local-recovery | |
execution.checkpointing.recover-without-channel-state.checkpoint-id | execution.recovery.without-channel-state.checkpoint-id |
All those options will be moved into a newly introduced class `RecoveryOptions` in flink-core.
State
Original (Current) option | Proposed New option | Note |
state.backend.changelog.* | state.changelog.* | |
dstl.* | state.changelog.dstl.* | |
state.backend.latency-track.* | state.latency-track.* |
Those options stay in their original classes.
As the most special one, the state.backend.local-recovery
defines whether the Flink do checkpoint in local disk, as well as whether the Flink would recover from the local. This FLIP proposes to spilt this option into two, named execution.checkpointing.local-copy
and execution.recovery.from-local
, representing checkpointing and recovery behavior respectively.
Additionally, some existing option classes lack annotations, including classes like RocksDBOptions
. This FLIP will annotate all such existing classes, alongside any new classes that are introduced, with the @PublicEvolving. With above changes, the options for state, checkpointing and recovery are organized in a more straightforward structure. It is simple to categorize new options under appropriate prefixes.
Compatibility, Deprecation, and Migration Plan
In Flink 1.19, the new options will be introduced and the current ones will be annotated as deprecated. Flink will read new options first, and if the user does not configure them, fallback to the old(current) options. The deprecation will last for two minor versions(1.19 and 1.20), and in Flink 2.0, the old options will be removed entirely.
Test Plan
Existing UT/IT can ensure the compatibility for old options. New tests will cover the new options.
Rejected Alternatives
None for now.