This page is meant as a template for writing a FLIP. To create a FLIP choose Tools->Copy on this page and modify with your content and replace the heading with the next FLIP number and a description of your issue. Replace anything in italics with your own description.
Please keep the discussion on the mailing list rather than commenting on the wiki (wiki discussions get unwieldy fast).
Motivation
Unaligned checkpoints have been in our codebase for a very long time already, proving to be stable, reliable and are solving a lot of potential problems. Especially with the option to timeout aligned checkpoints to unaligned, there seems to be no reasons to keep using the aligned checkpoints by default. This would make adoption of Flink easier, especially for the new users, who are most likely currently first deploying Flink the default aligned checkpoints, encountering problems during back-pressure, searching online for a solution, and only then enabling unaligned checkpoints.
Public Interfaces
None of the public interfaces will be changed. Only the default values of the org.apache.flink.streaming.api.environment.ExecutionCheckpointingOptions#ENABLE_UNALIGNED
and org.apache.flink.streaming.api.environment.ExecutionCheckpointingOptions#ALIGNED_CHECKPOINT_TIMEOUT
Proposed Changes
I'm proposing to:
- enable unaligned checkpoints by default
- change the aligned checkpoint timeout from
0ms
to5s
Those settings should make the change completely transparent for most of the users. Especially jobs that are working either without back pressure or with just small back pressure would be unaffected. Only jobs with some noticeable back pressure would switch to using unaligned checkpoints.
This would help for most of the jobs that are experiencing some back pressure. There are some edge cases, like very large parallelism jobs, with relatively small state, where user doesn't care about the checkpointing to completely timely, while the back pressure is not large enough to cause checkpoint timeouts. For those users change to the unaligned checkpoints will significantly increase state size, without any benefits. However I expect such cases to be far more rare compared to the jobs that would benefit from enabling unaligned checkpoints.
Compatibility, Deprecation, and Migration Plan
Those settings should make the change completely transparent for most of the users. Especially jobs that are working either without back pressure or with just small back pressure would be unaffected. Only jobs with some noticeable back pressure would switch to using unaligned checkpoints.
This would help for most of the jobs that are experiencing some back pressure. There are some edge cases, like very large parallelism jobs, with relatively small state, where user doesn't care about the checkpointing to completely timely, while the back pressure is not large enough to cause checkpoint timeouts. For those users change to the unaligned checkpoints will significantly increase state size, without any benefits. However I expect such cases to be far more rare compared to the jobs that would benefit from enabling unaligned checkpoints.
This change would have to be clearly visible in the release notes.
Test Plan
None.
Rejected Alternatives
None.