This page is meant as a template for writing a FLIP. To create a FLIP choose Tools->Copy on this page and modify with your content and replace the heading with the next FLIP number and a description of your issue. Replace anything in italics with your own description.

Document the state by adding a label to the FLIP page with one of "discussion", "accepted", "released", "rejected".

Please keep the discussion on the mailing list rather than commenting on the wiki (wiki discussions get unwieldy fast).

Motivation

Unaligned checkpoints have been in our codebase for a very long time already, proving to be stable, reliable and are solving a lot of problems. Especially with the option to timeout aligned checkpoints to unaligned, there seems to be very few reasons to keep using the aligned checkpoints by default. Enabling unaligned checkpoints by default would make adoption of Flink easier, especially for the new users. Instead first deploying Flink with the current default configuration, encountering problems during back pressure, searching online for a solution, and only then enabling unaligned checkpoints, new users wouldn't have to do anything.

Public Interfaces

None of the public interfaces will be changed. Only the default values of the org.apache.flink.streaming.api.environment.ExecutionCheckpointingOptions#ENABLE_UNALIGNED  and org.apache.flink.streaming.api.environment.ExecutionCheckpointingOptions#ALIGNED_CHECKPOINT_TIMEOUT

Proposed Changes

I'm proposing to:

  • enable unaligned checkpoints by default
  • change the aligned checkpoint timeout from 0ms to 5s 

Compatibility, Deprecation, and Migration Plan

Those settings should make the change completely transparent for most of the users. Especially jobs that are working either without back pressure or with just small back pressure would be unaffected. Only jobs with some noticeable back pressure would switch to using unaligned checkpoints.

This would help for most of the jobs that are experiencing some back pressure. There are some edge cases, like very large parallelism jobs, with relatively small state, where user doesn't care about the checkpointing to completely timely, while the back pressure is not large enough to cause checkpoint timeouts. In such a scenario the change to the unaligned checkpoints will significantly increase state size, without any benefits. However I expect such cases to be less common to the jobs that would benefit from enabling unaligned checkpoints.

Another thing to consider is that we currently do not support job upgrades and Flink minor version upgrades with unaligned checkpoints, so users would have to be guided to using savepoints in those cases.

This change would have to be clearly visible in the release notes.

Test Plan

None.

Rejected Alternatives

None.