Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

...

Please keep the discussion on the mailing list rather than commenting on the wiki (wiki discussions get unwieldy fast).

Motivation

Unaligned checkpoints have been in our codebase for a very long time already, proving to be stable, reliable and are solving a lot of potential problems. Especially with the option to timeout aligned checkpoints to unaligned, there seems to be no reasons to keep using the aligned checkpoints by default. This would make adoption of Flink easier, especially for the new users, who are most likely currently first deploying Flink the default aligned checkpoints, encountering problems during back-pressure, searching online for a solution, and only then enabling unaligned checkpoints.

Public Interfaces

None of the public interfaces will be changed. Only the default values of the org.apache.flink.streaming.api.environment.ExecutionCheckpointingOptions#ENABLE_UNALIGNED  and org.apache.flink.streaming.api.environment.ExecutionCheckpointingOptions#ALIGNED_CHECKPOINT_TIMEOUT

Proposed Changes

I'm proposing to:

  • enable unaligned checkpoints by default
  • change the aligned checkpoint timeout from 0ms to 5s 

...

This would help for most of the jobs that are experiencing some back pressure. There are some edge cases, like very large parallelism jobs, with relatively small state, where user doesn't care about the checkpointing to completely timely, while the back pressure is not large enough to cause checkpoint timeouts. For those users change to the unaligned checkpoints will significantly increase state size, without any benefits. However I expect such cases to be far more rare compared to the jobs that would benefit from enabling unaligned checkpoints.

Compatibility, Deprecation, and Migration Plan

Those settings should make the change completely transparent for most of the users. Especially jobs that are working either without back pressure or with just small back pressure would be unaffected. Only jobs with some noticeable back pressure would switch to using unaligned checkpoints.

This would help for most of the jobs that are experiencing some back pressure. There are some edge cases, like very large parallelism jobs, with relatively small state, where user doesn't care about the checkpointing to completely timely, while the back pressure is not large enough to cause checkpoint timeouts. For those users change to the unaligned checkpoints will significantly increase state size, without any benefits. However I expect such cases to be far more rare compared to the jobs that would benefit from enabling unaligned checkpoints.

Test Plan

None.

Rejected Alternatives

None.