THIS IS A TEST INSTANCE. ALL YOUR CHANGES WILL BE LOST!!!!
...
- State size increase
- Up to a couple of GB per task (especially painful on IO bound clusters)
- Depends on the actual policy used (probably UNALIGNED_WITH_MAX_INFLIGHT_DATA is the more plausible default)
- Longer and heavier recovery depending on the increased state size
- Can potentially trigger death spiral after a first failure
- More up-to-date checkpoints will most likely still be an improvement about the current checkpoint behavior during backpressure
- For the MVP
- no re-scaling
- max in-flight checkpoints = 1
Test Plan
- Correctness tests with induced failures
- Compare checkpoints times under backpressure with current state
- Compare throughput under backpressure with current state
- Compare progress under backpressure with frequently induced failures
- Compare throughput with no checkpointing
...