Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

...

If a TaskManager instance fails , Flink can bring it up again. However, and the partitions are stored at TM, all the intermediate results which have a partition on the failed TM will become unavailable. If the partitions are not stored at TM, e.g. remote shuffle service is used, TM failure will not affect the intermediate results. However, if some partitions are lost at the remote shuffle service, the intermediate result will become unavailable.

In this caseboth cases, the consuming job will throw an exception and the job will fail. At the same time, PartitionTracker in ResourceManager will release all the cluster partitions that are impacted (implemented in FLIP-67). The StreamExecutionEnvironment will fall back and re-submit the original job as if the cache hasn't been created. The original job will run as an ordinary job that follows the existing recovery strategy. And the cache will be recreated after the execution of the original job. The above process is transparent to the users.

...