Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

...

In the example topology A1 and C1 are connected with a blocking data exchange. B1 and C2 are communicating with a pipelined data exchange. Assume A1 to be the only task outside of the default slot sharing group.  

Image Removed Image Added

If only one slot is available to the cluster, the scheduling/execution of this job is prone to deadlocks. Concretely, if (B1, C1) is scheduled first, the tasks will never be able to finish.

...

The data exchange modes as proposed above have different resource requirements. We will explain the different modes by example using the job below.

Image RemovedImage Added

  • ALL_EDGES_BLOCKING
    • Pipelined regions: 12
      • {A1}, {A2}
      • {B1}, {B2}
      • {C1}, {C2}, {C3}, {C4}
      • {D1}, {D2}, {D3}, {D4}
    • Blocking logical edges: 3
    • Minimum slots required: 1

...

A streaming job that is embarrassingly parallel like the one below has multiple distinct regions that will be scheduled separately by the Pipelined Region Scheduler.

Image RemovedImage Added

If not enough resources are available, this can lead to only parts of the jobs being in running state. We do not consider this as a real limitation as partially running jobs can already occur since Flink 1.9 when using region failover. Moreover, users should be able to detect partially running jobs by monitoring relevant metrics. 

...

Below is an example to demonstrate the issue that no slot allocation bulk can be completely fulfilled even if the cluster has enough resources to fulfill each bulk.

...

Image Added

Possible solutions:

  • Option 1: SlotPool releases unused slots to RM and waits for the pending requests in RM to be fulfilled. Slot requests related to the released slots should also be re-sent to RM.
  • Option 2: Force FIFO slot allocation in SlotManager. We can do this after the SlotManager is pluggable ( FLINK-14106 ).

...

Below is an example to demonstrate this issue. Note that the 3 requests are in the same bulk.

Image Removed Image Added

Possible solutions:

...