Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

...

The ClusterPartitionDescriptor should have the necessary information to let another job to be able to consume the intermediate result. It includes the ShuffleDescriptor, along with some metadata about the intermediate result, i.e., numberOfSubpartitions, partitionType, and the IntermediateDataSetID. Since it contains the runtime class ShuffleDescriptor, it should not put into the flink-core module. Instead, it will get serialized before transfer back to the client-side, and only get deserialized in the StreamingJobGraphGenerator. We can put a ClusterPartitionDescriptor interface in the flink-core, and keep the implementation of the ClusterPartitionDescriptor in the flink-runtime. The JobResult contains the SerializedValue<ClusterPartitionDescriptor> will get sent back to the client-side.

In the per-job mode cluster, a cluster will be spun up on every submitted job and tore down when the submitted job finished. All lingering resources (files, etc) are cleared up, including the cluster partition in the TMs. Therefore, the cached table will not work in the Per-Job mode cluster. When a job that will create some cache tables is submitted to in Per-Job mode, the returned ClusterPartitionDescriptors will be ignored so that the planner will not attempt to replace the subtree of the cache node.

Repartition is needed when the cache consumer requires the input data to be partitioned in a specific way, i.e. hash partition, custom partition. When the StreamingJobGraphGenerator generates the job graph, it introduces a NoOp job vertex as the upstream vertex of the cache consumer and maintains the shipStrategyName of the output job edge. During execution, the task executor will make sure that the data is repartitioned.

If a Task Manager instance fails, Flink will bring it up again. However, all the intermediate results which have a partition on the failed TM will become unavailable.

In this case, the consuming job will throw an exception and the job will fail. As a result, PartitionTracker in ResourceManager will release all the cluster partitions that are impacted(implemented in FLIP-67). The TableEnvironment will fell back and resubmit the original DAG without using the cache. The original DAG will run as an ordinary job that follows the existing recovery strategy. Note that because there is no cache available, the TableEnvironment (planner) will again create a Sink to cache the result that was initially cached, therefore the cache will be recreated after the execution of the original DAG.

.e. hash partition, custom partition. When the StreamingJobGraphGenerator generates the job graph, it introduces a NoOp job vertex as the upstream vertex of the cache consumer and maintains the shipStrategyName of the output job edge. During execution, the task executor will make sure that the data is repartitioned.

If a Task Manager instance fails, Flink will bring it up again. However, all the intermediate results which have a partition on the failed TM will become unavailable.

In this case, the consuming job will throw an exception and the job will fail. As a result, PartitionTracker in ResourceManager will release all the cluster partitions that are impacted(implemented in FLIP-67). The TableEnvironment will fell back and resubmit the original DAG without using the cache. The original DAG will run as an ordinary job that follows the existing recovery strategy. Note that because there is no cache available, the TableEnvironment (planner) will again create a Sink to cache the result that was initially cached, therefore the cache will be recreated after the execution of the original DAG.

The above process is transparent to the users.

In the per-job mode cluster, a cluster will be spun up on every submitted job and tore down when the submitted job finished. All lingering resources (files, etc) are cleared up, including the cluster partition in the TMs. Therefore, the cached table will not work in the Per-Job mode cluster. When a job that read from some cached table is submitted in Per-Job mode, the first submission of the job will fail and the failover mechanism will re-execute the origin DAGThe above process is transparent to the users.

To explain how the optimizer affects the cache node, let look at a very simple DAG, where one scan node followed by a filter node, as shown below. The optimizer can push the filter to the scan node such that the scan node will produce fewer data to the downstream node. However, such optimization will affect the result of the scan node, which is an undesired behavior if we want to cache the scan node.

...