Status
Current state: Accepted
Page properties |
---|
|
...
...
|
Original Design Document: https://docs.google.com/document/d/1l7yIVNH3HATP4BnjEOZFkO2CaHf1sVn_DSxS2llmkd8/edit?usp=sharing
JIRA: FLINK-10653
Released: FLINK 1.9
Motivation
Shuffle is the process of data transfer between stages, which involves in writing outputs on producer side and reading inputs on consumer side. The shuffle architecture and behavior in Flink are unified for both streaming and batch jobs. It can be improved in two dimensions:
Lifecycle of TaskExecutor (TE)/Task: TE starts an internal shuffle environment for transporting partition data to consumer side. When task enters FINISHED state, its produced partition might not be fully consumed. Therefore TE container should not be freed until all the internal partitions consumed. It is obvious that there exists coupled implicit constraints among them, but has no specific mechanism for coordinating them work well.
Lifecycle of ResultPartition: Certain features, like fine-grained recovery and interactive programming, require flexible consumption of produced intermediate results: delayed consumption or multiple times. In these case, shuffle service user (JM or RM) should decide when to release the produced partitions and shuffle API should support this. More details in design proposal to extend this FLIP.
Extension of writer/reader: ResultPartition can only be written into local memory for streaming job and single persistent file per subpartition for batch job. It is difficult to extend partition writer and reader sides together based on current architecture. E.g. ResultPartition might be written in sort&merge way or to external storage. And partition might also be transported via external shuffle service on YARN, Kubernetes etc in order to release TE early.
...