Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

...

For large scale batch jobs, a large number of network connections will be established, which may incur stability issues. We can restrict the number of concurrent partition requests to relieve the issue. Besides, restricting concurrent partition requests can increase the number of network buffers can be used per remote channel, that is, more credits per channel which is helpful for the shuffle reader to read sequentially. (As we mentioned above, the number of available credits can influence sequential read because we can not read more buffers than the consumer can process)

Implementing a stand-alone shuffle service can further improve the shuffle IO performance because it is a centralized service and can collect more information which can lead to more optimized actions. For example, better node-level load balance, better disk-level load balance, further file merging, node-level IO scheduling and shared read/write buffer and thread pool. It can be introduced in a separated FLIP.

Implementation and Test Plan

...