Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

...

  1. User-defined source parallelism. Source parallelism can be configured via global parallelism, or can be assigned by DataStream, Table/SQL(FLIP-367).
  2. Connector static parallelism inference (e.g. hive source).
  3. Dynamic parallelism inference. For batch jobs which use adaptive batch scheduler, the current implementation will use a global default source parallelism as the inferred parallelism for sources.

As mentioned above, the current support for source parallelism inference in adaptive batch scheduler cannot be truly adaptive. It cannot set different parallelism for different sources in a same job, nor can it dynamically adjust based on the data volume of the source. At the same time, we believe that runtime information also provides guidance for inferring the source parallelism. Compared to manually setting parallelism, automatic parallelism inference is easier to use and can better adapt to varying data volumes each day. However, static parallelism inference cannot leverage runtime information, resulting in inaccurate parallelism inference (e.g. In the scenario of Flip-248 DynamicPartitionPruning, the actual amount of data that needs to be consumed by source can only be determined at runtime). Therefore, for batch jobs, dynamic parallelism inference is the most ideal, but currently, the support for adaptive batch scheduler is not very comprehensive.

Therefore, we aim to introduce a general interface that enables the adaptive batch scheduler to dynamically infer the source parallelism at runtime.

...