Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

...

  1. To avoid IO operations blocking the JM main thread, the JobEventStore will write each event out in an asynchronous thread.
  2. To avoid frequent IO operations causing great pressure on external file system, there will be a write buffer inside the JobEventStore. The JobEvents will be written to the buffer first, and then flushed to external file system when the buffer is full or the flush time is reached. The flush frequency will be controlled by the following 2 configuration options:
    1. job-event.store.write-buffer.size: The size of the write buffer, the content will be flushed to external file system once it's full.
    2. job-event.store.write-buffer.flush-interval:  The flush interval of write buffer, over this time, the content will be flushed to external file systemThe flush interval of write buffer, over this time, the content will be flushed to external file system.

Reconnection

Currently, when a JM crashes, TM will sense it through HA or heartbeat timeout, and then it will release all resources related to this job (including slots, and partitions when using TM shuffle), and then disconnect from JM.

...

  1. When it is found that the JM lost, TM will fail all tasks belongs to the target job and and release the corresponding slots.
  2. If there are partitions belongs the target job on TM, the TM should retain the partitions, and wait for HA to notify the new JM and try to establish connection to the JM. We need to register a timeout for waiting, and release the partition after the timeout. We can reuse the existing configuration option “taskmanager.registration.timeout” here, the default value is 5 minutes.
  3. If there is no partitions belongs the target job on TM, keep the same logic as currentIf there is no partitions belongs the target job on TM, keep the same logic as current.

If it‘s using other external shuffle services, it should be the same as TM shuffle, when it detects JM crash, it should retain the partitions and wait the JM to reconnect.

Re-schedule after JM restart

After JM restarts and becomes leader again, it will wait for a period of time for the TMs to re-establish connection with itself. The length of waiting time is controlled by After JM restarts and becomes leader again, it will wait for a period of time for the TMs to re-establish connection with itself. The length of waiting time is controlled by "execution.batch.job-recovery.previous-worker.recovery.timeout", only TMs connected within this period will be accepted, and those that time out will be rejected. Once we have enough partitions (all partitions required to continue running are registered), we can end this wait early and continue to the next step.

...