Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

...

  • Introduce the high-availability.enable-job-recovery to control the implementation of leader services and persistence services for JobMaster. This config option should only be valid in session mode and true by default.

  • Introduce the high-availability.blob-store.enabled to control the implementation of blob services. This config should set to true by default. It should be set to false if it is not manually configured and if
    high-availability.job-recovery.enabled is set to false.

Note: I don't mention HighAvailabilityServices here because it is not labeled as public, even though we do expose configurations to allow users to configure their own implementations.

...

As mentioned above, in OLAP scenarios, we only require the leader election services for the Dispatcher / ResourceManager and RestEndpoint in the JobManager process. Leader election services and persistence services are redundant for jobs and may impact cluster performance. Thus, we propose to:

  • To generate HA services suitable for OLAP scenarios, we introduce the high-availability.enable-job-recovery parameter. When users enable HA with Kubernetes or ZooKeeper and set this option to false, we will:
    • Use the embedded version of CheckpointStore, JobGraphStore and JobResultStore
    • Set the high-availability.blob-store.enabled to false if it is not manually configured
    Use Standalone version of LeaderElectionService and LeaderRetrieverService for JobMaster
  • After JM successfully grant the leadership, it no longer publishes leader information to the underlying system. Other components will determine the leader status of JM by listening to the leader information from RM.

Compatibility, Deprecation, and Migration Plan

...

After discussion, we found that the refactoring of HA services contains the following issues:
- Splitting LeaderServices and PersistenceServices; As Matthias
mentioned, this allows for easier testing.
- Removal of deprecated interfaces, such as getWebMonitorLeaderElectionService.
- Reviewing existing multiple close and cleanup interfaces.
- Integration of StandaloneHaServices and EmbeddedHaServices.

We decide to move this refactoring out of the scope of this FLIP as it is big enough to have a separate discussion
thread.