Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

...

Currently, Flink provides the highly-available setup in an "all or nothing" manner. In highly-available setups, Flink offers two mechanisms: leader election/retrieval services for JobManager and persistent persistence services for job metadata. The relevant interfaces for these two mechanisms are defined in the HighAvailabilityServices interface. At runtime, Flink will construct different implementations of HighAvailabilityServices based on user configuration, e.g. KubernetesLeaderElectionHaServices and ZooKeeperLeaderElectionHaServices. This means that these two mechanisms can only be enabled or disabled together.
However, in OLAP scenarios, we only need the leader election/retrieval services for components in JobManager. In our production environment, users submit a lot of short queries through the SQL Gateway. These jobs are typically completed within a few seconds. When an error occurs during job execution, users simply need to resubmit the job. Therefore, there is no need to restart the job from a JobManager failure, persist its state for recovery, or perform leader election for a specific job. On the contrary, the persistence of job states can lead to a decrease in the cluster's throughput for short query, which has been demonstrated by the HighAvailabilityServiceBenchmark(258 qps without HA v.s. 69 qps with ZK HA). At the same time, in this scenario, we consider the Flink session cluster as a service. To ensure the SLA (Service Level Agreement), we utilize JobManager's failover mechanism to minimize service downtime. Therefore, we need to enable the leader election for the components in the JobManager process, e.g. ResourceManager and Dispatcher.
In this FLIP, we propose to split the HighAvailabilityServices into LeaderServices and PersistentServices PersistenceServices and allowing users to independently adjust the high availability strategies related to jobs through configuration.

...

  • Introduce the high-availability.enable-job-recovery to control the implementation of leader services and persistent persistence services for JobMaster. This config option should only be valid in session mode and true by default.

...

We proposed separating the HighAvailabilityServices into LeaderServices and PersistentServicesPersistenceServices, and introducing high-availability.enable-job-recovery to control the behavior related to job recovery when HA enabled.

...

However, in fact, the difference between HighAvailabilityServices implementations lies only in the choice of leader service and persistent persistence service. Different combinations result in implementations with different names. Thus, we can abstract these two parts of services as LeaderServices and PersistentServicesPersistenceServices, and keep only one implementation for HighAvailabilityServices. Then, we can combine different LeaderServices and PersistentServices PersistenceServices to meet the requirements of various scenarios.
The following class diagram illustrates the classes and interfaces after refactoring:

...

  • The previous StandaloneHaServices and EmbeddedHaServices will be replaced by the combination of EmbeddedPersistentServices EmbeddedPersistenceServices with StandaloneLeaderServices or EmbeddedLeaderServices, respectively.

  • The previous Kubernetes and ZooKeeper scenarios will be unified using DefaultLeaderServices and DefaultPersistentServicesDefaultPersistenceServices. The relevant materials for the scenarios, such as LeaderElectionDriverFactory and leader paths for each component, will be provided through the LeaderServiceMaterialGenerator. The specific implementations for this are KubernetesHaServicesMaterialProvider and ZooKeeperHaServicesMaterialProvider. These two classes also provide CheckpointRecoveryFactory and JobGraphStore.

...

As mentioned above, in OLAP scenarios, we only require the leader election services for the Dispatcher / ResourceManager and RestEndpoint in the JobManager process. Leader election services and persistent persistence services are redundant for jobs and may impact cluster performance.
To generate HA services suitable for OLAP scenarios, we introduce the high-availability.enable-job-recovery parameter. When users enable HA with Kubernetes or ZooKeeper and set this option to false, we will select the combination of DefaultLeaderServices and EmbeddedPersistentServicesEmbeddedPersistenceServices. Additionally, we will set the JobMaster's LeaderElectionService and LeaderRetrieverService to the Standalone version.

...