Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

...

The KafkaSource solves the requirements 1-4 for a single Kafka cluster and the low level components can be composed into order to provide the functionality for multiple Kafka clusters. For example, an underlying KafkaSourceEnumerator will be used to discover splits, checkpoint assigned splits, and do periodic partition discovery. Likewise, an underlying KafkaSourceReader will be used to poll and deserialize records, checkpoint split state, and commit offsets back to Kafka.

Reconciliation is designed as KafkaSourceEnumerator and KafkaSourceReader restarts–this enables us to "remove" splits. There is careful consideration for resource cleanup and error handling--for example thread pools, metrics, and KafkaConsumer errors.

Other required functionality leverages and composes the existing KafkaSource implementation for discovering Kafka topic partition offsets and round robin split assignment per Kafka cluster. The diagrams below depicts how the source will reuse the code of the KafkaSource in order to achieve the requirements.

The Kafka Metadata Service

To provide the ability to dynamically change the underlying source components without job restart, there needs to exist a coordination mechanism to manage how the underlying KafkaSourceEnumerators and KafkaSources interact with multiple clusters and multiple topics. A KafkaMetadataService is the discovery mechanism by which the source components will reconcile metadata changes, and only the MultiClusterKafkaSourceEnumerator interacts with the KafkaMetadata Service. Periodic metadata discovery will be supported via source configuration just like topic partition discovery is supported via an interval in KafkaSource. It is possible to interpret this as a multi cluster extension of the AdminClient for KafkaSource, serving only cluster and topic metadata.

A default implementation will be provided so that native Kubernetes configmap can easily control the metadata (yaml/json file). This implementation is targeted for the basic use cases where external monitoring will inform how users change the metadata.

KafkaStream and KafkaClusterId

In addition, KafkaStream is part of the metadata returned by KafkaMetadataService and contains a logical mapping to physical Kafka clusters and Kafka topics, which can be transient and change dynamically. Changes in metadata are detected by the enumerator and propagated to readers via source events to reconcile the changes. In addition, Kafka clusters are uniquely identified by a string id since there could be multiple bootstrap servers lists that can read a certain Kafka cluster (e.g. kafka-server1:9092, kafka-server2:9092 and kafka-server1:9092).

Reconciliation is designed as KafkaSourceEnumerator and KafkaSourceReader restarts–this enables us to "remove" splits. There is careful consideration for resource cleanup and error handling--for example thread pools, metrics, and KafkaConsumer errors.

Other required functionality leverages and composes the existing KafkaSource implementation for discovering Kafka topic partition offsets and round robin split assignment per Kafka cluster. The diagrams below depicts how the source will reuse the code of the KafkaSource in order to achieve the requirements.

To the source more user friendly, a MultiClusterKafkaSourceBuilder will be provided (e.g. batch mode should not turn on KafkaMetadataService discovery, should only be done at startup).

...