Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

...

In addition, KafkaStream is part of the metadata and contains a logical mapping to physical Kafka clusters and Kafka topics, which can be transient and change dynamically. Changes in metadata are detected by the enumerator and propagated to readers via source events to reconcile the changes.  In addition, Kafka clusters are uniquely identified by a string id since there could be multiple bootstrap servers lists that can read a certain Kafka cluster (e.g. kafka-server1:9092, kafka-server2:9092 and kafka-server1:9092).

Reconciliation is designed as KafkaSourceEnumerator and KafkaSourceReader restarts–this enables us to "remove" splits. There is careful consideration for resource cleanup and error handling--for example thread pools, metrics, and KafkaConsumer errors.

Other required functionality leverages and composes the existing KafkaSource implementation for discovering Kafka topic partition offsets and round robin split assignment per Kafka cluster. The diagram below depicts how the source will reuse the code of the KafkaSource in order to achieve thisthe requirements.

To the source more user friendly, a MultiClusterKafkaSourceBuilder will be provided (e.g. batch mode should not turn on KafkaMetadataService discovery, should only be done at startup).

...