Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

...

This document highlights our effort to refactor the threading model of the KafkaConsumer, and the purpose of this refactor is to address issues and shortcomings we’ve encountered in the past years, such as increasing code complexity from the hotfixes and patches, lock and race conditions caused by both heartbeat thread and the polling thread making coordinator requests, and the coupling between the client poll and the rebalance progress.

  • Code complexity: the current rebalance protocol is quite heavy on the client side.  And over the years, patches and hotfixes have Patches and hotfixes in the past years have heavily impacted the readability of the code.  Firstly, it makes bug fixes increasingly difficult, and secondly, it makes the code hard to comprehend.  For example, coordinator communication can happen at different placesin both the heartbeat thread and the polling thread, which makes it challenging to understand the code path.  There are also lots of small patches to handle bugs and edge cases that might require more fixes.diagnose the source of the issue when there's a race condition, for example.  Also, many parts of the coordinator code have been refactored to execute asynchronously; therefore, we can take the opportunity to refactor the existing loops and timers. 

  • Complex Coordinator Communication: As the polling thread and the heartbeat thread both talk to the coordinator, it increases the code complexity and causes concurrent issuesCoordinator communication happens at both threads in the current design, and it has caused some race conditions such as KAFKA-13563.  We want to take the opportunity to move all coordinator communication to the background thread and adopt a more linear design.

  • Asynchronous Background Thread: One of our goals here is to make rebalance process happen asynchronously to the poll.  Firstly, it simplifies the polling thread design, as all the coordinator communication will be moved to the backgroundit essentially only needs to handle fetch requests.  Secondly, it prevents consumer poll blocking other consumer operations such as network requests.

Scope

We will be moving network communication, including heartbeat, to the background thread, which allows the consumer to operate more asynchronously.  In particular, the scope of this will be as such:

  • because rebalance occurs in the background thread, the polling thread won't be blocked or blocking the rebalance process.

Scope

  1. Deprecate HeartbeatThread

    1. Remove the heartbeat thread.
    2. Move the heartbeat logic to the background thread.
  2. Refactor the Coordinator classes

    1. Revisit timers and loops
    2. Remove some blocking methods, such as commitOffsetSync.  The blocking commit should be handled by the polling thread, waiting for the background thread to complete the future.
    3. The coordinator will be polled by the background thread loop.
    4. Rebalance state modification: We will add a few states to ensure the rebalance callbacks are executed in the correct sequence and time
  3. Deprecate HeartbeatThread

  4. Refactor the Coordinator classes, i.e., AbstractCoordinator and ConsumerCoordinator, to adapt to this asynchronous design
    1. .
  5. Refactor the KafkaConsumer API internals to adopt this asynchronous design

    1. It will send Events to the background thread if network communication is required.
    2. Remove dependency on the fetcher, coordinator, and networkClient.
  6. Introduce events and communication channels to facilitate communication between the polling and background threads..

    1. We will use two channels the facilitate the two-way communication
  7. Address issues in these Jira tickets

...