Here is a list of related wiki pages/works that have been done pre-this design:

Client rewrite (consumer API section).
Consumer client re-design motivation page.
Inbuilt Consumer Offset design page.
Centralized Consumer Coordinator design page and detailed implementation page.

Goals (G)

Here is a list of goals that we want to achieve in 0.9 Consumer Rewrite, some are musts, and some are good-to-haves:

[MUST] Broker-side Group Membership Management
1. Consumer membership change detection
2. Topic/partition change detection
3. Consumer coordinator failure detection
[MUST] Broker-side Offset Management
1. Inbuilt offset management
2. CommitOffset/FetchOffset protocol
[MUST] Non-blocking Consumer API
[GOOD-TO-HAVE] Single-threaded Consumer. Note that this means just one user thread, the consumer client itself will not have any background threads.
[MUST] Manual Partition Assignment
1. AddTopic/Partition Consumer API
[MUST] Manual Offset Assignment
1. GetOffset Consumer API
2. Consume function call returns message metadata with message
[GOOD-TO-HAVE] User Specified Callback on Rebalance. This is effectively conflicted with G4.

Requirements (R)

Besides the goals, here is a list of requirement for operational/release purposes:

Co-existence of new server with old consumer
1. Fetch/TopicMetadata request protocol backward-compatible
2. Broker side group/offset management module should be stand-alone
Backward-compatible with the inbuilt consumer offset management
1. We currently assume that 1) it will only support Kafka-based offset management with each consumer group assigned to one broker, 2) commit/fetch offset request will be sending to the found broker that hosts the offset manager.
2. With that, the requirement would be just 1) allow consumer initiated requests.

Consumer API

Since the consumer is no longer ZK-dependent, it needs to rely on the communication with the consumer coordinator for membership and offset management. In addition, its consume function call should be non-blocking. Here is the proposed consumer API interface:

Consumer (Properties) {

  // Subscribe to a list of topics, return immediately. 
  // Throws an exception if the new subscription is not valid with existing subscriptions.
  void subscribe(String...) throws SubscriptionNotValidException;

  // Subscribe to a specified topic partition, return immediately.
  // Throws an exception if the new subscription is not valid with existing subscriptions.
  void subscribe(String, Int) throws SubscriptionNotValidException;   

  // Subscribe to a specified topic partition at the given offset, return immediately.
  // Throws an exception if the new subscription is not valid with existing subscriptions.
  void subscribe(String, Int, Long) throws SubscriptionNotValidException;   

  // Try to fetch messages from the topic/partitions it subscribed to. 
  // Return whatever messages are available to be fetched or empty after specified timeout.
  List<Records> poll(Long) throws SubscriptionIsEmptyException, ConsumeInitOffsetNotKnownException;

  // Commit latest offsets for all partitions currently consuming from.
  // If the sync flag is set to true, commit call will be sync and blocking.
  // Otherwise the commit call will be async and best-effort.
  void commit(Boolean) throws SubscriptionIsEmptyException; 

  // Commit the specified offset for the specified topic/partition.
  // If the sync flag is set to true, commit call will be sync and blocking.
  // Otherwise the commit call will be async and best-effort.
  // Throws an exception if the partitions specified are not currently consuming from.
  void commit(List[(String, Int, Long)], Boolean) throws InvalidPartitionsToCommitOffsetException;

  // Specify the fetch starting offset for the specified topic/partition.
  void pos(List[(String, Int, Long)])

  // Get the currently consuming partitions.
  // Block wait if the current partitions are not known yet.
  List[(String, Int)] getPartitions() throws SubscriptionIsEmptyException;

  // Get the last committed offsets of the partitions currently consuming.
  // Block wait if the current partitions are not known yet.
  Map[(String, Int), Long] getOffsets() throws SubscriptionIsEmptyException;

  // Get the last committed offsets of the specified topic/partition.
  Long getOffset(String, Int) throws SubscriptionIsEmptyException, InvalidPartitionsToGetOffsetException;
  
  // --------- Call-back Below, not part of API ------------ //

  // Call-back function upon partition de-assignment.
  // Default implementation will commit offset depending on auto.commit config.
  void onPartitionDeassigned(List[(String, Int)]);

  // Call-back function upon partition re-assignment.
  // Default implementation is No-Op.
  void onPartitionAssigned(List[(String, Int)])

  // --------- Optional Call-back Below ------------ //

  // Call-back function upon partition reassignment given the group member list and the subscribed topic partition info.
  // Return the partitions that would like to consume, and need to make sure each partition is covered by just one member in the group.
  List[(String, Int)] partitionsToConsume(List[String], List[TopicPartition]);
}

The poll() function call returns MessageAndMetadata not just MessageAndOffset because of G6.b. This would also help removing the decompression/recompression in mirror maker (details can be found in KAFKA-1011).

Consumer Coordinator Overview

Each broker will have a consumer coordinator responsible for group management of one or more consumer groups, including membership management and offset management. Management of consumer groups will be distributed across all brokers' coordinators.

For each consumer group, the coordinator stores the following information:

Consumer groups it currently serve, containing
1. The consumer list along with their subscribed topic/partition.
2. Partition ownership for each consumer within the group.
List of subscribed groups for each topic.
List of groups subscribed to wildcards.
Latest offsets for each consumed topic/partition.

The coordinator also have the following functionalities:

handleNewConsumerGroup: this is called when the coordinator starts to serve a new consumer group.
handleAddedConsumer: this is called when a new consumer is registered into a existing consumer group.
handleConsumerFailure: this is called when the coordinator thinks a consumer has failed and hence kick it out of the group.
handleTopicChange: this is called when the coordinator detects a topic change from ZK.
handleTopicPartitionChange: this is called when the coordinator detects a topic partition change from ZK.
handleCommitOffset: this this called for committing offset for certain partitions.
handleConsumerGroupRebalance: this is called when the partition ownership needs to be re-assigned within a group.

The coordinator also maintain the following information in ZK:

Partition ownership of the subscribed topic/partition. This can now be written in a single path with a Json string instead of into multiple paths, one for each partition.
[Optional] Coordinators for each consumer groups.

The coordinator also holds the following modules:

A threadpool of rebalancer threads executing rebalance process of single consumer groups.
[Optional] A socket server for both receiving and send requests and sending responses.
[Optional] A delayed scheduler along with a request purgatory for sending heartbeat requests.
[Optional] A ZK based elector for assigning consumer groups to coordinators.

Design Proposal

For consumer implementation, we have two main proposals: single-threaded consumer of multi-threaded consumer.

Single-threaded Consumer

API Implementation

subscribe:
 
   1. Check if the new subscription is valid with the old subscription:
 
   1.1. If yes change the subscription and return.
 
   1.2. Otherwise throw SubscriptionNotValidException.

poll:
 
   1. Check if the subscription list is empty, if yes throw SubscriptionIsEmptyException

   2. Check if the subscription list has changed and is topic-based:
 
   2.1. If not, check if the subscription is topic-based:
 
   2.1.1. If not, call fetch() // no need to talk to coordinator
 
   2.1.2. If yes, call coordinate() and call fetch() // may need to heartbeat coordinator
 
   2.2. If yes, call coordinate() and call fetch()

commit:
 
   1. Call coordinate().
 
   2. Send commit offset to coordinator, based on sync param block wait on response or not.

getPartition:

   1. Check if the subscription list is empty, if yes throw SubscriptionIsEmptyException

   2. If the partition info is not known, call coordinate()

   3. Return partitions info.


getOffset(s):

   1. Check if the subscription list is empty, if yes throw SubscriptionIsEmptyException

   2. If the partition info is not known, call coordinate()

   3. If the offset info is not known, based on kafka.offset.manage send getOffset to coordinator or throw InvalidPartitionsToGetOffsetException

fetch:
 
   1. If the partition metadata is not know, send metadata request and block wait on response and setup channels to partition leaders.
 
   2. Immediate-select readable channels if there is any, return data.
 
   3. If there is no readable channels, check if the offset is known.

   3.1. If not throw ConsumeInitOffsetNotKnownException.

   3.2. Otherwise send the fetch request to leader and timeout-select.

coordinate:
 
   1. Check if the coordinator channel has been setup and not closed, and if subscription list has not changed. If not: 

   1.1. Block wait on getting the coordinator metadata.

   1.2. call onPartitionDeassignment().

   1.3. Send the registerConsumer to coordinator, set up the channel and block wait on registry response.
 
   1.3.a.1 The registerConsumer response contain member info and partition info, call partitionsToConsume to get the owned partitions.

   1.3.a.2 Send the partitions to consume info and get confirmed from the coordinator. // two round trips
 
   1.3.b. The registerConsumer response contain partition assignment from the coordinator. // one round trip
 
   1.4. and call onPartitionAssigned().

   2. If yes, check if heartbeat period has reached, if yes, send a ping to coordinator.

Open questions:

1. coordinate/deserialization/decompress procedures need to be interruptible on timeout?
2. getOffsets and getPartitions be blocking?

As for coordinator, most of the details can be found at:

https://iwww.corp.linkedin.com/wiki/cf/display/ENGS/Kafka+0.9+Consumer+Rewrite+Proposal#Kafka0.9ConsumerRewriteProposal-Option1.a:Consumer-InitiatedHeartbeatandConsumer-Reregistration-Upon-Rebalance

Multi-threaded Consumer

----- Deprecated below -----

There are two key open questions about the rewrite design, which are more or less orthogonal to each other. so we will propose different designs for each of these open issues in the following:

1. Consumer and Coordinator Communication

Without the coordinator, the consumer already needs to communicate with the broker for the following purposes:

Send fetch requests and get messages as responses.
Send topic metadata requests to refresh local metadata cache.

With the coordinator, there are a few more communication patterns:

Consumer-side coordinator failure detection and vice versa via heartbeat protocol.
Offset management: consumers need to commit/fetch offsets to/from the coordinator.
Group management: coordinator needs to notify consumers regarding consumer group changes and consumed topic/partition changes (e.g. rebalancing).

Among them, fetching messages, refreshing topic metdata and offset management has to be consumer-initiated, while heartbeat-based failure detection and group management can be either consumer-initiated or coordinator initiated. So the choices really are 1) who should initiate the heartbeat for failure detection, and 2) who should initiate the rebalance.

In addition, fetching requests and topic metadata requests are currently received via broker's socket server and handled by the broker's request handling threads. We can also either make the other requests received by the broker as well and pass them to the coordinator or let coordinator communicate directly with the consumers for these type of requests. I think the second option of keeping a separate channel is better since channel these requests should not be delayed by other fetch/produce requests.

Option 1.a: Consumer-Initiated Heartbeat and Consumer-Reregistration-Upon-Rebalance

If we want to support G4, then all the communication patterns must be done as consumer-initiated. Then the coordinator only needs the threadpool of rebalancers and probably its own socket server, and nothing else.

One proposed implementation following this direction can be found here. A brief summary:

Consumer-initiated heart beat in the poll() function for controller-end consumer failure detection.
Controller acks the heart beat if it still have the requested consumer in the group.
Upon receiving heart beat ack consumer send fetch request and block wait for responses, and return.
If the controller does not have the consumer in the group, kicks a rebalance process:
1. Add an error message in all the heart beat requests from the consumers in the group.
2. Wait until all the consumers' re-issued register consumer request.
3. Send the register consumer response with the owned partition/offset info
If the consumer does not hear the ack back after timeout or receive an error response:
1. Send a metadata request to any brokers to get the new controller
2. Send the register consumer request to the new controller
3. Re-start from step 1.

The communication pattern would be:

<-- (register) -- Consumer (on startup or subscription change or coordinator failover)

Coordinator <-- (register) -- Consumer (on startup or subscription change or coordinator failover)

<-- (register) -- Consumer (on startup or subscription change or coordinator failover)

-----------------------------------------------------------------------------------------------------------------

Coordinator (if there is at least one consumer already in the group, send error code to their ping requests, so that they will also resend register)

Coordinator -- (error in reply) --> Consumer (rest of consumers already in the group)

Coordinator <-- (register) -- Consumer (rest of consumers already in the group)

-----------------------------------------------------------------------------------------------------------------

Now everyone in the group have sent the coordinator a register request

(record metadata) (confirm) --> Consumer (successfully received the registration response which contains the assigned partition)

Coordinator (record metadata) (confirm) -X-> Consumer (consumer did not receive the confrontation or failed during the process, will try to reconnect to the coordinator and register again)

(reject) --> Consumer (register failed due to some invalid meta info, try to reconnect to the coordinator and register again)

------------------------------- consumer registration ---------------------------------------------

After that the consumer will send one ping request every time it calls send():

Consumer.send()

Coordinator <-- (ping) -- Consumer

<-- (ping) -- Consumer

Coordinator -- (ping) --> Consumer

-- (ping) --> Consumer

Broker <–(fetch)-- Consumer (if the socket is set to writable; if it is set to readable then the consumer can directly read data, skipping this step)

Broker --(response)–> Consumer

return response // if the timeout is reached before response is returned, the empty set will return to the caller, and responses returned later will be used in the next send call.

--------------------------------------------------------------------------------------------------------------

Coordinator (failed) -- X --> Consumer (have not heard from coordinator for some time, stop fetcher and try to reconnect to the new coordinator)

-- X --> Consumer (have not heard from coordinator for some time, stop fetcher try to reconnect to the new coordinator)

Coordinator <-- (register) -- Consumer (will contain current consumed offset)

<-- (register) -- Consumer (will contain current consumed offset)

Coordinator (update the offset by max-of-them)

– (confirm) --> Consumer (contains the assigned partition)

----------------------------------------------------------------------------------------------

If the coordinator have not heard from a consumer, start re-balancing by setting error to the ping reply to all other consumers

---------------------------- failure detection ----------------------------------------

The pros of this design would be having single-threaded consumer (G4) with simplified consumer-coordinator communication pattern (coordinator simply listens on each channel, receive requests and send responses).

The cons are:

Rebalance latency is determined by the longest heart beat timeout among all the consumers.
Hard for consumer clients to configure the heart beat timeout since it is based on its poll() function calling partition.
Rebalance will cause all consumers to re-issue metadata requests unnecessarily.
If a re-registration is required, poll() function may take longer than timeout.

Option 1.b: Coordinator-Initiated Heartbeat and Coordinator-Initiated Rebalance

An alternative approach is to let coordinator issue heartbeat and group management requests once the consumers have established the connection with it. Then the coordinator needs the threadpool of rebalancers, the delayed scheduler and the purgatory for heartbeat.

In this case, the consumer only needs to actively establish the connection to the coordinator on startup or coordinator failover.

<-- (register) -- Consumer (on startup or subscription change or coordinator failover)

Coordinator <-- (register) -- Consumer (on startup or subscription change or coordinator failover)

<-- (register) -- Consumer (on startup or subscription change or coordinator failover)

-----------------------------------------------------------------------------------------------------------------

(record metadata) (confirm) --> Consumer (successfully received the registration response, block reading for coordinator requests such as heartbeat, stop-fetcher, start-fetcher)

Coordinator (record metadata) (confirm) -X-> Consumer (consumer did not receive the confrontation or failed during the process, will try to reconnect to the coordinator and register again)

(reject) --> Consumer (register failed due to some invalid meta info, try to reconnect to the coordinator and register again)

------------------------------- consumer registration ---------------------------------------------

After that the consumer will just listen on the socket for coordinator requests and respond accordingly.

Coordinator (failed) -- X --> Consumer (have not heard from coordinator for some time, try to reconnect to the new coordinator)

-- X --> Consumer (have not heard from coordinator for some time, try to reconnect to the new coordinator)

----------------------------------------------------------------------------------------------

Coordinator -- (ping) --> Consumer (alive)

-- (ping) --> Consumer (failed)

----------------------------------------------------------------------------------------------

Coordinator (record offset) <-- (reply) -- Consumer (alive)

(rebalance) time out... Consumer (failed)

---------------------------- failure detection ----------------------------------------

If coordinator detects a consumer has failed or new consumer added to the group, or a topic/partition change that the group is subscribing to, issue a rebalance in two phases 1) first ask consumers to stop fetching with latest committed offset, and 2) then ask consumers to resume fetching with new ownership info.

Coordinator -- (stop-fetcher) --> Consumer (alive)

-- (stop-fetcher) --> Consumer (alive)

----------------------------------------------------------------------------------------------

Coordinator (commit offset) <-- (reply) -- Consumer (alive)

(commit offset) <-- (reply) -- Consumer (alive)

----------------------------------------------------------------------------------------------

Coordinator (compute new assignment) -- (start-fetcher) --> Consumer (alive)

-- (start-fetcher) --> Consumer (become failed)

----------------------------------------------------------------------------------------------

Coordinator <-- (reply) -- Consumer (alive)

(rebalance failed) time out... Consumer (failed)

Consumer Registration

For consumer's subscribe() function call, similar to Option 1.a we just update its subscribed topic/partitions.

The consumer's poll() function will be simply polling data from the fetching queue. It will use another fetcher thread to fetch the data. A single fetch queue will be used for the fetcher thread to append data and the poll() function call to get the data.

The fetch thread will check if it has registered with a coordinator or not, or if its subscribed topic/partitions have changed or not. If either case is true the consumer will try to register it self. If we only allow subscribe function to be called once, then the second check can be skipped.

In addition, when the consumer detects the current coordinator has failed (will talk about consumer-side coordinator failure detection later), it will also try to re-register itself with the new coordinator. Note that this registration process is blocking and will only return after it gets response.

Once the registration succeed a coordination thread will be created to just keep listening on the established socket for coordination-side requests.

Boolean Consumer : registerConsumer()

1. Compute the partition id for the "offset" topic

2. Send a TopicMetadataRequest to any brokers from the list to get the leader host from the response.

3. Send a RegisterConsumerRequest to the coordinator host with the specific coordinator port and wait for the response.

4. If a non-error response is received within timeout:

4.1. Get the session timeout info from the RegisterConsumerResponse.

4.2. Start the CoordinationThread listening on the socket for coordinator-side requests.

5. If failed in step 2/3 restart from step 1.

6. If specified number of restart has failed return false.


List<MessageAndMetadata> Consumer : poll(Long timeout)

1. Poll data from the fetching queue.

2. De-serialize data into list of messages and return


ConsumerFetcherThread : doWork()

1. Check whether the coordinator is unknown or the subscribed topic/partitions have changed, if yes:

1.1. If the coordinator is known, call commitOffset().  // blocking call

1.2. Call registerConsumer().  // blocking call

2. If owned partition/offset information is unknown wait on the condition.

3. Send FetchRequest to the brokers with partition/offset information and wait for response.

4. Decompress fetched messages if necessary, and append them to the fetching queue.

RegisterConsumerRequest

{
  Version                => int16
  CorrelationId          => int64
  GroupId                => String
  ConsumerId			 => String
  SessionTimeout         => int64
  Subscribing            => [<TopicPartitions>]
    Topic                => String
    PartitionIds         => [int16]                  // -1 indicates "any partitions"
}

RegisterConsumerResponse

{
  Version                => int16
  CorrelationId          => int64
  ErrorCode              => int16
  GroupId                => String
  ConsumerId			 => String
}

Handling RegisterConsumerRequest

The coordinator, on the other hand, will only serve on behalf of a certain consumer group when it receives the RegisterConsumerRequest from some consumers of the group.

The coordinator keeps

In addition, it needs to check if the subscribed topic/partitions are "valid". This is actually an open question I think and will get back to this point later.

It uses a delayed scheduler extended background thread to send out heartbeat requests and rebalance requests (we will talk about the details later).

Coordinator : handleRegisterConsumer(request : RegisterConsumerRequest)

1. Checks if its host broker is the leader of the corresponding "offset" topic partition. If yes:

1.1. If request.groupId is not in the cached consumer group list yet, call this.handleNewConsumerGroup(request.groupId, request.subscribing)

1.2. If this consumer is not in the group yet call this.handleAddedConsumer(request.consumerId, request.subscribing).

2. Otherwise send back RegisterConsumerResponse directly with errorCode "Not Coordinator".

Handling New Consumer Group

Coordinator : handleNewConsumerGroup(groupId : Int, subscribing : List[TopicAndPartition])

1. Create a new group registry with zero consumer members in coordinator

2. Check the subscribing info:

2.1. If it is a wildcard topic/count subscription, then add its group to the list of groups with wildcard topic subscriptions.

2.2. Otherwise for each topic that it subscribed to, add its group to the list of groups that has subscribed to the topic.

Handling Added Consumer

Coordinator : handleAddedConsumer(consumer : ConsumerInfo)
 
1. Add the heartbeat request task for this consumer to the scheduler.
 
2. Add the consumer registry to the group registry's member consumer registry list.
 
3. Validate if its subscribed topic/partitions. If not, reject the consumer by letting the caller handleRegisterConsumer to send back RegisterConsumerResponse with errorCode "Topic/Partition Not Valid".
 
4. Validate the session timeout value; If not, reject the consumer by letting the caller handleRegisterConsumer to send back RegisterConsumerResponse with errorCode "Session timeout Not Valid".

5. Try to rebalance by calling handleConsumerGroupRebalance()

Coordinator-side Consumer Failure Detection

The coordinator keeps its own socket server and a background delayed-scheduler thread to send heartbeat requests and also requests to handle rebalances (will talk a little bit more on the latter).

The coordinator also keeps a HeartbeatPurgatory for bookkeeping ongoing consumer heartbeat requests.

CoordinatorScheduler extends DelayedScheduler : HeartbeatTask()

1. Sends the Heartbeat Request via coordinator's SocketServer

2. Set the timeout watcher in HeartbeatRequestPurgatory for the consumer

3. this.scheduleTask(HeartbeatTask) with schedule time + session_timeout time   // Add the heartbeat request task back to the queue

Handling Consumer Failure

When the heartbeat request has been expired by the purgatory, the coordinator will set the consumer as failed by calling handleConsumerFailure (we can set the number of expired heartbeat requests until when the consumer will be treated as failed).

Also if a broken pipe is thrown for the established connection, the coordinator also treats the consumer as dead.

Coordinator: handleConsumerFailure(consumer : ConsumerInfo)
 
1. Remove its heartbeat request task from the scheduler.
 
2. Close its corresponding socket from the coordinator's SocketServer.
 
3. Remove its registry from its group's member consumer registry list.
 
4. If there is an ongoing rebalance procedure by some of the rebalance handlers, let it fail directly.
 
5. If the group still has some member consumers, call handleConsumerGroupRebalance()

Consumer-side Coordinator Failure Detection

If a consumer has not heard any request from the coordinator after the timeout has elapsed, it will suspects that the coordinator is dead (again we can set a number of sesstion timeout until when the coordinator will be treated as failed):

ConsumerCoordinationThread : doWork 

1. Listen on the established socket for requests.

2. If HeartbeatRequest is received, send out HeartbeatResponse

3. If timed out without any request received, call Consumer.onCoordinatorFailure()


Consumer : onCoordinatorFailure()

1. Close the connection to the previous coordinator. 

2. Re-call registerConsumer() to discover the new coordinator.

Handling Topic/Partition Change

This should be fairly straight-forward with coordinator registered listener to Zookeeper. The only tricky part is that when the ZK wachers are fired notifying topic partition changes, the coordinator needs to decide which consumer groups are affected by this change and hence need rebalancing.

For the algorithm described below I am only considering topic-addition, partition-deletion (due to broker down) and partition-addition. If we allow topic-deletion we need to add some more logic, but still should be straight-forward.

Coordinator.handleTopicChange
 
1. Get the newly added topic (since /brokers/topics are persistent nodes, no topics should be deleted even if there is no consumers any more inside the group)
 
2. For each newly added topic:
 
2.1 Subscribe TopicPartitionChangeListener to /brokers/topics/topic
 
2.2 For groups that have wildcard interests, check whether any group is interested in the newly added topics; if yes, add the group to the interested group list of this topic
 
2.3 Get the set of groups that are interested in this topic, call handleGroupRebalance


Coordinator.handleTopicPartitionChange
 
1. Get the set of groups that are interested in this topic, and if the group's rebalance bit is not set, set it and try to rebalance the group

Handling Consumer Group Rebalance

The rebalancing logic is handled by another type of rebalance threads, and the coordinator itself will keep a pool of the rebalance threads and assign consumer group rebalance tasks to them accordingly.

The reason of doing so is that the rebalance needs two blocking round trips between the coordinator and the consumers, and hence would be better done in a asynchronous way so that they will not interfere with or block each other. However, to avoid concurrent rebalancing of the same group by two different rebalancer threads, we need to enforce assigning the task to the same handler's queue if it is already assigned to that rebalancer.

There are cases that can cause a rebalance procedure: 1) added consumer, 2) consumer failure, 3) topic change, 4) partition change. The rebalance thread will receive a case code as well as the group id from the coordinator, and use this to decide if further actions are needed if no rebalance is required.

In addition the coordinator will keep a rebalance bit for each group it manages, indicating if the group is under rebalance procedure now. This is to avoid unnecessary rebalances.

Also we will not retry to rebalance failures any more since if any failure happens they will be very likely fatal.

Coordinator : handleGroupRebalance(group : GroupInfo)

1. Check whether the group's rebalance bit is not set. If yes:

2. Set the bit and assign a rebalance task to a rebalancer's working queue (in a load balanced manner).

CoordinatorRebalancer : doWork()

1. Dequeue the next group to be rebalanced from its working queue and reset the group's bit.

2. Get the topics that are interested by the group. For each topic:

2.1. Get the number of partitions by reading from ZK

2.2. Get the number of consumers for each topic/partition from the meta information of consumers in the group it kept in memory

2.3. Compute the new ownership assignment for the topic, if no such assignment can be found then fail the task directly.

3. Check if a rebalance is necessary by trying to get the current ownership from ZK for each topic.

3.1 If there is no registered ownership info in ZK, rebalance is necessary

3.2 If some partitions are not owned by any threads, rebalance is necessary

3.3 If some partitions registered in the ownership map do not exist any longer, rebalance is necessary

3.4 If ownership map do not match with the newly computed one, rebalance is necessary

3.5 Otherwise rebalance is not necessary

4. If the rebalance is not necessary, returns immediately.

5. If rebalance is necessary, do the following:

5.1 For each consumer in the group, send the StopFetcherRequest to the SocketServer

5.2 Then wait until SocketServer has reported that all the StopFetcherResponse have been received or a timeout has expired

5.3 Commit the consumed offsets extracted from the StopFetcherResponse // this will be done as appending to local logs

5.4 For each consumer in the group, send the StartFetcherRequest to the SocketServer

5.5 Then wait until SocketServer has reported that all the StartFetcherReponse have been received or a timeout has expired

5.6 Update the new assignment of partitions to consumers in the Zookeeper // this could be just one json string in one path

6. If a timeout signal is received in either 4.2 or 4.5 from the socket server, fail the rebalance.

Another note is that when Stop/StartFetcherResponse is received, the purgatory can also clear all the expiration watchers for that consumer.

StopFetcherRequest

{
  Version                => int16
  CorrelationId          => int64
  ConsumerId			 => String
}

StopFetcherResponse

{
  Version                => int16
  CorrelationId          => int64
  ErrorCode              => int16
  ConsumerId			 => String
  ConsumedSoFar          => [<PartitionAndOffset>]
}

Consumer Handling StopFetcherRequest

Consumer : handleStopFetcherRequest

1. Clear the partition ownership information (except the current consumed offsets).

2. Send the response, attach the current consumed offsets.
 
3. Clear the corresponding expiration watcher in StopFetcherRequestPurgatory, and if all the watchers have been cleared notify the rebalance handler.

StartFetcherRequest

{
  Version                => int16
  CorrelationId          => int64
  ConsumerId			 => String
  PartitionsToOwn        => [<PartitionAndOffset>]
}

StartFetcherResponse

{
  Version                => int16
  CorrelationId          => int64
  ErrorCode              => int16
  ConsumerId			 => String
}

Consumer Handling StartFetcherRequest

Consumer : handleStartFetcherRequest
 
1. Cache the assigned partitions along with offsets

2. Send back the StartFetcherResponse

3. call onPartitionAssigned()

Option 1.c: Consumer-Initiated Heartbeat and Coordinator-Initiated Rebalance

A third approach would be to let consumer initiate heartbeat while coordinator initiate rebalance, and as a result consumers need to both read and write the socket established with the coordinator.

Then we do not need the HeartbeatRequestPurgatory and the Scheduler thread from the coordinator. And the failure detection mechanism would be changed to:

Coordinator (failed) <-- (ping) -- Consumer (ping timed out, try to reconnect to the new coordinator)

-- X -->

<-- (ping) -- Consumer (ping timed out, try to reconnect to the new coordinator)

-- X -->

----------------------------------------------------------------------------------------------

Coordinator <-- (ping) -- Consumer 1 (alive)

-- (response) -->

(Have not heard from consumer 2, rebalance) Consumer 2 (failed)

---------------------------- failure detection ----------------------------------------

Changes in Consumer Handling Coordinator Requests

In addition, the consumer's coordination thread logic need to be greatly changed:

ConsumerCoordinationThread : doWork 

1. Send HeartbeatRequest to the coordinator and put the request into the HeartbeatRequestPurgatory.

2. Listen on the established socket for requests.

2.1. If HeartbeatResponse is received, clear the expiration watcher in the purgatory.

2.2. If StopFetcherRequest is received, clear the expiration watcher in the purgatory and call handleStopFetcherRequest()

2.3. If StartFetcherRequest is received, clear the expiration watcher in the purgatory and call handleStartFetcherRequest()

3. If timed out without any request received, call Consumer.onCoordinatorFailure()

Compared with 1.b, the heartbeat direction seems more natural, however the consumer client would be more complicated.

2. Consumer-Group Assignment and Migration

As described before each broker will have a coordinator which is responsible for a set of consumer groups. Hence we need to decide which consumer groups are assigned to each coordinator in order to 1) balance load, and 2) easy to be migrated in case of coordinator failure.

Since the coordinator is responsible for the offset management of the consumer group, and since the offsets will be stored as a special topic on the broker side (details can be found here), we can make the assignment of consumer groups to coordinators tied up with the partition leader election of the offset topic. The advantage of doing so would be easy automatic reassignment of consumer groups upon coordinator failures, but it might be not optimal for load balance (think of two groups, one with 1 consumer and another with 100). Another way would be assigning consumer groups to coordinators in a load balanced, and decoupling the offset partition leader from the coordinator of the consumers.

Option 2.a: GroupId-based Coordinator Assignment

The first approach is to treat the "offset" topic just as other topics to have an initial number of partitions (of course this number of partitions can be increased later by the add-partitions tool as other topics). Thus a set of consumer group ids would be assigned to one broker's coordinator in a fixed manner.

The pros of this approach is that coordinator assignment logic is quite simple.

The cons is that the assignment is purely based on group names, and hence could be suboptimal for load balancing. Also the migration of consumer groups would be coarsen-grained this this set of consumer groups must be migrated all to the same coordinator in a coordinator failover.

Coordinator Startup and Failover

Every server will create a coordinator object as its member upon startup.

Coordinator : onStartup()
 
1. Register session expiration listener
 
2. Read all the topics from ZK

3. Initialize 1) consumer group set, 2) set of subscribed groups for each topic and 3) set of groups having wildcard subscriptions.
 
4. Register listeners for topics and their partition changes
 
5. Start up the offset manager with latest offset read from logs and cached

6. Create the threadpool of rebalancers, and heartbeat scheduler if following 1.b)

7. Create and start the SocketServer to start accepting consumer registrations.  // this step must wait until tasks like offset loading have completed

And again as we have described before the coordinators will start serving a consumer group only if it received a RegisterConsumerRequest from an unknown group's consumer. Hence the coordinator failover is done automatically and coordinators themselves are agnostic.

Offset Management

When the consumer calls "commit", a OffsetCommitRequest will be sent to the coordinator associated with the [topic/partition, offset] information. The commit function call is blocking and will only return either after it has received the response from the coordinator or timeout has reached.

The coordiantor, on the other hand, upon receiving the OffsetCommitRequest via's socket server, will call handleCommitOffset:

Coordinator : handleCommitOffset (request: CommitOffsetRequest)
 
1. Transform offsets in the request as messages and append to the log of the "offset" topic.
 
2. Wait for the replicas have confirmed the committed messages.  // for this topic we may config ack=-1

2. Update its offset cache.
 
3. Send back the CommitOffsetResponse

OffsetCommitRequest

{
  Version                => int16
  CorrelationId          => int64
  ConsumerId			 => String
  OffsetInfo             => [<PartitionAndOffset>]
    Topic                => String
    Partition            => int16
    Offset               => int64
}

OffsetCommitResponse

{
  Version                => int16
  CorrelationId          => int64
  ErrorCode              => int16
}

Option 2.b: Random Coordinator Assignment

An alternative approach is random coordinator assignment. The rationale behind this idea is load balancing: by randomly distributing consumer groups, hopefully the groups of large number of consumer will be evenly distributed to the consumers. In order to do so we would prefer to decouple the coordinator with the leader of the "offset" topic partitions.

Changes in Consumer Registration

Since now each new consumer group will be randomly assigned to a coordinator and not necessarily the one for offset management, we need coordinators to form a consensus on those decisions, and the way of doing it is via ZK.

In some more details:

We add one more path in the root as /coordinators/[consumer_group_name]
The node /coordinators/consumer_group_name stores the coordinator's broker id for the consumer group
Each coordinator will have a ZK-based elector on this path.
Each coordinator keeps a cache of the coordinators for each consumer group.

Boolean Consumer : registerConsumer()

1. Send a CoordinatorMetadataRequest to any brokers from the list to get the leader host from the response.

2. Send a RegisterConsumerRequest to the coordinator host with the specific coordinator port and wait for the response.

3. Same as before..


Coordinator : handleCoordinatorMetadataRequest(request : CoordinatorMetadataRequest)

1. If the coordinator of the consumer group is in the cache, returns directly the response.

2. Otherwise try to write to /coordinators/request.consumerGroupName

2.1. If succeed, register the session timeout listener on the path, return itself as the coordinator in CoordinatorMetadataRequest

2.2. If failed:

2.2.1 Read the path to get the coordinator information and update the coordinator metadata cache. 

2.2.2. Register data change listener on the path.

2.2.3. Return the read coordinator information in CoordinatorMetadataRequest

Then on data deletion change, all the coordinators will try to re-elect as the new coordinator of the consumer group, and update the cache accordingly.

CoordinatorMetadataRequest

{
  Version                => int16
  CorrelationId          => int64
  ConsumerId			 => String
  GroupId                => String
}

CoordinatorMetadataResponse

{
  Version                => int16
  CorrelationId          => int64
  ErrorCode              => int16
  CoordinatorInfo        => <Broker>
    Host                 => String
    Port                 => int16
}

Changes in Offset Management

Since we are not decoupling offset management from coordinator, we will keep a offset manager as described here in each broker which is parallel to the coordinator and migrates as the partition leaders.The coordinators themselves just cache the latest offset upon receiving CommitOffsetRequest.

Also upon creating the new consumer group, the coordinator would send FetchOffsetRequest to the corresponding offset manager to update its cache.

Coordinator : handleNewConsumerGroup(groupId : Int, subscribing : List[TopicAndPartition])

1. Same steps as before

2. Send FetchOffsetRequest to the corresponding offset manager and wait on the response to update its offset cache. // This can be done asynchronously


Coordinator : handleCommitOffset (request: CommitOffsetRequest)
 
1. Update its offset cache.

2. Transform offsets in the request as messages and send a ProduceRequest to the corresponding partition leader.

3. Wait for the ack before send back the CommitOffsetResponse.

The pros of this approach is better load balancing for coordinators, however it complicates logic quite a bit by decoupling the offset management with the coordinator.

Open Questions

Apart with the above proposal, we still have some open questions that worth discuss:

Should coordinator use its own socket server or use broker's socket server? Although the proposal suggests former we can think about this more.
Should we allow consumers from the same group to have different session timeout? If yes when do we reject a consumer registration because of its proposed session timeout?
Should we allow consumers from the same group have specific fixed subscribing topic partitions (G5)? If yes when do we reject a consumer registration with its subscribed topic partitions?
Should we provide tools to delete topics? If yes, how will this affect coordinator's topic change logic?
Should we do de-serialization/compression in fetcher thread or user thread? Although the proposal suggests de-serialization in user thread and de-compression in fetcher thread (or deserialization/compression all in client thread if we are following Option 1.a) we can still think about this more.
Do we allow to call subscribe multiple times during its life time? This is related to G5.a
Do we allow users to specify the offsets in commit() function call, and explicitly get last committed offset? This is related to G6.a
Would we still keep a zookeeper string in consumer properties for consumer to read the bootstrap broker list as an alternative to the broker list property?
Shall we try to avoid unnecessary rebalances on coordinator failover? In the current proposal upon coordinator failover the consumers will re-issue registration requests to the new coordinator, causing the new coordinator to trigger rebalance that is unnecessary.
Shall we restrict commit offset request to only for the client's own assigned partitions? For some use cases such as clients managing their offsets themselves and only want to call commit offset once on startup, this might be useful.
Shall we avoid request forwarding for option 2.b? One way to do that is let consumer remember two coordinators, and upon receiving stop fetching from group management coordinator, send a commit offset request to offset management coordinator and wait on response, and then respond to the stop fetching request.

Space shortcuts

Child pages

Kafka 0.9 Consumer Rewrite Design

Related

Goals (G)

Requirements (R)

Consumer API

Consumer Coordinator Overview

Design Proposal

Single-threaded Consumer

API Implementation

Multi-threaded Consumer

1. Consumer and Coordinator Communication

Option 1.a: Consumer-Initiated Heartbeat and Consumer-Reregistration-Upon-Rebalance

Option 1.b: Coordinator-Initiated Heartbeat and Coordinator-Initiated Rebalance

Consumer Registration

RegisterConsumerRequest

RegisterConsumerResponse

Handling RegisterConsumerRequest

Handling New Consumer Group

Handling Added Consumer

Coordinator-side Consumer Failure Detection

Handling Consumer Failure

Consumer-side Coordinator Failure Detection

Handling Topic/Partition Change

Handling Consumer Group Rebalance

StopFetcherRequest

StopFetcherResponse

Consumer Handling StopFetcherRequest

StartFetcherRequest

StartFetcherResponse

Consumer Handling StartFetcherRequest

Option 1.c: Consumer-Initiated Heartbeat and Coordinator-Initiated Rebalance

Changes in Consumer Handling Coordinator Requests

2. Consumer-Group Assignment and Migration

Option 2.a: GroupId-based Coordinator Assignment

Coordinator Startup and Failover

Offset Management

OffsetCommitRequest

OffsetCommitResponse

Option 2.b: Random Coordinator Assignment

Changes in Consumer Registration

CoordinatorMetadataRequest

CoordinatorMetadataResponse

Changes in Offset Management

Open Questions