Status

Current state: Initiated

Discussion thread:

JIRA: Unable to render Jira issues macro, execution error.

Please keep the discussion on the mailing list rather than commenting on the wiki (wiki discussions get unwieldy fast).

Motivation

Follower replicas fetch data from leader replica continuously to catch up with the leader. Follower replica is considered to be out of sync if it lags behind the leader replica for more than replica.lag.time.max.ms.

Follower replica is considered out of sync

If it has not sent any fetch requests within the configured value of replica.lag.time.max.ms. This can happen when the follower is down or it is stuck due to GC pauses or it is not able to communicate to the leader replica within that duration.
If it has not consumed up to the leader’s log-end-offset within replica.lag.time.max.ms. This can happen with new replicas that start fetching messages from leader or replicas which are kind of slow in processing the fetched messages.

replica.fetch.wait.max.ms property represents wait time for each fetch request to be processed by the leader. This property value should be less than the replica.lag.time.max.ms to prevent frequent shrinking of ISR. So, follower always tries its best to avoid going out of sync by sending frequent fetch requests within replica.lag.time.max.ms.

But we found an issue of partitions going offline even though follower replicas try their best to fetch from the leader replica. Sometimes, the leader replica may take time to process the fetch requests while reading logs, it makes follower replicas out of sync even though followers send fetch requests within replica.lag.time.max.ms duration. It may even lead to offline partitions when in-sync replicas go below min.insync.replicas count. We observed this behavior multiple times in our clusters and making partitions offline.

Proposed Changes

One way to address this issue is to have the below enhancement while considering a replica insync

Fetch requests are added as pending requests before processing the fetch request in `ReplicaManager` if the fetch request’s message-offset >= leader’s LEO when the earlier fetch request is served. This replica is always considered to be insync as long as there are pending requests. This constraint will guard the follower replica not to go out of sync when a leader has some intermittent issues in processing fetch requests(especially read logs). This will make the partition online even if the current leader is stopped as one of the other insync replicas can be chosen as a leader. Pending fetch request will be removed once the respective fetch request is completed with success or failure.
Replica’s `lastCaughtUptime` is updated with the current time instead of the fetch time in `Replica#updateFetchState` if the fetch request processing time is >= replica.lag.time.max.ms * replica.read.time.insync.factor and fetch-offset >= leader’s LEO when the earlier fetch request is served. This is always done even in case of errors while processing the fetch request.

This feature can be enabled by setting the property “follower.fetch.reads.insync.enable” to true. The default value will be set as false to give backward compatibility.

Example :

<Topic, partition>: <events, 0>

replica.lag.max.time.ms = 10_000

Assigned replicas: 1001, 1002, 1003, isr: 1001,1002,1003, leader: 1001

Follower fetch request from broker:1002 to broker:1001

At time t = t_x , log-end-offset = 1002, fetch-offset = 950

At time t = t_y , log-end-offset = 1100, fetch-offset = 1002, time taken to process the request is: 25 secs

Any in-sync replica check after t_x+10s for events-0 would result in false that makes the replica on 1002 as out of sync. Replica on 1003 may also get out of sync in a similar way.

With this feature enabled:

While the fetch request is being processed, insync replica checks on the replica return true as there are pending fetch requests with fetch offset >= LEO at earlier fetch request.

Once the fetch request is served, lastCaughtupTime is set t_y+25. Subsequent insync checks would result in true as long as follower replica sends fetch requests at regular intervals according to replica.lag.time.max.ms and replica.fetch.wait.max.ms.

Public Interfaces

Below two properties are added to the broker configuration.

Configuration Name	Description	Valid Values	Default Value
follower.fetch.reads.insync.enable	This property allows setting fetch time of a follower replica as the current time instead of fetch time if it spends more time in processing the fetch requests which have fetch-offset >= LEO at the earlier fetch request.	true or false	false
replica.read.time.insync.factor	This is the factor that will be taken into account to set lastCaughtupTime of a replica as the time after the request is processed instead of the time before the fetch request is processed.	Range of (0, 1]	0.5

Compatibility, Deprecation, and Migration Plan

This change is backwards compatible with previous versions.

Rejected Alternatives

Whenever leader partition throws errors or not able to process the requests within a specific time (atleast replica.lag.time.max.ms) then it should relinquish its leadership and allow another in-sync replica as the leader. But this may cause ISR thrashing when there are intermittent issues in processing the follower fetch requests.
replica.max.lag.offsets can be reintroduced that will allow the follower replica to be insync even though its fetch requests are not being served within the expected interval.

Space shortcuts

Child pages

Status

Motivation

Proposed Changes

Example :

With this feature enabled:

Public Interfaces

Compatibility, Deprecation, and Migration Plan

Rejected Alternatives

Space shortcuts

Child pages

KIP-501 Avoid offline partitions when follower fetch requests not processed in time

Status

Motivation

Proposed Changes

Example :

With this feature enabled:

Public Interfaces

Compatibility, Deprecation, and Migration Plan

Rejected Alternatives