Status
Current state: Initiated
JIRA:
Please keep the discussion on the mailing list rather than commenting on the wiki (wiki discussions get unwieldy fast).
Motivation
Follower replicas fetch data from leader replica continuously to catch up with the leader. Follower replica is considered to be out of sync if it lags behind the leader replica for more than replica.lag.time.max.ms.
Follower replica is considered out of sync
- If it has not sent any fetch requests within the configured value of replica.lag.time.max.ms. This can happen when the follower is down or it is stuck due to GC pauses or it is not able to communicate to the leader replica within that duration.
- If it has not consumed up to the leader’s log-end-offset within replica.lag.time.max.ms. This can happen with new replicas that start fetching messages from leader or replicas which are kind of slow in processing the fetched messages.
replica.fetch.wait.max.ms property represents wait time for each fetch request to be processed by the leader. This property value should be less than the replica.lag.time.max.ms to prevent frequent shrinking of ISR. So, follower always tries its best to avoid going out of sync by sending frequent fetch requests within replica.lag.time.max.ms.
But we found an issue of partitions going offline even though follower replicas try their best to fetch from the leader replica. Sometimes, the leader replica may take time to process the fetch requests while reading logs, it makes follower replicas out of sync even though followers send fetch requests within replica.lag.time.max.ms duration. It may even lead to offline partitions when in-sync replicas go below min.insync.replicas count. We observed this behavior multiple times in our clusters and making partitions offline.
Proposed Changes
One way to address this issue is to have the below constraints to consider replicas insync
- Fetch requests are added as pending requests before reading the logs if the fetch request’s message-offset >= leader’s LEO when the earlier fetch request is served. This replica is always considered to be insync as long as there are pending requests. This constraint will guard the follower replica not to go out of sync when a leader has some intermittent issues in processing fetch requests(especially read logs). This will make the partition online even if the current leader is stopped as one of the other insync replicas can be chosen as a leader.
- Replica’s `lastCaughtUptime` is updated with the current time instead of the fetch time in `Replica#updateFetchState` if the fetch request processing time is >= replica.lag.time.max.ms * replica.read.time.insync.factor and fetch-offset >= leader’s LEO when the earlier fetch request is served
This feature can be enabled by setting the property “follower.fetch.reads.insync.enable” to true. The default value will be set as false to give backward compatibility.
Public Interfaces
'follower.fetch.reads.insync.enable'
This property allows setting fetch time of a follower replica as the current time instead of fetch time if it spends more time in processing the fetch requests which have fetch-offset >= LEO at the earlier fetch request. The default value is false.
'replica.read.time.insync.factor'
This is the factor that will be taken into account to set lastCaughtupTime of a replica as the time after the request is processed instead of the time before the fetch request is processed. The default value is 0.5.
Compatibility, Deprecation, and Migration Plan
This change is backwards compatible with previous versions.
Rejected Alternatives
- Whenever leader partition throws errors or not able to process the requests within a specific time (atleast replica.lag.time.max.ms) then it should relinquish its leadership and allow another in-sync replica as the leader. But this may cause ISR thrashing when there are intermittent issues in processing the follower fetch requests.
- replica.max.lag.offsets can be reintroduced that will allow the follower replica to be insync even though its fetch requests are not being served within the expected interval.