Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

...

A quick discussion on replica lag: Intuitively, it might seem reasonable to be more aggressive with ISR eviction. That is, we might consider letting the lag time be smaller  than the session timeout. The sooner a replica is removed from the ISR, the sooner the partition may be able to accept writes again. However, there are two reasons why this is not so simple. First, when a replica is failing to keep up, it is often not clear whether the problem is on the leader or the follower. For example, it It might just be that the leader is failing to work through a backlog of requests quickly enough. We have seen this many times. In this case, shrinking the ISR actually makes recovery more difficult because we are removing a potential leader from the ISR. Secondly, if a follower is genuinely not keeping up, then removing it from the ISR means that the broker gives up its ability to exert back-pressure on the clients through the advancement of the high watermark. If this is a persistent condition, then the lagging broker will fall further and further behind. For these reasons, we think it is smarter to be conservative about shrinking the ISR.

...