Table of Contents

Status

Current state: Under DiscussionAccepted

Discussion thread: here

JIRA: KAFKA-2063

Released: 0.10.1.0

Please keep the discussion on the mailing list rather than commenting on the wiki (wiki discussions get unwieldy fast).

Motivation

Currently the only possible way for client to limit fetch response size is via per-partition response limit max_bytes taken from config setting max.partition.fetch.bytes.

So the maximum amount of memory the client can consume is max.partition.fetch.bytes * num_partitions, where num_partitions is the total number of partitions currently being fetched by consumer.

This leads to following problems:

Since num_partitions can be quite big (several thousands), the memory required for fetch response responses can be several GB
max.partition.fetch.bytes can not be set arbitrarily low since it should be greater than maximum message size for fetch request to work.
Memory usage is not easily predictable - it depends on consumer lag

This KIP proposes to introduce new version of fetch request with new top-level parameter "limitmax_bytes" to limit the size of fetch response and solve above problem. The

In particular, if consumer issues N parallel fetch requests, the memory consumption will not exceed N * max_bytes.

Actually, it will be min(N * max_bytes, max.partition.fetch.bytes * num_partitions) since per-partition limit is removed from fetch requeststill respected.

Public Interfaces

This KIP introduces:

New fetch request (v.3) with a global response size limit and without per-partition limit
New client-side config parameter fetch.limitmax.bytes - global client's fetch response size limit
New replication config parameter replica.fetch.response.limitmax.bytes - limit used by replication threadNew server-side config parameter fetch.partition.max.bytes - maximum per-partition server side limit when serving new fetch request
New inter-broker protocol version "0.1110.01-IV0" - starting from this version brokers will use fetch request v.3 for replication

Proposed Changes

Proposed changes are quite straightforward. We introduce FetchRequest v.3 with new top level parameter max_bytes:

Fetch Request (Version: 3) => replica_id max_wait_time min_bytes max_bytes [topics]
  replica_id => INT32 
  max_wait_time => INT32
  min_bytes => INT32
  max_bytes => INT32
  topics => topic [partitions]
    topic => STRING
    partitions => partition fetch_offset max_bytes
      partition => INT32
      fetch_offset => INT64
      max_bytes => INT32

Fetch Response v.3 will remain the same as v.2.

Server New fetch request processes partitions in order they appear in request.

For Otherwise, for each partition except the first one server fetches up to fetch.corresponding partition .limit max._bytes, but not bigger than remaining response limit submitted by client.

Also, if remaining response limit is strictly greater than zero, the fetch size for partition is at least message.max.bytes. This is done to ensure that at least one message is present in all non-empty message sets.

For all other partitions server sends empty message sets.

This way we can ensure that response size is less than (limit_bytes + message.max.bytes).

...

For the first partition, server always fetches at least one message. Empty response limits will be returned for all partitions that didn't fit into response limit.

This algorithm provides following guarantees:

FetchRequest always makes progress - if server has message(s), than at least one message is returned irrespective of max_bytes
FetchRequest response size will not be bigger than max(max_bytes, size of the first message in first partition)

Since new fetch request processes partitions in order and stops fetching data when response limit is hit, client should use some kind of partition shuffling to ensure fairness.

...

In this scenario client won't get any messages from C and D until it catches up with A and B.

The solution is to start reorder partitions in fetch request in round-robin fashion to continue fetching from first empty partition in round-robin fashion received or to perform random shuffle of partitions before each request.

Round-robin shuffling seems to be more "fair" and predictable so we decided to deploy it at ReplicaFetcherThread and in Consumer Java API.

Compatibility, Deprecation, and Migration Plan

Client setting max.partition.fetch.bytes will be depricated in favour of fetch.limit.bytes. Former setting will be used only for old requests (if server side doesn't support new one).

Replication setting replica.fetch.max.bytes will be depricated in favour of replica.fetch.limit.bytes. Former setting will be used for inter-broker protocol older than "0.11.0-IV0".

New fetch The new fetch request is designed to work properly even if fetch.limit.bytes is the top level max_bytes is less than the message size. This way we can ignore custom per-partition maximums since they are mostly done to accommodate custom message size.

Old fetch requests should be processed on server exactly as before this KIP.

...

We decided to establish the following defaults:

fetch.max.bytes = 50MB

replica.fetch.response.max.bytes = 10MB

Rejected Alternatives

Some discussed/rejected alternatives:

Do not Together with addition of global response limit deprecate per-partition limits, just add global partitions limit. If global limit is zero, process request exactly as before. Pros: less intrusive change, clients can enable global limit only if they are ready to it (can do round-robin, etc). Cons: request becomes too confusing. It is unclear what should be use as partition-level limit.Rejected since per-partition limit can be useful for Kafka streams (see mail list discussion).
Do random partition shuffling on server side. Pros: ensure fairness without client-side modifications. Cons: non-deterministic behaviour on server side; round-robin can be easily implemented on client side.

Space shortcuts

Child pages

Versions Compared

Old Version 1

New Version Current

Key

Motivation

Public Interfaces

Proposed Changes

Space shortcuts

Child pages

Page History

Versions Compared

Old Version 1

New Version Current

Key

Motivation

Public Interfaces

Proposed Changes