Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

...

JIRAKafka-14768

Motivation

Sometimes,

...

When application try to reduce the max.block.ms to decrease the blocking time. They will find they couldn't change the value to any one which is smaller than the time costed for metadata's fetch. What's more, metadata's fetch is one heavy operation which cost a lot of time.

Take our project as example. we will take about 4 seconds to complete the metadata's fetch. So, we can't change the max.block.ms to any value < 4000ms.

After analyzing the issue. The root cause is the configured max.block.ms is shared by "metadata fetch" operation and "append record" operation. We can refer to follow table in detail:

where to block

when it is blocked

how long it will be blocked?

...

org.apache.kafka.clients.producer.KafkaProducer#waitOnMetadata

...

The first request which need to load the metadata from kafka

...

<max.block.ms

...

org.apache.kafka.clients.producer.internals.RecordAccumulator#append

...

At peak time for business, if the network can’t send message in short time.

...

<max.block.ms

What's more, the metadata's fetch only need to be done one time in KafkaProducer#send. After the complete of first fetch, the metadata will be retrieved from cache directly and its timer update only happen on network thread instead of user's thread.

...

we found the users' functional interaction take a lot of time. At last, we figure out the root cause is that after we complete deploy or restart the servers. The first message's delivery on each application server by kafka client will take much time.


After analyzing the source code about the first time's sending logic. The time cost is caused by the getting metadata before the sending. The latter's sending won't take the much time due to the cached metadata. The metadata's fetching logic is right and necessary. Thus, we still want to improve the experience for the first message's send/user first interaction.


So, This KIP try to raise one solution to improve it.

Public Interfaces

No public interface changed. Just change the inner implement of private method:

...

The changes can refer to the example PR:   https://github.com/apache/kafka/pull/1333513320/files

Add two configures with tiny code changes related which control the timeout in KafkaProducer#send

...