Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

...

Current state: Under Discussion

Discussion thread: here [Change the link from the KIP proposal email archive to your own email thread]

JIRA: here

Motivation

As the config 'segment.bytes' for internal topic related connect cluster(offset.storage.topic), if following the default configuration of the broker or set it larger, then when the connect cluster runs many and complicated tasks(for example, Involving a lot of topic and partition replication), especially the log volume of the topic 'offset.storage.topic' is very large, it will affect the restart speed of the connect workers. The actual impact can be seen in the jira link above.

After investigation, the reason is that a consumer needs to be started to read the data of ‘offset.storage.topic’ at startup. Although this topic is set to compact, if the 'segment size' is set to a large value, such as the default value of 1GB, then this topic may have tens of gigabytes of data(if following default 25 partitions) that cannot be compacted and has to be read from the earliest (because the active segment cannot be cleaned), which will consume a lot of time and caused the worker to be unable to start and execute tasks for a long time.

Therefore, I want to extract the “segment.bytes” settings for “offset.storage.topic” separately, just like "offsets.topic.segment.bytes" and "transaction.state.log.segment.bytes", where the size is set by the user, and if there is no explicit setting, give the default value such as 50MB. In this way to avoid receiving interference from kafka broker configuration. As for "config.storage.topic" and "status.storage.topic", It is difficult to write a large amount of data in practical use, and it may not be necessary to take similar measures.

Public Interfaces

Add a config in connect-distributed.properties:

...