...
This page is meant as a template for writing a KIP. To create a KIP choose Tools->Copy on this page and modify with your content and replace the heading with the next KIP number and a description of your issue. Replace anything in italics with your own description.
Status
Current state: Under Discussion
Discussion thread: here [Change the link from the KIP proposal email archive to your own email thread]
JIRA: If the idea is approved then I will create the Jirahere [Change the link from KAFKA-1 to your own ticket]
Motivation
The main motivation is to have a clear metric (in spite of the OS) to see when the produce requests become "async" .In a normal situation the produce requests will be written to disk via teh lib->syscall->etc.., as we know this will end up in a memory page (dirty page from now on)
...
the new volumes streamed the data from the other brokers generating a lot of new data to be written to disk, this + compacitons + produce requests reached the hard limit of the dirty pages forcing the OS to start the sync writes and drastically degrading the produce requests.
As far as the new broker is copying partitions they become again the leaders and start getting produce requests, but the broker is still under pressure because of the data being streamed,
In our case we were able to address this issue playing a bit with the OS resources and the OS dirty pages configs, but it would have been great if we had a metric to monitor when the produce requests get close to become "sync"
...
Public Interfaces
Monitoring
Proposed Changes
Describe the new thing you want to do in appropriate detail. This may be fairly extensive and have large subsections of its own. Or it may be a few sentences. Use judgement based on the scope of the change.
...
Code Block |
---|
object SegmentAppendStats { private val metricsGroup = new KafkaMetricsGroup(SegmentAppendStats.getClass) val SegmentAppendTimer: Timer = metricsGroup.newTimer("SegmentAppendRateAndTimeMs", TimeUnit.MILLISECONDS, TimeUnit.SECONDS) } |
Compatibility, Deprecation, and Migration Plan
- I need confirmation if tracking this metric could have a performance impact (Thanks in advance)
Test Plan
If the KIP is accepted I can easily test the scenario producing records, checking the new metric before and after (sync vs async) writes
I can play using the dirty_ratio and background_dirty_radio values.
Rejected Alternatives
The best alternative IMHO would be to get the information before "the disaster happens" so at OS level we can check the nr_dirty and the nr_dirty_threshold
nr_dirty is the amount of current dirty pages and nr_dirty_threshold is the limit when the OS will block the writes in the pages until some are flushed.
Having this relation could give us a hint when we are getting closer to the limit and add more resources or tune the OS settings.
This is possible as an "in house" metric but not for Kafka as it runs in the JVM and only god know in which OS If there are alternative ways of accomplishing the same thing, what were they? The purpose of this section is to motivate why the design is the way it is and not some other way.