...
This page is meant as a template for writing a KIP. To create a KIP choose Tools->Copy on this page and modify with your content and replace the heading with the next KIP number and a description of your issue. Replace anything in italics with your own description.
Status
Current state: "Under Discussion" Accepted
Discussion thread: here
JIRA: KAFKA-5746
...
ApiKey | Scope of error | Request:Errors Mapping |
---|---|---|
UpdateMetadata | request | 1:1 |
ControlledShutdown | request | 1:1 |
FindCoordinator | request | 1:1 |
JoinGroup | request | 1:1 |
Heartbeat | request | 1:1 |
LeaveGroup | request | 1:1 |
SyncGroup | request | 1:1 |
ListGroups | request | 1:1 |
SaslHandshake | request | 1:1 |
ApiVersions | request | 1:1 |
InitProducerId | request | 1:1 |
AddOffsetsToTxn | request | 1:1 |
EndTxn | request | 1:1 |
DescribeAcls | request | 1:1 |
Produce | partition | 1:n |
Fetch | partition | 1:n |
Offsets | partition | 1:n |
OffsetCommit | partition | 1:n |
OffsetFetch | partition | 1:n |
DeleteRecords | partition | 1:n |
OffsetForLeaderEpoch | partition | 1:n |
AddPartitionsToTxn | partition | 1:n |
WriteTxnMarkers | partition | 1:n |
TxnOffsetCommit | partition | 1:n |
LeaderAndIsr | partition + request | 1:n |
StopReplica | partition + request | 1:n |
Metadata | topic | 1:n |
CreateTopics | topic | 1:n |
DeleteTopics | topic | 1:n |
DescribeGroups | group | 1:n |
CreateAcls | acl | 1:n |
DeleteAcls | acl | 1:n |
DescribeConfigs | resource | 1:n |
AlterConfigs | resource | 1:n |
...
Message conversion rate and time
Down conversions are expensive since the whole response has to be read into memory for conversion. It will be useful to monitor the rate of down conversion and the time spent on conversions.
Fetch down and produce message conversion rate rates will be a meter meters in the same group as existing topic metrics TotalFetchRequestsPerSec
etc.
MBean: kafka.server:type=BrokerTopicMetrics,name=FetchMessageConversionsPerSec,topic=([-.\w]+)
MBean: kafka.server:type=BrokerTopicMetrics,name=FetchDownConversionsPerSecProduceMessageConversionsPerSec,topic=([-.\w]+)
It will also be useful to know the time taken for down conversions. Fetch down conversion time metric will be a histogram alongside other request time metrics. This time will also be included in request logs so that clients requiring expensive down conversions can be identified. Conversion time will also be added for produce requests.
MBean: kafka.network:type=RequestMetrics,name=
FetchDownConversionsTimeMs
MessageConversionsTimeMs
,request={Produce
...
|Fetch}
Request size and temporary memory size
Large messages can cause GC issues in the broker, especially if down conversions are required. Maximum message batch size can be configured per topic to control this, but that is the size after compression. Since the batches are decompressed to validate produce requests and for fetch down conversion, it will be useful to have metrics for produce message batch size.
...
MBean: kafka.network:type=RequestMetrics,name=RequestSizeRequestBytes,request=<apiKey>
MBean: kafka.network:type=RequestMetrics,name=TemporaryMemorySizeTemporaryMemoryBytes,request=<apiKey>
Authentication success and failure rates
...
- successful-authentication-rate
- failed-authentication-rate
ZooKeeper status and latency
It will be good to monitor latency of ZooKeeper requests so that any issues with ZooKeeper communication can be detected early.
...
MBean: kafka.server:type=ZooKeeperClientMetrics,name=ZooKeeperLatencyZooKeeperRequestLatencyMs
It will also be useful to see the current status of broker's connection to ZooKeeper.
This will be a String Gauge in the existing group SessionExpireListener which currently shows the rate of each state (eg. DisconnectsPerSec)
MBean: kafka.server:type=SessionExpireListener,name=SessionState
State will be one of Disconnected|SyncConnected|AuthFailed|ConnectedReadOnly|SaslAuthenticated|Expired
Client-side metrics
Client versions
...