You are viewing an old version of this page. View the current version.

Compare with Current View Page History

« Previous Version 10 Next »

Status

Current stateUnder Discussion

Discussion thread: here

JIRA: here

Please keep the discussion on the mailing list rather than commenting on the wiki (wiki discussions get unwieldy fast).

Motivation

An important part of deploying Kafka Connect is monitoring the health of the workers in a cluster and the connectors and tasks that have been deployed to the cluster. Although producers and consumers used in Kafka Connect can be monitored, the Kafka Connect framework only has a few metrics capturing the number of connectors and tasks for each worker. To augment these existing metrics, we propose to add metrics to monitor more information about the connectors, tasks, and workers. All metrics reported by each worker are scoped by the activities within that worker.

There are several things that are out of scope for this proposal, though they may be addressed in future KIPs. First, this proposal expressly avoids changes to the Connect API, and therefore does not address how connector implementations can define their own connector-specific metrics. Second, Kafka Connect does not have any existing mechanism to aggregate the metrics reported by each worker.

Public Interfaces

All of the following will be added via Kafka's metrics library like most of the metrics in the Kafka brokers and other components. The context of all metrics are limited to the worker where the metrics are being reported, and all metrics include the worker ID in the MBean attribute (similarly to how Kafka producer and consumer metrics include the client ID).

 


Connector Metrics

Metric NameDescriptionMBean attribute
connector-typeThe type of the connector, one of: source, sinkkafka.connect:type=connector-metrics,name=connector-type,connector=([-.\w]+)
connector-classThe name of the connector classkafka.connect:type=connector-metrics,name=connector-class,connector=([-.\w]+)
connector-versionThe version of the connector class, as reported by the connector in this workerkafka.connect:type=connector-metrics,name=connector-version,connector=([-.\w]+)
statusThe current status of the connector in this worker, one of: running, paused, stopped  kafka.connect:type=connector-metrics,name=status,connector=([-.\w]+)

Common Task Metrics

Metric NameDescriptionMBean attribute
statusThe current status of this task, one of: unassigned, running, paused, failed, destroyedkafka.connect:type=task-metrics,name=status,connector=([-.\w]+),task=([\d]+)
pause-ratioThe fraction of time this task has spent in the paused state.kafka.connect:type=task-metrics,name=pause-ratio  ,connector=([-.\w]+),task=([\d]+)
offset-commit-success-percentageThe average percentage of this task's offset commit attempts that succeededkafka.connect:type=task-metrics,name=offset-commit-success-percentage,connector=([-.\w]+),task=([\d]+)
offset-commit-failure-percentageThe average percentage of this task's offset commit attempts that failed or had an errorkafka.connect:type=task-metrics,name=offset-commit-failure-percentage,connector=([-.\w]+),task=([\d]+)
offset-commit-avg-timeThe average time taken by this task to commit offsetskafka.connect:type=task-metrics,name=offset-commit-max-time,connector=([-.\w]+),task=([\d]+)
offset-commit-max-timeThe maximum time taken by this task to commit offsetskafka.connect:type=task-metrics,name=offset-commit-max-time,connector=([-.\w]+),task=([\d]+)
batch-size-maxThe maximum size of the batches processed by the connectorkafka.connect:type=task-metrics,name=batch-size-max,connector=([-.\w]+),task=([\d]+)
batch-size-avgThe average size of the batches processed by the connectorkafka.connect:type=task-metrics,name=batch-size-avg,connector=([-.\w]+),task=([\d]+)

Source Task Metrics

 

Metric NameDescriptionMBean attribute
source-record-poll-rateThe average per-second number of records produced/polled (before transformation) by this task belonging to the named source connector in this worker. This is before transformations are applied.kafka.connect:type=source-task-metrics,name=source-record-produce-rate,connector=([-.\w]+),task=([\d]+)
source-record-write-rateThe average per-second number of records per second output from the transformations and written to Kafka for this task belonging to the named source connector in this worker. This is after transformations are applied.kafka.connect:type=source-task-metrics,name=source-record-write-rate,connector=([-.\w]+),task=([\d]+)

 

Sink Task Metrics

Metric NameDescriptionMBean attribute
sink-record-read-rateThe average per-second number of records read from Kafka for this task belonging to the named sink connector in this worker. This is before transformations are applied.kafka.connect:type=sink-task-metrics,name=sink-record-read-rate,connector=([-.\w]+),task=([\d]+)
sink-record-send-rateThe average per-second numbrer of records output from the transformations and sent to this task belonging to the named sink connector in this worker. This is after transformations are applied and excludes any records filtered out by the transformations.kafka.connect:type=sink-task-metrics,name=sink-record-process-rate  ,connector=([-.\w]+),task=([\d]+)
sink-record-lag-maxThe maximum lag in terms of number of records for any partition in this windowkafka.connect:type=sink-task-metrics,name=sink-record-lag-max,connector=([-.\w]+),task=([\d]+)
sink-record-{topic}-{partition}.records-lagThe latest lag in terms of number of records behind the consumer the offset commits are for the topic partition.kafka.connect:type=sink-task-metrics,name=sink-record-{topic}-{partition}-lag,connector=([-.\w]+),task=([\d]+)
sink-record-{topic}-{partition}.records-lag-avgThe average lag in terms of number of records behind the consumer the offset commits are for the topic partition.kafka.connect:type=sink-task-metrics,name=sink-record-{topic}-{partition}-lag-avg,connector=([-.\w]+),task=([\d]+)
sink-record-{topic}-{partition}.records-lag-maxThe maximum lag in terms of number of records behind the consumer the offset commits are for the topic partition.kafka.connect:type=sink-task-metrics,name=sink-record-{topic}-{partition}-lag-max,connector=([-.\w]+),task=([\d]+)
partition-countThe number of topic partitions assigned to this task belonging to the named sink connector in this worker.kafka.connect:type=sink-connector-metrics,name=partition-count,connector=([-.\w]+),task=([\d]+)

Worker Metrics

 

Metric NameDescriptionMBean attribute

assigned-tasks

The number of tasks run in this worker (existing metric)kafka.connect:type=connect-coordinator-metrics,name=assigned-tasks
assigned-connectorsThe number of connectors run in this worker (existing metric)kafka.connect:type=connect-coordinator-metrics,name=assigned-connectors

task-count

The number of tasks run in this workerkafka.connect:type=connect-worker-metrics,name=task-count
connector-countThe number of connectors run in this workerkafka.connect:type=connect-worker-metrics,name=connector-count
leader-nameThe name of the group leaderkafka.connect:type=connect-worker-metrics,name=leader-name 
stateThe state of this worker, one of: rebalancing, runningkafka.connect:type=connect-worker-metrics,name=state 


Worker Rebalance Metrics

Metric NameDescriptionMBean attribute
rebalance-success-totalThe total number of this worker's successful rebalanceskafka.connect:type=connect-worker-rebalance-metrics,name=rebalance-success-total
rebalance-success-percentageThe average percentage of this worker's rebalances that succeededkafka.connect:type=connect-worker-rebalance-metrics,name=rebalance-success-percentage 
rebalance-failure-totalThe total number of this worker's failed rebalanceskafka.connect:type=connect-worker-rebalance-metrics,name=rebalance-failure-total
rebalance-failure-percentageThe average percentage of this worker's rebalances that failedkafka.connect:type=connect-worker-rebalance-metrics,name=rebalance-failure-percentage
rebalance-max-timeThe maximum time spent by this worker to rebalancekafka.connect:type=connect-worker-rebalance-metrics,name=rebalance-max-time
rebalance-99p-timeThe 99th percentile time spent by this worker to rebalance during the last window (defaults to an hour)kafka.connect:type=connect-worker-rebalance-metrics,name=rebalance-99p-time
rebalance-95p-timeThe 95th percentile time spent by this worker to rebalance during the last window (defaults to an hour)kafka.connect:type=connect-worker-rebalance-metrics,name=rebalance-95p-time
rebalance-90p-timeThe 90th percentile time spent by this worker to rebalance during the last window (defaults to an hour)kafka.connect:type=connect-worker-rebalance-metrics,name=rebalance-90p-time
rebalance-75p-timeThe 75th percentile time spent by this worker to rebalance during the last window (defaults to an hour)kafka.connect:type=connect-worker-rebalance-metrics,name=rebalance-75p-time
time-since-last-rebalanceThe time since the most recent rebalance in this workerkafka.connect:type=connect-worker-rebalance-metrics,name=time-since-last-rebalance
task-failure-rateThe number of tasks that failed in this workerkafka.connect:type=connect-worker-rebalance-metrics,name=task-failure-rate


Worker REST Metrics

Metric NameDescriptionMBean attribute
request-rateThe average per second number of requests handled by the REST endpoints in this workerkafka.connect:type=worker-rest-metrics,name=request-rate

 

Proposed Changes

We will add the relevant metrics as specified in the Public Interfaces section, except for the two existing metrics that will be left unmodified.

Compatibility, Deprecation, and Migration Plan

Two existing metrics exist but will not be changed.

Rejected Alternatives

None


The average per-second number of retried record sends for a topic.

  • No labels