You are viewing an old version of this page. View the current version.

Compare with Current View Page History

« Previous Version 28 Next »

Status

Current stateApproved

Discussion thread: here

JIRA: here

Please keep the discussion on the mailing list rather than commenting on the wiki (wiki discussions get unwieldy fast).

Motivation

An important part of deploying Kafka Connect is monitoring the health of the workers in a cluster and the connectors and tasks that have been deployed to the cluster. Although producers and consumers used in Kafka Connect can be monitored, the Kafka Connect framework only has a few metrics capturing the number of connectors and tasks for each worker. To augment these existing metrics, we propose to add metrics to monitor more information about the connectors, tasks, and workers. All metrics reported by each worker are scoped by the activities within that worker.

There are several things that are out of scope for this proposal, though they may be addressed in future KIPs. First, this proposal expressly avoids changes to the Connect API, and therefore does not address how connector implementations can define their own connector-specific metrics. Second, Kafka Connect does not have any existing mechanism to aggregate the metrics reported across multiple workers.

Public Interfaces

All of the following will be added via Kafka's metrics library like most of the metrics in the Kafka brokers and other components. The context of all metrics are limited to the worker where the metrics are being reported, and all metrics are defined as attributes on the specified MBean attribute and are measured within the context of a single worker. All metrics defined below are at the INFO recording level.


Connector Metrics

MBean namekafka.connect:type=connector-metrics,connector=([-.\w]+)


Metric/Attribute NameDescription
connector-typeThe type of the connector, one of: source, sink
connector-classThe name of the connector class
connector-versionThe version of the connector class, as reported by the connector in this worker
statusThe current status of the connector in this worker, one of: running, paused, stopped 
status-runningSignals whether the connector is in the running state
status-pausedSignals whether the connector is in the paused state
status-stoppedSignals whether the connector is in the stopped state

Common Task Metrics

MBean namekafka.connect:type=task-metrics,connector=([-.\w]+),task=([-.\w]+)


Metric/Attribute NameDescription
statusThe current status of this task, one of: unassigned, running, paused, failed, destroyed
status-unassignedSignals whether the task is in the unassigned state
status-runningSignals whether the task is in the running state
status-pausedSignals whether the task is in the paused state
status-failedSignals whether the task is in the failed state
status-destroyedSignals whether the task is in the destroyed state
pause-ratioThe fraction of time this task has spent in the paused state.
offset-commit-success-percentageThe average percentage of this task's offset commit attempts that succeeded
offset-commit-failure-percentageThe average percentage of this task's offset commit attempts that failed or had an error
offset-commit-max-time-msThe maximum time in milliseconds taken by this task to commit offsets
offset-commit-99p-time-msThe 99th percentile time in milliseconds spent by this task to commit offsets to Kafka
offset-commit-95p-time-msThe 95th percentile time in milliseconds spent by this task to commit offsets to Kafka
offset-commit-90p-time-msThe 90th percentile time in milliseconds spent by this task to commit offsets to Kafka
offset-commit-75p-time-msThe 75th percentile time in milliseconds spent by this task to commit offsets to Kafka
offset-commit-50p-time-msThe 50th percentile (average) time in milliseconds spent by this task to commit offsets to Kafka
batch-size-maxThe maximum size of the batches processed by the connector
batch-size-avgThe average size of the batches processed by the connector

Source Task Metrics

MBean namekafka.connect:type=source-task-metrics,connector=([-.\w]+),task=([\d]+)


Metric/Attribute NameDescription
source-record-poll-rateThe average per-second number of records produced/polled (before transformation) by this task belonging to the named source connector in this worker.
source-record-poll-countThe number of records produced/polled (before transformation) by this task belonging to the named source connector in this worker, since the task was last restarted.
source-record-write-rateThe average per-second number of records output from the transformations and written to Kafka for this task belonging to the named source connector in this worker. This is after transformations are applied and excludes any records filtered out by the transformations.
source-record-write-countThe number of records output from the transformations and written to Kafka for this task belonging to the named source connector in this worker, since the task was last restarted.
poll-batch-max-time-msThe maximum time in milliseconds taken by this task to poll for a batch of source records
poll-batch-99p-time-msThe 99th percentile time in milliseconds spent by this task to poll for a batch of source records
poll-batch-95p-time-msThe 95th percentile time in milliseconds spent by this task to poll for a batch of source records
poll-batch-90p-time-msThe 90th percentile time in milliseconds spent by this task to poll for a batch of source records
poll-batch-75p-time-msThe 75th percentile time in milliseconds spent by this task to poll for a batch of source records
poll-batch-50p-time-msThe 50th percentile (average) time in milliseconds spent by this task to poll for a batch of source records

 

Sink Task Metrics

MBean namekafka.connect:type=sink-task-metrics,connector=([-.\w]+),task=([\d]+)


Metric/Attribute NameDescription
sink-record-read-rateThe average per-second number of records read from Kafka (before transformations are applied) for this task belonging to the named sink connector in this worker.
sink-record-read-countThe number of records read from Kafka (before transformations are applied) for this task belonging to the named sink connector in this worker, since the task was last restarted.
sink-record-send-rateThe average per-second numbrer of records output from the transformations and sent to this task belonging to the named sink connector in this worker. This is after transformations are applied and excludes any records filtered out by the transformations.
sink-record-send-countThe numbrer of records output from the transformations and sent to this task belonging to the named sink connector in this worker, since the task was last restarted.
sink-record-lag-maxThe maximum lag in terms of number of records behind the consumer the offset commits are for any topic partitions.
partition-countThe number of topic partitions assigned to this task belonging to the named sink connector in this worker.
offset-commit-seq-noThe current sequence number for offset commits
offset-commit-completion-rateThe average per-second number of offset commit completions that were completed successfully
offset-commit-completion-skip-rateThe average per-second number of offset commit completions that were received too late and skipped/ignored
put-batch-max-time-msThe maximum time taken by this task to put a batch of sinks records
put-batch-99p-time-msThe 99th percentile time in milliseconds spent by this task to put a batch of sinks records
put-batch-95p-time-msThe 95th percentile time in milliseconds spent by this task to put a batch of sinks records
put-batch-90p-time-msThe 90th percentile time in milliseconds spent by this task to put a batch of sinks records
put-batch-75p-time-msThe 75th percentile time in milliseconds spent by this task to put a batch of sinks records
put-batch-50p-time-msThe 50th percentile (average) time in milliseconds spent by this task to put a batch of sinks records
flush-max-time-msThe maximum time in milliseconds taken by this sink task to pre-commit/flush
flush-99p-time-msThe 99th percentile time in milliseconds spent by this sink task to pre-commit/flush
flush-95p-time-msThe 95th percentile time in milliseconds spent by this sink task to pre-commit/flush
flush-90p-time-msThe 90th percentile time in milliseconds spent by this sink task to pre-commit/flush
flush-75p-time-msThe 75th percentile time in milliseconds spent by this sink task to pre-commit/flush
flush-50p-time-msThe 50th percentile (average) time in milliseconds spent by this sink task to pre-commit/flush


MBean namekafka.connect:type=sink-task-metrics,connector=([-.\w]+),task=([\d]+),topic=([-.\w]+),partition=([\d]+)


Metric/Attribute NameDescription
sink-record-lagThe latest lag in terms of number of records behind the consumer the offset commits are for the topic partition.
sink-record-lag-avgThe average lag in terms of number of records behind the consumer the offset commits are for the topic partition.
sink-record-lag-maxThe maximum lag in terms of number of records behind the consumer the offset commits are for the topic partition.

Worker Metrics

MBean namekafka.connect:type=connect-worker-metrics


Metric/Attribute NameDescription

task-count

The number of tasks run in this worker
connector-countThe number of connectors run in this worker
leader-nameThe name of the group leader
epochThe epoch or generation number of this worker
statusThe state of this worker, one of: rebalancing, running
status-rebalancingSignals whether the worker is in the rebalancing state
status-runningSignals whether the worker is in the running state
rest-request-rateThe average per second number of requests handled by the REST endpoints in this worker


Worker Rebalance Metrics

MBean namekafka.connect:type=connect-worker-rebalance-metrics


Metric/Attribute NameDescription
rebalance-success-totalThe total number of this worker's successful rebalances
rebalance-success-percentageThe average percentage of this worker's rebalances that succeeded
rebalance-failure-totalThe total number of this worker's failed rebalances
rebalance-failure-percentageThe average percentage of this worker's rebalances that failed
rebalance-max-time-msThe maximum time in milliseconds spent by this worker to rebalance
rebalance-99p-time-msThe 99th percentile time in milliseconds spent by this worker to rebalance during the last window (defaults to an hour)
rebalance-95p-time-msThe 95th percentile time in milliseconds spent by this worker to rebalance during the last window (defaults to an hour)
rebalance-90p-time-msThe 90th percentile time in milliseconds spent by this worker to rebalance during the last window (defaults to an hour)
rebalance-75p-time-msThe 75th percentile time in milliseconds spent by this worker to rebalance during the last window (defaults to an hour)
rebalance-50p-time-msThe 50th percentile (average) time in milliseconds spent by this worker to rebalance during the last window (defaults to an hour)
time-ms-since-last-rebalanceThe time in milliseconds since the most recent rebalance in this worker
task-failure-rateThe number of tasks that failed in this worker

Configuration

The distributed and standalone worker configuration files will support the following properties. These exactly match the producer and consumer configurations of the same name. (The first three are already in the distributed worker configuration.)

Configuration FieldTypeDefaultImportanceDescription
metrics.sample.window.mslong30000lowThe window of time in milliseconds a metrics sample is computed over. Must be a non-negative number.
metrics.num.samplesint2lowThe number of samples maintained to compute metrics. Must be a positive number.
metric.reportersstring""lowA list of classes to use as metrics reporters. Implementing the MetricReporter interface allows plugging in classes that will be notified of new metric creation. The JmxReporter is always included to register JMX statistics.
metrics.recording.levelstring"INFO"lowThe highest recording level for metrics. Must be either "INFO" or "DEBUG".

 

Proposed Changes

We will add the relevant metrics and worker configuration properties as specified in the Public Interfaces section.

Compatibility, Deprecation, and Migration Plan

Existing Connect coordinator metrics will not be changed.

The metrics.sample.window.msmetrics.num.samples, and metric.reporters configurations already exist in the distrtibuted worker; these will also be added to the standalone worker. The metrics.recording.level configuration will be added to both the distributed and standalone worker configurations. All four of these metrics have sensible default values and therefore users do not need to add or override them in their existing configuration files.

Rejected Alternatives

None

 

  • No labels