Flink has defined a few standard metrics for jobs, tasks and operators. It also supports custom metrics in various scenarios. However, so far there is no standard or conventional metric definition for the connectors. Each connector defines their own metrics at the moment. This complicates operation and monitoring. Admittedly, different connectors may have different metrics, but some commonly used metrics can probably be standardized. This FLIP proposes a set of standard connector metrics that each connector should emit if applicable. The metrics proposed in this FLIP will serve as a convention for the connector implementations

In the future, the other projects in Flink ecosystem may rely on this metric convention. Therefore, the connector implementations are expected to follow the conventions when reporting the metrics.

Public Interfaces

We propose to introduce a set of conventional / standard metrics for the connectors.

...

Name	Type	Unit	Description
numBytesIn	Counter	Bytes	The total number of input bytes since the source started
numBytesInPerSec	Meter	Bytes/Sec	The input bytes per second
numRecordsIn	Counter	Records	The total number of input records since the source started
numRecordsInPerSec	Meter	Records/Sec	The input records per second
numRecordsInErrors	Counter	Records	The total number of record that failed to consume
recordSize*	Histogram	Bytes	The size of a record.
currentFetchLatency	Gauge	ms	The latency occurred before Flink fetched the record. This metric is different from fetchLatency in that it is an instantaneous value recorded for the last processed record. This metric is provided because latency histogram could be expensive. The instantaneous latency value is usually a good enough indication of the latency. fetchLatency = FetchTime - EventTime
currentLatency	Gauge	ms	The latency occurred before the record is emitted by the source connector. This metric is different from latency in that it is an instantaneous value recorded for the last processed record. This metric is provided because latency histogram could be expensive. The instantaneous latency value is usually a good enough indication of the latency. latency = EmitTime - EventTime
fetchLatency*	Histogram	ms	The latency occurred before Flink fetched the record. fetchLatency = FetchTime - EventTime
latency*	Histogram	ms	The latency occurred before the record is emitted by the source connector. latency = EmitTime - EventTime
idleTime	Gauge	ms	The time in milliseconds that the source has not processed any record. idleTime = CurrentTime - LastRecordProcessTime

...

Name	Type	Unit	Description
numBytesOut	Counter	Bytes	The total number of output bytes since the source started
numBytesOutPerSec	Meter	Bytes/Sec	The output bytes per second
numRecordsOut	Counter	Records	The total number of output records since the source started
numRecordsOutPerSec	Meter	Records/Sec	The output records per second
numRecordsOutErrors	Counter	Record	The total number of records failed to send
recordSize*	Histogram	Bytes	The size of a record
currentSendTime	Gauge	ms	The time it takes to send the last record.
sendTime*	Histogram	ms	The time it takes to send a record

...

A connector implementation does not have report all the following metrics. But the connectors that do report these metrics should conform to this convention.
The histogram metrics are usually very expensive. Due to its performance impact, so it is strongly recommended that the connectors do not report them by default. But give the options to the , but allow users to enable opt them in on demand.

Scope

The metric group for each source and sink would be the same as ordinary operator scope, i.e. default to <host>.taskmanager.<tm_id>.<job_name>.<operator_name>.<subtask_index>

...

Page tree

Versions Compared

Old Version 7

New Version 8

Key

Public Interfaces

Scope

Page tree

Page History

Versions Compared

Old Version 7

New Version 8

Key

Public Interfaces

Scope