Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

...

We propose to introduce a set of conventional / standard metrics for the connectors.

It is important to mention that

  • A connector implementation does not have to report all the defined metrics. But if a connector reports a metric of the same semantic defined below, the implementation should follow the convention.
  • The following metric convention is not a complete list. More conventional metric will be added over time.
  • The histogram metrics are usually very expensive. Due to its performance impact, we intentionally excluded them in this FLIP. Please see future work section for more details.

Source Metrics

BytesThe latency occurred before Flink fetched record.

fetchLatency = FetchTime - EventTime

The occurred before the record is emitted by the source connector

Name

Type

Unit

Description

numBytesIn

Counter

Bytes

The total number of input bytes since the source started

numBytesInPerSec

Meter

Bytes/Sec

The input bytes per second

numRecordsIn

Counter

Records

(Existing operator metric) The total number of input records since the source started

numRecordsInPerSec

Meter

Records/Sec

(Existing operator metric) The input records per second

numRecordsInErrorsCounterRecordsThe total number of record that failed to consume

recordSize*

Histogram

The size of a record.

currentFetchLatencyGaugems

The latency occurred before Flink fetched the record.

This metric is different from fetchLatency in that it is metric is an instantaneous value recorded for the last processed record.

This metric is provided because latency histogram could be expensive. The instantaneous latency value is usually a good enough indication of the latency.

fetchLatency = FetchTime - EventTime

currentLatencyGaugems

The latency occurred before the record is emitted by the source connector.

This metric is different from latency in that it is  is an instantaneous value recorded for the last processed record.

This metric is provided because latency histogram could be expensive. The instantaneous latency value is usually a good enough indication of the latency.

latency = EmitTime - EventTime

fetchLatency*

Histogram

ms

the

latency*

Histogram

ms

latency

.

latency = EmitTime - EventTime

idleTime

Gauge

ms

The time in milliseconds that the source has not processed any record.

idleTime = CurrentTime - LastRecordProcessTime

...

The size of a record

Name

Type

Unit

Description

numBytesOut

Counter

Bytes

The total number of output bytes since the source started

numBytesOutPerSec

Meter

Bytes/Sec

The output bytes per second

numRecordsOut

Counter

Records

(Existing operator metric) The total number of output records since the source started

numRecordsOutPerSec

Meter

Records/Sec

(Existing operator metric) The output records per second

numRecordsOutErrorsCounterRecordThe total number of records failed to send

recordSize*

Histogram

Bytes

currentSendTimeGaugems

The time it takes to send the last record.

sendTime*

Histogram

ms

The time it takes to send a record

Note:

  • A connector implementation does not have report all the following metrics. But the connectors that do report these metrics should conform to this convention.
  • The histogram metrics are usually very expensive. Due to its performance impact, it is strongly recommended that the connectors do not report them by default, but allow users to opt them in on demand.

This metric is an instantaneous value recorded for the last processed record.

Scope

The metric group for each source and sink would be the same as ordinary operator scope, i.e. default to <host>.taskmanager.<tm_id>.<job_name>.<operator_name>.<subtask_index>

...

If the connector has its original metrics, the original metric names should still be kept, even some of the original metrics are exposed with standard metric names.

Anchor
FutureWork
FutureWork
Future Work

Opt in/out metrics

In this FLIP, we intentionally left some of the useful but potentially expensive metrics out of the scope. For example:

Name

Type

Unit

Description

recordSize*

Histogram

Bytes

The size of a record.

fetchLatency*

Histogram

ms

The latency occurred before Flink fetched the record.

fetchLatency = FetchTime - EventTime

latency*

Histogram

ms

The latency occurred before the record is emitted by the source connector.

latency = EmitTime - EventTime

sendTime*

Histogram

ms

The time it takes to send a record

We plan to add these metrics to the convention by introducing optional metrics to allow user opt in/out these expensive metrics on demand. This will be discussed in a separate FLIP.

Proposed Changes

  1. Add the proposed metrics to the existing connectors.
  2. Mark the old metrics as deprecated if necessary.
  3. Correct the scope and metric names of the connectors if needed.

...