Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

...

The following metrics will be exposed in the Kafka Streams' metrics

  • write-waiting-time-(avg|total) [ms]
  • bytes-written-rate [bytes/s]
  • bytes-written-total [bytes]
  • bytes-read-rate [bytes/s]
  • bytes-read-total [bytes]memtable-hit-rate
  • block-cache-hit-rate
  • bytes-flushed-rate [bytes/s]
  • bytes-flushed-total [bytes]
  • flush-time-(avg|min|max) [ms]
  • memtable-hit-rate
  • block-cache-hit-rate
  • bytes-read-compaction-rate [bytes/s]
  • bytes-written-compaction-rate [bytes/s]
  • compaction-time-(avg|min|max) [ms]
  • write-waiting-time-(avg|total) [ms]
  • num-open-files
  • num-file-errors-total

...

In section, I will explain the meaning of the metrics listed in the previous section and why I chose them.

bytes-written-(rate|total)

These metrics measure the bytes written to a RocksDB instance. The metrics show the write

...

load on a RocksDB instance.

bytes-read-(rate|total)

Analogously to bytes-written-(rate|total), these metrics measure the bytes read from a RocksDB instance. The metrics show the read load on a RockDB instance.

bytes-flushed-(rate|total) and flush-time-(avg|

...

min|max)

When data is put into RocksDB, the data is written into a in-memory tree data structure called memtable. When the memtable is almost full, data in the memtable is flushed to disk by a background process. The data on disk needs to be reorganised from time to time. This reorganisation is called compaction and is also performed by a background process. During flush and compaction a write to the database might need to wait until these processes finish. These metrics measure the average and total waiting time of a write process until flush and compaction finish.

If flush and compaction happen too often this time may increase and signal a bottleneck. Users can then take action by, e.g., increasing the size of the memtable to decrease the rate of flushes or changing the compaction settings.

...

Metrics bytes-flushed-(rate|total)

...

These metrics measure the measure the average throughput of flushes and the total amount of bytes written to a RocksDB instance. The metrics show the write load on the RocksDB instance.

bytes-read-(rate|total)

disk. Metrics flush-time-(avg|min|max) measure the processing time of the flush operation. 

The metrics should help to identify flushes as bottlenecksAnalogously to bytes-written-(rate|total), these metrics measure the bytes read from a RocksDB instance. The metrics show the read load on the RockDB instance.

memtable-hit-rate

When data is read from RocksDB, the memtable is consulted firstly to find the data. This metric measures the number of hits with respect to the number of all lookups into the memtable. Hence, the formula for this metric is hits/(hits + misses).

...

If block-cache-hit-rate is to high for the given workload, the block-cache-hit-rate needs maybe some tuning.

bytes-read-compaction-rate, bytes-written-compaction-rate, and compaction-time-(avg|min|max)

After data is flushed to disk, the data needs to be reorganised on disk from time to time. This reorganisation is called compaction and is performed by a background process. For the reorganisation, the data needs to be moved from disk to memory and back.  Metrics bytes-read-compaction-rate and bytes-written-compaction-rate measure read and write throughput on average. Metrics compaction-

...

time-(avg|min|max) measure the processing time of compaction.

The metrics should help to identify compactions as bottlenecks.

write-waiting-time-(avg|total)

As explained for write-waitingfor bytes-flushed-(rate|total) and flush-time-(avg|min|totalmax), when when the memtable is memtable is almost full, data in the memtable is emptied and flushed to disk by a background process. During flush and compaction a write to the database might need to wait until these processes finish. These metrics measure the number of bytes flushed to disk.average and total waiting time of a write process until flush and compaction finish.

If flush and compaction happen too often this time may increase and signal a bottleneck. Users can then take action by, e.g., increasing the size of the memtable to decrease the rate of flushes or changing the compaction settings.

This The metrics show IO load produced by the RocksDB instance.   

Compatibility, Deprecation, and Migration Plan

...