You are viewing an old version of this page. View the current version.

Compare with Current View Page History

« Previous Version 8 Next »

Status

Current state: Under Discussion

Discussion thread: here [Change the link from the KIP proposal email archive to your own email thread]

JIRA: Unable to render Jira issues macro, execution error.

Please keep the discussion on the mailing list rather than commenting on the wiki (wiki discussions get unwieldy fast).

Motivation

RocksDB has functionality to collect statistics about its operations to monitor running RocksDB's instances. These statistics enable users to find bottlenecks and to accordingly tune RocksDB. RocksDB's statistics can be accessed programmatically via JNI or RocksDB can be configured to periodically dump them to disk. Although RocksDB provides this functionality, Kafka Streams does currently not expose RocksDB's statistics in its metrics. Hence users need to implement Streams' RocksDBConfigSetter to fetch the statistics. This KIP proposes to expose a subset of RocksDB's statistics in the metrics of Kafka Streams.  


Public Interfaces

Each exposed metric will have the following tags:

  • type = stream-state-metrics,
  • thread-id = [thread ID],
  • task-id = [task ID]
  • rocksdb-state-id = [store ID]

The following metrics will be exposed in the Kafka Streams' metrics

  • bytes-written-rate [bytes/s]
  • bytes-written-total [bytes]
  • bytes-read-rate [bytes/s]
  • bytes-read-total [bytes]
  • bytes-flushed-rate [bytes/s]
  • bytes-flushed-total [bytes]
  • flush-time-(avg|min|max) [ms]
  • memtable-hit-rate
  • block-cache-data-hit-rate
  • block-cache-index-hit-rate
  • block-cache-filter-hit-rate
  • bytes-read-compaction-rate [bytes/s]
  • bytes-written-compaction-rate [bytes/s]
  • compaction-time-(avg|min|max) [ms]
  • write-stall-duration-(avg|total) [ms]
  • num-open-files
  • num-file-errors-total

Proposed Changes

In this section, I will explain the meaning of the metrics listed in the previous section and why I chose them. Generally, I tried to choose the metrics that are useful independently of any specific configuration of the RocksDB instances. Furthermore, I tried to keep the number of metrics at a minimum, because adding metrics in future is easier than deleting them from a backward-compatibility point of view. 

bytes-written-(rate|total)

These metrics measure the bytes written to a RocksDB instance. The metrics show the write load on a RocksDB instance.

bytes-read-(rate|total)

Analogously to bytes-written-(rate|total), these metrics measure the bytes read from a RocksDB instance. The metrics show the read load on a RockDB instance.

bytes-flushed-(rate|total) and flush-time-(avg|min|max)

When data is put into RocksDB, the data is written into a in-memory tree data structure called memtable. When the memtable is almost full, data in the memtable is flushed to disk by a background process. Metrics bytes-flushed-(rate|total) measure the average throughput of flushes and the total amount of bytes written to disk. Metrics flush-time-(avg|min|max) measure the processing time of flushes. 

The metrics should help to identify flushes as bottlenecks.

memtable-hit-rate

When data is read from RocksDB, the memtable is consulted firstly to find the data. This metric measures the number of hits with respect to the number of all lookups into the memtable. Hence, the formula for this metric is hits/(hits + misses).

A low memtable-hit-rate might indicate a too small memtable.

block-cache-data-hit-rate, block-cache-index-hit-rate, and block-cache-filter-hit-rate

If data is not found in the memtable, the block cache is consulted. Metric block-cache-data-hit-rate measures the number of hits for data blocks with respect to the number of all lookups for data blocks into the block cache. The formula for this metric is the equivalent to the one for memtable-hit-rate.

Metrics block-cache-index-hit-rate and block-cache-filter-hit-rate measure the hit rates for index and filter blocks if they are cached in the block cache. By default index and filter blocks are cached outside of block cache. Users can configure RocksDB to include index and filter blocks into the block cache to better control the memory consumption of RocksDB. If users do not opt to cache index and filter blocks in the block cache, the value of these metrics should stay at zero.

A low hit-rate might indicate a too small block cache.

bytes-read-compaction-rate, bytes-written-compaction-rate, and compaction-time-(avg|min|max)

After data is flushed to disk, the data needs to be reorganised on disk from time to time. This reorganisation is called compaction and is performed by a background process. For the reorganisation, the data needs to be moved from disk to memory and back. Metrics bytes-read-compaction-rate and bytes-written-compaction-rate measure read and write throughput of compactions on average. Metrics compaction-time-(avg|min|max) measure the processing time of compactions.

The metrics should help to identify compactions as bottlenecks.

write-stall-duration(avg|total)

As explained above, from time to time RocksDB flushes data from the memtable to disk and reorganises data on the disk with compactions. Flushes and compactions might stall writes to the database, hence the writes need to wait until these processes finish. These metrics measure the average and total duration of write stalls.

If flush and compaction happen too often and stall writes this time will increase and signal a bottleneck.

num-open-files and num-file-errors-total

Part of the data in RocksDB is kept in files. This files need to be opened and closed. Metric num-open-files measures the number of currently open files and metric num-file-errors-total measures the number of file errors. Both metrics may help to find issues connected to OS and file systems.  


Compatibility, Deprecation, and Migration Plan

Since metrics are only added and no other metrics are modified, this KIP should not

  • affect backward-compatibility
  • deprecate public interfaces
  • need a migration plan other than adding the new metrics to its own monitoring component

Rejected Alternatives

  • Metrics bytes-read-compaction-total and bytes-written-compaction-total did not seem useful to me since they would measure bytes moved between memory and disk due to compaction. The metric bytes-flushed-total gives at least a feeling about the size of the persisted data in the RocksDB instance.
  • No labels