You are viewing an old version of this page. View the current version.

Compare with Current View Page History

Version 1 Next »

There have been some questions and discussions about how to efficiently let users to configure their memory usage in Kafka Streams since 0.10.0 release, and how that will affect our current development plans regarding caching, buffering, and state store management, etc. In this page we summarize the memory usage background in Kafka Streams as of 0.10.0, and discuss what would be the "end goal" for Kafka Stream's memory management. This is not used as an implementation design and development plan for memory management, but rather as a guidance for related feature developments that may be correlating to the memory usage.

 

Background

There are a few modules inside Kafka Streams that allocate memory during the runtime:

  1. Kafka Producer: each thread of a Kafka Streams instance maintains a producer client. The client itself maintains buffer for batching records that are going to be sent to Kafka. This is completely controllable by producer's
    buffer.memory config.
     
  2. Kafka Consumer: each thread of a Kafka Streams instance maintains two consumer client, one for normal data fetching and one for state store replication and restoration only. Each client maintains buffer fetched messages before they are returned to user from the poll call. Today it is not controllable yet, but in the near future we are going to add similar memory bound controls like we have in producers:  Unable to render Jira issues macro, execution error.

  3. Triggering based Caches: as summarized in KIP-63, we will be adding a cache for each of the aggregation and KTable.to operators, and we are adding a StreamsConfig to bound the total number of bytes used for all caches. BUT we are caching them as deserialized objects in order to avoid serialization costs.

  4. Deserialized Objects Buffering: within each thread's running loop, after the records are returned in raw bytes from consumer.poll, the thread will deserialize each one of them into typed objects and buffer them, and process them one record at-a-time. This is mainly used for extracting the timestamps (which may be in the message's value payload) and reason about streams time to determine which stream to process next (i.e. synchronizing streams based on their current timestamps, see this for details).
     
  5. Persistent State Store Buffering: currently we are using RocksDB by default as persistent state stores for stateful operations such as aggregation / joins, and RocksDB have their own buffering and caching mechanism which allocate memory both off-heap and on-heap.

 

 

  • No labels