Status

Current state: Under Discussion

Discussion thread: Thread

JIRA: Unable to render Jira issues macro, execution error.

Please keep the discussion on the mailing list rather than commenting on the wiki (wiki discussions get unwieldy fast).

Motivation

As described in KIP-844, under EOS, crash failures cause all Task state to be wiped out on restart. This is because, currently, data is written to the StateStore before the commit to its changelog has completed, so it's possible that records are written to disk that were not committed to the store changelog.

In KIP-844, it was proposed to create an alternative type of StateStore, which would enable users to opt-in to "transactional" behaviour, that ensured data was only persisted once the changelog commit has succeeded. However, the design and approach outlined in KIP-844 unfortunately did not perform well when tested (with a write throughput that was approximately only 4% of the regular RocksDB StateStore!).

This KIP explores an alternative design that should have little/no performance impact, and can thus be enabled for all stores.

Public Interfaces

New configuration:

Name Default Description

statestore.uncommitted.max.records -1 Maximum number of uncommitted state-store records to buffer per-task. If this limit is exceeded, a task commit will be requested. No limit: -1.

statestore.uncommitted.max.bytes

67108864 (64 MB)

Maximum number of memory bytes to be used to buffer uncommitted state-store records per-task. If this limit is exceeded, a task commit will be requested. No limit: -1.

Note: if this is too high or unbounded, it's possible for RocksDB to trigger out-of-memory errors.

Changed:

org.apache.kafka.streams.processor.StateStore

Changes:

org.apache.kafka.streams.processor.StateStore

    /**
     * Return an approximate count of records not yet committed to this StateStore.
     * <p>
     * This method will return an approximation of the number of records that would be committed by the next call to
     * {@link #commit(Map)}.
     * <p>
     * If this StateStore is unable to approximately count uncommitted records, it will return {@code -1}.
     * If this StateStore does not support atomic transactions, it will return {@code 0}, because records will always be
     * immediately written to a non-transactional store, so there will be none awaiting a {@link #commit(Map)}.
     *
     * @return The approximate number of records awaiting {@link #commit(Map)}, {@code -1} if the number of
     *         uncommitted records can't be counted, or {@code 0} if this StateStore does not support transactions.
     */
    @Evolving
    default long approximateNumUncommittedEntries() {
        return 0;
    }

   /**
     * Return an approximate count of memory used by records not yet committed to this StateStore.
     * <p>
     * This method will return an approximation of the memory would be freed by the next call to {@link #commit(Map)}.
     * <p>
     * If this StateStore is unable to approximately count uncommitted memory usage, it will return {@code -1}.
     * If this StateStore does not support atomic transactions, it will return {@code 0}, because records will always be
     * immediately written to a non-transactional store, so there will be none awaiting a {@link #commit(Map)}.
     *
     * @return The approximate size of all records awaiting {@link #commit(Map)}, {@code -1} if the size of uncommitted
     *         records can't be counted, or {@code 0} if this StateStore does not support transactions.
     */
    @Evolving
    default long approximateNumUncommittedBytes() {
        return 0;
    }

Proposed Changes

RocksDB provides WriteBatchWithIndex as a means to accomplishing atomic writes when not using the RocksDB WAL. The performance overhead of doing this should be negligible. The main performance consideration is that the buffer must reside completely in-memory until it is committed.

To mitigate this, we will automatically force a Task commit if the total uncommitted records returned by StateStore#approximateNumUncommittedEntries() exceeds a threshold, configured by statestore.uncommitted.max.records; or the total memory used for buffering uncommitted records returned by StateStore#approximateNumUncommittedBytes() exceeds the threshold configured by statestore.uncommitted.max.bytes. This will roughly bound the memory required per-Task for buffering uncommitted records, irrespective of the commit.interval.ms, and will effectively bound the number of records that will need to be restored in the event of a failure.

Note that these new methods provide default implementations that ensure existing custom stores and non-transactional stores (e.g. InMemoryKeyValueStore) do not force any early commits.

When reading records, we will use the WriteBatchWithIndex#getFromBatchAndDB and WriteBatchWithIndex#newIteratorWithBase utilities in order to ensure that uncommitted writes are available to query.

Concurrent Access by Interactive Queries

While the Task that owns a StateStore is processed by a single thread, concurrent access is possible when other threads perform an interactive query on the StateStore. While RocksDB itself is thread-safe, and optimized for concurrent access, WriteBatchWithIndex is not thread-safe. Care will need to be taken to ensure the consistency and safety of the batch in the presence of concurrent access from interactive query threads.

All RocksDBStore access methods (get, put , etc.) are already synchronized, however, it's possible for iterators to escape the synchronization locking, enabling a RocksDBIterator to be in-use while the underlying WriteBatchWithIndex is being modified. In addition to tackling this problem, we will evaluate whether the access methods need to be syncrhonized, as this may be an unnecessary performance penalty.

Compatibility, Deprecation, and Migration Plan

The above changes will retain compatibility for all existing StateStores, including user-defined custom implementations. Any StateStore that extends RocksDBStore will automatically inherit its behaviour, although users that directly write via the db RocksDB instance may need to switch to using the dbAccessor to ensure consistent results.

Test Plan

Testing will be accomplished by both the existing tests and by writing some new unit tests that verify atomicity, durability and consistency guarantees that this KIP provides.

Rejected Alternatives

The design outlined in KIP-844, sadly, does not perform well (as described above), and requires users to opt-in to transactionality, instead of being a guarantee provided out-of-the-box.

Space shortcuts

Child pages

Status

Motivation

Public Interfaces

Proposed Changes

Concurrent Access by Interactive Queries

Compatibility, Deprecation, and Migration Plan

Test Plan

Rejected Alternatives

Space shortcuts

Child pages

KIP-892: Transactional Semantics for StateStores

Status

Motivation

Public Interfaces

Proposed Changes

Concurrent Access by Interactive Queries

Compatibility, Deprecation, and Migration Plan

Test Plan

Rejected Alternatives