Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

...

To ensure that data is not written to a state store until it has been committed to the changelog, we need to isolate writes from the underlying database until changelog commit. To achieve this, we introduce the concept of transaction Isolation Levels, that dictate the visibility of records, written by different processing threads, to Interactive Query threads.

We enable configuration of the level of isolation provided by StateStores via a default.state.isolation.level, which can be configured to either:

default.state.isolation.levelDescription
processing.modes
READ_UNCOMMITTED

Records written by the StreamThread are visible to all Interactive Query threads immediately. This level provides no atomicity, consistency, isolation or durability guarantees.

at-least-once

Under this Isolation Level, Streams behaves as it currently does, wiping state stores on-error when the processing.mode is one of exactly-once, exactly-once-v2  or exactly-once-beta.

READ_COMMITTED

Records written by the StreamThread are only visible to Interactive Query threads once they have been committed.

at-least-once, exactly-once, exactly-once-v2, exactly-once-beta

Under this Isolation Level, Streams will isolate writes from state stores until commit. This guarantees consistency of the on-disk data with the store changelog, so Streams will not need to wipe stores on-error.

In Kafka Streams, all StateStore s are written to by a single StreamThread  (this is the Single Writer principle). However, multiple other threads may concurrently read from StateStore s, principally to service Interactive Queries. In practice, this means that under READ_COMMITTED, writes by the StreamThread  that owns the StateStore  will only become visible to Interactive Query threads once commit()  has been called.

The READ_UNCOMMITTED isolation default value for default.state.isolation.level will only be available under the at-least-once processing.mode. If READ_UNCOMMITTED is selected with an EOS processing.mode, it , to mirror the behaviour we have today; but this will be automatically upgraded set to READ_COMMITTED and a warning will be produced. This is due to the complexity of making uncommitted writes in the RocksDB transaction buffer available to other threads, and this restriction is expected to be removed in a later KIP. We are able to provide READ_UNCOMMITTED isolation under at-least-once because under this mode, transactions are not used for writes to the changelog, therefore it is acceptable to write directly to the database, even if those writes fail to append to the changelog. This is the same behaviour that we have today, where stores are not wiped on error under at-least-once.

The default value for default.state.isolation.level will be READ_UNCOMMITTED, to mirror the behaviour we have today; but this will be automatically set to READ_COMMITTED if the processing.mode has been set to an EOS mode (see above).

if the processing.mode has been set to an EOS mode, and the user has not explicitly set deafult.state.isolation.level to READ_UNCOMMITTED. This will provide EOS users with the most useful behaviour out-of-the-box, but ensures that they may choose to sacrifice the benefits of transactionality to ensure that Interactive Queries can read records before they are committed, which is required by a minority of use-cases.

In-memory Transaction Buffers

...

Note that this new method provide provides default implementations that ensure existing custom stores and non-transactional stores (e.g. InMemoryKeyValueStore) do not force any early commits.

...

With Transactional StateStores, we can guarantee that the local state is consistent with the changelog, therefore, it will no longer be necessary to reset the local state on a TimeoutException when operating under the READ_COMMITTED isolation level.

Atomic Checkpointing

Kafka Streams currently stores the changelog offsets for a StateStore in a per-Task on-disk file, .checkpoint, which under EOS, is written only when Streams shuts down successfully. There are two major problems with this approach:

...

When writing data to a RocksDBStore (via put, delete, etc.), the input partition offsets will be read from the changelog record metadata (as before), and these offsets will be added to the current transactions WriteBatch. When the StateStore is committed, the position offsets in the current WriteBatch will be written to RocksDB, alongside the records they correspond to. Alongside this, RocksDBStore will maintain two Position maps in-memory, one containing the offsets pending in the current transaction's WriteBatch, and the other containing committed offsets. On commit(Map), the uncommitted Position map will be merged into the committed Position map such that both maps contain the same offsets. In . In this sense, the two Position maps will diverge during writes, and re-converge on-commit.

...

When the isolation level is READ_COMMITTED, we will use RocksDB's WriteBatchWithIndex as a means to accomplishing atomic writes when not using the RocksDB WAL. When reading records from the StreamThread, we will use the WriteBatchWithIndex#getFromBatchAndDB and WriteBatchWithIndex#newIteratorWithBase utilities in order to ensure that uncommitted writes are available to query. When reading records from Interactive Queries, we will use the regular RocksDB#get and RocksDB#newIterator methods, to ensure we see only records that have been flushed committed (see above). The performance of this is expected to actually be better than the existing, non-batched write path. The main performance concern is that the buffer WriteBatch must reside completely in-memory until it is committed, which is addressed by statestore.uncommitted.max.bytes, see above.

...

Users may notice a change in the performance/behaviour of Kafka Streams. Most notably, under EOS Kafka Streams will now regularly "commit" StateStores, where it would have only done so when the store was closing in the past. While the The overall performance of this should not be a concern, this change will be observable in metrics, potentially causing thresholds users may have set in monitoring systems to require updatingbe at least as good as before, but the profile will be different, with write latency being substantially faster, and commit latency being a bit higher.

Upgrading

When upgrading to a version of Kafka Streams that includes the changes outlined in this KIP, users will not be required to take any action. Kafka Streams will automatically upgrade any RocksDB stores to manage offsets directly in the RocksDB database, by importing the offsets from any existing .checkpoint and/or .position files.

Downgrading

Users that currently use processing.mode: exactly-once(-v2|-beta) and who wish to continue to read uncommitted records from their Interactive Queries will need to explicitly set default.state.isolation.level: READ_UNCOMMITTED.

Downgrading

When downgrading from a version of Kafka Streams that includes When downgrading from a version of Kafka Streams that includes the changes outlined in this KIP to a version that does not contain these changes, users will not be required to take any action. The older Kafka Streams version will be unable to open any RocksDB stores that were upgraded to store offsets (see Upgrading), which will cause Kafka Streams to wipe the state for those Tasks and restore the state, using an older RocksDB store format, from the changelogs.

...

It has been recommended to instead pursue this idea in a subsequent KIP, as the interface changes outlined in this KIP should be compatible with this idea.

Transactional support under READ_UNCOMMITTED

When query isolation level is READ_UNCOMMITTED, Interactive Query threads need to read records from the ongoing transaction buffer. Unfortunately, the RocksDB WriteBatch is not thread-safe, causing Iterators created by Interactive Query threads to produce invalid results/throw unexpected errors as the WriteBatch is modified/closed during iteration.

Ideally, we would build an implementation of a transaction buffer that is thread-safe, enabling Interactive Query threads to query it safely. One approach would be to "chain together" WriteBatches, creating a new WriteBatch every time a new Iterator is created by an Interactive Query thread and "freezing" the previous WriteBatch.

It was decided to defer tackling this problem to a later KIP, in order to realise the benefits of transactional state stores to users as quickly as possible.

Query-time Isolation Levels

It was requested that users be able to select the isolation level of queries on a per-query basis. This would require some additional API changes (to the Interactive Query APIs). Such an API would require that state stores are always transactional, and that the transaction buffers can be read from by READ_UNCOMMITTED queries. Due to the problems outlined in the previous section, it was decided to also defer this to a subsequent KIP.

The new configuration option default.state.isolation.level was deliberately named to enable query-time isolation levels in the future, whereby any query that didn't explicitly choose an isolation level would use the configured default. Until then, this configuration option will globally control the isolation level of all queries, with no way to override it per-query.