Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.
Comment: Add replication modes & CAP table, start on user stories.

...

  1. They run in external processes which may experience outages when both Kafka clusters are otherwise healthy and accepting clients
  2. They represent an additional operational burden beyond just running Kafka
  3. They provide logical replication, in which the offset of a record in the source and target are given different offsets
  4. Replication does not preserve offsets of individual records
  5. Replication does not preserve exactly-once-semantics for records & consumer offsetsThe replication is asynchronous, preventing the extension of strong consistency guarantees between Kafka clusters

Goals

  • Replicate topics and other associated resources between intentionally separate Kafka clustersProvide physical replication semantics
  • Offsets for replicated records should be the same as in the origin Kafka
  • Preserve exactly-once-semantics (immediately, in some configurations, or as a later extension of the feature?)for replicated records

Public Interfaces

User Interface

  • New AdminClient methods for managing Replication Links on both the source and destination clusters
  • A single cluster can participate in arbitrarily many links, and both a source and destination simultaneously.
  • Links accept a configuration, including topics.regex and consumer.groups.regex, and a mode selector that accepts either Asynchronous or Synchronous
    • Can a link change it's topics.regex or consumer.groups.regexWhat happens if a topic or group is created or deleted while the link is in-sync? Does the link change state?
    • Can a link change from synchronous to asynchronous?
  • Existing Consumers & Producers can access replicated topics without an upgrade or reconfiguration.
  • Metrics allow observation of the state of each partition included in the replication link and progress of the replication flow (lag, throughput, etc)
  • During and shortly after network partitions, the link will be out-of-sync while it catches up with the source ISR
  • Links can be temporarily or permanently disconnected. During temporary disconnects, the replicated topic is not writable. After a permanent disconnect, the replicated topic is writable.manually disconnected, after which the destination topic becomes writable.
    • Can we manually reverse a link while keeping consistency?
    • Can we reconnect a link, truncating the target if it has diverged?

Data Semantics

Cross-Cluster Replication is similar to Intra-Cluster replication, as both cross-cluster topics and intra-cluster replicas:

...

  • Are subject to the target cluster's ACL environment
  • Are not included in acks=all or min-ISR requirements (optional? maybe acks=all is useful/required for EOS producers that want to make sure cross-cluster replication finishes)Are not eligible for fetch-from-follower on the source clustereligible for fetches from source cluster consumers
  • Have a separate topic-id

Consumer Offsets

...

Replication Mode

  • Asynchronous mode allows replication of a record to proceed after the source cluster has ack'd the record to the producer.
    • Maintains existing latency of source producers, but provides only single-topic consistency guarantees
  • Synchronous mode requires the source cluster to delay acks for a producer until after the record has been replicated to all ISRs for all attached synchronous replication links.
    • Also applies to consumer group offsets submitted by the consumer or transactional producer 
    • Increases latency of source producers, in exchange for multi-topic & consumer-offset consistency guarantees

Partition Behavior

We must support partition tolerance, which requires choosing between consistency and availability. Availability in the table below means that the specified client can operate, at the expense of consistency with another client. Consistency means that the specified client will be unavailable, in order to be consistent with another client.

ModeAsynchronous ModeSynchronous Mode

Link State

StartupOut-of-syncDisconnectedStartupOut-of-syncDisconnected

Source Consumers

Available1Available1Available1,2Available1Available1Available1,2

Source Non-Transactional Producers

Available1Available1Available1,2Available1Available1Available1,2
Source Transactional ProducersAvailable3Available4Available2Available3Consistent5Available2
Target ConsumersAvailable6Available6Available2Consistent7Consistent8Available2
Target ProducersConsistent9Consistent9Available2Consistent9Consistent9Available2
  1. Source clients not involved in transactions will always prioritize availability over consistency
  2. When the link is permanently disconnected, clients on each cluster are not required to be consistent and can show whatever the state was when the link was disconnected.
  3. Transactional clients are available during link startup, to allow creating links on transactional topics without downtime.
  4. Transactional clients are available when asynchronous links are out-of-sync, because the source cluster is allowed to ack transactional produces while async replication is offline or behind.
  5. Transactional clients are consistent when synchronous links are out-of-sync, because the destination topic is readable by target consumers. Transactional produces to a topic with an out-of-sync synchronous replication link should timeout or fail.
  6. Consumers of an asynchronous replicated topic will see partial states while the link is catching up
  7. While starting up a synchronous replicated topic, consumers will be unable to read the partial topic. This allows source transactional produces to proceed while the link is starting (3)
  8. Consumers of a synchronous replicated topic should always see the same contents as the source topic
  9. Producers targeting the replicated topic will fail because the topic is not writable until it is disconnected. Allowing a dual-write to the replicated topic would be inconsistent. 

...

Remote and Local Logs

  • Remote log replication should be controlled by the second tier's provider.
  • Remote logs can be referenced if not replicated by the second tier's provider so that replicated Kafka topics reference the source remote log storage.

...

  • The network path between Kafka clusters is assumed to be less reliable have less uptime, lower bandwidth,  and higher latency than the intra-cluster network, and have more strict routing constraints.
  • Network connections for replication links can proceed either source→target or target→source to allow one cluster to be behind a NAT (can we support NAT punching as a first-class feature, to allow both clusters to be behind NAT?)
  • Allow for simple traffic separation of cross-cluster connections and client/intra-cluster connections (do we need to connect to something other than the other cluster's advertised listeners?)

...

  • Binary log format

  • The network protocol and api behavior

  • Any class in the public packages under clientsConfiguration, especially client configuration

    • org/apache/kafka/common/serialization

    • org/apache/kafka/common

    • org/apache/kafka/common/errors

    • org/apache/kafka/clients/producer

    • org/apache/kafka/clients/consumer (eventually, once stable)

  • Monitoring

  • Command line tools and arguments

  • Anything else that will likely break existing users in some way when they upgrade

User Stories

Disaster Recovery (multi-zone Asynchronous)

  1. I administrate multiple Kafka clusters in different availability zones
  2. I have a performance-sensitive application that reads in all zones but writes to only one zone at a time. For example, an application that runs consumers in zones A and B to keep caches warm but disables producers in zone B while zone A is running.
  3. I set up an asynchronous Cross-Cluster replication link for my topics and consumer groups from cluster A to cluster B. While the link is being created, applications in zone A are performant, and zone B can warm it's caches with historical data as it is replicated.
  4. I do the same with cluster A and cluster C (and others that may exist)
  5. When zone A goes offline, I want the application in zone B to start writing. I manually disconnect the A→B cross-cluster link, and trigger the application in zone B to begin writing.
  6. What happens to cluster C? Can we connect B→C quickly? What happens if C is ahead of B and truncating C breaks the cluster C consumers?
  7. When zone A recovers, I see that the history has diverged between zone A and B. I manually delete the topics in zone A and re-create the replication link in the opposite direction.

Proposed Changes

Describe the new thing you want to do in appropriate detail. This may be fairly extensive and have large subsections of its own. Or it may be a few sentences. Use judgement based on the scope of the change.

...

  • Both the source and target clusters should have a version which includes the Cross-Cluster Replication feature
  • Clusters which support the Cross-Cluster Replication feature should negotiate on the mutually-supported replication semantics
  • If one of the clusters is downgraded to a version which does not support Cross-Cluster Replication, the partner cluster should temporarily disable replication's link should fall out-of-sync.

Test Plan

Describe in few sentences how the KIP will be tested. We are mostly interested in system tests (since unit-tests are specific to implementation details). How will we know that the implementation works as expected? How will we know nothing broke?

...