Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

Table of Contents

Status

Current state:

Discussion thread

...

Please keep the discussion on the mailing list rather than commenting on the wiki (wiki discussions get unwieldy fast).

Motivation

Apache Kafka is in the process of moving from storing metadata in Apache Zookeeper, to storing metadata in an internal Raft topic.  KIP-500 described the overall architecture and plan.  The purpose of this KIP is to go into detail about the Kafka Controller will change during this transition.

Proposed Changes

KIP-500 Mode

Once this KIP is implemented, system administrators will have the option of running in KIP-500 mode.  In this mode, we will not use ZooKeeper.  The alternative mode where KIP-500 support is not enabled will be referred to as legacy mode.

...

Initially, we will not support upgrading a cluster from legacy mode to KIP-500 mode.  This is in keeping with the experimental nature of KIP-500 mode.  A future KIP will describe and implement an upgrade process from legacy mode to KIP-500 mode.

Deployment

Currently, a ZooKeeper cluster must be deployed when running Kafka.  This KIP will eliminate that requirement, as well as the requirement to configure the addresses of the zookeeper nodes on each broker.

...

System administrators will be able to choose whether to run separate controller nodes, or whether to run controller nodes which are co-located with broker nodes.  Kafka will provide support for running a controller in the same JVM as a broker, in order to save memory and enable single-process test deployments.

Node IDs

Just like brokers, controller nodes will have non-negative integer node IDs.  There will be a single ID space.  In other words, no controller should share the same ID as a broker.  Even when a broker and a controller are co-located in the same JVM, they must have different node IDs.

Automatic node ID assignment via ZooKeeper will no longer be supported in KIP-500 mode.  Node IDs must be set in the configuration file for brokers and controllers.

Networking

Controller processes will listen on a separate port from brokers.  This will be true even when the broker and controller are co-located in the same JVM.

...

The only time when clients should contact a controller node directly is when they are debugging system issues.  This is similar to ZooKeeper, where we have things like zk-shell, but only for debugging.

Controller Heartbeats

Each broker will periodically send heartbeat messages to the controller.  These heartbeat messages will serve to identify the broker as still in the cluster.  The controller can also use these heartbeats to communicate back information to each node.

These messages are separate from the fetch requests which the brokers send to retrieve new metadata.  However, controller heartbeats do include the offset of the latest metadata update which the broker has applied.

Broker Fencing

Brokers start up in the fenced state and can leave this state only by sending a heartbeat to the active controller and getting back a response that tells them they can become active.  That response serves as a kind of lease which will expire after a certain amount of time.

When a broker is fenced, it cannot A broker is considered to be "fenced" if it will not process any client requests.

  There are a few reasons why we might want the broker to be in this state.  The first is that the broker is truly isolated from the controller quorum, and therefore not up-to-date on its metadata.  The second is if the broker is attempting to use the ID of another broker which has already been registered.Brokers start up in the fenced state and can leave this state only by sending a heartbeat to the active controller and getting back a response that tells them they can become active.  That response serves as a kind of lease which will expire after a certain amount of time.

Metadata Edits

Each change that the controller makes will generate a metadata edit log entry, also known as an"edit."  These edits will be persisted to the metadata partition.

...

It's important to note that in many cases, it will be better to add a tagged field than to bump the version number of an edit type.  A tagged field is appropriate any time older brokers can simply ignore a field that newer brokers might want to see.

Snapshots

Clearly, as we keep creating metadata edits, the metadata partition will grow without bound, even if the size of the actual metadata stays the same.  To avoid this, we will periodically write out our metadata state into a snapshot.

...

The controller will also snapshot less frequently when too many members of the quorum have fallen behind.  Specifically, if losing a node would probably impact availability, we will use a separate set of configurations for determining when to snapshot.


Public Interfaces

Compatibility, Deprecation, and Migration Plan

Test Plan

Rejected Alternatives