You are viewing an old version of this page. View the current version.

Compare with Current View Page History

« Previous Version 26 Next »

 

Status

Current state: Under Discussion

Discussion thread: here

JIRA: Unable to render Jira issues macro, execution error.

Github Pull Request: https://github.com/apache/kafka/pull/132

Please keep the discussion on the mailing list rather than commenting on the wiki (wiki discussions get unwieldy fast).

Motivation

Machines in data center are sometimes grouped in racks. Racks provide isolation as each rack may be in a different physical location and has its own power source. When resources are properly replicated across racks, it provides fault tolerance in that if a rack goes down,  the remaining racks can continue to serve traffic.

In Kafka, if there are more than one replica for a partition, it would be nice to have replicas placed in as many different racks as possible so that the partition can continue to function if a rack goes down. In addition, it makes maintenance of  Kafka cluster easier as you can take down the whole rack at a time.

In AWS, racks are usually mapped to the concept of availability zones

Public Interfaces

Changes to Broker property

An optional broker property will be added 

case class Broker(id: Int, endPoints: Map[SecurityProtocol, EndPoint], rack: Option[String] = None)

rack will be an optional field of version 2 JSON schema for a broker. e.g.:

{"version":2,
  "host","localhost",
  "port",9092
  "jmx_port":9999,
  "timestamp":"2233345666",
  "endpoints": ["PLAINTEXT://host1:9092",
                "SSL://host1:9093"],
  "rack": "dc1"
}

If no rack information is specified, the field will not be included in JSON.

Rack can be specified in the broker property file as

broker.rack=<rack ID as string>

For example, 

broker.rack=dc1

broker.rack=us-east-1c

Consequently, Broker.writeTo will append rack at the end of ByteBuffer and Broker.readFrom will read it:

Broker => ID NumberOfEndPoints [EndPoints] Rack
ID => int32
NumberOfEndPoints => int32
EndPoint => Port Host ProtocolID
Port => int32
Host => string
ProtocolID => int16
Rack => string
 

Same kind of string serialization (as how host is serialized) will be applied to rack, which means it will first write the size of the string as a short, followed by the actual string content. If rack is not available, it will write size -1 only without any actual string content. When reading from ByteBuffer, -1 will be interpreted as "null".

Changes to UpdateMetadataRequest

UpdateMetadataRequest understands version and the serialization for Broker are done in different ways for version 0 and version 1. For version 0, only broker ID, host and port will be serialized. For version 1, the complete Broker object will be serialized. This is done by calling Broker.writeTo and Broker.readFrom. Therefore, for version 1 the rack information will be automatically handled and the the serialization format is the same as the above.

Proposed Changes

  • AdminUtils.assignReplicasToBrokers will be updated to create broker-rack mapping from ZooKeeper data before doing replica assignment. If none of the brokers have rack information, the algorithm will create the same assignment as the current implementation. If some brokers have rack, and some do not, the algorithm will thrown an exception. This is to prevent incorrect assignment caused by user error. 
  • When making the rack aware assignment, it tries to keep the following properties:
    • Even distribution of replicas among brokers
    • Even distribution of partition leadership among brokers
    • Assign to as many racks as possible. That means if the number of racks are more than or equal to the number of replicas, each rack will have at most one replica. On the other hand, if the number of racks is less than the the number of replicas (which should happen very infrequently), each rack should have at least one replica and no other guarantees are made on how the replicas will be distributed among racks. For example, if there are 2 racks and 4 replicas, one rack can have 3 replicas, 2 replicas or 1 replica. This is to keep the algorithm simple while still keeping other replica distribution properties and fault tolerance from the racks.
  • Implementation detail of the rack aware assignment (see more in the pull request https://github.com/apache/kafka/pull/132):
    • Before doing the rack aware assignment, sort the broker list such that they are interlaced according to the rack. In other words, adjacent brokers in the sorted list should not be in the same rack if possible . For example, assuming 6 brokers mapping to 3 racks: 0 -> "rack1", 1 -> "rack1", 2 -> "rack2", 3 -> "rack2", 4 -> "rack3", 5 -> "rack3", the sorted broker list could be (0, 2, 4, 1, 3, 5)
    • Apply the same assignment algorithm to assign replicas, with the addition of skipping a broker if its rack is already used for the same partition
  • UpdateMetadataRequest should be updated to correctly handle rack for both controller protocol version 0 and version 1.

Compatibility, Deprecation, and Migration Plan

  • Rack property will be included in version 2 broker JSON schema and version 1 of UpdateMetaDataRequest. Controller will include version specific broker information in its wire format so that broker with old version can still interoperate with broker with new version and rack.
  • When upgrading brokers, consumers with the old version can still understand the new broker information with rack in ZooKeeper. It calls ZkUtils.getBrokerInfo() which parses the JSON into a map and only gets the id, host and port from it.
  • No labels