Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.
Comment: Added progress monitoring of reassignments

...

Describe the problems you are trying to solve.

The Firstly, the ReassignPartitionsCommand (which is used by the kafka-reassign-partitions.sh tool) talks directly to ZooKeeper. This prevents the tool being used in deployments where only the brokers are exposed to clients (i.e. where the zookeeper servers are intentionally not exposed). In addition, there is a general push to refactor/rewrite/replace tools which need ZooKeeper access with equivalents which use the the AdminClient API. Thus it is necessary to change the ReassignPartitionsCommand so that it no longer talks to ZooKeeper directly, but via an intermediating broker.

Secondly, ReassignPartitionsCommand currently has no proper facility to report progress of a reassignment; --verify can be used periodically to check whether the request assignments have been achieved. It would be useful if the tool could report progress better.

Public Interfaces

Briefly list any new interfaces that will be introduced as part of this proposal or any existing interfaces that will be removed or changed. The purpose of this section is to concisely call out the public contract that will come along with this feature.

...

  • Binary log format

  • The network protocol and api behavior

  • Any class in the public packages under clientsConfiguration, especially client configuration

    • org/apache/kafka/common/serialization

    • org/apache/kafka/common

    • org/apache/kafka/common/errors

    • org/apache/kafka/clients/producer

    • org/apache/kafka/clients/consumer (eventually, once stable)

  • Monitoring

  • Command line tools and arguments

  • Anything else that will likely break existing users in some way when they upgrade

 

A Two new network protocol API APIs will be added:

  • PartitionAssignmentRequest and PartitionAssignmentResponse
  • PartitionAssignmentStatusRequest and PartitionAssignmentStatusResponse

The AdminClient API will have a two new method methods added (plus overloads for options):

  • assignPartitions()
  • partitionAssignmentStatus()

The options accepted by kafka-reassign-partitions.sh command will change:

  • --zookeeper will be deprecated, with a warning message
  • a new --bootstrap-server option will be added
  • a new --progress action option will be added, with a new --request option

Reassignment requests will be identified by a monotonic integer (the "request id"), which will be apparent in each of these public interfaces. The intention is to avoid ambiguity in the event that one reassignment is immediately followed by another. Note that this KIP will not add support for concurrent reassignments, but having a request id is compatible with disambiguating concurrent reassignments, should support for them be added at a later date.

Proposed Changes

Describe the new thing you want to do in appropriate detail. This may be fairly extensive and have large subsections of its own. Or it may be a few sentences. Use judgement based on the scope of the change.

...

It is anticipated that a future version of Kafka would remove support for the --zookeeper option.

New --progress and --request options will be added. These will only be supported when used with --bootstrap-server. If used with --zookeeper the command will produce an error message and the tool will exit without doing the intended operation. The existing --execute option will cause the request id to be printed to the standard output. This can be used as an argument to the --request option to identify the request that the --progress pertains to. If --progress is used without a --request option present the most recent request will be used. If an invalid --request option is given the tool will exit with an error message. For example:

Code Block
# Start the execution of a reassignment and exit immediately,
# printing a reassignment request id to standard output
kafka-reassign-partitions.sh --bootstrap-server localhost:1234 \
  --execute ...
      
# Assuming the above command printed 42, the following
# will print the progress of that reassignment, then exit immediately
kafka-reassign-partitions.sh --bootstrap-server localhost:1234 \
  --progress --request 42 ...

Internally, the ReassignPartitionsCommand will be refactored to support the above changes to the options. An interface will abstract the commands currently issued directly to zookeeper.

...

The following methods will be added to AdminClient to support the ability to reassign partitions: 

Code Block

...

public AssignPartitionsResult assignPartitions(Map<TopicPartition, List<Integer>>)

...


public AssignPartitionsResult assignPartitions(Map<TopicPartition, List<Integer>>,

...

  AssignPartitionOptions options)

...

Where:

Code Block
 

...

 public class AssignPartitionsResult {

...


        // private constructor
        public Map<TopicPartition, KafkaFuture<Void>> values()
        public KafkaFuture<Void> all()
  }

The following methods will be added to AdminClient to support the progress reporting functionality:

Code Block
public PartitionAssignmentStatusResult partitionAssignmentStatus(int requestId)    
public PartitionAssignmentStatusResult partitionAssignmentStatus(int requestId,  PartitionAssignmentStatusOptions options)

Where:

Code Block
public class PartitionAssignmentStatusResult {
   

...

 // private constructor

...


    public Map<TopicPartition, KafkaFuture<Void>> values()

...


    public KafkaFuture<Void> all()
}

public class PartitionAssignmentStatusResult {
    // private constructor  
    public Map<PartitionReplica, Progress> all()      
    /** The progress of moving the given partition */
    public KafkaFuture<Progress> progress(

...

PartitionReplica)
    /**
     * Whether the overall reassignment is complete,
     * that is, if all the values in the map returned by {@link #all()}
     * have complete() == true
     */
    public KafkaFuture<Boolean> complete()
}
 
/** Represents the reassignment of a partition to a broker. */
public class PartitionReplica {
    public String getTopic()
    public int getPartition()
    public int getBroker()
    // plus override equals() and hashCode()
}
 
public class Progress {
    /** The number of completed units of work. */
    long getNumerator()
    /** The total number of units of work. */
    Long getDenominator()
    /** The error code, if the task ended erroneously. */
    Errors getError()
}

The Progress class presented here can easily be reused in future APIs also dealing with progress of long running jobs. The result of getDenominator() is nullable because it cannot be expected that all tasks can efficiently compute their total size. The isComplete() method is provided to determine completeness since getNumerator() == getDenominator() cannot be used in the event of a null denominator.

In general, the denominator will not be constant between successive requests since new messages will arrive from producers. Indeed it is possible for a partition to receive messages from producers at a faster rate than the syncing replica is catching up, i.e. the percentage completion implied by numerator ÷ denominator can decrease.

For partition reassignment, Progress has sufficient information for a percent completion calculation, but not any kind of time-to-completion, since the AdminClient doesn't know the time that the initial request was made.

...

Authorization

With broker-mediated reassignment it becomes possible limit the authority to perform reassignment to something finer-grained than "anyone with access to zookeeper".

...

An AssignPartitionsRequest will initiate the process of partition reassignment

Code Block

...

AssignPartitionsRequest => [PartitionAssignment]

...


  PartitionAssignment => Topic Partition 

...

Brokers
    

...

Topic => string

...


    Partition =>int32

...


    

...

Brokers => [int32]

Where:

  • Topic a topic name
  • Partition a partition of that topic
  • Replicas Brokers the broker ids which will host the partition

...

It is not necessary to send an AssignPartitionsRequest to the leader for a given partition. Any broker will do.

...

The AssignPartitionsResponse enumerates those topics and partitions in the request, together with any error in initiating reassignment that partition:

Code Block
AssignPartitionsResponse => RequestId [PartitionAssignmentResult]

...


  RequestId => int32
  PartitionAssignmentResult => Topic Partition Broker Error

...


    Topic => string

...


    

...

Partition => int32

...


    Broker => int32
    Error => int16

...

Where:

  • RequestId is an id which can be used to make in a PartitionAssignmentStatusRequest
  • Topic is a topic name
  • Partition is a partition id
  • Broker is the broker
  • Error is an error code for why the initiation of assignment of the given topic to the given broker failed

The anticipated errors in the header The AssignPartitionsResponse enumerates those topics and partitions in the request, together with any error for reassigning that partition. The anticipated errors are:

  • INVALID_TOPIC_EXCEPTION (17) If the topic doesn't exist
  • INVALID_PARTITIONS (37) If the partition doesn't exist
  • UNKNOWN_MEMBER_ID (25) If the Replicas Brokers in the AssignPartitionsRequest included an unknown broker id

Network Protocol: PartitionAssignmentStatusRequest and PartitionAssignmentStatusResponse

To request progress information about the given request

Code Block
PartitionAssignmentStatusRequest => RequestId
  RequestId => int32

Where:

  • RequestId is the identifier from a previous AssignPartitionsResponse
Code Block
PartitionAssignmentStatusResponse => [RequestId] [PartitionReplica Progress]
  PartitionReplica => Topic Partition Broker
    Topic => string
    Partition => int32
    Broker => int32
  Progress => Error Numerator Denominator
    Error => int16
    Numerator => int64
    Denominator => int64

Where:

  • RequestId is the identifier from a previous AssignPartitionsResponse
  • Topic the a topic name
  • Partition a partition of the topic
  • Broker the id of the receiving broker
  • Error is an error code about the transfer of the partition to the given broker:
    • PartitionReassignmentInProgress (new) if the reassignment is still in progress,
    • None (0) if the reassignment completed normally,
    • Any other error if the reassignment completed exceptionally.
  • Numerator the number of items that have been processed so far
  • Denominator the total number of items we expect to process for the request. Or -1 if the total is not known. Never 0.

The anticipated errors in the header are:

  • INVALID_REQUEST (42) If the request id is not known

 As with the AdminClient API, the Progess Schema could be reused in other progress reporting APIs.

Implementation

  1. The mediating broker will write the reassignment JSON to the /admin/reassign_partitions znode, as present. The znode version for /admin/reassign_partitions becomes the request_id. This means that the request id is monotonic increasing, but that request id may not be contiguous. The monotonic increasing property is only required so that when the request id is omitted it is easy to figure out the last request.
  2. Receiving brokers are notified of the change to /admin/reassign_partitions (thus obtaining the request id) and start fetching replicas as necessary, as present
  3. As the leader serves fetch requests to followers, at the same time as deciding whether to update the ISR, it will publish their progress as JSON to /admin/reassign_partitions_status/$request_id
  4. To obtain progress the mediating broker will read the /admin/reassign_partitions_status/$request_id znodes.
  5. When the leader adds a syncing broker to the ISR the progress JSON for that partition in /admin/reassign_partitions_status/$request_id will be marked as complete.

It will be necessary to delete the /admin/reassign_partitions_status/$request_id znodes to prevent the /admin/reassign_partitions_status subtree growing arbitrarily. At the same time the leader publishes status to the current /admin/reassign_partitions_status/$request_id it will remove znodes which had completed a given amount of time ago. This will be configurable via the broker property reassign.partitions.status.min.retention.ms which will default to 1 day.

Compatibility, Deprecation, and Migration Plan

...