Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.
Comment: Update num.replica.alter.log.dirs.threads config

...

This page is meant as a template for writing a KIP. To create a KIP choose Tools->Copy on this page and modify with your content and replace the heading with the next KIP number and a description of your issue. Replace anything in italics with your own description.

Status

Current state:  Under Discussion Completed

Discussion thread: here

JIRA: KAFKA-5163 and KAFKA-5694

Released:  <Kafka Version>1.1.0

Please keep the discussion on the mailing list rather than commenting on the wiki (wiki discussions get unwieldy fast).

...

The idea is that user can send a ChangeReplicaDirRequest AlterReplicaDirRequest which tells broker to move topicPartition directory (which contains all log segments of the topicPartition replica) from the source log directory to a destination log directory. Broker can create a new directory with .move postfix on the destination log directory to hold all log segments of the replica. This allows broker to tell log segments of the replica on the destination log directory from log segments of the replica on the source log directory during broker startup. The broker can create new log segments for the replica on the destination log directory, push data from source log to the destination log, and replace source log with the destination log for this replica once the new log has caught up.

...

1. Initiate replica movement using ChangeReplicaDirRequestusing AlterReplicaDirRequest

User uses kafka-reassignment-partitions.sh to send ChangeReplicaDirRequest AlterReplicaDirRequest to broker to initiate replica movement between its log directories. The flow graph below illustrates how broker handles ChangeReplicaDirRequest AlterReplicaDirRequest.

Notes:
- Broker will cancel existing movement of the replica if "any" is specified as destination log directory.
- If broker doesn't not have already replica created for the specified topicParition when it receives ChangeReplicaDirRequestAlterReplicaDirRequest, it will reply ReplicaNotAvailableException AND remember (replica, destination log directory) pair in memory to create the replica in the specified log directory when it receives LeaderAndIsrRequest later.

...

Here we describe how a broker moves a Log from source to destination log directory and swaps the Log.  This corresponds to the "Initiate replica data movement" box in the flow graph above. Note that the broker responds to ChangeReplicaDirRequest AlterReplicaDirRequest with MoveInProgress after step 1) described below.

...

- If both the directory topicPartition and the directory topicPartition.move exist on good log directories, broker will start ReplicaMoveThread to copy data from topicPartition to topicPartition.move. The effect is the same as if broker has received ChangeReplicaDirRequest AlterReplicaDirRequest to move replica from topicPartition to topicPartition.move.
- If topicPartition.move exists but topicPartition doesn't exist on any good log directory, and if there is no bad log directory, then broker renames topicPartition.move to topicPartition.
- If topicPartition.move exists but topicPartition doesn't exist on any good log directory, and if there is bad log directory, then broker considers topicPartition as offline and would not touch topicPartition.move.
- If topicPartition.delete exists, the broker schedules topicParition.delete for asynchronous delete.

...

The idea is that user should be able to specify log directory when using kafka-reassign-partitions.sh to reassign partition. If user has specified log directory on the destination broker, the script should send ChangeReplicaDirRequest AlterReplicaDirRequest directly to the broker so that broker can start ReplicaMoveThread to move the replica. Finally, the script should send DescribeDirsRequest DescribeLogDirsRequest to broker to verify that the replica has been created/moved in the specified log directory when user requests to verify the assignment.

...

- User specifies a list of log directories, one log directory per replica, for each topic partition in the reassignment json file that is provided to kafka-reassignment-partitions.shThe log directory specified by user must be either "any", or absolute path which begins with '/'. If "any" is specified as the log directory, the broker is free to choose any log directory to place the replica. Current broker implementation will select log directory using round-robin algorithm by default. See Scripts section for the format of this json file.
- The script sends ChangeReplicaDirRequest AlterReplicaDirRequest to those brokers which need to move replicas to user-specified log directory. This step can be skipped if user has specified "any" as log directory for all replicas. The script exits with error if the broker to receive ChangeReplicaDirRequest AlterReplicaDirRequest is offline or if the ChangeReplicaDirResponse AlterReplicaDirResponse contains any error that is not ReplicaNotAvailableException.
- The script creates reassignment znode in zookeeper.
- The script retries ChangeReplicaDirRequest AlterReplicaDirRequest to those brokers which have responded with ReplicaNotAvailableException in the ChangeReplicaDirResponse  AlterReplicaDirResponse previously. The script keeps retrying up to user-specified timeout. The timeout is 10 seconds by default. The script exits with error if the broker to receive ChangeReplicaDirRequest AlterReplicaDirRequest is offline or if the ChangeReplicaDirResponse AlterReplicaDirResponse contains any error that is not ReplicaNotAvailableException.
- The script returns result to user.

...

kafka-reassignment-partitions.sh will verify partition assignment across brokers as it does now. 
- For those replicas with destination log directory != "any", kafka-reassignment-partitions.sh groups those replicas according to their brokers and and sends DescribeDirsRequest DescribeLogDirsRequest to those brokers. The DescribeDirsRequest DescribeLogDirsRequest should provide the log directories and partitions specified in the expected assignment.
- Broker replies with DescribeDirsResponseDescribeLogDirsResponse which shows the current log directory for each partition specified in the DescribeDirsRequest DescribeLogDirsRequest.
- kafka-reassignment-partitions.sh determines whether the replica has been moved to the specified log directory based on the DescribeDirsResponseDescribeLogDirsResponse.

3) How to retrieve information to determine the new replica assignment across log directories

...

In order to optimize replica assignment across log directories, user would need to figure out the list partitions per log directory, the size of each partition. As of now Kafka doesn't expose this information via any RPC and user would need to either query the JMX metrics of the broker, or use external tools to log onto each machine to get this information. While it is possible to retrieve these information via JMX, users would have to manage JMX port and related credentials. It is better if Kafka can expose this information via RPC.

Solution:

We introduce DescribeDirsRequest DescribeLogDirsRequest and DescribeDirsResponseDescribeLogDirsResponseWhen a broker receives DescribeDirsRequestDescribeLogDirsRequest with empty list of log directories, it will respond with a DescribeDirsResponseDescribeLogDirsResponse which shows the size of each partition and lists of partitions per log directory for all log directories. If user has specified a list of log directories in the DescribeDirsRequestDescribeLogDirsRequest, the broker will provide the above information for only log directories specified by the user. If user has specified an empty list of topics in the DescribeDirsRequestDescribeLogDirsRequest, all topics will be queried and included in the response. Otherwise, only those topics specified in the DescribeDirsRequest DescribeLogDirsRequest will be queried. Non-zero error code will be specified in the DescribeDirsResponseDescribeLogDirsResponse for each log directory that is either offline or not found by the broker.

...

Public interface

Protocol

Create ChangeReplicaDirRequestAlterReplicaDirRequest

 

Code Block
ChangeReplicaDirRequestAlterReplicaDirRequest => topics
  topics => [ChangeReplicaDirRequestTopicAlterReplicaDirRequestTopic]
 
ChangeReplicaDirRequestTopicAlterReplicaDirRequestTopic => topic partitions
  topic => str
  partitions => [ChangeReplicaDirRequestPartitionAlterReplicaDirRequestPartition]

ChangeReplicaDirRequestPartitionAlterReplicaDirRequestPartition => partition log_dir
  partition => int32
  log_dir => str

 

Create ChangeReplicaDirResponseAlterReplicaDirResponse

 

Code Block
ChangeReplicaDirResponseAlterReplicaDirResponse => topics
  topics => [ChangeReplicaDirResponseTopicAlterReplicaDirResponseTopic]
 
ChangeReplicaDirResponseTopicAlterReplicaDirResponseTopic => topic partitions
  topic => str
  partitions => [ChangeReplicaDirResponsePartitionAlterReplicaDirResponsePartition]
 
ChangeReplicaDirResponsePartitionAlterReplicaDirResponsePartition => partition error_code
  partition => int32
  error_code => int16

Create DescribeDirsRequestDescribeLogDirsRequest

Code Block
DescribeLogDirsRequest => topics
  topics => DescribeLogDirsRequestTopic // If this is empty, all topics will be queried
 
DescribeLogDirsRequestTopic => topic partitions
  topic => str
  partitions => [int32]

Create DescribeLogDirsResponse

Code Block
DescribeLogDirsResponse => log_dirs// log_dirs and topics are used to filter the results to include only the specified log_dir/topic. The result is the intersection of both filters.
DescribeDirsRequest => log_dirs topics
  log_dirs => [strDescribeLogDirsResponseDirMetadata]

DescribeLogDirsResponseDirMetadata  // If this is empty, then all log directories will be queried=> error_code path topics
  error_code => int16
  path => str
  topics => DescribeDirsRequestTopic // If this is empty, all topics will be queried
 
DescribeDirsRequestTopic => topic partitions
  topic => [DescribeLogDirsResponseTopic]
 
DescribeLogDirsResponseTopic => topic partitions
  topic => str
  partitions => [int32]

Create DescribeDirsResponse

Code Block
DescribeDirsResponse => log_dirs
  log_dirs => [DescribeDirsResponseDirMetadata]

DescribeDirsResponseDirMetadata => error_code path topics
  error_codeDescribeLogDirsResponsePartition]
  
DescribeLogDirsResponsePartition => partition size offset_lag is_temporary
  partition => int16int32
  pathsize => strint64
  topicsoffset_lag => [DescribeDirsResponseTopic]
 
DescribeDirsResponseTopic => topic partitions
  topic => str
  partitions => [DescribeDirsResponsePartition]
  
DescribeDirsResponsePartition => partition size log_end_offset is_temporary
  partition => int32
  size => int64
  log_end_offset => int64  // Enable user to track movement progress by comparing LEO of the *.log and *.move 
  is_ int64  // If this is not a temporary replica, then offset_lag = max(0, HW - LEO). Otherwise, offset_lag = primary Replica's LEO - temporary Replica's LEO
  is_temporary => boolean  // True if replica is *.move

Broker Config

1) Add config intra.broker.throttled.rate. This config specified the maximum rate in bytes-per-second that can be used to move replica between log directories. This config defaults to MAX_LONG. The intra.broker.throttled.rate is per-broker and the specified capacity is shared by all replica-movement-threads.

2) Add config num.replica.movealter.log.dirs.threads. This config specified the number of threads in ReplicaMoveThreadPool. The thread in this thread pool is responsible to moving replica between log directories. This config defaults to the number of log directories. Note that we typically expect 1-1 mapping between log directories and disks. Thus setting the config to number of log directories by default provides a reasonable way to keep the movement capacity in proportion with the number of disks.

...

./bin/kafka-log-dirs.sh --describe --zookeeper localhost:2181 --broker 1 --log-dirs dir1,dir2,dir3 --topics topic1,topic2 will show list of partitions and their size per log directory for the specified topics and the specified log directories on the broker. If no log directory is specified by the user, then all log directories will be queried. If no topic is specified, then all topics will be queried. If a log directory is offline, then its error code in the DescribeDirsResponse the DescribeLogDirsResponse will indicate the error and the log directory will be marked offline in the script output.

...

 

Code Block
{
  "version" : 1,
  "log_dirs" : [
    {
      "is_live" : boolean,
      "path" : str,
      "partitions": [
        {
          "topic" : str, 
          "partition" : int32, 
          "size" : int64,
          "logoffset_end_offsetlag" : in64,
          "is_temporary" : boolean
        },
        ...
      ]
    },

    ...
  ]
}



...

3) Add optional argument --timeout to kafka-reassignment-partitions.sh. This is because kafka-reassignment-partitions.sh may need to re-send ChangeReplicaDirRequest AlterReplicaDirRequest to broker if replica hasn't already been created there. The timeout is set to 10 seconds by default.


AdminClient

 The following methods and classes are added.

Code Block
languagejava
public interface AdminClient extends AutoCloseable { 

    /**
     * ChangeQuery the log directory information for the specified replicas.
log directories on the given *brokers.
     * UpdatesAll arelog notdirectories transactionalon soa theybroker mayare succeedqueried forif somean resourcesempty whilecollection failof forlog others.directories Theis logspecified directory for
     * a particular replica is updated atomically.this broker
     *
     * This operation is supported by brokers with version 0.11.1.0 or higher.
     *
     * @param logDirsByBroker  replicaAssignment  The replicasA withlist theirof log directorydirs absoluteper pathbroker
     * @param options             The options to use when changingquerying replicalog dir info
     * @return                    The ChangeReplicaDirResultDescribeLogDirsResult
     */
    public abstract ChangeReplicaDirResultDescribeLogDirsResult changeReplicaDirdescribeLogDirs(Map<TopicPartitionReplicaMap<Integer, String>Collection<String>> replicaAssignmentlogDirsByBroker, ChangeReplicaDirOptionsDescribeLogDirsOptions options);
 
    /**
     * Query the logreplica directory information for the specified log directoriesreplicas.
     * All log directories on the broker are queries of the collection is empty.
     *
     * This operation is supported by brokers with version 0.11.1.0 or higher.
     *
     * @param logDirsByBrokerreplicas     A listThe ofreplicas log dirs per brokerto query
     * @param options             The options to use when querying logreplica dir info
     * @return              The DescribeReplicaLogDirsResult
     The DescribeDirsResult
     **/
    public abstract DescribeDirsResultDescribeReplicaLogDirsResult describeDirsdescribeReplicaLogDirs(Map<Integer, Collection<String>> logDirsByBrokerCollection<TopicPartitionReplica> replicas, DescribeDirsOptionsDescribeReplicaLogDirsOptions options);
}
 
 
public class KafkaAdminClient extends AdminClient {
    /**
     * QueryAlter the replicalog dir informationdirectory for the specified replicas.
     *
     * ThisUpdates operationare isnot supportedtransactional byso brokersthey withmay version 0.11.1.0 or higher.
     *
     * @param replicas      The replicas to querysucceed for some resources while fail for others. The log directory for
     * a particular replica is updated atomically.
     *
     * @paramThis optionsoperation is supported by brokers with  The options to use when querying replica dir infoversion 0.11.1.0 or higher.
     *
     * @return@param replicaAssignment  The replicas with their log directory absolute path
    The DescribeReplicaDirResult
* @param options   */
    public abstract DescribeReplicaDirResult describeReplicaDir(Collection<TopicPartitionReplica> replicas, DescribeReplicaDirOptions options);
The options to use when changing replica dir
     * @return                   The AlterReplicaDirResult
     */
    public AlterReplicaDirResult alterReplicaDir(Map<TopicPartitionReplica, String> replicaAssignment, AlterReplicaDirOptions options);
 
}


/**
  * Options for the changeReplicaDiralterReplicaDir call.
  */
class ChangeReplicaDirOptionsAlterReplicaDirOptions {
    private Integer timeoutMs = null;
    public ChangeReplicaDirOptionsAlterReplicaDirOptions timeoutMs(Integer timeoutMs);
    public Integer timeoutMs();
}
 
/**
  * The result of the changeReplicaDiralterReplicaDir call.
  */
class ChangeReplicaDirResultAlterReplicaDirResult {
    /**
     * Return a map from replica to futures, which can be used to check the status of individual replica movement.
     */
    public Map<TopicPartitionReplica, KafkaFuture<Void>> values();

    /**
     * Return a future which succeeds if all the replica movement have succeeded
     */
    public KafkaFuture<Void> all();
}
 
/**
  * Options for the describeDirs call.
  */
class DescribeDirsOptionsDescribeLogDirsOptions {
    private Integer timeoutMs = null;
    public DescribeDirsOptionsDescribeLogDirsOptions timeoutMs(Integer timeoutMs);
    public Integer timeoutMs();
}
 
/**
  * The result of the describeDirs call.
  */
class DescribeDirsResultDescribeLogDirsResult {
    /**
     * Return a map from brokerId to futures which can be used to check the logDirInfoinformation of partitions on each individual broker
     */
    public Map<Integer, KafkaFuture<Map<String, LogDirInfo>>> values();

    /**
     * Return a future which succeeds only if all the brokers have responded without error
     */
    public KafkaFuture<Map<Integer, Map<String, LogDirInfo>>> all();
}
 
/**
  * Description of a log directory
  */
class LogDirInfo {
    public final Errors error;
    public final Map<TopicPartition, ReplicaInfo> replicaInfos;
}
 
/**
  * Description of a replica
  */
public class ReplicaInfo {
    public final long size;
    public final long logEndOffset;
    public final boolean isTemporary;
}
 
/**
  * Options for the describeReplicaDir call.
  */
class DescribeReplicaDirOptionsDescribeReplicaLogDirsOptions {
    private Integer timeoutMs = null;
    public DescribeReplicaDirOptionsDescribeReplicaLogDirsOptions timeoutMs(Integer timeoutMs);
    public Integer timeoutMs();
}
 
/**
  * The result of the describeReplicaDir call.
  */
class DescribeReplicaDirResultDescribeReplicaLogDirsResult {
    /**
     * Return a map from replica to futures which can be used to check the statuslog directory information of individual replicas
     */
    public Map<TopicPartitionReplica, KafkaFuture<ReplicaDirInfo>> values();

    /**
     * Return a future which succeeds if log ReplicaDirInfodirectory information of all replicas are available
     */
    public KafkaFuture<Map<TopicPartitionReplica, ReplicaDirInfo>> all();
}

/**
  * Log directory information of a given replica and its intra-broker movement progress
  */
class ReplicaDirInfo {
    public String currentReplicaDir;
    public String temporaryReplicaDir;
    public long temporaryReplicaOffsetLag;
}

...

- Validated client/cluster state for topic1. 

Rejected Alternatives

  

1) Write the replica -> log directory mapping in the reassignment znode and have controller send AlterReplicaDirRequest to brokers.

This alternative solution has a few drawbacks:
- There can be use-cases where we only want to rebalance the load of log directories on a given broker. It seems unnecessary to go through controller in this case.
- If controller is responsible for sending AlterReplicaDirRequest, and if the user-specified log directory is either invalid or offline, then controller probably needs a way to tell user that the partition reassignment has failed. We currently don't have a way to do this since kafka-reassign-partition.sh simply creates the reassignment znode without waiting for response. I am not sure that is a good solution to this.
- If controller is responsible for sending AlterReplicaDirRequest, the controller logic would be more complicated because controller needs to first send AlterReplicaDirRequest so that the broker memorize the partition -> log directory mapping, send LeaderAndIsrRequest, and keep sending AlterReplicaDirRequest (just in case broker restarted) until replica is created. Note that the last step needs repeat and timeout as the proposed in the KIP-113.

Overall this alterantive adds quite a bit complexity to controller and we probably want to do this only if there is strong clear of doing so. Currently in KIP-113 the kafka-reassign-partitions.sh is responsible for sending AlterReplicaDirRequest with repeat and provides error to user if it either fails or timeout. It seems to be much simpler and user shouldn't care whether it is done through controller.