Authors Satish Duggana, Sriharsha Chintalapani, Ying Zheng, Suresh Srinivas
Table of Contents |
---|
Status
Current State: "Accepted"
...
1) Retrieve the Earliest Local Offset (ELO) and the corresponding leader epoch (ELO-LE) from the leader with a ListOffset request (timestamp = -34)
2) Truncate local log and local auxiliary state
...
This API is enhanced with supporting new target timestamp value as -3 which 4 which is called EARLIEST_LOCAL_TIMESTAMP. There will not be any new fields added in request and response schemes but there will be a version bump to indicate the version update. This request is about the offset that the followers should start fetching to replicate the local logs. It represents the log-start-offset available in the local log storage which is also called as local-log-start-offset. All the records earlier to this offset can be considered as copied to the remote storage. This is used by follower replicas to avoid fetching records that are already copied to remote tier storage.
...
remote.log.metadata.topic.replication.factor | Replication factor of the topic Default: 3 |
remote.log.metadata.topic.num.partitions | No of partitions of the topic Default: 50 |
remote.log.metadata.topic.retention.ms | Retention of the topic in milli seconds. Default: -1, that means unlimited. Users can configure this value based on their usecases. To avoid any data loss, this value should be more than the maximum retention period of any topic enabled with tiered storage in the cluster. |
remote.log.metadata.manager.listener.name | Listener name to be used to connect to the local broker by RemoteLogMetadataManager implementation on the broker. This is a mandatory config while using the default RLMM implementation which is `org.apache.kafka.server.log.remote.metadata.storage.TopicBasedRemoteLogMetadataManager`. Respective endpoint address is passed with "bootstrap.servers" property while invoking RemoteLogMetadataManager#configureinvoking RemoteLogMetadataManager#configure(Map<String, ?> props). This is used by kafka clients created in RemoteLogMetadataManager implementation. |
remote.log.metadata.* | Default RLMM implementation creates producer and consumer instances. Common client propoerties properties can be configured with `remote.log.metadata.common.client.` prefix. User can also pass properties specific to producer/consumer with `remote.log.metadata.producer.` and `remote.log.metadata.consumer.` prefixes. These will override properties with `remote.log.metadata.common.client.` prefix. Any other properties should be prefixed with the config: "remote.log.metadata.manager.impl." and these prefix", default value is "rlmm.config.". These configs will be passed to RemoteLogMetadataManager#configure(Map<String, ?> props). For ex: Security configuration to connect to the local broker for the listener name configured are passed with propsexample: "rlmm.config.remote.log.metadata.producer.batch.size=100" will set the |
remote.partition.remover.task.interval.ms | The interval at which remote partition remover runs to delete the remote storage of the partitions marked for deletion. Default value: 3600000 (1 hr ) |
...
System-Wide | remote.log.storage.system.enable - Whether to enable tier storage functionality in a broker or not. Valid values are `true` or `false` and the default value is false. This property gives backward compatibility. When it is true broker starts all the services required for tiered storage. remote.log.storage.manager.class.name - This is mandatory if the remote.log.storage.system.enable is set as true. remote.log.metadata.manager.class.name(optional) - This is an optional property. If this is not configured, Kafka uses an inbuilt metadata manager backed by an internal topic. |
RemoteStorageManager | (These configs are dependent on remote storage manager implementation) remote.log.storage.* |
RemoteLogMetadataManager | (These configs are dependent on remote log metadata manager implementation) remote.log.metadata.* |
Remote log manager related configuration. | remote.log.index.file.cache.total.size.mb remote.log.manager.thread.pool.size remote.log.manager.task.interval.ms Remote log manager tasks are retried with the exponential backoff algorithm mentioned here. remote.log.manager.task.retry.backoff.ms remote.log.manager.task.retry.backoff.max.ms remote.log.manager.task.retry.jitter remote.log.reader.threads remote.log.reader.max.pending.tasks |
Per Topic Configuration | Users can set the desired config for remote.storage.enable property for a topic, the default value is false. To enable tier storage for a topic, set remote.storage.enable as true. You can not disable this config once it is enabled. We will provide this feature in future versions. Below retention configs are similar to the log retention. This configuration is used to determine how long the log segments are to be retained in the local storage. Existing retention.* are retention configs for the topic partition which includes both local and remote storage. local.retention.ms local.retention.bytes |
...
`RemoteStorageManager` is an interface to provide the lifecycle of remote log segments and indexes. More details about how we arrived at this interface are discussed in the document. We will provide a simple implementation of RSM to get a better understanding of the APIs. HDFS and S3 implementation are planned to be hosted in external repos and these will not be part of Apache Kafka repo. This is in line with the approach taken for Kafka connectors.
Copying and Deleting APIs are expected to be idempotent, so plugin implementations can retry safely and overwrite any partially copied content, or not failing when content is already deleted.
Code Block | ||||
---|---|---|---|---|
| ||||
package org.apache.kafka.server.log.remote.storage; ... /** * RemoteStorageManager provides the lifecycle of remote log segments that includes copy, fetch, and delete from remote * storage. * <p> * Each upload or copy of a segment is initiated with {@link RemoteLogSegmentMetadata} containing {@link RemoteLogSegmentId} * which is universally unique even for the same topic partition and offsets. * <p> * RemoteLogSegmentMetadata is stored in {@link RemoteLogMetadataManager} before and after copy/delete operations on * RemoteStorageManager with the respective {@link RemoteLogSegmentState}. {@link RemoteLogMetadataManager} is * responsible for storing and fetching metadata about the remote log segments in a strongly consistent manner. * This allows RemoteStorageManager to store segments even in eventually consistent manner as the metadata is already * stored in a consistent store. * <p> * All these APIs are still evolving. */ @InterfaceStability.Unstable public interface RemoteStorageManager extends Configurable, Closeable { /** * Type of the index file. */ enum IndexType { /** * Represents offset index. */ Offset, /** * Represents timestamp index. */ Timestamp, /** * Represents producer snapshot index. */ ProducerSnapshot, /** * Represents transaction index. */ Transaction, /** * Represents leader epoch index. */ LeaderEpoch, } /** * Copies the given {@link LogSegmentData} provided for the given {@param@code remoteLogSegmentMetadata}. This includes * <p> log segment and its auxiliary *indexes Invokerlike ofoffset thisindex, APItime shouldindex, alwaystransaction send index, leader epoch index, and * producer snapshot index. * <p> * Invoker of this API should always send a unique id as part of {@link RemoteLogSegmentMetadata#remoteLogSegmentId()#id()} * even when it retries to invoke this method for the same log segment data. * <p> * This operation is expected to be idempotent. If a copy operation is retried and there is existing content already written, * it should be overwritten, and do not throw {@link RemoteStorageException} * * @param remoteLogSegmentMetadata metadata about the remote log segment. * @param logSegmentData data to be copied to tiered storage. * @throws RemoteStorageException if there are any errors in storing the data of the segment. */ void copyLogSegmentcopyLogSegmentData(RemoteLogSegmentMetadata remoteLogSegmentMetadata, LogSegmentData logSegmentData) throws RemoteStorageException; /** * Returns the remote log segment data file/object as InputStream for the given {@link RemoteLogSegmentMetadata starting} * starting from the given startPosition. The stream will end at the end of the remote log segment data file/object. * * @param remoteLogSegmentMetadata metadata about the remote log segment. * @param startPosition start position of log segment to be read, inclusive. * @return input stream of the requested log segment data. * @throws RemoteStorageException if there are any errors while fetching the desired segment. */ @throws RemoteResourceNotFoundException the requested InputStream fetchLogSegmentDatalog segment is not found in the remote storage. */ InputStream fetchLogSegment(RemoteLogSegmentMetadata remoteLogSegmentMetadata, int startPosition) throws RemoteStorageException; /** * Returns the remote log segment data file/object as InputStream for the given {@link RemoteLogSegmentMetadata starting} * starting from the given startPosition. The stream will end at the smaller of endPosition and the end of the remote log * remote *log segment data file/object. * * @param remoteLogSegmentMetadata metadata about the remote log segment. * @param startPosition start position of log segment to be read, inclusive. * @param endPosition end position of log segment to be read, inclusive. * @return input stream of the requested log segment data. * @throws RemoteStorageException if there are any errors while fetching the desired segment. * @throws RemoteResourceNotFoundException the requested log segment is not found in the remote storage. */ InputStream fetchLogSegmentDatafetchLogSegment(RemoteLogSegmentMetadata remoteLogSegmentMetadata, int startPosition, int startPosition, int endPosition) throws RemoteStorageException; /** * Returns the index for the respective log segment of {@link RemoteLogSegmentMetadata}. * <p> * @paramIf remoteLogSegmentMetadatathe metadataindex aboutis thenot remote log segment. * @param indexType type of the index to be fetched for the segment. * present (e.g. Transaction index may not exist because segments create prior to * version 2.8.0 will not have transaction index associated with them.), * throws {@link RemoteResourceNotFoundException} * * @param remoteLogSegmentMetadata metadata about the remote log segment. * @param indexType type of the index to be fetched for the segment. * @return input stream of the requested index. * @throws RemoteStorageException if there are any errors while fetching the index. * @throws RemoteResourceNotFoundException the requested index is not found in the remote storage. * The caller of this function are encouraged to re-create the indexes from the segment * as the suggested way of handling this error. */ InputStream fetchIndex(RemoteLogSegmentMetadata remoteLogSegmentMetadata, IndexType indexType) throws RemoteStorageException; /** * Deletes the resources associated with the given {@param@code remoteLogSegmentMetadata}. Deletion is considered as * successful if this call returns successfully without any errors. It will throw {@link RemoteStorageException} if * there are any errors in deleting the file. * <p>. * <p> * This operation is expected to be idempotent. If resources are not found, it is not expected to * throw {@link RemoteResourceNotFoundException} isas thrownit whenmay therebe arealready noremoved resourcesfrom associateda with the given * {@param remoteLogSegmentMetadata}previous attempt. * * @param remoteLogSegmentMetadata metadata about the remote log segment to be deleted. * @throws RemoteResourceNotFoundException if the requested resource is not found * @throws RemoteStorageException if there are any storage related errors occurred. */ void deleteLogSegmentdeleteLogSegmentData(RemoteLogSegmentMetadata remoteLogSegmentMetadata) throws RemoteStorageException; } package org.apache.kafka.common; ... public class TopicIdPartition { private final UUID topicId; private final TopicPartition topicPartition; public TopicIdPartition(UUID topicId, TopicPartition topicPartition) { Objects.requireNonNull(topicId, "topicId can not be null"); Objects.requireNonNull(topicPartition, "topicPartition can not be null"); this.topicId = topicId; this.topicPartition = topicPartition; } public UUID topicId() { return topicId; } public TopicPartition topicPartition() { return topicPartition; } ... } package org.apache.kafka.server.log.remote.storage; ... /** * This represents a universally unique identifier associated to a topic partition's log segment. This will be * regenerated for every attempt of copying a specific log segment in {@link RemoteStorageManager#copyLogSegment(RemoteLogSegmentMetadata, LogSegmentData)}. */ public class RemoteLogSegmentId implements Comparable<RemoteLogSegmentId>, Serializable { private static final long serialVersionUID = 1L; private final TopicIdPartition topicIdPartition; private final UUID id; public RemoteLogSegmentId(TopicIdPartition topicIdPartition, UUID id) { this.topicIdPartition = requireNonNull(topicIdPartition); this.id = requireNonNull(id); } /** * Returns TopicIdPartition of this remote log segment. * * @return */ public TopicIdPartition topicIdPartition() { return topicIdPartition; } /** * Returns Universally Unique Id of this remote log segment. * * @return */ public UUID id() { return id; } ... } package org.apache.kafka.server.log.remote.storage; ... /** * It describes the metadata about the log segment in the remote storage. */ public class RemoteLogSegmentMetadata implements Serializable { private static final long serialVersionUID = 1L; /** * Universally unique remote log segment id. */ private final RemoteLogSegmentId remoteLogSegmentId; /** * Start offset of this segment. */ private final long startOffset; /** * End offset of this segment. */ private final long endOffset; /** * Leader epoch of the broker. */ private final int leaderEpoch; /** * Maximum timestamp in the segment */ private final long maxTimestamp; /** * Epoch time at which the respective {@link #state} is set. */ private final long eventTimestamp; /** * LeaderEpoch vs offset for messages with in this segment. */ private final Map<Int, Long> segmentLeaderEpochs; /** * Size of the segment in bytes. */ private final int segmentSizeInBytes; /** * It indicates the state in which the action is executed on this segment. */ private final RemoteLogSegmentState state; /** * @param remoteLogSegmentId Universally unique remote log segment id. * @param startOffset Start offset of this segment. * @param endOffset End offset of this segment. * @param maxTimestamp Maximum timestamp in this segment * @param leaderEpoch Leader epoch of the broker. * @param eventTimestamp Epoch time at which the remote log segment is copied to the remote tier storage. * @param segmentSizeInBytes Size of this segment in bytes. * @param state State of the respective segment of remoteLogSegmentId. * @param segmentLeaderEpochs leader epochs occurred with in this segment */ public RemoteLogSegmentMetadata(RemoteLogSegmentId remoteLogSegmentId, long startOffset, long endOffset, long maxTimestamp, int leaderEpoch, long eventTimestamp, int segmentSizeInBytes, RemoteLogSegmentState state, Map<Int, Long> segmentLeaderEpochs) { this.remoteLogSegmentId = remoteLogSegmentId; this.startOffset = startOffset; this.endOffset = endOffset; this.leaderEpoch = leaderEpoch; this.maxTimestamp = maxTimestamp; this.eventTimestamp = eventTimestamp; this.segmentLeaderEpochs = segmentLeaderEpochs; this.state = state; this.segmentSizeInBytes = segmentSizeInBytes; } /** * @return unique id of this segment. */ public RemoteLogSegmentId remoteLogSegmentId() { return remoteLogSegmentId; } /** * @return Start offset of this segment(inclusive). */ public long startOffset() { return startOffset; } /** * @return End offset of this segment(inclusive). */ public long endOffset() { return endOffset; } /** * @return Leader or controller epoch of the broker from where this event occurred. */ public int brokerEpoch() { return brokerEpoch; } /** * @return Epoch time at which this evcent is occurred. */ public long eventTimestamp() { return eventTimestamp; } /** * @return */ public int segmentSizeInBytes() { return segmentSizeInBytes; } public RemoteLogSegmentState state() { return state; } public long maxTimestamp() { return maxTimestamp; } public Map<Int, Long> segmentLeaderEpochs() { return segmentLeaderEpochs; } ... } package org.apache.kafka.server.log.remote.storage; ... public class LogSegmentData { private final File logSegment; private final File offsetIndex; private final File timeIndex; private final File txnIndex; private final File producerIdSnapshotIndex; private final ByteBuffer leaderEpochIndex; public LogSegmentData(File logSegment, File offsetIndex, File timeIndex, File txnIndex, File producerIdSnapshotIndex, ByteBuffer leaderEpochIndex) { this.logSegment = logSegment; this.offsetIndex = offsetIndex; this.timeIndex = timeIndex; this.txnIndex = txnIndex; this.producerIdSnapshotIndex = producerIdSnapshotIndex; this.leaderEpochIndex = leaderEpochIndex; } public File logSegment() { return logSegment; } public File offsetIndex() { return offsetIndex; } public File timeIndex() { return timeIndex; } public File txnIndex() { return txnIndex; } public File producerIdSnapshotIndex() { return producerIdSnapshotIndex; } public ByteBuffer leaderEpochIndex() { return leaderEpochIndex; } ... } |
...
The following new metrics will be added:
mbeanMBean | description |
---|---|
kafka.server:type=BrokerTopicMetrics, name=RemoteReadRequestsPerSec, topic=([-.w]+) | Number of remote storage read requests per second. |
kafka.server:type=BrokerTopicMetrics, name=RemoteBytesInPerSec, topic=([-.w]+) | Number of bytes read from remote storage per second. |
kafka.server:type=BrokerTopicMetrics, name=RemoteReadErrorPerSec, topic=([-.w]+) | Number of remote storage read errors per second. |
kafka.log.remote:type=RemoteStorageThreadPool, name=RemoteLogReaderTaskQueueSize | Number of remote storage read tasks pending for execution. |
kafka.log.remote:type=RemoteStorageThreadPool, name=RemoteLogReaderAvgIdlePercent | Average idle percent of the remote storage reader thread pool. |
kafka.log.remote:type=RemoteLogManager, name=RemoteLogManagerTasksAvgIdlePercent | Average idle percent of RemoteLogManager thread pool. |
kafka.server:type=BrokerTopicMetrics, name=RemoteBytesOutPerSec, topic=([-.w]+) | Number of bytes copied to remote storage per second. |
kafka.server:type=BrokerTopicMetrics, name=RemoteWriteErrorPerSec, topic=([-.w]+) | Number of remote storage write errors per second. |
Some of these metrics have been updated with new names as part of KIP-930
Upgrade
Follow the steps mentioned in Kafka upgrade to reach the state where all brokers are running on the latest binaries with the respective "inter.broker.protocol" and "log.message.format" versions. Tiered storage requires the message format to be > 0.11.
...
- Once tier storage is enabled for a topic, it can not be disabled. We will add this feature in future versions. One possible workaround is to create a new topic and copy the data from the desired offset and delete the old topic. Another possible work around is to set the log.local.retention.ms same as retention.ms and wait until the local retention catches up until complete log retention. This will make the complete data available locally. After that, set remote.storage.enable as false to disable tiered storage on a topic.
- Multiple Log dirs on a broker are not supported (JBOD related features).
- Tiered storage is not supported for compacted topics.
...
Consuming old data has a significant performance impact to acks=all producers. Without tiered storage, the P99 produce latency is almost tripled~1.5 times. With tiered storage, the performance impact is relatively lower, because remote storage reading does not compete with the local hard disk bandwidth with produce requests.
...
- Discussion Recording
- Notes
- KIP is updated with follower fetch protocol and ready to reviewed
- Satish to capture schema of internal metadata topic in the KIP
- We will update the KIP with details of different cases
- Test plan will be captured in a doc and will add to the KIP
- Add a section "Limitations" to capture the capabilities that will be introduced with this KIP and what will not be covered in this KIP.
Other associated KIPs
KIP-852: Optimize calculation of size for log in remote tier
KIP-917: Additional custom metadata for remote log segment