ID | IEP-28 |
Author | |
Sponsor |
Created | 31-Oct-2018 | |||
Status |
|
|
|
Table of Contents |
---|
The Apache Ignite cluster balance procedure with enabled persitence currently doesn't utilize network and storage device throughout to its full extent. The balance procedure processes cache data entries one by one which is not efficient enough for the cluster with enabled persistence.
...
The Apache Ignite needs to can support cache rebalancing as transferring partition files using zero copy algorithm [1] based on an extension of communication SPI and Java NIO API. When the partition file has been transferred to the demander node there are a few possible approaches can be implemented to preload entries from particular partition file.
The Demander node first under checkpoint write lock must swap cache data storage with the temporary one to perform recovery operations under original cache data storage. After partition file has been received from the Supplier node there are to possible cases to make this partition file up-to-date.
Disadvantages:
After partition is received the historical rebalance must be initiated to load other cache updates.
The swapped temporary storage will log all the cache updates to the temporary WAL storage (per each partition) for further applying them to the corresponding partition file. While the Demander is being receive partition files it must save sequentially all cache entries corresponding to the MOVING partition into a new temporary storage. These entries will be applied later one by one on the newly received cache partition file. All asynchronous operations will be enrolled to the end of temporary storage during storage reads until it becomes fully read. The file-based FIFO approach assumes to be used by this process.
The temporary storage is chosen to be WAL-based. The storage must support to:
Expected problems to be solved
The demander node will use a preloaded patition file as a new source of cache data entries to load.
Disadvantages:
In the process of balancing data:
The whole process is described in terms of rebalance single cache group and partition files rebalancing a single partition file of a cache group. All the other partitions would be rebalanced one-by-one:
To benefit from zero file copy we must delegate the file transferring to FileChannel#transferTo(long, long, java.nio.channels.WritableByteChannel) [2] because the fast path of transferTo method is only executed if the destination buffer inherits from an internal JDK class.
The cache partition file transfer over the network must be done using chunks with validation of received piece of data on the demander side.
...
During the cache partition file transmitting, the demander node must hold all corresponding data entries on the new temporary WAL storage to apply them later. The file-based FIFO technique assumes to be used.
The process description on the demander node – items 2, 10 of the Process Overview.
The following changes needs to be made:
Code Block | ||||||
---|---|---|---|---|---|---|
| ||||||
/**
* @return {@code True} if new type of direct connections supported.
*/
public default boolean pipeConnectionSupported() {
return false;
}
/**
* @param src Source cluster node to initiate connection with.
* @return Channel to listen.
* @throws IgniteSpiException If fails.
*/
public default ReadableByteChannel getRemotePipe(ClusterNode src) throws IgniteSpiException {
throw new UnsupportedOperationException();
}
/**
* @param dest Destination cluster node to communicate with.
* @param out Channel to write data.
* @throws IgniteSpiException If fails.
*/
public default void sendOnPipe(ClusterNode dest, WritableByteChannel out) throws IgniteSpiException {
throw new UnsupportedOperationException();
} |
In case of crash recovery, there is no additional actions need to be applied to keep the cache partition file consistency. We are not recovering partition with the moving state, thus we will lose the single partition file and only it. The uniqueness of it is guaranteed by the single-file-transmission process. The cache partition file will be fully loaded on the next rebalance procedure.
The overview of recovery guarantees:
...
.
In terms of a high-level abstraction, Apache Ignite must support the features described below.
The node partition preloader machinery download cache partition files from cluster nodes which owns desired partitions (the zero copy algorithm [1] assume to be used by default). To achieve this, the file transmission process must be implemented at Apache Ignite over Communication SPI.
IThe Comminication SPI must support to:
Code Block | ||||||||
---|---|---|---|---|---|---|---|---|
| ||||||||
public interface CommunicationListenerEx<T extends Serializable> extends EventListener {
/**
* @param nodeId Remote node id.
* @param initMsg Init channel message.
* @param channel Locally created channel endpoint.
*/
public void onChannelOpened(UUID nodeId, Message initMsg, Channel channel);
} |
IO manager must support to:
Code Block | ||||||||
---|---|---|---|---|---|---|---|---|
| ||||||||
public interface TransmissionHandler {
/**
* @param err The err of fail handling process.
*/
public void onException(UUID nodeId, Throwable err);
/**
* @param nodeId Remote node id from which request has been received.
* @param fileMeta File meta info.
* @return Absolute pathname denoting a file.
*/
public String filePath(UUID nodeId, TransmissionMeta fileMeta);
/**
* <em>Chunk handler</em> represents by itself the way of input data stream processing.
* It accepts within each chunk a {@link ByteBuffer} with data from input for further processing.
*
* @param nodeId Remote node id from which request has been received.
* @param initMeta Initial handler meta info.
* @return Instance of chunk handler to process incoming data by chunks.
*/
public Consumer<ByteBuffer> chunkHandler(UUID nodeId, TransmissionMeta initMeta);
/**
* <em>File handler</em> represents by itself the way of input data stream processing. All the data will
* be processed under the hood using zero-copy transferring algorithm and only start file processing and
* the end of processing will be provided.
*
* @param nodeId Remote node id from which request has been received.
* @param initMeta Initial handler meta info.
* @return Intance of read handler to process incoming data like the {@link FileChannel} manner.
*/
public Consumer<File> fileHandler(UUID nodeId, TransmissionMeta initMeta);
} |
Code Block | ||||||
---|---|---|---|---|---|---|
| ||||||
public class TransmissionSender implements Closeable {
/**
* @param file Source file to send to remote.
* @param params Additional transfer file description keys.
* @param plc The policy of handling data on remote.
* @throws IgniteCheckedException If fails.
*/
public void send(
File file,
Map<String, Serializable> params,
TransmissionPolicy plc
) throws IgniteCheckedException, InterruptedException, IOException {
send(file, 0, file.length(), params, plc);
}
/**
* @param file Source file to send to remote.
* @param plc The policy of handling data on remote.
* @throws IgniteCheckedException If fails.
*/
public void send(
File file,
TransmissionPolicy plc
) throws IgniteCheckedException, InterruptedException, IOException {
send(file, 0, file.length(), new HashMap<>(), plc);
}
/**
* @param file Source file to send to remote.
* @param offset Position to start trasfer at.
* @param cnt Number of bytes to transfer.
* @param params Additional transfer file description keys.
* @param plc The policy of handling data on remote.
* @throws IgniteCheckedException If fails.
*/
public void send(
File file,
long offset,
long cnt,
Map<String, Serializable> params,
TransmissionPolicy plc
) throws IgniteCheckedException, InterruptedException, IOException {
// Impl.
}
}
|
When the supplier node receives the cache partition file demand request it will send the file over the CommunicationSpi. The cache partition file can be concurrently updated by checkpoint thread during its transmission. To guarantee the file consistency Сheckpointer must use Copy-on-Write [3] tehnique and save a copy of updated chunk into the temporary file.
The node is ready to become partition owner when partition data is rebalanced and cache indexes are ready. For the message-based cluster rebalancing approach indexes are rebuilding simultaneously with cache data loading. For the file-based rebalancing approach, the index rebuild procedure must be finished before the partition state is set to the OWNING state.
Ignite doesn't provide any recovery guarantees for the partitions with the MOVING state. The cache partitions will be fully loaded when the next rebalance procedure occurs.
The node which is beeing rebalancing left the cluster. For such nodes WAL is always disabled (all partitions have MOVING state due to this node is new for the cluster and has no cache data).
Since WAL is disabled we can guarantee that all operations with loaded partition files are safe to be done (renaming partition files, applying async updates) due to a cache directory will be fully dropped on recovery.
Each topology change event JOIN/LEFT/FAILED may or may not change cache affinity assignments of currently rebalacning caches. If assignments is not changed and the node is still needs partitions being rebalanced we can continue the current rebalance process (see for details IGNITE-7165).
The rebalance procedure will be stopped if the deactivation event occurs. The single partition will be lost and will be preloaded on the next cluster rebalancing.
A new connection must be established and the download process of partition file must be continued from the last successfully send cache partition chunk.
To provide basic recovery guarantees we must to:
Recovery from different stages:
The SSL must be disabled to take an advantage of Java NIO zero-copy file transmission using FileChannel#transferTo method. If we need to use SSL the file must be splitted on chunks the same way to send them over the socket channel with ByteBuffer. As the SSL engine generally needs a direct ByteBuffer to do encryption we can't avoid copying buffer payload from the kernel level to the application level.
...
A few notes can be mentioned:
The SSL must be disabled to take an advantage of Java NIO zero-copy file transmission using FileChannel#transferTo method. If we need to use SSL the file must be splitted on chunks the same way to send them over the socket channel with ByteBuffer. As the SSL engine generally needs a direct ByteBuffer to do encryption we can't avoid copying buffer payload from the kernel level to the application level.
http://apache-ignite-developers.2346864.n4.nabble.com/DISCUSSION-Design-document-Rebalance-caches-by-transferring-partition-files-td38388.html// Links to discussions on the devlist, if applicable.
Jira | ||||||||||
---|---|---|---|---|---|---|---|---|---|---|
|