ID	IEP-28
Author	Maxim Muzafarov
Sponsor

Created

31-Oct-2018

Status

colour

Grey

title
Yellow

DRAFT

ACTIVE

Table of Contents

Motivation

The Apache Ignite cluster balance procedure with enabled persitence currently doesn't utilize network and storage device throughout to its full extent. The balance procedure processes cache data entries one by one which is not efficient enough for the cluster with enabled persistence.

...

Competitive Analysis

Profiling current balancing procedure

Rebalance procedure optimizations

Possible partition file sending approaches

The Apache Ignite needs to can support cache rebalancing as transferring partition files using zero copy algorithm [1] based on an extension of communication SPI and Java NIO API. When the partition file has been transferred to the demander node there are a few possible approaches can be implemented to preload entries from particular partition file.

Hot swap cache data storage

The Demander node first under checkpoint write lock must swap cache data storage with the temporary one to perform recovery operations under original cache data storage. After partition file has been received from the Supplier node there are to possible cases to make this partition file up-to-date.

Disadvantages:

A complex index reduild procedure that requires the development of additional crash recovery guarantees. It will start immediately when the partition file is fully received from the supplier node. If the node crashes in the middle of the rebuilding index process it will have an inconsistent index state at the further node startup. To avoid this a new index-undo WAL record must be logged within rebuilding and used on node start to remove previously added index records.

Historical rebalance

After partition is received the historical rebalance must be initiated to load other cache updates.

Catch-up temporary WAL

The swapped temporary storage will log all the cache updates to the temporary WAL storage (per each partition) for further applying them to the corresponding partition file. While the Demander is being receive partition files it must save sequentially all cache entries corresponding to the MOVING partition into a new temporary storage. These entries will be applied later one by one on the newly received cache partition file. All asynchronous operations will be enrolled to the end of temporary storage during storage reads until it becomes fully read. The file-based FIFO approach assumes to be used by this process.

The temporary storage is chosen to be WAL-based. The storage must support to:

Unlimited number of WAL-files to store temporary data records;
Iterating over stored data records during an asynchronous writer thread insert new records;
WAL-per-partiton approach is need to be used;
Write operations to storage must have higher priority over read operations;

Expected problems to be solved

We must stop updating indexes on demander when the data is ready to be transferred from the supplier node. All async cache updates on demander must not cause the index update;
The previous partition metadata page and all stored meta information must be destroyed in PageMemory and restored from the new partition file;

Preload entries from loaded partition file

The demander node will use a preloaded patition file as a new source of cache data entries to load.

Disadvantages:

The approach will require a new temporary FilePageStore to be initialized. It must be created as a part of the temporary cache group or in the separate temporary data region to provide reusing machinery of iteration over the full partition file.

Proposed Changes (Hot swap with historical rebalance)

Process Overview

In the process of balancing data:

demaner Demander (receiver of partition files)
supplier Supplier (sender of partition files).

The whole process is described in terms of rebalance single cache group and partition files rebalancing a single partition file of a cache group. All the other partitions would be rebalanced one-by-one:

The demander node prepares the set of IgniteDhtDemandedPartitionsMap#full cache partitions to fetch;
The demander node checks compatibility version (for example, 2.8) and starts recording all incoming cache updates to the new special storage – the temporary WAL;
The demander node sends the GridDhtPartitionDemandMessage to the supplier node;
When the supplier node receives GridDhtPartitionDemandMessage and starts the new checkpoint process;
The supplier node creates empty the temporary cache partition file with .tmp postfix in the same cache persistence directory;
The supplier node splits the whole cache partition file into virtual chunks of predefined size (multiply to the PageMemory size);
1. If the concurrent checkpoint thread determines the appropriate cache partition file chunk and tries to flush dirty page to the cache partition file
  1. If rebalance chunk already transferred
    1. Flush the dirty page to the file;
  2. If rebalance chunk not transferred
    1. Write this chunk to the temporary cache partition file;
    2. Flush the dirty page to the file;
2. The node starts sending to the demander node each cache partition file chunk one by one using FileChannel#transferTo
  1. If the current chunk was modified by checkpoint thread – read it from the temporary cache partition file;
  2. If the current chunk is not touched – read it from the original cache partition file;
The demander node starts to listen to new pipe incoming connections from the supplier node on TcpCommunicationSpi;
The demander node creates the temporary cache partition file with .tmp postfix in the same cache persistence directory;
The demander node receives each cache partition file chunk one by one
1. The node checks CRC for each PageMemory in the downloaded chunk;
2. The node flushes the downloaded chunk at the appropriate cache partition file position;
When the demander node receives the whole cache partition file
1. The node initializes received .tmp cache partition file as the file holder;
2. Thread-per-partition begins to apply data entries from the begining of WAL-temporary storage;
3. All async operations corresponding to this partition file still write to the end of temporary WAL;
4. At the moment of WAL-temporary storage is ready to be empty
  1. The node switches writings direct to the partition file (step of writing to the temp-WAL is excluded);
  2. Schedule the temporary WAL storage deletion;
The supplier node deletes the temporary cache partition file;

Сomponents to change

CommunicationSpi

To benefit from zero file copy we must delegate the file transferring to FileChannel#transferTo(long, long, java.nio.channels.WritableByteChannel) [2] because the fast path of transferTo method is only executed if the destination buffer inherits from an internal JDK class.

The CommunicationSpi needs to support pipe connections between two nodes;
- The WritableByteChannel needs to be accesses on the supplier side;
- The ReadableByteChannel needs to be read on the demander side;
The CommunicationListener must be extended to respond on new incoming pipe connections;

Partition transmission

The cache partition file transfer over the network must be done using chunks with validation of received piece of data on the demander side.

The new layer over the cache partition file must support direct using of FileChannel#transferTo method over the CommunicationSpi pipe connection;
The process manager must support transferring the cache partition file by chunks of predefined size (multiply to the page size) one by one;
The connection bandwidth of the cache partition file transfer must have an ability to be limited at runtime;

Checkpointing on supplier

...

Catch-up WAL

During the cache partition file transmitting, the demander node must hold all corresponding data entries on the new temporary WAL storage to apply them later. The file-based FIFO technique assumes to be used.

The new write-ahead-log manager for writing temporary records must support
- Unlimited number of wal-files to store temporary data records;
- Iterating over stored data records during an asynchronous writer thread inserts new records;
- WAL-per-partiton approach need to be used;

The process description on the demander node – items 2, 10 of the Process Overview.

Public API changes

The following changes needs to be made:

Code Block

language	java
title	CommunicationSpi.java
collapse	true

/**
 * @return {@code True} if new type of direct connections supported.
 */
public default boolean pipeConnectionSupported() {
    return false;
}
 
/**
 * @param src Source cluster node to initiate connection with.
 * @return Channel to listen.
 * @throws IgniteSpiException If fails.
 */
public default ReadableByteChannel getRemotePipe(ClusterNode src) throws IgniteSpiException {
    throw new UnsupportedOperationException();
}
 
/**
 * @param dest Destination cluster node to communicate with.
 * @param out Channel to write data.
 * @throws IgniteSpiException If fails.
 */
public default void sendOnPipe(ClusterNode dest, WritableByteChannel out) throws IgniteSpiException {
    throw new UnsupportedOperationException();
}

Recovery

In case of crash recovery, there is no additional actions need to be applied to keep the cache partition file consistency. We are not recovering partition with the moving state, thus we will lose the single partition file and only it. The uniqueness of it is guaranteed by the single-file-transmission process. The cache partition file will be fully loaded on the next rebalance procedure.

The overview of recovery guarantees:

...

.

NODE_JOIN event occurrs and the blocking PME starts;
1. The Demander decides which partitions must be loaded. All the desired partitions have MOVING state;
2. The Demander initiates a new checkpoint process;
  1. Under the checkpoint write-lock it swaps cache data storage with the temporary one for each partition of the given set;
  2. The temporary cache data storage tracks partition counter number as ususal (on each cache operations);
  3. Wait for the checkpoint begin future ends;
The Demander sends a request to the Supplier with the previously prepares set of cache groups and partition files;
The Supplier receives a request and starts a new local checkpoint process;
1. Creates a temporary file with .delta postfix (for each partition file e.g. part-0.bin.delta);
2. Under checkpoint write lock fixes the partition expected file size (at the moment of the checkpoint end);
3. Wait for the checkpoint begin future ends;
4. Starts the copy process of the partition file to the Demander;
  1. Opens the partition file in read-only mode;
  2. Starts sending partition file (with any concurrent writes) by chunks of predefined size;
5. Asynchronouosly writes each page to the partition file and the same page to the corresponding file with .delta postfix;
6. When the partition file sent it starts sending corresponding .delta file;
The Demander listens new file sending attempts from the Supplier;
The Demander receives partition file (for each partition file one by one);
The Demander reads corresponding partition .delta file by chunks and applies them on the received partiton file;
When the Demander receives the whole cache partition file;
1. Swap the temporary cache data storage with the original one on the next checkpoint (under write lock);
2. When the partition has been swapped it starts the rebuild indexes procedure over given partition files;
3. Starts historical rebalance for the given partition file;
The Supplier deletes all temporary files;

Components

In terms of a high-level abstraction, Apache Ignite must support the features described below.

File transfer between nodes

The node partition preloader machinery download cache partition files from cluster nodes which owns desired partitions (the zero copy algorithm [1] assume to be used by default). To achieve this, the file transmission process must be implemented at Apache Ignite over Communication SPI.

CommunicationSpi

IThe Comminication SPI must support to:

opening channel connections to a remote node to an arbitrary topic (GridTopic is used) with initial meta information;
listening incoming channel connections and handling them by registered handlers;
an arbitrary set of channel parameters on connection handshake (some initial Message assumed to be used);

API

Code Block

language	java
theme	Confluence
title	CommunicationListenerEx.java
collapse	true

public interface CommunicationListenerEx<T extends Serializable> extends EventListener {
    /**
     * @param nodeId Remote node id.
     * @param initMsg Init channel message.
     * @param channel Locally created channel endpoint.
     */
    public void onChannelOpened(UUID nodeId, Message initMsg, Channel channel);
}

GridIoManager

IO manager must support to:

different approaches of incoming data handling: CHUNK (read channel into ByteBuffer), FILE (zero-copy approach)
send and receive data by chunks of predefined size with storing intermediate results;
reestablishing connection between nodes if an error occurs and continue file sending\receiving;
limiting connection bandwidth at runtime;

API

Code Block

language	java
theme	Confluence
title	TransmissionHandler.java
collapse	true

public interface TransmissionHandler {
    /**
     * @param err The err of fail handling process.
     */
    public void onException(UUID nodeId, Throwable err);

    /**
     * @param nodeId Remote node id from which request has been received.
     * @param fileMeta File meta info.
     * @return Absolute pathname denoting a file.
     */
    public String filePath(UUID nodeId, TransmissionMeta fileMeta);

    /**
     * <em>Chunk handler</em> represents by itself the way of input data stream processing.
     * It accepts within each chunk a {@link ByteBuffer} with data from input for further processing.
     *
     * @param nodeId Remote node id from which request has been received.
     * @param initMeta Initial handler meta info.
     * @return Instance of chunk handler to process incoming data by chunks.
     */
    public Consumer<ByteBuffer> chunkHandler(UUID nodeId, TransmissionMeta initMeta);

    /**
     * <em>File handler</em> represents by itself the way of input data stream processing. All the data will
     * be processed under the hood using zero-copy transferring algorithm and only start file processing and
     * the end of processing will be provided.
     *
     * @param nodeId Remote node id from which request has been received.
     * @param initMeta Initial handler meta info.
     * @return Intance of read handler to process incoming data like the {@link FileChannel} manner.
     */
    public Consumer<File> fileHandler(UUID nodeId, TransmissionMeta initMeta);
}

Code Block

language	java
title	GridIoManager.TransmissionSender.java
collapse	true

public class TransmissionSender implements Closeable {
    /**
     * @param file Source file to send to remote.
     * @param params Additional transfer file description keys.
     * @param plc The policy of handling data on remote.
     * @throws IgniteCheckedException If fails.
     */
    public void send(
        File file,
        Map<String, Serializable> params,
        TransmissionPolicy plc
    ) throws IgniteCheckedException, InterruptedException, IOException {
        send(file, 0, file.length(), params, plc);
    }

    /**
     * @param file Source file to send to remote.
     * @param plc The policy of handling data on remote.
     * @throws IgniteCheckedException If fails.
     */
    public void send(
        File file,
        TransmissionPolicy plc
    ) throws IgniteCheckedException, InterruptedException, IOException {
        send(file, 0, file.length(), new HashMap<>(), plc);
    }

    /**
     * @param file Source file to send to remote.
     * @param offset Position to start trasfer at.
     * @param cnt Number of bytes to transfer.
     * @param params Additional transfer file description keys.
     * @param plc The policy of handling data on remote.
     * @throws IgniteCheckedException If fails.
     */
    public void send(
        File file,
        long offset,
        long cnt,
        Map<String, Serializable> params,
        TransmissionPolicy plc
    ) throws IgniteCheckedException, InterruptedException, IOException {
		// Impl.
    }
}

Copy partition on the fly

Checkpointer

When the supplier node receives the cache partition file demand request it will send the file over the CommunicationSpi. The cache partition file can be concurrently updated by checkpoint thread during its transmission. To guarantee the file consistency Сheckpointer must use Copy-on-Write [3] tehnique and save a copy of updated chunk into the temporary file.

Rebuild indexes

The node is ready to become partition owner when partition data is rebalanced and cache indexes are ready. For the message-based cluster rebalancing approach indexes are rebuilding simultaneously with cache data loading. For the file-based rebalancing approach, the index rebuild procedure must be finished before the partition state is set to the OWNING state.

Failover and Recovery

Ignite doesn't provide any recovery guarantees for the partitions with the MOVING state. The cache partitions will be fully loaded when the next rebalance procedure occurs.

FAIL\LEFT during rebalancing

The node which is beeing rebalancing left the cluster. For such nodes WAL is always disabled (all partitions have MOVING state due to this node is new for the cluster and has no cache data).
Since WAL is disabled we can guarantee that all operations with loaded partition files are safe to be done (renaming partition files, applying async updates) due to a cache directory will be fully dropped on recovery.

Topology change

Each topology change event JOIN/LEFT/FAILED may or may not change cache affinity assignments of currently rebalacning caches. If assignments is not changed and the node is still needs partitions being rebalanced we can continue the current rebalance process (see for details IGNITE-7165).

Activation\deactivation

The rebalance procedure will be stopped if the deactivation event occurs. The single partition will be lost and will be preloaded on the next cluster rebalancing.

Unstable connection

A new connection must be established and the download process of partition file must be continued from the last successfully send cache partition chunk.

Crash recovery

To provide basic recovery guarantees we must to:

Wait for the first checkpoint ends and set OWNING status to partition;

Recovery from different stages:

The Supplier crashes when sending partition;
The Demander crashes when receiving partition;
The Demander crashes when applying temp WAL;

Phase-2

The SSL must be disabled to take an advantage of Java NIO zero-copy file transmission using FileChannel#transferTo method. If we need to use SSL the file must be splitted on chunks the same way to send them over the socket channel with ByteBuffer. As the SSL engine generally needs a direct ByteBuffer to do encryption we can't avoid copying buffer payload from the kernel level to the application level.

...

Risks and Assumptions

A few notes can be mentioned:

Zero-copy limitations – If operating system does not support zero copy, sending a file with FileChannel#transferTo might fail or yield worse performance. For example, sending a large file doesn't work well enough on Windows;
Writing WAL io wait time – Under the heavy load of partition file transmission, writing to the temporary WAL storage may be slowing down. Since the loss of data of temporary WAL storage has no risks we can consider store the whole storage into the memory.

Phase-2

The SSL must be disabled to take an advantage of Java NIO zero-copy file transmission using FileChannel#transferTo method. If we need to use SSL the file must be splitted on chunks the same way to send them over the socket channel with ByteBuffer. As the SSL engine generally needs a direct ByteBuffer to do encryption we can't avoid copying buffer payload from the kernel level to the application level.

Discussion Links

http://apache-ignite-developers.2346864.n4.nabble.com/DISCUSSION-Design-document-Rebalance-caches-by-transferring-partition-files-td38388.html// Links to discussions on the devlist, if applicable.

Reference Links

Zero Copy I: User-Mode Perspective – https://www.linuxjournal.com/article/6345
Example: Efficient data transfer through zero copy – https://www.ibm.com/developerworks/library/j-zerocopy/index.html
Copy-on-write – https://en.wikipedia.org/wiki/Copy-on-write

Tickets

Jira

server	ASF JIRA
columns	key,summary,type,updated,assignee,reporter,priority,status,resolution
maximumIssues	20
jqlQuery	project = Ignite AND labels IN (iep-28) order by key
serverId	5aa69414-a9e9-3523-82ec-879b028fb15b

// Links or report with relevant JIRA tickets.

Page tree

Page History

Versions Compared

Old Version 32

New Version Current

Key

Motivation

Competitive Analysis

Profiling current balancing procedure

Rebalance procedure optimizations

Possible partition file sending approaches

Hot swap cache data storage

Historical rebalance

Catch-up temporary WAL

Preload entries from loaded partition file

Proposed Changes (Hot swap with historical rebalance)

Process Overview

Сomponents to change

CommunicationSpi

Partition transmission

Checkpointing on supplier

Catch-up WAL

Public API changes

Recovery

Components

File transfer between nodes

CommunicationSpi

API

GridIoManager

API

Copy partition on the fly

Checkpointer

Rebuild indexes

Failover and Recovery

FAIL\LEFT during rebalancing

Topology change

Activation\deactivation

Unstable connection

Crash recovery

Phase-2

Risks and Assumptions

Phase-2

Discussion Links

Discussion Links

Reference Links

Tickets