The Apache Ignite needs to support cache rebalancing as transferring partition files using zero copy algorithm [1] based on an extension of communication SPI and Java NIO API.

Process

...

Overview

There are two participants in the process of balancing data : – demaner (receiver of partition files)

...

, supplier (sender of partition files).
The process of ordering cache groups for rebalance remains the same.
The The whole process is described in terms of rebalance single cache group:

The GridDhtPreloaderAssignments created for cache group (type of Map<ClusterNode, GridDhtPartitionDemandMessage>)

The demander node prepares the set of IgniteDhtDemandedPartitionsMap#full cache partitions to fetch;
The demander node checks compatibility version (for example, 2.8) and starts recording all incoming cache updates to the new special storage – the temporary WAL;
The demander node sends the GridDhtPartitionDemandMessage to the supplier node;
When the supplier node receives GridDhtPartitionDemandMessage and starts the new checkpoint process;
The supplier node creates empty the temporary cache partition file with .tmp postfix in the same cache persistence directory;
The supplier node splits the whole cache partition file into virtual chunks of predefined size (multiply to the PageMemory size);
1. If the concurrent checkpoint thread determines the appropriate cache partition file chunk and tries to flush dirty page to the cache partition file
  1. If rebalance chunk already transferred
    1. Flush the dirty page to the file;
  2. If rebalance chunk not transferred
    1. Write this chunk to the temporary cache partition file;
    2. Flush the dirty page to the file;
2. The node starts sending to the demander node each cache partition file chunk one by one using FileChannel#transferTo
  1. If the current chunk was modified by checkpoint thread – read it from the temporary cache partition file;
  2. If the current chunk is not touched – read it from the original cache partition file;
The demander node starts to listen to new pipe incoming connections from the supplier node on TcpCommunicationSpi;
The demander node creates the temporary cache partition file with .tmp postfix in the same cache persistence directory;
The demander node receives each cache partition file chunk one by one
1. The node checks CRC for each PageMemory in the downloaded chunk;
2. The node flushes the downloaded chunk at the appropriate cache partition file position;
When the demander node received the whole cache partition file it cut the .tmp postfix;
The supplier node deletes the temporary cache partition file;
The demander node starts applying saved WAL record updates from the temporary storage;
The demander node owning the new cache partition file;

CommunicationSpi

To benefit from zero file copy we must delegate the file transferring to FileChannel#transferTo(long, long, java.nio.channels.WritableByteChannel) [2] because the fast path of transferTo method is only executed if the destination buffer inherits from an internal JDK class.

...

When the supplier node receives the cache partition file demand request it must prepare and provide the cache partition file to transfer over network. The Copy-on-Write [3] tehniques assume to be used to guarantee the data consistency during chunk transfer.

The checkpointing process description on the supplier node :

The node starts and waits for the checkpoint process to be finished;
The node creates empty temporary cache partition file with .tmp postfix in the same cache persistence directory;
The whole cache partition file divided into virtual rebalance chunks of predefined size (multiply to the PageMemory size);
The checkpoint thread determines the appropriate rebalance file chunk and tries to flush dirty page to the cache partition file
1. If rebalance chunk already transferred than just flush the dirty page to the file;
2. If rebalance chunk not transferred
  1. Write this chunk to the temporary cache partition file;
  2. Flush the dirty page;

...

– items 4 - 6 of the Process Overview.

Recording to temp-WAL

When the demander node determines the possibility of peer-2-peer file transfer cache rebalance approach it initiates the new temporary cache update storage. During the cache partition file transfer, the demander node stores WAL records of all cache updates to this storage.

Risks and Assumptions

A few notes can be mentioned:

...

Page tree

Versions Compared

Old Version 4

New Version 5

Key

Process

Overview

CommunicationSpi

Recording to temp-WAL

Risks and Assumptions

Page tree

Page History

Versions Compared

Old Version 4

New Version 5

Key

Process

Overview

CommunicationSpi

Recording to temp-WAL

Risks and Assumptions