Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

...

There are two participants in the process of balancing data – demaner (receiver of partition files), supplier (sender of partition files).

The process of ordering cache groups for rebalance remains the same. The  
The whole process is described in terms of rebalance single cache group :and partition files would be rebalanced one-by-one.

  1. The demander node prepares the set of IgniteDhtDemandedPartitionsMap#full cache partitions to fetch;
  2. The demander node checks compatibility version (for example, 2.8) and starts recording all incoming cache updates to the new special storage – the temporary WAL;
  3. The demander node sends the GridDhtPartitionDemandMessage to the supplier node;
  4. When the supplier node receives GridDhtPartitionDemandMessage and starts the new checkpoint process;
  5. The supplier node creates empty the temporary cache partition file with .tmp postfix in the same cache persistence directory;
  6. The supplier node splits the whole cache partition file into virtual chunks of predefined size (multiply to the PageMemory size);
    1. If the concurrent checkpoint thread determines the appropriate cache partition file chunk and tries to flush dirty page to the cache partition file
      1. If rebalance chunk already transferred
        1. Flush the dirty page to the file;
      2. If rebalance chunk not transferred
        1. Write this chunk to the temporary cache partition file;
        2. Flush the dirty page to the file;
    2. The node starts sending to the demander node each cache partition file chunk one by one using FileChannel#transferTo
      1. If the current chunk was modified by checkpoint thread – read it from the temporary cache partition file;
      2. If the current chunk is not touched – read it from the original cache partition file;
  7. The demander node starts to listen to new pipe incoming connections from the supplier node on TcpCommunicationSpi;
  8. The demander node creates the temporary cache partition file with .tmp postfix in the same cache persistence directory;
  9. The demander node receives each cache partition file chunk one by one
    1. The node checks CRC for each PageMemory in the downloaded chunk;
    2. The node flushes the downloaded chunk at the appropriate cache partition file position;
  10. When the demander node receives the whole cache partition file
    1. The node initializes received .tmp cache partition file as the file holder;
    2. Thread-per-partition begins to apply data entries from the begining of WAL-temporary storage;
    3. All async operations corresponding to this partition file still write to the end of temporary WAL;
    4. At the moment of WAL-temporary storage is ready to be empty
      1. The node owns the new cache partition;
      2. The node switches writings direct to the partition file (step of writing to the temp-WAL is excluded);
      3. Schedule the temporary WAL storage deletion;
  11. The supplier node deletes the temporary cache partition file;

...

Code Block
languagejava
titleCommunicationSpi.java
collapsetrue
/**
 * @return {@code True} if new type of direct connections supported.
 */
public default boolean pipeConnectionSupported() {
    return false;
}
 
/**
 * @param src Source cluster node to initiate connection with.
 * @return Channel to listen.
 * @throws IgniteSpiException If fails.
 */
public default ReadableByteChannel getRemotePipe(ClusterNode src) throws IgniteSpiException {
    throw new UnsupportedOperationException();
}
 
/**
 * @param dest Destination cluster node to communicate with.
 * @param out Channel to write data.
 * @throws IgniteSpiException If fails.
 */
public default void sendOnPipe(ClusterNode dest, WritableByteChannel out) throws IgniteSpiException {
    throw new UnsupportedOperationException();
}

Recovery


Risks and Assumptions

A few notes can be mentioned:

  • Zero-copy limitations – If operating system does not support zero copy, sending a file with FileChannel#transferTo might fail or yield worse performance. For example, sending a large file doesn't work well enough on Windows;
  • Disbaled SSL connection – SSL must be disabled to take an advantage of Java NIO zero copy file transmission using of FileChannel#transferTo. We can consider to use OpenSSL's non-copying interface to avoid allocating new buffers for each read and write operation at Phase-2;
  • Writing WAL io wait time –  Under the heavy load of partition file transmission, writing to the temporary WAL storage may be slowing down. Since the loss of data of temporary WAL storage has no risks we can consider store the whole storage into the memory.

Phase-2

The SSL must be disabled to take an advantage of Java NIO zero-copy file transmission using FileChannel#transferTo method. If we need to use SSL the FileChunk approach needs to be implemented instead. As the SSL engine generally needs a direct ByteBuffer to do encryption we can't avoid copying buffer payload from the OS level to the application level

Discussion Links

// Links to discussions on the devlist, if applicable.

...