Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

...

  • Although all cache data is sent between peers in batches (GridDhtPartitionSupplyMessage used) it still processes entries one by one. Such an process has low impact with a pure in-memory Apache Ignite usage but it leads to additional fsyncs and logging WAL records with the native persistence enabled. 

    By default, setRebalanceThreadPoolSize is set to 1 and setRebalanceBatchSize to 512K which means that thousands of key-value pairs will be processed single-thread and individually. Such an approach impacts on: 
    • The extra unnecessary changes to keep node data structures up to date. Adding each entry record into CacheDataStore will traverse and modify each index tree N-times. It will allocate the space N-times within FreeList structure and will have to additionally store WAL page delta records with approximate complexity ~ O(N*log(N));
    • Batch with N-entries will produce N-records in WAL which might end up with N fsyncs (assume fsync WAL mode configuration enabled);
    • Increased the chance of huge JVM pauses. The more serving objects we produce by applying changes, the more often GC happens and the greater chance of JVM pauses arise;

  • The rebalancing procedure doesn't utilize the network and storage device throughout to its full extent even with enough meaningful values of setRebalanceThreadPoolSize. For instance, trying to use a common recommendation of N+1 threads ( – the number of CPU cores available) to increase rebalance speed will drammatically slowdown computation performance on demander node. This can be easily shown on the graphs below.

    CPU utilization (supplier, demaner)

    Image Removed

    Image Removed

    setRebalanceThreadPoolSize – 9; setRebalanceBatchSize – 512K;setRebalanceThreadPoolSize – 1; setRebalanceBatchSize – 512K;

     

Advantages of peer-2-peer balancing

...

  • All data stored in the single partition file will be transmitted within single batch (equal to partition file) much faster and without the serealization\deserialization overhead. To roughly estimate the superiority of partition file transmitting using network sockets the native Linux scp\rsync commands can be used. The test environment showed us results – 270 MB/s over the current 40 MB/s single-threaded rebalance speed;
  • The zero-copy file transmission can be used [1]. The contents of a file can be transmitted without copying them through the user space. Internally, it depends on the underlying operating system's support for zero copy. For instance, in UNIX and various flavors of Linux, the Java method FileChannel.transfertTo() call is routed to the sendfile() system call;

Profiling

CPU


CPU utilization (supplier, demaner)

Image Added

Image Added

setRebalanceThreadPoolSize – 9; setRebalanceBatchSize – 512K;setRebalanceThreadPoolSize – 1; setRebalanceBatchSize – 512K;

Persistence mode


Code Block
batches            : 146938   
rows               : 77321844 
rows per batch     : 526      

time (total)       : 40 min   
cache size         : 78055 MB 
rebalacne speed    : 31 MB\sec
rows per sec       : 31470 rows

+ cache rebalance total                          : 2456973 ms : 100.00
+ + preload on demander                          : 2415154 ms : 98.30 
+ + + offheap().invoke(..)                       : 1640175 ms : 66.76 
+ + + + dataTree.invoke(..)                      : 1595260 ms : 64.93 
+ + + + + BPlusTree.invokeDown(..)               : 220390 ms  : 8.97  
+ + + + + FreeList.insertDataRow(..)             : 1340636 ms : 54.56 
+ + + + CacheDataStoreImpl.finishUpdate(..)      : 10807 ms   : 0.44  
+ + + ttl().addTrackedEntry(..)                  : 9678 ms    : 0.39  
+ + + wal().log(..)                              : 664680 ms  : 27.05 
+ + + continuousQueries().onEntryUpdated(..)     : 8521 ms    : 0.35  
+ message serialization                          : 1618 ms    : 0.07  
+ network delay between nodes                    : 7788 ms    : 0.32  
+ make batch on supplier handleDemandMessage(..) : 185749 ms  : 7.59  

...