IEP-16: Optimization of rebalancing

Created by Ilya Lantukh, last modified by Alexey Goncharuk on Aug 15, 2018

ID	IEP-16
Author	Ilya Lantukh
Sponsor	Ilya Lantukh Anton Vinogradov
Created	Mar 28 2018
Status	ACTIVE

Motivation

Rebalancing procedure doesn't utilize network and storage device throughput to full extent.

Description

Our current implementation has a number of issues caused by a single fundamental problem.

During rebalance process the data is sent in batches (called GridDhtPartitionSupplyMessages) but the entries in the batch are processed one by one.

So we don't take any advantage of batch processing and:

- checkpointReadLock is acquired multiple times for every entry, leading to unnecessary contention - this is clearly a bug;

- for each entry we write (and fsync, if configuration assumes it) a separate WAL record - so, if batch contains N entries, we might end up doing N fsyncs;

- adding every entry into CacheDataStore also happens completely independently. It means, we will traverse and modify each index tree N times, we will allocate space in FreeList N times and we will have to additionally store in WAL O(N*log(N)) page delta records.

The default batch size is 512KB which means thousands of key-value pairs received at once but processed individually.

We propose two step approach to fix the issue:

Remove ineffectiveness from current implementation, avoid any unnecessary but costly operations while still handling each cache entry independently.
Redesign rebalance process to handle entries in batches.
Introduce a new mode which will allow to transfer the whole partition file instead of key-value iteration

Risks and Assumptions

// Describe project risks, such as API or binary compatibility issues, major protocol changes, etc.

Discussion Links

http://apache-ignite-developers.2346864.n4.nabble.com/Rebalancing-how-to-make-it-faster-td28457.html

Reference Links

// Links to various reference documents, if applicable.

Tickets

key	summary	type	created	updated	due	assignee	reporter	priority	status	resolution
JQL and issue key arguments for this macro require at least one Jira application link to be configured

No labels

10 Comments

Maxim Muzafarov
Ilya Lantukh Can we move the whole IGNITE-8020 to the new IEP-28? I think it's a more convenient way to handle this improvement.
- Permalink
- Nov 06, 2018
- Delete comments
Ilya Lantukh
Maxim Muzafarov , why do we need another IEP for rebalancing? Why can't we just continue work in the scope of this IEP?
- Permalink
- Nov 06, 2018
- Delete comments
1. Maxim Muzafarov
  Ilya Lantukh From my point, this IEP-16 is more about rebalancing optimization, not the developing a new rebalance approach. It contains a few raw opened questions which are not described well enough and need to be discussed on dev-list first.
  
  I'd like to focus on the single improvement of the Apache Ignite rebalance in the separate IEP-28 and discuss it on dev-list separately, because of:
  - it contains the major changes of CommunicationSpi interface
  - the new temporary WAL storage will be introduced
  - the checkpoint will be changed to provide zero-copy file transmission.
  
  I will link the new IEP-28 to this IEP-16.
  Thoughts?
  Permalink
  
  Nov 06, 2018
  
  Delete comments
  1. Ilya Lantukh
    Maxim Muzafarov According to the motivation section of IEP-28, it is also about rebalancing optimizations. If you think that some questions are not described well, you can add your own description or start a conversation on the dev-list.
    
    Permalink
    
    Nov 06, 2018
    
    Delete comments
    1. Maxim Muzafarov
      
      Ilya Lantukh Thanks, yes sure, I will start a discussion on dev-list. The IEP-28 is currently at the DRAFT state, so we can change the motivation section as you like. It will also be complemented with new details. Basically, it's the new rebalance approach – updating this IEP-16 page will overcomplicate the whole article.
      
      Why we should do that? Why we should group all improvements on the single page?
      My point – single improvement = single IEP
      
      Permalink
      
      Nov 06, 2018
      
      Delete comments
      1. Ilya Lantukh
        
        Maxim Muzafarov I'd prefer the approach "IEP per unit of functionality", and in this sense we do not need another IEP for rebalancing optimizations because we already have one. "Single improvement = single IEP" will eventually lead to "single ticket = single IEP".
        
        Permalink
        
        Nov 06, 2018
        
        Delete comments
        
        Maxim Muzafarov
        
        Ilya Lantukh This is not exactly true.
        
        The IEP-28 will produce at least 4 tickets. They can be implemented independently. At least 3 of them are not related to rebalancing optimization. So, I propose to create a separate IEP page and describe those changes there.
        
        Permalink
        
        Nov 06, 2018
        
        Delete comments
        
        Ilya Lantukh
        
        Maxim Muzafarov I think such questions should be discussed on the dev-list.
        
        Permalink
        
        Nov 06, 2018
        
        Delete comments
    2. Anton Vinogradov
      
      + 1 to "single improvement = single IEP"
      
      Permalink
      
      Nov 06, 2018
      
      Delete comments
Dmitry Pavlov
Hi, from a very brief review of these IEPs it seems to me it can be IEP-16 & IEP-16.1.
It seems to me paramount thing here is that improvements will be done, and IEP naming or content is not so important.
But anyway we can discuss this it fundamentally important for someone.
- Permalink
- Nov 06, 2018
- Delete comments

Apache Ignite

Page tree