IDIEP-16
AuthorIlya Lantukh
Sponsor

Ilya Lantukh

Anton Vinogradov

CreatedMar 28 2018
Status

ACTIVE


Motivation

Rebalancing procedure doesn't utilize network and storage device throughput to full extent.

Description

Our current implementation has a number of issues caused by a single fundamental problem.

During rebalance process the data is sent in batches (called GridDhtPartitionSupplyMessages) but the entries in the batch are processed one by one.

So we don't take any advantage of batch processing and:

- checkpointReadLock is acquired multiple times for every entry, leading to unnecessary contention - this is clearly a bug;
- for each entry we write (and fsync, if configuration assumes it) a separate WAL record - so, if batch contains N entries, we might end up doing N fsyncs;
- adding every entry into CacheDataStore also happens completely independently. It means, we will traverse and modify each index tree N times, we will allocate space in FreeList N times and we will have to additionally store in WAL O(N*log(N)) page delta records.
The default batch size is 512KB which means thousands of key-value pairs received at once but processed individually. 
We propose two step approach to fix the issue:
  1. Remove ineffectiveness from current implementation, avoid any unnecessary but costly operations while still handling each cache entry independently.
  2. Redesign rebalance process to handle entries in batches.
  3. Introduce a new mode which will allow to transfer the whole partition file instead of key-value iteration

Risks and Assumptions

// Describe project risks, such as API or binary compatibility issues, major protocol changes, etc.

Discussion Links

http://apache-ignite-developers.2346864.n4.nabble.com/Rebalancing-how-to-make-it-faster-td28457.html

Reference Links

// Links to various reference documents, if applicable.

Tickets

key summary type created updated due assignee reporter priority status resolution

JQL and issue key arguments for this macro require at least one Jira application link to be configured

  • No labels

10 Comments

  1. Ilya Lantukh Can we move the whole IGNITE-8020 to the new IEP-28? I think it's a more convenient way to handle this improvement.

  2. Maxim Muzafarov , why do we need another IEP for rebalancing? Why can't we just continue work in the scope of this IEP?

    1. Ilya Lantukh From my point, this IEP-16 is more about rebalancing optimization, not the developing a new rebalance approach. It contains a few raw opened questions which are not described well enough and need to be discussed on dev-list first. 

      I'd like to focus on the single improvement of the Apache Ignite rebalance in the separate IEP-28 and discuss it on dev-list separately, because of:
       - it contains the major changes of CommunicationSpi interface
       - the new temporary WAL storage will be introduced
       - the checkpoint will be changed to provide zero-copy file transmission.

      I will link the new IEP-28 to this IEP-16.
      Thoughts? 

      1. Maxim Muzafarov According to the motivation section of IEP-28, it is also about rebalancing optimizations. If you think that some questions are not described well, you can add your own description or start a conversation on the dev-list.

        1. Ilya Lantukh Thanks, yes sure, I will start a discussion on dev-list. The IEP-28 is currently at the DRAFT state, so we can change the motivation section as you like. It will also be complemented with new details. Basically, it's the new rebalance approach – updating this IEP-16 page will overcomplicate the whole article.

          Why we should do that? Why we should group all improvements on the single page?
          My point – single improvement = single IEP

          1. Maxim Muzafarov I'd prefer the approach "IEP per unit of functionality", and in this sense we do not need another IEP for rebalancing optimizations because we already have one. "Single improvement = single IEP" will eventually lead to "single ticket = single IEP".

            1. Ilya Lantukh This is not exactly true.

              The IEP-28 will produce at least 4 tickets. They can be implemented independently. At least 3 of them are not related to rebalancing optimization. So, I propose to create a separate IEP page and describe those changes there.

              1. Maxim Muzafarov I think such questions should be discussed on the dev-list.

        2. + 1 to "single improvement = single IEP"

  3. Hi, from a very brief review of these IEPs it seems to me it can be IEP-16 & IEP-16.1.

    It seems to me paramount thing here is that improvements will be done, and IEP naming or content is not so important. 
    But anyway we can discuss this it fundamentally important for someone.