Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

...

Rebalancing relocates data from heavily loaded members to lightly loaded members. Currently Geode only supports manual rebalancing by issuing a gfsh command or a java function call. In most cases, the decision to rebalance is based on the size data distribution in the cluster and max memory configuration of the membermembers. As Geode monitors the data size, it can also automatically trigger rebalancing. Auto-Rebalancing (AR) is expected to prevent failures cased by unbalanced cluster more than manual rebalancebalancing will redistribute data-load periodically and prevent conditions leading to failures.

Requirements

  1. Configurable size threshold to qualify system as unbalancedoff-balanced
  2. Minimize the impact on concurrent operations caused by continuous rebalancing
    1. Configurable schedule
    2. Ability to disable AR
    3. Minimum size qualification
    Ability to plug a custom AR manager
    1. auto-balancing
  3. Reuse existing manual-rebalancing flow for a consistent rebalancing experience

Alternatives

The user can schedule a cron job to invoke the gfsh rebalance command on a periodic basis.

...

Background

  1. A member is unhealthyif its heap is critical. Ideally a user would want to redistribute load on a unhealthy member to other members iff the members have sufficient capacity (i.e. totalBytes + newBucketSize << localMaxMemory). In some cases this can cause entire cluster to fail. Redistribution of load may cause healthy members to become unhealthy. Rebalancing can also increases IO activity significantly. So it may be safer to manually rebalance the cluster if any node is unhealthy. 
  2. Current implementation of rebalance operation can be used to estimate transfer-size, before actually executing transfer size. Transfer size is the total number of bytes that may be moved during a rebalance operation. It is mainly based on the total number of buckets below redundancy level and load on individual nodes. It will be inefficient if rebalance is executed if transfer-size is too small. Moreover rebalancing when transfer size is high may overload the system.
  3. New capacity of a cluster can be increased by adding new nodes. A user can specify rebalance flag after the last node is added. This way frequent rebalance can be avoided. 

Based on the points discussed above, we plan to use transfer-size metric as the primary decision factor for triggering rebalance. Presence of empty node will be ignored assuming user may be adding more capacity. Similarly critical nodes will be ignored assuming such nodes need specific region based rebalance actions.

How is load defined?

Load on a member is a function of

  1. Total number of buckets Total number of membershosted on the member
  2. Number of primary buckets on the member
  3. Number of secondary buckets on the member
  4. Size of the buckets
  5. Maximum memory

When is a

...

  1. Its heap is critical (included in ResourceAdvisor().adviseCriticalMembers())
  2. The node is misconfigured, for e.g. max memory is not sufficient to host even one bucket

When is a member lightly loaded?

  1. It has enough memory: totalBytes + newBucket.getBytes() << localMaxMemory

When is the cluster out of balance

cluster off-balance?

  1. [Auto-balance candidate] if transfer-size is more than X% of the total data size, rebalance can result in a consistent data distribution and create comparable free space on all nodes

  2. [Auto-balance candidate] if the cluster is not running at configured redundancy levels
  3. [prefer manual rebalance] or any unhealthy
  4. If (maxLoadedMember - minLoadedMember) > 20%
  5. Or any heavily-loaded node exists in the cluster

Where can a bucket be moved?

  1. .

Use Cases

  1. After node failure and recovery, gfsh command "rebalance -simulate" reports a high transfer-size. In this case, the nodes may have comparable utilization, but a rebalance would result in a uniform region data distribution. So action would be taken
  2. Over time, some buckets may grow much larger than other buckets in the region. Or some regions may grow more than others. Rebalance would get triggered, resulting in a uniform distribution

    For a bucket B, if there is a lightly loaded member which is not hosting B

Design

We would like to implement this as an independent module without modifying existing code, so that it can be easily applied to any version of the system. To enable ARauto-balancing, the user will place the auto-rebalance balance jar on their classpath and add an initializer to their cache.xml. The initializer will provide the following configuration

  1. Schedule - cron stringIn order to minimize the impact on concurrent operations, we feel it’s important to provide the user with the ability to configure the frequency and timing of automatic rebalancing. Bucket movement does add load to the system and in our performance tests we can see that the throughput of concurrent operations drops during bucket movement. Schedule can be provided as a cron string. Default: Once a week on Saturday. A user is expected to configure off-peak hours for rebalancing. So a schedule based on cron like configuration is useful.
  2. Size-threshold-percent - int between 1 and 99: Rebalancing will be triggered if the transfer-size is more than this threshold. This threshold is the percentage of the total data size. Rebalance operation computes transfer size based on relationship between regions, primary ownership and redundancy.

  3. AR manager: the class to check the system state and trigger rebalance. Default: AutoRebalance
  4. Threshold: Rebalancing will be triggered if the load difference between the most-loaded and least-loaded member is more than this threshold. Default: 20%
  5. Minimum cluster: 

    Rebalancing could be harmful when the cache is initially being populated, because bucket sizes may vary wildly when there is very little data. Because of that, we will also provide a threshold before automatic rebalancing will kick in.

E.g.

<cache>
...
 <initializer>
  <!-- Optional auto-rebalance manager -->
  <class-name> com.gemstone.gemfire.cache.util.AutoBalancer </class-name>
 
  <!-- Optional. Default: Once a week on Saturday. E.g. check at 3 am every night -->
  <parameter name=”schedule”> 0 0 3 * * *? </parameter>
 
  <!-- Optional auto-rebalance manager -->  <class-name> org.apache.geode.rebalance.AutoBalance </class-name>
 
  <!-- Optional. Default: 20%10%. E.g. Don’t rebalance until the variation intransfer size between members is more than 10% of the total data size -->
  <parameter name=”unbalance”size-threshold-percentage”>percent”> 10 </parameter>
 
  <!-- Optional. Default: 50%100 MB. E.g. Don’t rebalance a region until atthe leasttransfer onesize memberis isatleast 50%100 fullMB -->
  <parameter name=”size”minimum-threshold-percentage”>size”> 50100000000 </parameter>
 </initializer>
... 
</cache>


We only want one member to be automatically rebalancing a given region. So each member that starts auto rebalancing will try to get a distributed lock. If the member obtains the lock it will do the auto rebalancing until the member diesrebalance completes. Otherwise it continue to wait for the lock to become availablenext cycle and repeat.

At the scheduled interval the AR auto-balancer will check the balance of the system. It will do that by calling PartitionRegionHelper.getPartitionRegionInfo and fetching the size of all of the regions in bytes from all members. It will sum the colocated regions together (like rebalancing does). 

Note that this means there is a limitation that members configured with the auto rebalancer have all of the regions defined, because otherwise some regions may not be rebalanced. 


PlantUML
aligncenter
(*) --> "Init Cache"
--> "Init ARAuto-Balancer"
--> "GrabWait ARfor distributednext lockslot"
--> "Wait for next slot" as schedule
-right-> "Execute AR"
 
if "is out of balance" then
 as schedule
if "DLock acquired?" then
  -left->[No] schedule
else
  --> [Yes] "Start execution"
  --> if "balanced?" then
    --> [No] schedule
  else
    --> [Yes]  "Invoke Rebalance"
else
-> [No] schedule    --> "Wait for completion"
    --> schedule
  endif
endif

Testing

We will need to add auto rebalancing to some existing tests and give it a schedule that will cause it to run during the test. We will also need to write unit tests for the rebalancing triggering and scheduling logic.

Limitations

  • Initializer: Geode has provision for a single initializer instance. Spring integration also depends on Initializer. So initializer based approach could block user from using some features. Initializer initializer based approach seems ok for POC. Also some parts of the code will be reusable, scheduler, locking and trigger logic.
  • For now start with a separate module (like gemfire-web) for rebalancer. We will consolidate smaller modules into a bigger one later if it gets too cluttered.
  • Quartz seems to be an overkill for just cron string parsing. Since rebalance is an expensive operation, we expect uses to schedule it off-peak hours. This is where cron based schedule is very useful. We are not exposing cron api externally and may replace it with a lighter implementation for cron parsing.
  • Only regions that are defined on the auto rebalancer node will be rebalanced. Users can add accessors if there is a region they want to make sure gets rebalanced but is not available everywhere.
  • Rebalancing always recovers redundancy, moves buckets, and moves primaries. This means that when the rebalancer kicks in, redundancy will be recovered, regardless of the settings for recovery-delay.
  • There is no way to disable or modify the automatic rebalancing without restarting members, since the configuration is part of the member configuration.