Redundancy Gfsh Commands

To be Reviewed By: 04/06/2020

Authors: Donal Evans

Status: Draft | Discussion | Active | Dropped | Superseded

Superseded by: N/A

Related: N/A

Problem

For some time, there has been a demand for a feature to allow users to determine the redundancy status of partitioned regions and to restore any missing redundancy without having to trigger a full rebalance of the system.^{[1] [2]} Currently, no simple internal API call or gfsh command exists that provides users with the redundancy status of all partitioned regions in a system and the only way to manually trigger redundancy recovery is to perform a rebalance operation, which is a resource-intensive operation that can potentially move a lot of data around and cause exceptions if transactions are running in the system.^[3] In order to determine the redundancy status of all partitioned regions, a user has to use a workaround of repeatedly calling

gfsh>show metrics --categories=partition --region=region_name

for every partitioned region in the system, the output of which contains a lot of information that is not relevant to redundancy status.^[4]

Anti-Goals

These gfsh commands and internal API are not intended to facilitate moving buckets or data from one member to another. Nor are they intended to guarantee full redundancy after calling, as it is possible that there are not enough members in the cluster to allow regions to meet their configured redundancy. It is also not within the scope of this RFC to describe any REST API that may be created at a future point in time to make use of the proposed internal API.

Current Implementation Details

The current method of triggering a rebalance operation is by creating a RebalanceFactory object using the InternalResourceManager.createRebalanceFactory() method and calling the RebalanceFactory.start() method on that factory. Regions can be explicitly included or excluded from the operation using the RebalanceFactory.includeRegions() and RebalanceFactory.excludeRegions() methods before calling start().

The RebalanceFactory.start() method creates and returns a RebalanceOperation object and calls the RebalanceOperation.start() method on it. This method schedules PartitionedRegionRebalanceOp operations for each partitioned region, passing as an argument a CompositeDirector implementation of the RebalanceDirector interface. The RebalanceDirector receives a model of the current system, and requests changes to the system such as creating buckets, moving buckets, or reassigning which members host primaries. The CompositeDirector implementation of RebalanceDirector removes over-redundancy, restores missing redundancy, moves buckets for better load balance and reassigns primaries for better primary distribution.

The user can then call RebalanceOperation.getResults(), which waits for the operation to finish for all regions and returns a RebalanceResults object containing details about the number of buckets created and moved.

Solution

The proposed solution is to create a new RestoreRedundancyOperation which will behave and be accessed in broadly the same way as the existing RebalanceOperation. The operation will schedule PartitionedRegionRebalanceOp operations for each appropriate region, but use a new RestoreRedundancyDirector instead of the CompositeDirector, which will only perform the removing over redundancy, restoring missing redundancy and (optionally) reassigning primary buckets steps.

Since the existing RebalanceResults object does not capture the relevant information regarding actual redundancy, the RestoreRedundancyOperation.getResults() method will return a new RestoreRedundancyResults object containing the redundancy status for each region involved in the operation as well as information about primary bucket reassignments.

Changes and Additions to Public Interfaces

To allow users to trigger a restore redundancy operation, a new RestoreRedundancyBuilder object will be created, similar to the existing RebalanceFactory. Two new methods will be added to the ResourceManager public interface to support this:

RestoreRedundancyBuilder createRestoreRedundancyBuilder()

Set<RestoreRedundancyOperation> getRestoreRedundancyOperations()

RestoreRedundancyBuilder

The RestoreRedundancyBuilder will be responsible for setting the regions to be explicitly included or excluded in the RestoreRedundancyOperation (default behaviour will be to include all regions), setting whether to reassign which members host primary buckets (default behaviour will be to reassign primaries) and starting the operation. A method will also be included for getting the current redundancy status:

public interface RestoreRedundancyBuilder {
  RestoreRedundancyBuilder includeRegions(Set<String> regions);

  RestoreRedundancyBuilder excludeRegions(Set<String> regions);

  RestoreRedundancyBuilder doNotReassignPrimaries(boolean shouldNotReassign);

  RestoreRedundancyOperation start();

  RestoreRedundancyResults redundancyStatus();

}

RestoreRedundancyOperation

The RestoreRedundancyOperation interface will provide access to the current status of the operation, the ability to cancel it, and the ability to retrieve the results, with or without a timeout:

public interface RestoreRedundancyOperation {
  boolean isCancelled();

  boolean isDone();

  boolean cancel();

  RestoreRedundancyResults getResults();

  RestoreRedundancyResults getResults(long timeout, TimeUnit unit);

}

RestoreRedundancyResults

The RestoreRedundancyResults object will be a collection of individual results for each region and will contain methods for determining the overall success or failure of the operation, generating a detailed description of the state of the regions and for getting information about the work done to reassign primaries as part of the operation:

public interface RestoreRedundancyResults {
  enum Status {
    SUCCESS,
    FAILURE,
    ERROR
  }

  void addRegionResults(Map<String, RestoreRedundancyRegionResult> results);

  void addRegionResult(RestoreRedundancyRegionResult regionResult);

  Status getStatus();

  String getMessage();

  RestoreRedundancyRegionResult getRegionResult(String regionName);

  Map<String, RestoreRedundancyRegionResult> getZeroRedundancyRegionResults();

  Map<String, RestoreRedundancyRegionResult> getUnderRedundancyRegionResults();

  Map<String, RestoreRedundancyRegionResult> getSatisfiedRedundancyRegionResults();

  Map<String, RestoreRedundancyRegionResult> getRegionResults();

  int getTotalPrimaryTransfersCompleted();

  long getTotalPrimaryTransferTime();

}

RestoreRedundancyRegionResult

Finally, the RestoreRedundancyRegionResult object will be a data structure containing a snapshot of the name, configured redundancy, actual redundancy and a status representing the state of redundancy for a given region:

public class RestoreRedundancyRegionResult{
  enum RedundancyStatus {
    SATISFIED,
    NOT_SATISFIED,
    NO_REDUNDANT_COPIES
  }

  public String getRegionName();

  public int getDesiredRedundancy();

  public int getActualRedundancy();

  public RedundancyStatus getStatus();

}

Gfsh commands

To provide additional utility, two gfsh commands will also be created;

restore redundancy [--include-region=value(,value)*] [--exclude-region=value(,value)*] [--dont-reassign-primaries(=value)]

status redundancy [--include-region=value(,value)*] [--exclude-region=value(,value)*]

The first command will execute a function on members hosting the specified partitioned regions and trigger the restore redundancy operation for those regions, then report the final redundancy status of those regions. If at least one redundant copy exists for every bucket in the specified regions, the status of the command will be success. If at least one bucket in a region has zero redundant copies, if there is a member in the system with an older version of Geode or if the restore redundancy function encounters an exception, the command will return error status.

The second command will determine the current redundancy status for the specified regions and report it to the user.

Both commands will take optional --include-region and --exclude-region arguments, similar to the existing rebalance command. If neither argument is specified, all regions will be included. Included regions will take precedence over excluded regions when specified. The restore redundancy command will also take an optional --dont-reassign-primaries argument to determine if primaries should not be reassigned during the operation. The default behaviour will be to reassign primaries.

Both commands will output a list of regions with zero redundant copies first (unless they are configured to have zero copies), then regions with less than their configured redundancy, then regions with full redundancy. The restore redundancy command will also output information about how many primaries were reassigned and how long that process took, similar to the existing rebalance command.

Performance Impact

Since the proposed changes do not modify any existing behaviour, no performance impact is anticipated. Moreover, since restoring redundancy without performing a full rebalance is significantly less resource intensive, the addition of this API and these gfsh commands provides a more performant solution for cases in which only restoration of redundancy is wanted.

Backwards Compatibility and Upgrade Path

Members running older versions of Geode will not be able to perform the restore redundancy operation, so if any such members are detected in the system, the operation will fail to start and return an error status.

Prior Art

Any proposed solution to the problem that did not use the existing rebalance logic would have to reimplement large and complicated areas of code in order to correctly create redundant copies on members. One possible other solution that would use the existing rebalance logic would be to provide additional arguments to the existing rebalance operation to prevent moving buckets and prevent moving primaries. Given that the rebalance operation is already complicated, and that it could be confusing from a user perspective to use the name “rebalance” for an operation that is not actually balancing any data load, this solution was rejected in favour of creating a new, specific operation to restore redundancy.

Errata

None so far

References

[1] https://issues.apache.org/jira/projects/GEODE/issues/GEODE-4250

[2] https://issues.apache.org/jira/projects/GEODE/issues/GEODE-4434

[3] https://geode.apache.org/docs/guide/16/developing/partitioned_regions/rebalancing_pr_data.html

[4] https://geode.apache.org/docs/guide/14/developing/partitioned_regions/checking_region_redundancy.html

Space shortcuts

Page tree