Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

PartitionedRegion Clear()

To be Reviewed By:Dec 20,2019

...

Superseded by: N/A

Related: N/A

Problem

To support clear() operation on Partitioned Region (PR - partitioned region will be referred as PR in parts of this doc). 

...

  • A PR is comprised of number of distributed bucket regions with 0 or more secondary(redundant) copies; and clear has to be performed on both primary and secondary copies, keeping the data consistent between them.
  • During the operation the bucket region could be moved with rebalance operation or from members joining/leaving the cluster OR its state could be changed (primary/secondary).
  • Updating/clearing OQL indexes (which are synchronously and asynchronously maintained).
  • Updating/clearing Lucene indexes which are asynchronously maintained via AEQ.
  • Clearing persistent/overflowed data and managing data consistency across primary and secondary copies and disk stores which could be offline.
  • Handling clear during region initialization operation - from both initial image provider and requester point of view 
  • Handling concurrent clear and cache operations (put/destroy) - in synchronization with primary and secondary copies.
  • Notifying client subscribed to the PR on clear event, and keeping it consistent with server side data.
  • Managing Transactional operations with and during clear.

Anti-Goals

  •  Sending clear event for WAN replication.As this has implication on entire region data on remote cluster; this is kept out of the scope of this work; in-line with WAN GII, Region destroy operations.

Solution

When clear() is executed from client or peer member, one of the members is selected as clear message co-ordinator (likely where the command is originated/received). The co-ordinator member propagates the clear message to all the members hosting that region. And the co-ordinator member will become responsible for handling success or failure condition while processing local or distributed clear message.

Summary

At a high level the steps involved in initiating and completing clear operation:

...

Considering the pros and cons (explained below) of each options, the option-1 and option-2 was picked as probable choices and finally option-2 was chosen as a viable option. 

Option-1: Co-ordinator member sends a single PR clear message to each member hosting the region (data store)

In this case the co-ordinator member will send a single clear region message to all other members hosting the region; the receiving member will process this message serially or in-parallel using thread/executor pool. The co-ordinator member gets the primary bucket info, sends the clear messages to remote member with primary bucket list. The co-ordinator member has to obtain and manage the primary bucket list, in order to address any bucket region movement during the clear (due to destroy or rebalance) and retry the clear operation. This is similar to how the query engine executes queries on remote bucket region; 

...

The member contains primary bucket1 and bucket2. Both clear and TX uses RvvLock at bucket level, if clear thread is processing bucket1, then bucket2, and TX thread is doing operation on bucket2 first and then on bucket1 they will end up with dead lock. 

Option-2: Co-ordinator member sending clear message for each primary bucket

In this case the co-ordinator member will send a separate clear region message to every primary bucket. And handles success and failure (retry) based on the response to those messages. This is similar to how the region.removeAll() works. 

...

Based on the pros and cons between option-1 and option-2; option 2 is the recommended solution to implement.

Note: The complexity of option-3 and option-4 ruled them out as a solution at this time, a brief explanation of the options are included the end of this doc.

Flow Chart - Sending clear message at bucket level 

Implementation Details

(addressing the consideration/challenges listed above)

  • The member after receiving the clear request: 

    Acquires the distributed lock; and elects himself as the co-ordinator member. This prevents multiple clear() ops concurrently getting executed.

    Gets the primary bucket list. Sends clear message to the primary buckets (members).

    The primary buckets upon receiving the message (processing), take a distributed lock to prevent losing primary bucket. Then takes RVV.lockForClear (a local write lock to prevent on-going operations, GII. transactions). 

    Upon completion of clear on primary bucket, sends the clear message to secondary buckets (members). 

    When secondary bucket receives RVV, it will wait/check for local RVV to dominate the received RVV, which makes sure the concurrent cache operations are applied/rejected accordingly.

    NOTE:

    The cache operations are synchronized under the RVV lock; for non off-heap region the clear will call the map.clear() which should have minimal or no impact on performance. For off-heap region map.clear() will iterate over the entries to clear the off-heap entries; this could have impact on cache operation performance. This will be documented. And In future option of iterating over region entires could be done in background thread.    

  • Handling Transaction (TX) during clear

    As the clear and transactions are handled at bucket region level. They will operate by taking rvvLock.

    If TX gets lock first, clear thread will wait until TX finishes and releases the rvvLock. 

    If clear gets the rvvLock first, TX will fail and rollback.

  • Updating OQL Index

    The index are managed both synchronously and asynchronously. The clear will update both synchronous and asynchronous indexes under lock, by clearing both index data structures and the queues used for asynchronous maintenance.  

  • Disk Region clear version tag is written to oplog for each bucket region.
  • CacheListener.afterRegionClear(), CacheWriter.beforeRegionClear() will be invoked at PR level.
  • Update to client queue i.e. notify client

    The subscription clients will be notified with clear region event (with version info?) at PR level. 

  • Off-heap entries:  Reuse current non-partitioned regions clear logic to release off-heap entries in an iteration.
  • GII: Reuse current non-partitioned regions clear's logic to compete for rvvLock with clear.
  • Region Destroy  

    The clear will throw RegionDestroyedException at bucket region level. The coordinator should collect the exception and  throw this exception to caller. 

  • LRU: Clear will remove all the expiration tasks at bucket.
  • PartitionOfflineException If some buckets’ clear failed with PartitionOfflineException, the PR.clear should return the caller with PartialResultException. Let user to decide if to call clear again.
  • Updating Lucene Index

    As part of clear(), the Lucene indexes will be recreated. Any events which are in the AEQ prior to clear (for the cleared region entries) will be rejected/handled (this logic exists in the product). 

Test Cases to consider

  1. TX in PR
    1. Test TX with one or more primary buckets in the same or different member, when clear is on-going
  2. Retry at Accessor (a member is killed, need to find new primary target)
  3. Retry at client (coordinator is killed)
  4. PR ends with PartialResult, due to PartitionOfflineException
  5. When PR.clear is on-going, restart some members.
  6. When PR.clear is on-going, rebalance to move primary.
  7. Benchmark test to see how long is used in local clear vs distribution when there’s off-heap setting.
  8. Region destroy (bucket or the whole PR) when clear is on-going
  9. When PR.clear is on-going, try to create OQL index on value/key, or try to create Lucene Index, or try to alter region

Changes and Additions to Public Interfaces

Currently PartitionRegion.Clear() throws UnsupportedOperationException. It will be updated to perform clear operation.

Performance Impact

Cache operation on off-heap region during clear could be impacted. 

Backwards Compatibility and Upgrade Path

Clear will throw exception, if there is any older version member running in the cluster. 

Assumptions

The data consistency is not satisfied If region does not enable concurrency check (there will be no rvv and rvvLock). This is also current behavior with non partitioned region clear().

Prior Art

Besides option-1 and option-2 following options are considered. These are considered keeping the performance impact with data iteration during clear.

...

Answers to questions you’ve commonly been asked after requesting comments for this proposal.

Errata

What are minor adjustments that had to be made to the proposal since it was approved?