You are viewing an old version of this page. View the current version.

Compare with Current View Page History

« Previous Version 14 Next »

Overview

The WAN replication feature allows 2 remote data centers, or 2 availability zones, to maintain data consistency. In the case where one data center cannot process incoming events for any reason, the other data center should retain the failed events so that no data is lost. Currently if data center 1 (DC1) is able to connect to data center 2 (DC2) and send it events, those events are removed from the queue on DC1 when the ack from DC2 is received, regardless of what happens to them on DC2. This behavior is controlled by the internal system property REMOVE_FROM_QUEUE_ON_EXCEPTION which defaults to true. Most common exceptions thrown from a receiving site include:

  • LowMemoryException - when one or more of the receiving site's members is low on memory
  • CacheWriterException - when a CacheWriter before* method throws an exception
  • PartitionOfflineException - when all the members defining a persistent bucket are offline
  • RegionDestroyedException - when the region doesn't exist in the remote site
  • Malformed data exception (unable to deserialize)

Goals

We will provide a mechanism for users to preserve events on the gateway sender that do not get successfully processed on the receiving data center.  Our example implementation will store these events on disk at the sending data center and notify the user what events did not get transmitted. 

  1. Deprecate (and later remove) the internal system property REMOVE_FROM_QUEUE_ON_EXCEPTION, but detect if it is set to false and support existing behavior (infinite retries)
  2. Create a new callback API that will be executed when an exception is returned with the acknowledgement from the receiver
  3. Provide an example implementation of the callback that saves events with exceptions returned from the receiver in a 'dead-letter' queue on the sender (on disk)
  4. Add 2 new properties for the gateway receiver to control when to send the acknowledgement with the exceptions:
    1. the number of retries for events failing with an exception
    2. the wait time between retries 

Not in Scope

  1. Providing the ability to directly replay events from the dead-letter queue.

Approach

 Our current design approach is as follows:

  1. Deprecate existing internal boolean system property: REMOVE_FROM_QUEUE_ON_EXCEPTION
    1. Continue to support default behavior if boolean set to false by setting # retries on receiver to -1
  2. Create new Java API

    1. Define callback API for senders to set callback to dispatchers

    2. If sender is configured with a callback, invoke the callback if batch exception occurs prior to batch removal

    3. Implement a default callback API (see item 5 below)

    4. Add properties on gateway receiver factory to specify # retries for a failed event and wait time between retries.

  3. Modify Gfsh commands

    1. Add option to gfsh ‘create gateway sender’ command to specify custom callback

    2. Add options to gfsh ‘create gateway receiver’ command to set # retries and wait time between retries

    3. Store new options in cluster config

      1. Sender: callback implementation

      2. Receiver: # of retries and wait time between retries

  4. Add support in cache.xml for specifying new callback for gateway sender and setting new options for gateway receiver

  5. Create example implementation of Sender callback that writes event(s) and associated exceptions to a file

  6. Security features  

    1. Define privileges needed to deploy and configure sender callback

    2. With security, callback should only write eventIds and exceptions, i.e. no entry values should be written to disk.

  7. Add logging and statistics for callback

    1. Log messages for gateway receiver for start time and results of retries

    2. Add statistics and MBean for callbacks in-progress, completed, # and duration

New workflow for setting up WAN gateway using gfsh:

  1. Create gateway receiver including new options for specifying # of retries and wait time between retries
  2. Deploy jar on gateway sender(s) containing callback implementation
  3. Create gateway sender with option to add callback

API Change

  1. TBD

Risks and Unknowns

  1. How to handle class not found exception for sender callback

  2. Default behavior when no callback is provided for sender? - Should be same as current behavior
  3. Backward compatibility behavior
    1. old sender connected to new receiver using new options
    2. new sender with callback implemented connected to old receiver
  4. Sort out security privileges needed for deploying vs installing with sender vs reading values for failed events written to disk.

Potential Future Enhancements

  • Ability to modify batch removal to remove specific events from the batch
  • Ability to resend events saved in dead-letter queue

Current Implementation

Site 1 Site 2 EventProcessor EventProcessor RemoteDispatcher RemoteDispatcher ServerConnection ServerConnection ReceiverCommand ReceiverCommand peekBatchFromQueue dispatchBatch getConnection sendBatch readRequest createCommand execute readBatchEvents loop[For Each Batch Event] loop[Retry] determineOperation (create, update, destroy) executeOperation alt[Successful executeOperation:] breakRetry [Failed executeOperation:] alt[Remove from queue on exception:] storeException breakRetry [Keep in queue on exception:] sleep N milliseconds continueRetry sendAcknowledgement readAcknowledgement logExceptions (if necessary) removeBatchFromQueue

Proposed Implementation

Site 1 Site 2 EventProcessor EventProcessor RemoteDispatcher RemoteDispatcher FailedEventHandler FailedEventHandler ServerConnection ServerConnection ReceiverCommand ReceiverCommand peekBatchFromQueue dispatchBatch getConnection sendBatch readRequest createCommand execute readBatchEvents loop[For Each Batch Event] loop[Retry numberOfRetries] determineOperation (create, update, destroy) executeOperation alt[Successful executeOperation:] breakRetry [Failed executeOperation:] storeException sleep waitTimeBetweenRetries milliseconds continueRetry sendAcknowledgement readAcknowledgement loop[For Each Failed Batch Event] onException removeBatchFromQueue

  • No labels