Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

...

The WAN replication feature allows 2 remote data centers, or 2 availability zones, to maintain data consistency. In the case where one data center cannot process incoming events for any reason, the other data center should retain the failed events so that no data is lost. Currently if data center 1 (DC1) is able to connect to data center 2 (DC2) and send it events, those events are removed from the queue on DC1 when the ack from DC2 is received, regardless of what happens to them on DC2. This behavior is controlled by the system property REMOVE_FROM_QUEUE_ON_EXCEPTION which defaults to true. Most common exceptions thrown from a receiving site include:

  • LowMemoryException - when one or more of the receiving site's members is low on memory
  • CacheWriterException - when a CacheWriter before* method throws an exception
  • PartitionOfflineException - when all the members defining a persistent bucket are offline
  • RegionDestroyException - when the region doesn't exist in the remote siteLow Memory Exception
  • Malformed data exception (unable to deserialize)

...

  1. Deprecate existing internal boolean system property: REMOVE_FROM_QUEUE_ON_EXCEPTION
    1. Continue to support default behavior if boolean set to false by setting # retries on receiver to -1
  2. Create new Java API

    1. Define callback API for senders to set callback to dispatchers

    2. If sender is configured with a callback, invoke the callback if batch exception occurs prior to batch removal

    3. Implement a default callback API (see item 5 below)

    4. Add properties on gateway receiver factory to specify # retries for a failed event and wait time between retries.

  3. Modify Gfsh commands

    1. Add option to gfsh ‘create gateway sender’ command to specify custom callback

    2. Add options to gfsh ‘create gateway receiver’ command to set # retries and wait time between retries

    3. Store new options in cluster config

      1. Sender: callback implementation

      2. Receiver: # of retries and wait time between retries

  4. Add support in cache.xml for specifying new callback for gateway sender and setting new options for gateway receiver

  5. Create example implementation of Sender callback that writes event(s) and associated exceptions to a file

  6. Security features  

    1. Define privileges needed to deploy and configure sender callback

    2. With security, callback should only write eventId's eventIds and exceptions, i.e. no entry values should be written to disk.

  7. Add logging and statistics for callback

    1. Log messages for gateway receiver for start time and results of retries

    2. Add statistics and MBean for callbacks in-progress, completed, # and duration

...