Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

Table of Contents

Overview

The WAN gateway replication feature allows 2 distinct remote data centers, or 2 availability zones, to maintain data consistency. In the case where one data center cannot process incoming events for any reason, the other data center should retain the failed events so that no data is lost. Currently if one site can connect to another data center 1 (DC1) is able to connect to data center 2 (DC2) and send it events,  those those events are removed from the queue on DC1 when the ack from DC2 is received, regardless of what happens to them on the remote siteDC2. This behavior is controlled by the system property REMOVE_FROM_QUEUE_ON_EXCEPTION which defaults to true. It is unacceptable to simply store events that did not get successfully processed on the receiving end somewhere without replaying them. Customer needs to only know what events did not get transmitted. Most common exceptions thrown from a receiving site include:

  • Low Memory Exception
  • Malformed data exception (unable to deserialize)

Goals:

We will provide a mechanism for users to preserve events on the gateway sender that do not get successfully processed on the receiving data center.  Our example implementation will store these events on disk at the sending data center and notify the user what events did not get transmitted. 

  1. Deprecate (and later remove) the internal system property REMOVE_FROM_QUEUE_ON_EXCEPTION, but detect if it is set to false and support existing behavior (infinite retries)
  2. Create a new callback API that will be executed when an exception returned with the acknowledgement from the receiver
  3. Provide an example implementation of the callback that saves events with exceptions returned from the receiver in a 'dead-letter' queue on the sender (on disk)
  4. Add 2 new properties for the gateway receiver to control when to send the acknowledgement with the exceptions:
    1. the number of retries for events failing with an exception
    2. the wait time between retries 

...