You are viewing an old version of this page. View the current version.

Compare with Current View Page History

« Previous Version 2 Next »

Overview

 

The WAN gateway allows 2 distinct data centers, or 2 availability zones, to maintain data consistency. In the case where one data center cannot process incoming events for any reason, the other data center should retain events so that no data is lost. Currently if one site can connect to another and send it events, those events are removed from the queue when the ack is received, regardless of what happens to them on the remote site. This behavior is controlled by the system property REMOVE_FROM_QUEUE_ON_EXCEPTION which defaults to true. It is unacceptable to simply store events that did not get successfully processed on the receiving end somewhere without replaying them. Customer needs to only know what events did not get transmitted. Most common exceptions thrown from a receiving site include:

  • Low Memory Exception
  • Malformed data exception (unable to deserialize)


Goals:

  1. Deprecate (and later remove) the internal system property REMOVE_FROM_QUEUE_ON_EXCEPTION, but detect if it is set to false and support existing behavior (infinite retries)
  2. Create a new callback API that will be executed when an exception returned with the acknowledgement from the receiver
  3. Provide an example implementation of the callback that saves events with exceptions returned from the receiver in a 'dead-letter' queue on the sender (on disk)
  4. Add 2 new properties for the gateway receiver to control when to send the acknowledgement with the exceptions:
    1. the number of retries for events failing with an exception
    2. the wait time between retries 

Not in Scope

  1. Providing the ability to directly replay events from the dead-letter queue.

Approach

 Our current design approach is as follows:

  1. Deprecate existing internal boolean system property: REMOVE_FROM_QUEUE_ON_EXCEPTION
    1. Continue to support default behavior if boolean set to false by setting # retries on receiver to -1
  2. Create new Java API

    1. Define callback API for senders to set callback to dispatchers

    2. Invoke callback if batch exception occurs prior to batch removal

    3. Implement a default callback API (see item 8 below)

    4. Add parameters on gateway receiver factory for # retries and wait time between retries.

  3. Modify Gfsh commands

    1. Add option to gfsh ‘create gateway sender’ command to specify custom callback

    2. Add options to gfsh ‘create gateway receiver’ command to set # retries and wait time between retries

    3. Store new options in cluster config

      1. Sender: callback implementation

      2. Receiver: # of retries and wait time between retries

  4. Create example implementation of Sender callback that writes event(s) and associated exceptions to a file

  5. Security features  

    1. Define privileges needed to deploy and configure sender callback

    2. With security, callback should only write eventId's and exceptions, i.e. no entry values should be written to disk


API Change



Risks and Unknowns

  1.  

  • No labels