WAN - Gateway Sender Callback API & Dead-letter queue example implementation

Overview

The WAN gateway allows 2 distinct data centers, or 2 availability zones, to maintain data consistency. In the case where one data center cannot process incoming events for any reason, the other data center should retain events so that no data is lost. Currently if one site can connect to another and send it events, those events are removed from the queue when the ack is received, regardless of what happens to them on the remote site. This behavior is controlled by the system property REMOVE_FROM_QUEUE_ON_EXCEPTION which defaults to true. It is unacceptable to simply store events that did not get successfully processed on the receiving end somewhere without replaying them. Customer needs to only know what events did not get transmitted. Most common exceptions thrown from a receiving site include:

Low Memory Exception
Malformed data exception (unable to deserialize)

Goals:

Deprecate (and later remove) the internal system property REMOVE_FROM_QUEUE_ON_EXCEPTION, but detect if it is set to false and support existing behavior (infinite retries)
Create a new callback API that will be executed when an exception returned with the acknowledgement from the receiver
Provide an example implementation of the callback that saves events with exceptions returned from the receiver in a 'dead-letter' queue on the sender (on disk)
Add 2 new properties for the gateway receiver to control when to send the acknowledgement with the exceptions:
1. the number of retries for events failing with an exception
2. the wait time between retries

Not in Scope

Providing the ability to directly replay events from the dead-letter queue.

Approach

Our current design approach is as follows:

Deprecate existing internal boolean system property: REMOVE_FROM_QUEUE_ON_EXCEPTION
1. Continue to support default behavior if boolean set to false by setting # retries on receiver to -1
Create new Java API
1. Define callback API for senders to set callback to dispatchers
2. Invoke callback if batch exception occurs prior to batch removal
3. Implement a default callback API (see item 8 below)
4. Add parameters on gateway receiver factory for # retries and wait time between retries.
Modify Gfsh commands
1. Add option to gfsh ‘create gateway sender’ command to specify custom callback
2. Add options to gfsh ‘create gateway receiver’ command to set # retries and wait time between retries
3. Store new options in cluster config
  1. Sender: callback implementation
  2. Receiver: # of retries and wait time between retries
Create example implementation of Sender callback that writes event(s) and associated exceptions to a file
Security features
1. Define privileges needed to deploy and configure sender callback
2. With security, callback should only write eventId's and exceptions, i.e. no entry values should be written to disk

API Change

Risks and Unknowns

Space shortcuts

Page tree

Overview

Goals:

Not in Scope