To be Reviewed By: 2 April 2022

Authors: Mario Ivanac

Status: Draft | Discussion | Active | Dropped | Superseded

Superseded by: N/A

Related: N/A

Problem

It has been observed, that in case server with full parallel gateway sender queue is restarted,

after it is up, it unqueues events much slower , than other members in the cluster.

Reason for this is current logic,  to mark all events in queue as possible duplicate when secondary buckets are becoming primary (or bucket is recovered and becoming primary).

Due to this, events on receiving site are processed slower (checking each event, is it duplicate).

If we look at current logic, if link between sites is temporarily down, then exact number of events will be marked as possible duplicate for each server.

This number is multiplication of number of dispatcher threads (default 5) and batch size (default 100). So for default configuration this number is 500.

Additionally, duplicate events are also possible, while dispatching events, if sender is stopped, or server hosting sender is gracefully shutdown.

Anti-Goals

Solution will not cover Parallel Async  Events.

Solution will not cover situation when server is ungracefully shutdown.

Solution

Implement logic, that for each dispatch thread, in case events are unsuccessfully dispatched, when marking them as possible duplicate, for same batch of events notify secondary bucket.

Also add logic for pending batches at moment of stopping sender, when marking them as possible duplicate, for same batch of events notify secondary bucket.

Also add new API prepareForStop(), which will be called prior to closing of cache, so notifications can be sent prior shutdown.

Remove current logic, to mark all events in bucket, when it becomes primary.

PR with proposed solution: https://github.com/apache/geode/pull/7323

Changes and Additions to Public Interfaces

Add prepareForStop() method in GatewaySender interface.

Performance Impact

No impacts.

Backwards Compatibility and Upgrade Path

No impacts.

Prior Art

What would be the alternatives to the proposed solution? What would happen if we don’t solve the problem? Why should this proposal be preferred?

FAQ

Answers to questions you’ve commonly been asked after requesting comments for this proposal.

Errata

What are minor adjustments that had to be made to the proposal since it was approved?

  • No labels

4 Comments

  1. > This number is multiplication of number of dispatcher threads (default 5) and batch size (default 100). So for default configuration this number is 500.

    Is this really true? For some reason I was under the impression that the gateway sender was streaming batches of data and receiving acks asnychronously. If that is the case the number of events that have been sent but not acknowledged could be much greater.

  2. For WAN, in case batch is unsuccessfully dispatched, we will retry with sending of the same batch until it is successful or gw sender is stopped. This is according to implementation in 

    AbstractGatewaySenderEventProcessor.
  3. Currently this RFC only covers problem, when we have broken link between sites, and queue is filling. Problem is if any server is now restarted, that all events in buckets that become primary will be marked as possible duplicate (instead of 500 that was marked in restarted server).