Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

Avoid the queuing of dropped events by the primary gateway sender when the gateway sender is stopped

To be Reviewed By: July 9th16th, 2020

Authors: Alberto Gomez (alberto.gomez@est.tech)

Status: Draft  Draft | Discussion | Active | Dropped | Superseded

Superseded by: N/A

Related: N/A

Problem

Primary Gateway senders drop all events received when they are stopped. Nevertheless, primary gateway senders, while stopped, store all events received in the tmpDroppedEvents member variable of the AbstractGatewaySender class. These events are stored so that they can be sent later (when the primary gateway sender is started) to the secondary gateway senders in order for them to remove those events from their queues. If it were not so, secondary gateway senders could have events in their queues that would never be removed.

This feature was implemented in

Jira
serverASF JIRA
serverId5aa69414-a9e9-3523-82ec-879b028fb15b
keyGEODE-4942
  as a solution to avoid secondary gateway senders to leave un-drained events after GII.

This solution works well when stopped gateway senders are not to remain in that state for a long time, e.g., when they are stopped but in the process of starting. But, if a gateway sender is stopped to be left in that state for some time, the incoming events reaching the primary gateway sender will be stored in the mentioned member variable of AbstractGatewaySender and could eventually provoke a heap exhaustion error. Moreover, dropped events stored while the gateway sender is stopped will not be queued by secondary gateway senders which makes the storing of the dropped events in the primary gateway sender unnecessary.

Stopping a gateway sender is an action that may be used to avoid the filling of gateway sender queues in long lasting split brain situations. But, given the current status of the implementation, it would not be effective because incoming events will still be stored by the primary gateway senders, using at least the same amount of memory (if not more if overflow to disk is configured) as the events queued by the sender when it is running, and with a very high risk of heap memory exhaustion.

Anti-Goals

What is outside the scope of what the proposal is trying to solve?

Solution

As described above, dropped events in the primary gateway sender are stored in a member variable. It is out of the scope of this RFC to change how those events are stored.

Solution 1 (original proposal, deprecated)

The solution proposes to change the primary gateway sender so that it does not store dropped events when it is stopped explicitly (not while starting). The reason is that these events could never end in the queue of any secondary gateway sender and will use memory unnecessarily.

In order to do so, it is proposed to add a new boolean member variable (mustQueueDroppedEvents) to the AbstractGatewaySender that will tell if the primary gateway sender must store dropped events or not.

  • mustQueueDroppedEvents must be set to false (do not store dropped events) in the primary and secondary gateway sender instances:
    • At gateway sender creation if the --manual-start option was used.
    • Right after stopping the gateway sender using the gfsh stop gateway sender command.
  • mustQueueDroppedEvents must be set to true (store dropped events) in the primary and secondary gateway sender instances:
    • At gateway sender creation if the --manual-start option was not used or set to false.
    • Right before a start gateway sender gfsh command is executed.

The start gateway sender and stop gateway sender gfsh commands would be modified in order to set the value of mustQueueDroppedEvents as follows:

  • Code will be added at the end of the current stop gateway sender gfsh command which will set mustQueueDroppedEvents to false (after the gateway senders have been stopped).
  • Code will be added at the beginning of the current start gateway sender gfsh command which will set mustQueueDroppedEvents to true (before the gateway senders are started).

In case the GatewaySender Java API is used to start/stop gateway senders, in order to get the new behavior, given that the scope of the start/stop methods is the VM on which it is invoked, it will be necessary to set the mustQueueDroppedEvents accordingly (to true before starting all sender instances and to false after stopping all sender instances) on every VM. To set the value of the variable, the GatewaySender interface will offer the following new method: setMustQueueDroppedEvents(boolean mustQueue). If the new method is not used, the legacy behavior will prevail except if the gateway sender is started manually in which case dropped events will not be queued.

A draft PR of the solution can be found here: https://github.com/apache/geode/pull/5348Describe your solution and how it’s going to solve the problem. This is likely the largest section of your proposal and might even include some high-level diagrams if you are proposing code changes. While all important aspects need to be covered, also keep in mind that shorter documents are more likely to be read.

Changes and Additions to Public Interfaces

If you are proposing to add or modify public interfaces, those changes should be outlined here in detailTwo new methods: setMustQueueDroppedEvents(boolean) and mustQueueDroppedEvents() will be added to the GatewaySender public interface.

Performance Impact

As the proposal implies changing the implementation of the start gateway sender and  stop gateway sender gfsh commands to be done in two steps, these commands may be slightly slower although not significantly.Do you anticipate the proposed changes to impact performance in any way? Are there plans to measure and/or mitigate the impact?

Backwards Compatibility and Upgrade Path

Will The proposal does not affect the rolling upgrade and has not impacts in the regular rolling upgrade process work with these changes?

How do the proposed changes impact backwards-compatibility? Are message or file formats changing?

Is there a need for a deprecation process to provide an upgrade path to users who will need to adjust their applications?

Prior Art

.

Solution 2 (new proposal, agreed after discussions on this RFC )

The solution consists of, instead of storing dropped events in `tmpDroppedEvents` to later send batch removal messages when the primary gateway sender is not started, try to send the batch removal message when the event to be dropped is received. That way, when the sender is stopped for a long time and there are events coming, the memory of the `AbstractGatewaySender` will not grow with entries in the `tmpDroppedEvents` member.

In order to send the batch removal message directly, the `eventProcessor` for the `AbstractGatewaySender` must have been created. If it is not yet created because the sender was created with manual start set to true, while receiving events to be dropped, they will be stored in `tmpDroppedEvents` as there is no other choice. Nevertheless, in order to consume less memory, the event stored could be a simplified event containing only the necessary information to handle it.

A draft PR of the solution can be found here: https://github.com/apache/geode/pull/5486

Changes and Additions to Public Interfaces

No changes.

Performance Impact

No impacts foreseen.

Backwards Compatibility and Upgrade Path

The proposal does not affect the rolling upgrade and has not impacts in the regular rolling upgrade process.

Prior Art

What would be the alternatives to the proposed solution? What would happen if we don’t solve the problem? Why should this proposal be preferred?-

FAQ

Answers to questions you’ve commonly been asked after requesting comments for this proposal.

Errata

What are minor adjustments that had to be made to the proposal since it was approved?