Avoid the queuing of dropped events by the primary gateway sender when the gateway sender is stopped
To be Reviewed By: July 9th16th, 2020
Authors: Alberto Gomez (alberto.gomez@est.tech)
Status: Draft Draft | Discussion | Active | Dropped | Superseded
Superseded by: N/A
Related: N/A
Problem
Primary Gateway senders drop all events received when they are stopped. Nevertheless, primary gateway senders, while stopped, store all events received in the tmpDroppedEvents
member variable of the AbstractGatewaySender
class. These events are stored so that they can be sent later (when the primary gateway sender is started) to the secondary gateway senders in order for them to remove those events from their queues. If it were not so, secondary gateway senders could have events in their queues that would never be removed.
This feature was implemented in
Jira | ||||||
---|---|---|---|---|---|---|
|
This solution works well when stopped gateway senders are not to remain in that state for a long time, e.g., when they are stopped but in the process of starting. But, if a gateway sender is stopped to be left in that state for some time, the incoming events reaching the primary gateway sender will be stored in the mentioned member variable of AbstractGatewaySender
and could eventually provoke a heap exhaustion error. Moreover, dropped events stored while the gateway sender is stopped will not be queued by secondary gateway senders which makes the storing of the dropped events in the primary gateway sender unnecessary.
Stopping a gateway sender is an action that may be used to avoid the filling of gateway sender queues in long lasting split brain situations. But, given the current status of the implementation, it would not be effective because incoming events will still be stored by the primary gateway senders, using at least the same amount of memory (if not more if overflow to disk is configured) as the events queued by the sender when it is running, and with a very high risk of heap memory exhaustion.
Anti-Goals
What is outside the scope of what the proposal is trying to solve?
Solution
As described above, dropped events in the primary gateway sender are stored in a member variable. It is out of the scope of this RFC to change how those events are stored.
Solution 1 (original proposal, deprecated)
The solution proposes to change the primary gateway sender so that it does not store dropped events when it is stopped explicitly (not while starting). The reason is that these events could never end in the queue of any secondary gateway sender and will use memory unnecessarily.
In order to do so, it is proposed to add a new boolean member variable (mustQueueDroppedEvents)
to the AbstractGatewaySender
that will tell if the primary gateway sender must store dropped events or not.
mustQueueDroppedEvents
must be set to false (do not store dropped events) in the primary and secondary gateway sender instances:- At gateway sender creation if the
--manual-start
option was used. - Right after stopping the gateway sender using the gfsh
stop gateway sender
command.
- At gateway sender creation if the
mustQueueDroppedEvents
must be set to true (store dropped events) in the primary and secondary gateway sender instances:- At gateway sender creation if the
--manual-start
option was not used or set to false. - Right before a
start gateway sender
gfsh command is executed.
- At gateway sender creation if the
The start gateway sender
and stop gateway sender
gfsh commands would be modified in order to set the value of mustQueueDroppedEvents
as follows:
- Code will be added at the end of the current
stop gateway sender
gfsh command which will setmustQueueDroppedEvents
to false (after the gateway senders have been stopped). - Code will be added at the beginning of the current
start gateway sender
gfsh command which will setmustQueueDroppedEvents
to true (before the gateway senders are started).
In case the GatewaySender
Java API is used to start/stop gateway senders, in order to get the new behavior, given that the scope of the start/stop methods is the VM on which it is invoked, it will be necessary to set the mustQueueDroppedEvents
accordingly (to true before starting all sender instances and to false after stopping all sender instances) on every VM. To set the value of the variable, the GatewaySender interface will offer the following new method: setMustQueueDroppedEvents(boolean mustQueue)
. If the new method is not used, the legacy behavior will prevail except if the gateway sender is started manually in which case dropped events will not be queued.
A draft PR of the solution can be found here: https://github.com/apache/geode/pull/5348Describe your solution and how it’s going to solve the problem. This is likely the largest section of your proposal and might even include some high-level diagrams if you are proposing code changes. While all important aspects need to be covered, also keep in mind that shorter documents are more likely to be read.
Changes and Additions to Public Interfaces
If you are proposing to add or modify public interfaces, those changes should be outlined here in detailTwo new methods: setMustQueueDroppedEvents(boolean)
and mustQueueDroppedEvents()
will be added to the GatewaySender
public interface.
Performance Impact
As the proposal implies changing the implementation of the start gateway sender
and stop gateway sender
gfsh commands to be done in two steps, these commands may be slightly slower although not significantly.Do you anticipate the proposed changes to impact performance in any way? Are there plans to measure and/or mitigate the impact?
Backwards Compatibility and Upgrade Path
Will The proposal does not affect the rolling upgrade and has not impacts in the regular rolling upgrade process work with these changes?
How do the proposed changes impact backwards-compatibility? Are message or file formats changing?
Is there a need for a deprecation process to provide an upgrade path to users who will need to adjust their applications?
Prior Art
.
Solution 2 (new proposal, agreed after discussions on this RFC )
The solution consists of, instead of storing dropped events in `tmpDroppedEvents` to later send batch removal messages when the primary gateway sender is not started, try to send the batch removal message when the event to be dropped is received. That way, when the sender is stopped for a long time and there are events coming, the memory of the `AbstractGatewaySender` will not grow with entries in the `tmpDroppedEvents` member.
In order to send the batch removal message directly, the `eventProcessor` for the `AbstractGatewaySender` must have been created. If it is not yet created because the sender was created with manual start set to true, while receiving events to be dropped, they will be stored in `tmpDroppedEvents` as there is no other choice. Nevertheless, in order to consume less memory, the event stored could be a simplified event containing only the necessary information to handle it.
A draft PR of the solution can be found here: https://github.com/apache/geode/pull/5486
Changes and Additions to Public Interfaces
No changes.
Performance Impact
No impacts foreseen.
Backwards Compatibility and Upgrade Path
The proposal does not affect the rolling upgrade and has not impacts in the regular rolling upgrade process.
Prior Art
What would be the alternatives to the proposed solution? What would happen if we don’t solve the problem? Why should this proposal be preferred?-
FAQ
Answers to questions you’ve commonly been asked after requesting comments for this proposal.
Errata
What are minor adjustments that had to be made to the proposal since it was approved?