Avoid the queuing of dropped events by the primary gateway sender when the gateway sender is stopped

To be Reviewed By: July 16th, 2020

Authors: Alberto Gomez (alberto.gomez@est.tech)

Status: Draft | Discussion | Active | Dropped | Superseded

Superseded by: N/A

Related: N/A

Problem

Gateway senders drop all events received when they are stopped. Nevertheless, primary gateway senders, while stopped, store all events received in the tmpDroppedEvents member variable of the AbstractGatewaySender class. These events are stored so that they can be sent later (when the primary gateway sender is started) to the secondary gateway senders in order for them to remove those events from their queues. If it were not so, secondary gateway senders could have events in their queues that would never be removed.

This feature was implemented in Unable to render Jira issues macro, execution error.   as a solution to avoid secondary gateway senders to leave un-drained events after GII.

This solution works well when stopped gateway senders are not to remain in that state for a long time, e.g., when they are stopped but in the process of starting. But, if a gateway sender is stopped to be left in that state for some time, the incoming events reaching the primary gateway sender will be stored in the mentioned member variable of AbstractGatewaySender and could eventually provoke a heap exhaustion error. Moreover, dropped events stored while the gateway sender is stopped will not be queued by secondary gateway senders which makes the storing of the dropped events in the primary gateway sender unnecessary.

Stopping a gateway sender is an action that may be used to avoid the filling of gateway sender queues in long lasting split brain situations. But, given the current status of the implementation, it would not be effective because incoming events will still be stored by the primary gateway senders, using at least the same amount of memory (if not more if overflow to disk is configured) as the events queued by the sender when it is running, and with a very high risk of heap memory exhaustion.

Anti-Goals

As described above, dropped events in the primary gateway sender are stored in a member variable. It is out of the scope of this RFC to change how those events are stored.

Solution 1 (original proposal, deprecated)

The solution proposes to change the primary gateway sender so that it does not store dropped events when it is stopped explicitly (not while starting). The reason is that these events could never end in the queue of any secondary gateway sender and will use memory unnecessarily.

In order to do so, it is proposed to add a new boolean member variable (mustQueueDroppedEvents) to the AbstractGatewaySender that will tell if the primary gateway sender must store dropped events or not.

  • mustQueueDroppedEvents must be set to false (do not store dropped events) in the primary and secondary gateway sender instances:
    • At gateway sender creation if the --manual-start option was used.
    • Right after stopping the gateway sender using the gfsh stop gateway sender command.
  • mustQueueDroppedEvents must be set to true (store dropped events) in the primary and secondary gateway sender instances:
    • At gateway sender creation if the --manual-start option was not used or set to false.
    • Right before a start gateway sender gfsh command is executed.

The start gateway sender and stop gateway sender gfsh commands would be modified in order to set the value of mustQueueDroppedEvents as follows:

  • Code will be added at the end of the current stop gateway sender gfsh command which will set mustQueueDroppedEvents to false (after the gateway senders have been stopped).
  • Code will be added at the beginning of the current start gateway sender gfsh command which will set mustQueueDroppedEvents to true (before the gateway senders are started).

In case the GatewaySender Java API is used to start/stop gateway senders, in order to get the new behavior, given that the scope of the start/stop methods is the VM on which it is invoked, it will be necessary to set the mustQueueDroppedEvents accordingly (to true before starting all sender instances and to false after stopping all sender instances) on every VM. To set the value of the variable, the GatewaySender interface will offer the following new method: setMustQueueDroppedEvents(boolean mustQueue). If the new method is not used, the legacy behavior will prevail except if the gateway sender is started manually in which case dropped events will not be queued.

A draft PR of the solution can be found here: https://github.com/apache/geode/pull/5348

Changes and Additions to Public Interfaces

Two new methods: setMustQueueDroppedEvents(boolean) and mustQueueDroppedEvents() will be added to the GatewaySender public interface.

Performance Impact

As the proposal implies changing the implementation of the start gateway sender and  stop gateway sender gfsh commands to be done in two steps, these commands may be slightly slower although not significantly.

Backwards Compatibility and Upgrade Path

The proposal does not affect the rolling upgrade and has not impacts in the regular rolling upgrade process.

Solution 2 (new proposal, agreed after discussions on this RFC )

The solution consists of, instead of storing dropped events in `tmpDroppedEvents` to later send batch removal messages when the primary gateway sender is not started, try to send the batch removal message when the event to be dropped is received. That way, when the sender is stopped for a long time and there are events coming, the memory of the `AbstractGatewaySender` will not grow with entries in the `tmpDroppedEvents` member.

In order to send the batch removal message directly, the `eventProcessor` for the `AbstractGatewaySender` must have been created. If it is not yet created because the sender was created with manual start set to true, while receiving events to be dropped, they will be stored in `tmpDroppedEvents` as there is no other choice. Nevertheless, in order to consume less memory, the event stored could be a simplified event containing only the necessary information to handle it.

A draft PR of the solution can be found here: https://github.com/apache/geode/pull/5486

Changes and Additions to Public Interfaces

No changes.

Performance Impact

No impacts foreseen.

Backwards Compatibility and Upgrade Path

The proposal does not affect the rolling upgrade and has not impacts in the regular rolling upgrade process.

Prior Art

-

FAQ

Answers to questions you’ve commonly been asked after requesting comments for this proposal.

Errata

What are minor adjustments that had to be made to the proposal since it was approved?

  • No labels

43 Comments

  1. The changes to the GatewaySender interface (adding the setMustQueueDroppedEvents() and mustQueueDroppedEvents() methods) constitute a change to a public API, so should be documented in the appropriate section of this RFC. Also, no mention is given to how this new behaviour will be implemented when the Java API is used to stop a gateway sender via the GatewaySender.stop() method, which would lead to inconsistent default behaviour if not addressed.

  2. Good points, Donal. Thanks!

    I have updated the RFC describing how to use the new behavior in case the Java API is used. As you can see, there is no magic, but the one using the Java API can get the new behavior by making use of the new method to set the value for the new flag accordingly and doing the required coordination.

  3. I am not very familiar how gateway senders behave, but by the description of the problem here, this may be a naive question: instead of setting up a boolean, can we just check if the gateway sender is stopped or not before we put the event on the queue?

    1. Thanks Jinmei Liao .

      Each gateway sender instance (gateway sender running in a member) has its own variable stating if the instance is stopped or not. Currently, if the sender instance is primary and the sender instance is stopped, the dropped event is queued. But this variable does not provide information about the status of the other instances. This is an excerpt of the code in AbstractGatewaySender.java

      if (!isRunning()) {
      if (this.isPrimary()) {
      tmpDroppedEvents.add(clonedEvent);

      What the proposed boolean variable is achieving is letting know the primary gateway sender instance if the gateway sender (not just the instance, but all the instances) are stopped. The variable is set to true by the stop gateway sender command in all gateway sender instances. Another name for the variable could be allSenderInstancesStopped.

      Without this variable, the primary gateway sender instance would have to contact each secondary gateway sender instance to query if they are stopped. This is not very efficient and also there could be race conditions that could make a dropped event to not be queued because the secondary sender was just stopped when queried but just a fraction of time ago could have been up and storing the event just received by the primary.

  4. I am the original author who introduce GEODE-4942 and GEODE-5087. I understand current tmpDroppedEvents contains the side-effect which is mentioned in this RFC. However, I don't think to change the APIs to introduce new parameter is a good idea to resolve it. 


    To enhance the original design of tmpDroppedEvents to workaround the side-effect, there are some easier ways. The simplest one is to set a timeout. Currently we did not distinguish if the eventProcessor is purposely shutdown or is in normal restarting. We can introduce a stopTime (long) as attribute, which will be set after 

    this.eventProcessor.isStopped(). Then before tmpDroppedEvents.add(clonedEvent), we can compare current timestamp with the saved stopTime,
    if it's longer than 10 seconds, don't add to tmpDroppedEvents. I introduce the tmpDroppedEvents is only for restarting window, it should not
    be long. 10 seocnds is way longer than necessary.

  5. I have a more mature idea to use the timeout to resolve this issue. I noticed AbstractGatewaySender has an attribute startTime, which will be set to current timestamp in start() of SerialGatewaySenderImpl and SerialAEQImpl. But ParallelGatewaySender never used it. We can make use of this attribute. 

    1) in start(), set startTime to be current timestamp as SerialGatewaySenderImpl does. 
    2) in stop(), set startTime to be (-1)*System.currentTimeMillis()

    3) in distribute(): 

    if (!isRunning()) {
    if (this.isPrimary()) {
    if (startTime < 0 && System.currentTimeMillis() + startTime < 10000) {
    tmpDroppedEvents.add(clonedEvent);
    if (isDebugEnabled) {
    logger.debug("add to tmpDroppedEvents for evnet {}", clonedEvent);
    }
    }
    }
    if (isDebugEnabled) {
    logger.debug("Returning back without putting into the gateway sender queue:" + event);
    }
    return;
    }

    We only need to distinguish if the sender is down due to purposely shutdown or in the middle of restarting. Above code is enough to do it. 


    We can have more detail discuss if you like. 

  6. Alberto Gomez I also created a draft Pull Request based on the simple solution. It will resolve the problem you mentioned. https://github.com/apache/geode/pull/5381

    You can keep refining this fix. 

    This fix has minimum impact to current system and APIs. 

    1. Xiaojian Zhou Thanks for your comments.

      I find the solution solves the problem of not storing dropped events indefinitely when a sender is stopped. But I have some concerns:

      1. When a sender previously stopped is started back, no dropped events will be queued but I think they should because the secondary gateway sender could start before the primary. Is this right or am I missing something?
      2. Setting the timeout to 15 seconds when the amount of traffic received is high and the objects are big could mean that a lot of heap memory would be used to store tmpDroppedEvents and these objects are not overflown to disk and no limit can be set on them. If, on the contrary, the value set for the timeout is too low then there is a risk of missing dropped events and therefore not draining the secondary gateway sender queue.
      3. I think setting the startTimeout to a negative value when the sender is stopped is a bit hacky. It will be hard for the people looking at the code in the future to understand why is it done that way. I would rather have a more clear solution. Maybe using another variable like stopTime.
  7. Alberto Gomez
    Introducing a new stopTime is OK. I just find the startTime is never used and want to use it. 

    15 seconds is adjustable. I feel it's safe to increase even to 1 minute. I don't worry that much about the space used for the time being. We can enhance the item's size which are put into the tmpDroppedEvents. We only keep the tmpDroppedEvents to notify secondary queue to delete. So we might not need to clone the whole event's data structure. For example, during clone, we can set some fields to be null. Note: I means to clone a new event dedicated for tmpDroppedEvents and keep the current cloneEvent as is. 

    For your question 1, we have not found any bugs for that scenario yet. However, we are not supposed to shutdown one sender for long time. Either we restart it (which should finish within 15 seconds), or shutdown all senders (usually due to shutdown all members). 

    1. Xiaojian Zhou

      I think it would be a good idea to not clone the whole event to prevent using too much memory.

      Regarding question 1, I have written a test case in which you can see that if you stop a sender and then start it while there are incoming operations, there could be dropped events not queued that will be received by the secondarygateway sender and never drained. If you run it several times, you'll see it fail from time to time.

      This case is not handled by this solution.

  8. In my previous comment, there's a mistake. We cannot clone a simplified event, because the event will be enqueued when sender is restarted. For your test, I will take a look later. 

  9. Alberto Gomez Your test failure will happen when the senders are started at the first time, when the startTime is initialized as 0. vm2 and vm4 could start in different speed, which caused different number of events are rejected at vm2 and vm4. In my reproduce, vm2 reject tailKey==100, 101, but vm4 rejected tailKay=100,101,102. If vm2 happened to be the secondary, then taikKey=102 will not be drained. 

    I have fixed it. See the latest code in pull request.  

    By the way, in your test code need to adjust a little bit. I have added your test case into the PR. 

    1. Xiaojian Zhou Thanks for your reply. Again, the problem with this solution is that it does not fix completely the case when you start a sender.

      If you start a sender right after it has been created (or within 15 seconds since the creation), the sender will queue dropped events for 15 or less seconds. But what if you first create the sender and more than 15 seconds later you start it? In this case the solution will not work. No dropped events will be queued and you could have events not drained.

      Also, if you start a server that was previously stopped, you need to do it within the 15 seconds after you stopped the sender. Otherwise, no dropped events will be queued and you could have events not drained. I do not think we can assume that a stopped sender will always be restarted within the next 15 seconds. In fact, we have a use case in which we need to stop the gateway sender for a long time in order not queue more events to not fill up the heap.

      You can verify this by adding sleeps of 15 seconds before sending puts in the new test case. You will see that the test case will fail sometimes.

  10. I know it might. However in general, I think nothing should be added to the queue if it's not running. If events get added to the secondary that aren't added to the primary, they should be dealt with in the first batch removal message.

    Currently our  batch removal message packed a key list. Only the events in the key list at secondary will be removed. We can enhance it to double check if there's any "too-old" events in secondary. Remove them if any. There're 2 alternative to identify an "too-old" event. 1) set a long timeout such as 1 hour.  2) set a boolean firstBatchRemoval to the batch removal message. If it's true, then remove all the events in secondary queue with smaller shadow keys than the smallest one in the batch. 

    1. Xiaojian Zhou We can implement any of the alternatives you are proposing. Still I wonder if doing all this is not much more complex than the solution that I proposed originally. Plus we would have to address the issue I mentioned previously about, depending on the size of the objects, taking up too much memory due to storing 15 seconds of dropped events.

      In my opinion, even if the public API needs to be changed (expanded), the original solution is much simpler and addresses the problem without any drawback.


      1. Above solution has minimum impact to current system. Your new API solution has more side effects than you thought. 

        1. Could you please elaborate on the side effects I have not thought about?

  11. It could break backward compatibility. And any changes to public API will trigger many other components to be changed. Such as JMX and Restful API. Adding gfsh parameter is the last choice. We would rather to disable the whole tmpDroppedEvents feature to resolve your original issue. 

    My fix in this PR have dramatically reduce the chance of undrained event in secondary queue. Only could happen when user stopped one sender (not for restart) then wait for more than 15 seconds to manual restart. Even though, the worst case is the possibility of a few undrained events in secondary queue, and the possibility is way less than totally disabled tmpDroppedEvents. 

    As for manually stop one sender, it will always face the risk of missing data. We strongly recommend NOT to do it. You can either shutdown all senders, or restart one sender. 

    1. Hi Xiaojian Zhou . The original solution does not break the backward compatibility because the changes in the API are an extension of the it - there are new methods available but you are not required to use them. Regarding gfsh, I am not proposing to add any new parameter. My proposal consists of modifying the implementation of the start and stop commands to use the new methods of the API to avoid the problems with tmpDroppedEvents.

      Please, correct me if I am wrong but I have not seen in the JMX or the restful API a way to start or stop the gateway senders so I do not see how this proposal would impact them.

      I am not against your proposal. I just think it has some drawbacks (you need timeouts which are always hard to tune and pose on the one configuring the system an extra burden), there are aspects which are not totally clear yet (the alternatives about how to solve the problem when starting servers) and the impacts in code and complexity are bigger. That's why I still consider the original solution better.

      Finally, the proposal aims at solving the problems when starting or stopping all sender instances, not other cases. That's why the new logic is added to the start/stop gfsh commands in which all sender instances are stopped/started.


      1. Your solution cannot resolve the event undrained in secondary queue. When you manual start (or shutdown and restart using default setting), the tmpDroppedEvents is actually disabled. Then there's possibility that event will be undrained in secondary queue. 

        I have reproduced that using my version of ParallelWANPropagationLoopBackDUnitTest.java. You can comment out the lines to verify size of tmpDroppedEvents. 

        1. That's because I set the new variable to true when manual start is true because I thought it would be less problematic.

          We can initialize it to true and then you would have the default behavior equivalent to what we have today.

  12. Alberto Gomez In your solution, if a developer wants to use java API to stop a gateway sender, does he need to set the added boolean flag to a specific value?  It's not very obvious to me what value I need to set it to be, so to me this added boolean flag does add API complexity and confusion.

    All these seem to be some internal management of the queue sizes, so I would like to leave users out of it. Requiring users to know the details of the implementation in order to set a flag seems not user-friendly to me.

    In your solution, you are using a loop to set the flag on all the members, so it's still possible that the secondary GWS (gateway sender) will get the notice later than the primary, so it could still possibly queue some events before it turns the boolean off, so that event could still not be drained.

    As far as I understand, tmpDroppedEvents is used to store events that might have been queued up by the secondary GWS while the primary GWS is down.  Ideally, if user didn't stop the secondary GWS but have the primary GWS stopped, we would have this tmpDroppedEvents store those events indefinitely until the primary starts again,  but if we are concerned the variable will get too big, then setting a timeout seems like a reasonable solution.  The problem of this approach is when user didn't stop all the secondary GWS after the timeout period after they stop the primary GWS, then we will need to ask ourselves: how likely is this to happen?

    1. Thanks a lot for your comments, Jinmei. Very appreciated.

      I totally agree with you that the API extension proposed is not very user friendly and deals with internal management of queues that should be hidden to the user of the API. Due to how the tmpDroppedEvents feature implementation was done, I did not think I had a better choice, mainly because it is all about coordination between sender instances. I also have to say that I proposed to add it so that gfsh start/stop could make use of it and solve the problem described in the RFC and not thinking about Java API users. The latter clients could be left with the current behavior. With this proposal, the logic would be hidden inside the implementation of gfsh start and stop. Nonetheless if Java users would like to make use of it we could try to find a better name for the new methods/flag (for example: prepareToStopSender/prepareToStartSender or preStartSender/postStopSender) and add proper documentation so that it would not be impossible for the user to understand how to use the API. As this is a draft, I expected to have discussion on the namesbut for the sake of understanding what the methods/flag did in the proposal I decided to be very explicit with the name of the flag/method and probably less clear for the future user of the API.

      Regarding what you comment about the loop, the idea is that before starting a sender, you must tell all sender instances they are going to be started. Once all instances have been notified (and not before as you can see in the implementation of the gfsh start sender proposal) they are sent the signal to be started. Events will not be queued in the secondary before it is started but the primary will start to put stopped events in tmpDroppedEvents as soon as it receives the change of the flag, so there is no possibility for the race condition you mention.

      As I wrote in previous comments to Xiaojian Zhou, I do not think using a timeout is a good idea in general for solving problems and less so in this case in which, depending a lot in the type of traffic received, the value of the timeout could be very different and either use a lot of memory if the timeout was too high or leave undrained events if too short.

      Besides, the timeout only solves the problem when stopping the sender. To solve the problem of not leaving undrained events when starting a sender, a couple of draft ideas were proposed by Xiaojian Zhou but I could foresee some problems already with them and it is clear that they add yet more complexity to this already complex feature.

      For all of the above, I consider that solving the problem in the gfsh start and stop sender commands (and leaving the current behavior of start/stop gateway sender when used  out of gfsh ) is the least bad solution. It is very simple in terms of code and solves perfectly the problem when gfsh is used.

      Afterall it might be that what we really need is to see if there is a better way to drain event queues in the secondary that were dropped in the primary without storing the dropped events in the primary in the first place.




      1. If you are only set to solve the problem when gfsh commands are used, then your solution is ok. I think Gester is trying to solve the problem regardless whether gfsh is used to start/stop the GWSs. 

        1. I agree to Jinmei. To write a private-used workaround is different story. To add a feature into the product for all users to use need to consider more. A feature should not assume user to call some APIs to do housekeeping in their application. 

          Actually, I am thinking to add some of Alberto's code into my proposal. For example, the mustQueueDroppedEvents and its setter. Then users can easily to work out a private workaround (such as using a function to call the setter on all members) if they are unhappy with our fix. The original GEODE-4942 did not fix well. I did not consider the scenario Albert mentioned. What I provided is an enhanced fix. 

          To write a complete java API and gfsh to change a status in all senders is a feature, which needs more work efforts and consideration. 

          1. I agree with you guys and see valid points on your comment below.

            As I told Jinmei, I aimed at solving the problem just for the  gfsh case (taking into consideration the interaction with servers on different versions and failure cases like the ones you  mentioned below) and to me, the solution was simple and good enough for most use cases.

            To provide a complete fix for the java API I would explore the possibility of not storing dropped events by the primary at all. I am not sure if it is possible but you wrote a couple of ideas to solve the problem when starting the sender that may do it.

            I did not want to go that far when I thought about fixing this problem but if you want we can have a discussion to exchange ideas to try to find the best solution next week - as this week I am on vacation. How does it sound?

            1. I have an idea to add your mustQueueDroppedEvents attribute and setMustQueueDroppedEvents() API to my PR. I will add a function to call setMustQueueDroppedEvents() on all members. However, I will put this function into dunit test as example. Then users can copy this function as a java-API in their application if they want to turn off/on tmpDroppedEvents.
              If they don't care that much, my PR (using timeout) will serve the normal customers' use cases. 

              Actually, you can do above if you like. I think this is the best balance in current stage. 

  13. Alberto Gomez
    My proposal is also a draft workaround. It will help the situation in most use cases (if not all). And actually we can apply both solutions. 

    I am not against to introduce a new feature to set a status to all senders. However, there are the 4 major problems in your solution:

    1. Your solution only changed gfsh command to setMustQueueDroppedEvents on all senders, but not to provide a java API to do it. Your RFC relied on pure java user java application to call the sender.setMustQueueDroppedEvents server by server. This is not the correct way to introduce such a feature. If we need to add a feature to change a status on all senders, we need to have a java API to send message or a function to all members, then let gfsh to call this new java API.
    2. Backward compatibility issue: if your cluster has mixed with old version of servers, which does not have this attribute and does not have the setMustQueueDroppedEvents() method either. How you handle it? For example, at least you need to  detect the version and not to call setMustQueueDroppedEvents() for old members.
    3. You need to handle that the setMustQueueDroppedEvents(true or false) message or function only executed on some members, but not on other members due to race condition. This error handling should also be considered. Usually we employ a retry. 
    4. You also need to handle that setMustQueueDroppedEvents(true or false) is done on different member at different time. There’s a time window that some member has set it to true but other members set it to false. RFC needs to prove that such a time window will not cause problems. 

      If considered all above 4 challenges, you will see why my proposal is simpler. 

    1. So, if I understand correctly you are proposing to solve the problem in two phases:

      1. Workaround: set a 15 seconds timeout for tmpDroppedEvents
      2. Definitive solution: provide a Java API to set the originally proposed flag (mustQueueDroppedEvents) via a function which will set the flag on all senders and will take care of retries, versions, etc. and update gfsh to use this API.

      Is this right?

      1. Xiaojian Zhou I had another idea based on your proposal with a timeout.

        If we use the startTime and stopTime we can make sure that the solution works not only when senders are stopped (after 15 seconds no more dropped events will be queued) but also when starting them.

        This solution will be good for Java API users and gfsh users and they will not be impacted.

        You can check a draft PR here:

        https://github.com/apache/geode/pull/5434

        1. Xiaojian Zhou

          With this solution, when starting the gateway senders there is some very small chance that some event queued in the secondary gateway sender and dropped by the primary is not stored in the tmpDroppedEvents. I have seen the last test case of ParallelWANPropagationLoopBackDUnitTest in my  PR fail 1 out of 350 executions.

          Nevertheless, I think the solution is good enough until we decide to use the flags to report that all gateway sender instances are stopped.

          Another thing to take into account is that this solution, when manually starting the gateway sender, will store events in tmpDroppedEvents for 15 seconds since creation. Should we in this case make the gateway not store events in tmpDroppedEvents until it is started?

          Do you think we could go on to create a PR with this solution?


          1. Alberto Gomez
            After discussed with the team, it will be a community development. So I will not fix it for the time-being, but I will help you if necessary. 

            There're following options you can choose:
            1. completely introduce a java API with gfsh parameter to change all sender's status at one time. When you do this java API, you need to consider 2 negative cases: 1) retry if some member did not finish setting the status. 2) detected some servers are running in older version, then abort the API. 
            2. introduce the attribute and the setter, then add a function (let's name it testSetSenderStatustFunction) to call the setting in dunit test. Also create a dunit test to use gfsh to call testSetSenderStatustFunction. Then with this checkin, your application code can simulate the 2 test code to define a user function, let's name it AlbertoFunction, and in your application to use gfsh to call AlbertoFunction. 
            3. Use timeout to resolve the issue. 
            4. Use both option-2 and option3. 

            After you decided which option to choose, then you can have a new PR and I will review for you. 

            1. Xiaojian Zhou Regarding option 2., are you thinking about something like what I proposed in my last commit in the original PR (https://github.com/apache/geode/pull/5348) in which a new function is created and it is called from gfsh? That draft PR is missing the retries and the checking of the version of the servers but apart from that, it can give you an idea of what it would look like.
              What I do not get is the parameter for gfsh. I do not think any parameter is needed. The new function should always be called from gfsh. Otherwise, the tmpDroppedEvents will be indefinitely collected when the sender is stopped.



              1. Your PR 5348 is based on option1. If you want to change gfsh product code, you have to consider and implement all the negative cases I mentioned earlier. 

                My option2 means: you can add an internal function like you have done in PR, the attribute and a setter. That's all you can add into product. 

                Then in test code, you write a test case to verify using both java api and gfsh to call this function to change the status on all senders.   

                gfsh can call your own function. Jinmei knows better on that. 

                That's the option-2. In this case, you don't need to worry about the negative cases, because you did not change the product and have no impact to other users of geode. 

                In your own application, you can use the test code you wrote to do the setting. Of course, it's up to you to handle the potential negative cases. 

                1. Xiaojian Zhou You are right, I was referring to option 1. not 2.

                  I will take some time to think about which one to implement. It would be option 1 or 3.

                  Thanks for your comments.

                  1. Xiaojian Zhou , Jinmei Liao, I have thought of an alternative solution that is simpler than the ones we have right now and should solve all the problems described in the RFC.

                    I have created a draft pull request with it so that you can take a look and comment:

                    https://github.com/apache/geode/pull/5486

                    1. I feel the solution is overall good. It only needs tiny adjustment. I have put comments in PR. 

                      1. Thanks. I have written a ticket Unable to render Jira issues macro, execution error. to handle it.

  14. There is `ExecuteFunctionCommand` that will allow you to execute user functions on any members. I believe what Gester suggested Option 2 is your PR #5348, with the following changes:

    1. do not call your `GatewaySenderManagerFunction` in gfsh commands
    2. add a few tests just to make sure you can call `GatewaySenderManagerFunction` using `ExeucteFunctionCommand`, which does the thing you expected.


    With you PR #5348, Gester is concerned that calling the function in Stop/StartGatewaySenderCommand might still cause some events being undrained and we don't do enough testing to ensure that's not the case.

    1. Jinmei Liao Thanks a lot for the clarification. Now I fully understand Gester's Option 2.

      Would I need to implement the function in any special way so that it can be invoked from gfsh? With the current implementation extending Function, when trying to execute from gfsh I got a "Function GatewaySenderManageFunction is not registered in member..." error.


      In the case of Option 1 (sketched in PR #5348), if it was finally selected, we could add more tests to make sure that there are no events undrained.

  15. Alberto Gomez to register a function for member execution, call FunctionService.registerFunction in an initialization code block.

  16. Solution 2 can be simplified to be:

    The tmpDroppedEvents was introduced to save the incoming events temporally before EventProcessor was created when sender is manually started at the first time. When EventProcessor is present (stopping or restarting a sender will keep the EventProcessor until the whole sender is garbage collected), no need to save incoming events into tmpDroppedEvents. 

    This proposal has minimum impact to current logic and has no backward compatibility concern. 

  17. The solution finally implemented can be seen in the following JIRA ticket: https://issues.apache.org/jira/browse/GEODE-8491