To be Reviewed By: January 20th, 2021

Authors: Jakov Varenina

Status: Draft | Discussion | Active | Dropped | Superseded

Superseded by: N/A

Related: N/A

Problem

Apache Geode members does not persist gateway-sender state after it is changed with pause, stop, resume or start gateway-sender command. Currently server will by default try to start gateway-sender automatically at startup, if it is not configured differently with the deprecated parameter manual-start (see create gateway-sender command). For example, due to above limitation, when user issue stop gateway-sender command (and he might want to do that when the queues are growing wildly to protect the server from using all memory) cannot be sure that after a server restart that gateway-sender will be kept down.

Anti-Goals

What is outside the scope of what the proposal is trying to solve?

Solution

New startup-action parameter with values stop, pause and start will be persisted during the runtime when following commands are issued:

  • pause gateway-sender    --> startup-action="pause"
  • stop gateway-sender       --> startup-action="stop"
  • start gateway-sender       --> startup-action="start"
  • resume gateway-sender  --> startup-action="start"

The startup-action parameter will be persisted within cluster configuration, but only if gfsh commands are executed successfully on at least one of the servers. The only case when state will not be updated and persisted is when commands are executed per member (using --members parameter). The reason behind this is because cluster configuration is only persisted per cluster and group, and not per member. This exception will be documented.

Currently it is possible to configure gateway-sender to always start automatically or to always require manual start (manual-start=true) at member startup. With addition of the new parameter in Cluster Configuration this behavior will be changed in a following way:

  1. If manual-start="true" and startup-action parameter is missing, then gateway sender will require manual start (same as before).
  2. If manual-start is not set (or "false") and startup-action parameter is missing, then gateway sender will be started automatically (same as before).
  3. If parameter startup-action is available in cluster configuration at startup, then gateway-sender will try to reach that state regardless of manual-start parameter value.

 Any failures that happen when gateway-sender try to reach desire state will be handled same as now when for example automatic startup fail. The startup-action parameter will remain unchanged in all failure cases at startup.

The behavior of manual-start parameter must be improved in order to comply to above requirements.

Current issue with the manual-start parameter:

Currently when manual-start is configured to be true the colocated persistent parallel gateway sender queue region and buckets are not recovered after server is restarted. Because of that the main persistent region that is colocated with gateway sender queue region cannot reach online status.

Solution to manual-start parameter issue:

When manual-start parameter is true or gateway sender startup-action is stop, then persistent parallel gateway-sender queues should be recovered (if needed) from persistent storage during startup of the server. Queues should be recovered by using the existing mechanism that is also used when gateway sender is automatically recovered (manual-start==false) after server is restarted. In that case parallel gateway sender queue persistent region and buckets are recovered (if needed) right after the main persistent region and buckets are recovered.

Additionally, parallel gateway sender should reach the same state that is has when first started and then stopped by using gfsh commands. In that state parallel gateway sender buckets remain on the servers, but dispatcher threads are stopped and non of the events are stored in queues.

This new behavior will be really beneficial in automated environments like Kubernetes cluster where servers could be automatically restarted.

Changes and Additions to Public Interfaces

none

Performance Impact

not expected

Backwards Compatibility and Upgrade Path

We will introduce backward incompatible change with this feature, because now state parameter will always have advantage over the manual-start parameter as describe in solution part.

This could be avoided with additional parameter in create gateway-sender command which should be then used to enable new behavior.

Prior Art

What would be the alternatives to the proposed solution? What would happen if we don’t solve the problem? Why should this proposal be preferred?

FAQ

Answers to questions you’ve commonly been asked after requesting comments for this proposal.

Errata

What are minor adjustments that had to be made to the proposal since it was approved?

  • No labels

3 Comments

  1. I like this idea in general. Is this idea just for gateway senders, or will it also apply to async event queues? The async event queue already has a property called pause-event-processing which sounds like a similar concept. It seems like they should be consistent.

    1. Thanks for your comment!

      The idea was to apply this change only to gateway-senders, since it is not possible to change state of the async-event-queue during runtime.

      As you already said, the state of async-event-queue can be set with alter async-event-queue pause-event-processing, but that will take affect only after server restart. It is not possible to change async-event-queue state during runtime like it is possible for gateway-sender with start, stop and pause commands. For example, if we would add similar parameter to create and alter gateway-sender commands, then alter command would have to be always used together with pause, start and stop commands to get wanted state after the restart (and in this order):

      alter gateway-sender state-event-processing=paused
      pause gateway-sender
      ...
      alter gateway-sender state-event-processing=running
      resume gateway-sender

      Same limitations would apply in this solution, as for the one proposed in RFC. Also, in my opinion this solution adds more complexity and would be less clear to the user.

      Not sure if this is what you had in mind? Do you maybe see some third alternative?

  2. Jira ticket that implements this new feature: Unable to render Jira issues macro, execution error.