Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

To be Reviewed By: 26-03-2020

Authors: Alberto Bustamante Reyes (alberto.bustamante.reyes@est.tech)

Status: Draft | Discussion | Active | Dropped | Superseded Development

Superseded by: N/A

Related: N/A

...

The same refactoring is done in EndPointManager, as it holds a map of endpoints that also uses ServerLocation objects as key.

Check this commit for a draft of the proposed solution: https://github.com/apache/geode/pull/4824/commits/b180869c73095e7a810ba2e1c92e243a0220e888

Gw sender pings not reaching gw receivers

When PingTask are run by LiveServerPinger, they call PingOp.execute(ExecutablePool pool, ServerLocation server). PingOp only uses hostname and ip (ServerLocation) to get the connection to send the ping message. As all receivers are sharing the same host and port, it is not guaranteed that the connection is really pointing to the server we want to connect to.

Solution consists on the modification of the ping messages to include info about the server they want to reach. If the messages are received by other server, they can be sent to the proper server.

Other alternative is the addition of a retry mechanism to PingOp to be able to discard Other alternative is the addition of a retry mechanism to PingOp to be able to discard a connection if the endpoint of that connection is not the server we want to connect to. We have added a new method PingOp.execute(Executable pool, Endpoint endpoint) to solve this. In this way, if the connection obtained is not pointing to the required Endpoint, it can be discarded an ask for a new one.

...

After checking with the dev mailing list, we received the suggestion to configure serverAffinity in Kubernetes to solve the issue with the pings, but that option broke the failover of gw senders when a gw receiver is down.

FAQ

TBD

Errata

N/A

Annex: testExecuteOp failing

After our changes we have been stuck trying to solve testExecuteOp from ConnectionPoolImplJUnitTest. The test hangs when executing an operation that has been implemented to throw an exception. Instead of trying to execute the operation on both servers, we have seen it tries continuously to execute it on the same server.

The problem is in handshakeWithServer function at ClientSideHandshakeImpl class. We have seen that after the operation fails on the first server, and it is going to be executed on the second server,  at this line:

...

The variable contains the member id of the second server, but readServerMember return the id of the first server, so finally the operations is executed on that server again.