You are viewing an old version of this page. View the current version.

Compare with Current View Page History

« Previous Version 15 Next »

To be Reviewed By: 26-03-2020

Authors: Alberto Bustamante Reyes (alberto.bustamante.reyes@est.tech)

Status: Draft | Discussion | Active | Dropped | Superseded

Superseded by: N/A

Related: N/A

Problem

There is a problem with Geode WAN replication when GW receivers are configured with the same hostname-for-senders and port on all servers. The reason for such a setup is deploying Geode cluster on a Kubernetes cluster where all GW receivers are reachable from the outside world on the same VIP and port. Other kinds of configuration (different hostname and/or different port for each GW receiver) are not cheap from operation & maintenance and resources perspective in cloud native environments and also limit some important use-cases (like scaling).

Currently, it is possible to set GW receivers as described, but there are some problems derived from this configuration that must be solved prior to state that this configuration is supported by Geode.

The printouts of this wiki were obtained from a minikube environment, using two Geode clusters, "Cluster-1" and "Cluster-2". Geode software used was develop branch, at c8413592e5573f675c538c63ef9ee9f97a349e73.

Each cluster contains a locator and two servers, but "Cluster-1" has gw senders and "Cluster-2" has gw receivers.


$ kubectl --namespace=geode-cluster-2 get all
NAME           READY  STATUS   RESTARTS  AGE
pod/locator-0  1/1    Running  0         55m
pod/server-0   1/1    Running  0         55m
pod/server-1   1/1    Running  0         55m

NAME                           TYPE       CLUSTER-IP      EXTERNAL-IP  PORT(S)          AGE
service/locator-site2-service  ClusterIP  None            <none>       10334/TCP        55m
service/receiver-site2-service NodePort   10.103.196.204  <none>       32000:32000/TCP  55m
service/server-site2-service   ClusterIP  None            <none>       30303/TCP        55m

NAME                      READY  AGE
statefulset.apps/locator  1/1    55m
statefulset.apps/server   2/2    55m


Problem 1: Gw sender failover

The problem experienced is that shutting down one server is stopping replication to this cluster until the server is up again. This is because Geode incorrectly assumes there are no more alive servers when just one of them is down, because since they share hostname-for-senders and port, they are treated as one same server.


Example, using "cluster-1" and "cluster-2", both with one locator and two servers.  :

Cluster-1 gfsh>list members
Member Count : 3

Name      | Id
--------- | ------------------------------------------------------------
server-0  | 172.17.0.4(server-0:65)<v1>:41000
locator-0 | 172.17.0.6(locator-0:25:locator)<ec><v0>:41000 [Coordinator]
server-1  | 172.17.0.8(server-1:47)<v1>:41000

Cluster-1 gfsh>list gateways
GatewaySender Section

GatewaySender Id | Member                            | Remote Cluster Id | Type     | Status                | Queued Events | Receiver Location
---------------- | --------------------------------- | ----------------- | -------- | --------------------- | ------------- | --------------------------------------------------------------
sender-to-2      | 172.17.0.4(server-0:65)<v1>:41000 | 2                 | Parallel | Running and Connected | 0             | receiver-site2-service.geode-cluster-2.svc.cluster.local:32000
sender-to-2      | 172.17.0.8(server-1:47)<v1>:41000 | 2                 | Parallel | Running and Connected | 0             | receiver-site2-service.geode-cluster-2.svc.cluster.local:32000



Cluster-2 gfsh>list members
Member Count : 3

Name      | Id
--------- | ------------------------------------------------------------
server-0  | 172.17.0.5(server-0:65)<v1>:41000
locator-0 | 172.17.0.7(locator-0:25:locator)<ec><v0>:41000 [Coordinator]
server-1  | 172.17.0.9(server-1:46)<v1>:41000



Cluster-2 gfsh>list gateways
GatewayReceiver Section

Member                            | Port  | Sender Count | Senders Connected
--------------------------------- | ----- | ------------ | -----------------------------------------------------------------------------------------------------------------------------------------------

172.17.0.5(server-0:65)<v1>:41000 | 32000 | 6            | 172.17.0.4(server-0:65)<v1>:41000, 172.17.0.8(server-1:47)<v1>:41000, 172.17.0.8(server-1:47)<v1>:41000,

                                                                                                                                172.17.0.8(server-1:47)<v1>:41000, 172.17.0.8(server-1:47)<v1>:41000, 172.17.0.4(server-0:65)<v1>:41000

172.17.0.9(server-1:46)<v1>:41000 | 32000 | 8            | 172.17.0.8(server-1:47)<v1>:41000, 172.17.0.4(server-0:65)<v1>:41000, 172.17.0.4(server-0:65)<v1>:41000,

                                                                                                                               172.17.0.8(server-1:47)<v1>:41000, 172.17.0.4(server-0:65)<v1>:41000, 172.17.0.4(server-0:65)<v1>:41000,

                                                                                                                               172.17.0.4(server-0:65)<v1>:41000, 172.17.0.8(server-1:47)<v1>:41000


If one server is stopped on "cluster-2", both senders in "cluster-1" are disconnected:



Problem 2: Gw sender pings not reaching gw receivers

Gw sender use internally a client pool which sends ping messages to the gw receiver it is connected to. In the receivers, ClientHealthMonitor thread is in charge of handle these ping messages. If no one is received from a given sender, it is considered down and the connection is closed. When configuring all gw receivers with same host and port, ping messages are not reaching all the receivers, just one of them, so connections are closed.

When same host and port is used for all gw receivers, pings are not handled correctly. Following examples shows the errors we have seen:


Example 1:

Cluster-2 gfsh>list gateways
GatewayReceiver Section

Member                            | Port  | Sender Count | Senders Connected
--------------------------------- | ----- | ------------ | ------------------------------------------------------------------------------------------------------------

172.17.0.5(server-0:65)<v1>:41000 | 32000 | 6            | 172.17.0.4(server-0:65)<v1>:41000, 172.17.0.8(server-1:47)<v1>:41000, 172.17.0.8(server-1:47)<v1>:41000,

                                                                                                                                172.17.0.8(server-1:47)<v1>:41000, 172.17.0.8(server-1:47)<v1>:41000, 172.17.0.4(server-0:65)<v1>:41000

172.17.0.9(server-1:46)<v1>:41000 | 32000 | 8            | 172.17.0.8(server-1:47)<v1>:41000, 172.17.0.4(server-0:65)<v1>:41000, 172.17.0.4(server-0:65)<v1>:41000,

                                                                                                                               172.17.0.8(server-1:47)<v1>:41000, 172.17.0.4(server-0:65)<v1>:41000, 172.17.0.4(server-0:65)<v1>:41000,

                                                                                                                               172.17.0.4(server-0:65)<v1>:41000, 172.17.0.8(server-1:47)<v1>:41000


But after some time, connections from one of the sender are closed. Connections from cluster-1/server-1 have dissapeared from cluster-2/server-1 list of connected senders:

Cluster-2 gfsh>list gateways
GatewayReceiver Section

Member                            | Port  | Sender Count | Senders Connected
--------------------------------- | ----- | ------------ | ------------------------------------------------------------------------------------------------------------
172.17.0.5(server-0:65)<v1>:41000 | 32000 | 6            | 172.17.0.4(server-0:65)<v1>:41000, 172.17.0.8(server-1:47)<v1>:41000, 172.17.0.8(server-1:47)<v1>:41000,

                                                                                                                                172.17.0.8(server-1:47)<v1>:41000, 172.17.0.8(server-1:47)<v1>:41000, 172.17.0.4(server-0:65)<v1>:41000

172.17.0.9(server-1:46)<v1>:41000 | 32000 | 5            | 172.17.0.4(server-0:65)<v1>:41000, 172.17.0.4(server-0:65)<v1>:41000, 172.17.0.4(server-0:65)<v1>:41000,

                                                                                                                               172.17.0.4(server-0:65)<v1>:41000, 172.17.0.4(server-0:65)<v1>:41000


Looking for ClientHealtMonitor logs on both servers:


root@server-0:/# grep ClientHealthMonitor server-0/server-0.log


[info 2020/03/10 11:13:38.546 GMT <main> tid=0x1] ClientHealthMonitorThread maximum allowed time between pings: 60000



root@server-1:/# grep ClientHealthMonitor server-1/server-1.log

[info 2020/03/10 11:13:38.700 GMT <main> tid=0x1] ClientHealthMonitorThread maximum allowed time between pings: 60000

[warn 2020/03/10 11:14:52.763 GMT <ClientHealthMonitor Thread> tid=0x39] ClientHealthMonitor: Unregistering client with member id identity(172.17.0.8(server-1:47)<v1>:41000,connection=1 due to: Unknown reason

[warn 2020/03/10 11:14:52.763 GMT <ClientHealthMonitor Thread> tid=0x39] Monitoring client with member id identity(172.17.0.8(server-1:47)<v1>:41000,connection=1. It had been 60595 ms since the latest heartbeat. Max interval is 60000. Terminated client.


And some minutes later, all connections are lost:

Cluster-1 gfsh>list gateways
GatewaySender Section

GatewaySender Id | Member                            | Remote Cluster Id | Type     | Status                 | Queued Events | Receiver Location
---------------- | --------------------------------- | ----------------- | -------- | ---------------------- | ------------- | --------------------------------------------------------------

sender-to-2      | 172.17.0.4(server-0:65)<v1>:41000 | 2                 | Parallel | Running, not Connected | 0             | receiver-site2-service.geode-cluster-2.svc.cluster.local:32000
sender-to-2      | 172.17.0.8(server-1:47)<v1>:41000 | 2                 | Parallel | Running, not Connected | 0             | receiver-site2-service.geode-cluster-2.svc.cluster.local:32000


Cluster-2 gfsh>list gateways
GatewayReceiver Section

Member                            | Port  | Sender Count | Senders Connected
--------------------------------- | ----- | ------------ | -----------------
172.17.0.5(server-0:65)<v1>:41000 | 32000 | 0            |
172.17.0.9(server-1:46)<v1>:41000 | 32000 | 0            |


Checking the logs again, we can see new logs from the ClientHealthMonitor:

root@server-0:/# grep ClientHealthMonitor server-0/server-0.log

[info 2020/03/10 11:13:38.546 GMT <main> tid=0x1] ClientHealthMonitorThread maximum allowed time between pings: 60000

[warn 2020/03/10 11:20:12.203 GMT <ServerConnection on port 32000 Thread 3> tid=0x45] ClientHealthMonitor: Unregistering client with member id identity(172.17.0.8(server-1:47)<v1>:41000,connection=1 due to: The connection has been reset while reading the header

[warn 2020/03/10 11:22:22.336 GMT <ServerConnection on port 32000 Thread 6> tid=0x4c] ClientHealthMonitor: Unregistering client with member id identity(172.17.0.4(server-0:65)<v1>:41000,connection=1 due to: The connection has been reset while reading the header


root@server-1:/# grep ClientHealthMonitor server-1/server-1.log

[info 2020/03/10 11:13:38.700 GMT <main> tid=0x1] ClientHealthMonitorThread maximum allowed time between pings: 60000

[warn 2020/03/10 11:14:52.763 GMT <ClientHealthMonitor Thread> tid=0x39] ClientHealthMonitor: Unregistering client with member id identity(172.17.0.8(server-1:47)<v1>:41000,connection=1 due to: Unknown reason

[warn 2020/03/10 11:14:52.763 GMT <ClientHealthMonitor Thread> tid=0x39] Monitoring client with member id identity(172.17.0.8(server-1:47)<v1>:41000,connection=1. It had been 60595 ms since the latest heartbeat. Max interval is 60000. Terminated client.

[warn 2020/03/10 11:22:13.064 GMT <ClientHealthMonitor Thread> tid=0x39] ClientHealthMonitor: Unregistering client with member id identity(172.17.0.4(server-0:65)<v1>:41000,connection=1 due to: Unknown reason

[warn 2020/03/10 11:22:13.065 GMT <ClientHealthMonitor Thread> tid=0x39] Monitoring client with member id identity(172.17.0.4(server-0:65)<v1>:41000,connection=1. It had been 60747 ms since the latest heartbeat. Max interval is 60000. Terminated client.


Example 2:

Cluster-1 gfsh>list members
Member Count : 3

Name      | Id
--------- | ------------------------------------------------------------
server-0  | 172.17.0.4(server-0:69)<v1>:41000
locator-0 | 172.17.0.6(locator-0:26:locator)<ec><v0>:41000 [Coordinator]
server-1  | 172.17.0.8(server-1:46)<v1>:41000

Cluster-1 gfsh>list gateways
GatewaySender Section

GatewaySender Id | Member                            | Remote Cluster Id | Type     | Status                | Queued Events | Receiver Location
---------------- | --------------------------------- | ----------------- | -------- | --------------------- | ------------- | --------------------------------------------------------------
sender-to-2      | 172.17.0.4(server-0:69)<v1>:41000 | 2                 | Parallel | Running and Connected | 0             | receiver-site2-service.geode-cluster-2.svc.cluster.local:32000
sender-to-2      | 172.17.0.8(server-1:46)<v1>:41000 | 2                 | Parallel | Running and Connected | 0             | receiver-site2-service.geode-cluster-2.svc.cluster.local:32000


Cluster-2 gfsh>list members
Member Count : 3

Name      | Id
--------- | ------------------------------------------------------------
server-0  | 172.17.0.5(server-0:65)<v1>:41000
locator-0 | 172.17.0.7(locator-0:24:locator)<ec><v0>:41000 [Coordinator]
server-1  | 172.17.0.9(server-1:51)<v1>:41000

Cluster-2 gfsh>list gateways
GatewayReceiver Section

Member                            | Port  | Sender Count | Senders Connected
--------------------------------- | ----- | ------------ | -------------------------------------------------------------------------------------------------------------------------------------------------
172.17.0.5(server-0:65)<v1>:41000 | 32000 | 7            | 172.17.0.8(server-1:46)<v1>:41000, 172.17.0.8(server-1:46)<v1>:41000, 172.17.0.8(server-1:46)<v1>:41000, 172.17.0.4(server-0:69)<v1>:41000,

                                                                                                                               172.17.0.4(server-0:69)<v1>:41000, 172.17.0.8(server-1:46)<v1>:41000, 172.17.0.4(server-0:69)<v1>:41000

172.17.0.9(server-1:51)<v1>:41000 | 32000 | 7            | 172.17.0.4(server-0:69)<v1>:41000, 172.17.0.4(server-0:69)<v1>:41000, 172.17.0.8(server-1:46)<v1>:41000, 172.17.0.8(server-1:46)<v1>:41000,

                                                                                                                               172.17.0.4(server-0:69)<v1>:41000, 172.17.0.8(server-1:46)<v1>:41000, 172.17.0.4(server-0:69)<v1>:41000


And after some seconds:

Cluster-2 gfsh>list gateways
GatewayReceiver Section

Member                            | Port  | Sender Count | Senders Connected
--------------------------------- | ----- | ------------ | -------------------------------------------------------------------------------------------------------------------------------------------------
172.17.0.5(server-0:65)<v1>:41000 | 32000 | 0            |
172.17.0.9(server-1:51)<v1>:41000 | 32000 | 7            | 172.17.0.4(server-0:69)<v1>:41000, 172.17.0.4(server-0:69)<v1>:41000, 172.17.0.8(server-1:46)<v1>:41000, 172.17.0.8(server-1:46)<v1>:41000,

                                                                                                                               172.17.0.4(server-0:69)<v1>:41000, 172.17.0.8(server-1:46)<v1>:41000, 172.17.0.4(server-0:69)<v1>:41000


Logs of the servers. In this test, both senders were considered down by one of the receivers:

root@server-0:/# grep ClientHealthMonitor server-0/server-0.log

[info 2020/03/10 14:02:34.130 GMT <main> tid=0x1] ClientHealthMonitorThread maximum allowed time between pings: 60000

[warn 2020/03/10 14:03:56.191 GMT <ClientHealthMonitor Thread> tid=0x37] ClientHealthMonitor: Unregistering client with member id identity(172.17.0.8(server-1:46)<v1>:41000,connection=1 due to: Unknown reason

[warn 2020/03/10 14:03:56.192 GMT <ClientHealthMonitor Thread> tid=0x37] Monitoring client with member id identity(172.17.0.8(server-1:46)<v1>:41000,connection=1. It had been 60507 ms since the latest heartbeat. Max interval is 60000. Terminated client.

[warn 2020/03/10 14:03:56.194 GMT <ClientHealthMonitor Thread> tid=0x37] ClientHealthMonitor: Unregistering client with member id identity(172.17.0.4(server-0:69)<v1>:41000,connection=1 due to: Unknown reason

[warn 2020/03/10 14:03:56.194 GMT <ClientHealthMonitor Thread> tid=0x37] Monitoring client with member id identity(172.17.0.4(server-0:69)<v1>:41000,connection=1. It had been 60444 ms since the latest heartbeat. Max interval is 60000. Terminated client.


root@server-1:/# grep ClientHealthMonitor server-1/server-1.log

[info 2020/03/10 14:02:34.275 GMT <main> tid=0x1] ClientHealthMonitorThread maximum allowed time between pings: 60000


And some minutes later:

Cluster-2 gfsh>list gateways
GatewayReceiver Section

Member                            | Port  | Sender Count | Senders Connected
--------------------------------- | ----- | ------------ | -------------------------------------------------------------------------------------------------------
172.17.0.5(server-0:65)<v1>:41000 | 32000 | 0            |
172.17.0.9(server-1:51)<v1>:41000 | 32000 | 3            | 172.17.0.8(server-1:46)<v1>:41000, 172.17.0.8(server-1:46)<v1>:41000, 172.17.0.8(server-1:46)<v1>:41000

Anti-Goals

N/A

Solution

Gw sender failover

Solution consists on refactoring some maps on LocatorLoadSnapshot class. They use ServerLocation objects as key, this has to change due to it will not be unique for each server. We changed the maps to use InternalDistributedMember objects as key for the map entries. The ServerLocation information is not lost, as it is contained in the entry value for all the maps.

The same refactoring is done in EndPointManager, as it holds a map of endpoints that also uses ServerLocation objects as key.

Gw sender pings not reaching gw receivers

When PingTask are run by LiveServerPinger, they call PingOp.execute(ExecutablePool pool, ServerLocation server). PingOp only uses hostname and ip (ServerLocation) to get the connection to send the ping message. As all receivers are sharing the same host and port, it is not guaranteed that the connection is really pointing to the server we want to connect to.


Other alternative is the addition of a retry mechanism to PingOp to be able to discard a connection if the endpoint of that connection is not the server we want to connect to. We have added a new method PingOp.execute(Executable pool, Endpoint endpoint) to solve this. In this way, if the connection obtained is not pointing to the required Endpoint, it can be discarded an ask for a new one.

Other alternatives to the retry mechanism that we have not explored could be:

  • Add the option for deactivating the ping mechanism for gw sender/gw receivers communication
  • Send the ping using just existing connections, not creating new ones.

Changes and Additions to Public Interfaces

N/A

Performance Impact

When getting the connection to execute the ping, some retries could happen until the right connection is obtained so this operation will take longer, but we do not think it will impact performance.

Backwards Compatibility and Upgrade Path

N/A

Prior Art

After checking with the dev mailing list, we received the suggestion to configure serverAffinity in Kubernetes to solve the issue with the pings, but that option broke the failover of gw senders when a gw receiver is down.

FAQ

TBD

Errata

N/A


Annex: testExecuteOp failing

After our changes we have been stuck trying to solve testExecuteOp from ConnectionPoolImplJUnitTest. The test hangs when executing an operation that has been implemented to throw an exception. Instead of trying to execute the operation on both servers, we have seen it tries continuously to execute it on the same server.

The problem is in handshakeWithServer function at ClientSideHandshakeImpl class. We have seen that after the operation fails on the first server, and it is going to be executed on the second server,  at this line:

member = readServerMember(dis);

The variable contains the member id of the second server, but readServerMember return the id of the first server, so finally the operations is executed on that server again.

  • No labels