Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

...

The problem experienced is that shutting down one server is stopping replication to this cluster until the server is up again. This is because Geode incorrectly assumes there are no more alive servers when just one of them is down, because since they share hostname-for-senders and port, they are treated as one same server.


Example, using "cluster-1" and "cluster-2", both with one locator and two servers :

Cluster-1 gfsh>list members
Member Count : 3

Name      | Id
--------- | ------------------------------------------------------------
server-0  | 172.17.0.4(server-0:65)<v1>:41000
locator-0 | 172.17.0.6(locator-0:25:locator)<ec><v0>:41000 [Coordinator]
server-1  | 172.17.0.8(server-1:65)<v1>:41000


Cluster-1 gfsh>list gateways
GatewaySender Section

GatewaySender Id | Member                            | Remote Cluster Id | Type     | Status                | Queued Events | Receiver Location
---------------- | --------------------------------- | ----------------- | -------- | --------------------- | ------------- | --------------------------------------------------------------
sender-to-2      | 172.17.0.4(server-0:65)<v1>:41000 | 2                 | Parallel | Running and Connected | 0             | receiver-site2-service.geode-cluster-2.svc.cluster.local:32000
sender-to-2      | 172.17.0.8(server-1:65)<v1>:41000 | 2                 | Parallel | Running and Connected | 0             | receiver-site2-service.geode-cluster-2.svc.cluster.local:32000


Cluster-2 gfsh>list members
Member Count : 3

Name      | Id
--------- | ------------------------------------------------------------
server-0  | 172.17.0.5(server-0:65)<v1>:41000
locator-0 | 172.17.0.7(locator-0:24:locator)<ec><v0>:41000 [Coordinator]
server-1  | 172.17.0.9(server-1:46)<v1>:41000


Cluster-2 gfsh>list gateways
GatewayReceiver Section

Member                            | Port  | Sender Count | Senders Connected
--------------------------------- | ----- | ------------ | -------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
172.17.0.5(server-0:65)<v1>:41000 | 32000 | 6            | 172.17.0.4(server-0:65)<v1>:41000, 172.17.0.4(server-0:65)<v1>:41000, 172.17.0.4(server-0:65)<v1>:41000, 172.17.0.4(server-0:65)<v1>:41000, 172.17.0.4(server-0:65)<v1>:41000, 172.17.0.4..
172.17.0.9(server-1:46)<v1>:41000 | 32000 | 7            | 172.17.0.8(server-1:65)<v1>:41000, 172.17.0.4(server-0:65)<v1>:41000, 172.17.0.8(server-1:65)<v1>:41000, 172.17.0.8(server-1:65)<v1>:41000, 172.17.0.8(server-1:65)<v1>:41000, 172.17.0.8..


If one server is stopped on "cluster-2", both senders in "cluster-1" are disconnected:



Gw sender pings not reaching gw receivers

Gw sender use internally a client pool which sends ping messages to the gw receiver it is connected to. In the receivers, ClientHealthMonitor thread is in charge of handle these ping messages. If no one is received from a given sender, it is considered down and the connection is closed. When configuring all gw receivers with same host and port, ping messages are not reaching all the receivers, just one of them, so connections are closed.

After booting both clusters, we can see in the logs of servers in cluster-2:

root@server-0:/# grep ClientHealthMonitor server-0/server-0.log

[info 2020/03/10 10:34:34.231 GMT <main> tid=0x1] ClientHealthMonitorThread maximum allowed time between pings: 60000



root@server-1:/# grep ClientHealthMonitor server-1/server-1.log

[info 2020/03/10 10:34:34.353 GMT <main> tid=0x1] ClientHealthMonitorThread maximum allowed time between pings: 60000

[warn 2020/03/10 10:35:49.405 GMT <ClientHealthMonitor Thread> tid=0x38] ClientHealthMonitor: Unregistering client with member id identity(172.17.0.4(server-0:65)<v1>:41000,connection=1 due to: Unknown reason

[warn 2020/03/10 10:35:49.407 GMT <ClientHealthMonitor Thread> tid=0x38] Monitoring client with member id identity(172.17.0.4(server-0:65)<v1>:41000,connection=1. It had been 60082 ms since the latest heartbeat. Max interval is 60000. Terminated client.



Notice the connection from cluster-1/server-0 has dissapeared from cluster-2/server-1 list of connected senders:

Cluster-2 gfsh>list gateways
GatewayReceiver Section

Member                            | Port  | Sender Count | Senders Connected
--------------------------------- | ----- | ------------ | -------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
172.17.0.5(server-0:65)<v1>:41000 | 32000 | 6            | 172.17.0.4(server-0:65)<v1>:41000, 172.17.0.4(server-0:65)<v1>:41000, 172.17.0.4(server-0:65)<v1>:41000, 172.17.0.4(server-0:65)<v1>:41000, 172.17.0.4(server-0:65)<v1>:41000, 172.17.0.4..
172.17.0.9(server-1:46)<v1>:41000 | 32000 | 6            | 172.17.0.8(server-1:65)<v1>:41000, 172.17.0.8(server-1:65)<v1>:41000, 172.17.0.8(server-1:65)<v1>:41000, 172.17.0.8(server-1:65)<v1>:41000, 172.17.0.8(server-1:65)<v1>:41000, 172.17.0.8..


Anti-Goals

What is outside the scope of what the proposal is trying to solve?

...