Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

The Geode distributed system is made up of a number of member caches that connect to each other in a peer-to-peer (P2P) network. In this topology, each peer maintains communication channels to every other peer, allowing for one network hop access to any other cache member. Even though Geode clients (that can host their own local cache) only connect to specific servers, all servers in a Geode distributed system are always aware of and connected to each other.

Dynamic Group Membership Service (GMS)

A static distributed membership system, where the member caches are well defined (their identity and where they will be hosted) is inflexible and makes it difficult to add or remove members dynamically. The system would have to shutdown partially or completely to expand or contract the number of members participating in the distributed system in response to changing load.

Instead, Geode's dynamic model uses a concept of system membership that is self-defined, allowing members to be added or removed in a very short time scale compared to static systems. The GMS allows processes to join or leave the system at any time and it communicates these membership events to every other member in the system.

Discovery

Any peer member (either a application hosting a peer cache or a CacheServer) announces itself to the distributed system through one of two mechanisms:

...

by first contacting a Locator.  This is a component that maintains a discovery set of all peer members in the distributed system at any given moment. Though typically started as a separate process (with redundancy), locators can also be embedded in any peer cache (like a CacheServer). New members contact the locator via a TCP port to get the current set of peer members. These members are then asked who the current membership coordinator is. Locators are neither a bottleneck to cache operations nor a single point of failure.

...

Multicast based discovery: As an alternative to locators, members can use a multicast channel to find the membership coordinator and join the distributed system. Multicast is not enabled in all networks, particularly small, home-based networks.

Image Added

Discovery DiagImage Removed

The membership coordinator is usually the oldest member locator in the distributed system and is responsible for ensuring that all members know about changes in membership. If the membership coordinator fails, the GMS automatically selects the next-oldest member locator to function as the coordinator.  If all locators are down a server will become the coordinator but new members will not be able to join until at least one locator has been restarted.

Joining

Once a new process has discovered the membership coordinator's identity, it sends the coordinator a join-request. The coordinator responds by adding the process's identity to the membership set (called a membership view) and sending the updated set to the new process and to all of the existing members.

Joining DiagImage RemovedImage Added

When the new process receives the join-response, it forms network connections to the other members and exchanges configuration information. The network connections are TCP/IP stream-socket connections unless the Geode disable-tcp property has been set to disable use of stream-sockets for this purpose.

The exchange of configuration information allows existing members to ensure that the new process is using the same version of Geode and has compatible licensing and communication settings. If the new process, for instance, is using a license meant for testing or development and it attempts to join a distributed system that is using a production license, the members of the distributed system will reject the new process. The new process will then send a leave-request to the coordinator and close its cache.

Surprise Members

One of the guiding principles in the design of Geode's distributed algorithms is that nothing is instantaneous. For instance, when a new process is joining the distributed system, it may receive its join-response from the membership coordinator and commence configuration exchange (described below) before other members know of its existence. When this happens, the other members announce the new process as a surprise member and allow the configuration exchange. The new process remains designated as a surprise member until the new membership view is finally received. If this doesn't happen within 30 seconds (6 * member-timeout setting), the surprise member is rejected and communication with it is terminated.

Cache Formation

Once a new member has joined the distributed system, it begins constructing its cache regions. Cache region formation also entails configuration exchange with other members of the distributed system to ensure that the new member is using compatible settings, and to initiate the flow of events for the region to the new member. This may also include using a non-blocking state-flush algorithm to ensure that the new member's region has consistent state with other members. The distributed state-flush algorithm essentially places markers in communication channels that are used to trigger correct transfer of current state of the region to the new member. Using these markers allows the state-flush process to proceed without blocking reads or updates that are occurring on the existing members.

...

Since the current state of the region may be quite large and take time to transfer, Geode is careful to not overwrite more recent changes that have been received while this is happening.

Process when a member cache departs normally

When a member leaves the distributed system, it flushes its communication channels to other members and then sends them messages that it is stopping cache operations. The communication channels must be flushed to ensure that all messages have been read by their recipients. The other members receive the stopping message and update their cache region meta information and cease distributing changes to the departing member. The member then sends a leave-request to the membership coordinator, which responds by sending a new membership view to all members of the distributed system. Those members then close communication channels to the departing member and add its identity to a shun set to disallow further communications with it.

If Geode locators are being used for discovery, they will eventually remove the departed member from the discovery set. This is done asynchronously in order to reduce load on the locator process.

Process when a member cache departs abnormally

Geode uses a combination of TCP/IP stream-socket connections and UDP datagram heartbeats to detect abnormal termination of members. Each member selects one other member of the system to monitor. That member is expected to send periodic datagram heartbeats to its monitoring process. If a heartbeat is missed within twice the member-timeout Geode setting (which defaults to 5000 milliseconds), the product attempts to form a TCP/IP stream-socket connection to the monitored process. If that fails, a suspect message is distributed to all members.

...

A sick member, or one that is consuming too much CPU to respond to are you alive messages, may be kicked out while still being more or less alive. When that happens, the member's cache will shut down and a ForcedDisconnectException alert is issued to any attached applications.

Handling network partition (aka Split brain)

When there is network segmentation, a distributed system that doesn't handle the condition properly will allow multiple subgroups to form. This can lead to numerous problems including distributed applications operating on inconsistent data. For instance, since clients connecting to a server cluster are not tied into the membership system, a client might communicate with servers from multiple subgroups. Or, one set of clients might see one subgroup of servers while another set of clients cannot see that subgroup but can see another one.

...

Each member is responsible for detecting a network partition. Usually this happens quickly, with one of the members on the losing partition determining that the old membership coordinator is gone and that it should elect itself membership coordinator. Once it does this and sends a new membership view it discovers, through a 2-phase view commit, the other missing members and sends a network parition message to the processes on the losing side.

Auto-reconnect

Geode also has an auto-reconnect mechanism that is useful in Locators and caches with persistent storage or sufficient redundancy that they can recover cache data after a network partition is corrected.

...

Disconnected members also create a background Quorum Responder thread that allows others to contact them. This is used in multicast clusters to let disconnected members discover a quorum of members (whether disconnected or not) and proceed to rejoin the system. It is also used by Locators to allow them to boot their location services and service discovery requests.

Enabling network partition detection

Network partition detection is enabled by setting the enable-network-partition-detection DistributedSystem property to true. This should be done in all Locators and in any other process that you wish to be sensitive to network partitioning.

Dealing with slow or sick members

If a member is slow to respond it can drag down the whole distributed system. A sick member, perhaps an application that is low on memory and is constantly running garbage-collection, can have the same effect. Geode has three mechanisms for dealing with this problem:

...