Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

...

Akka looks promising and a lot of people are using it. 

A custom solution that does not leverage other projects for clustering could also be implemented, especially if JGroups is used for reliable UDP messaging.

Examination of the options

In this section we'll look at each of the options and see how it might address the requirements.

JGroups

A newer version of JGroups could be modified to fit the requirements, just as is being done with the initial, incubating, version of Geode.

New versions of JGroups use the Apache license so they are compatible with Geode and could replace the old 2.2.9 version.  A new version of JGroups would be a natural fit for replacing the old version.  A number of the changes made to the older JGroups for GemFire, such as out-of-band messaging and out-of-band messages are now available in off-the-shelf JGroups.  The GossipServer service used in the Locator is now, sadly, missing but there is a GossipRouter service that might plug into the Locator instead.  Other problems, such as the membership view being transmitted in a message header instead of the body have also been corrected (allowing for large membership views through body fragmentation).  Message-handling executors have been added and the old up/down thread pattern has been removed.  UUIDs are used for member IDs in preference to InetAddress:port (though these are still used for identifying endpoints associated with UUID identifiers).

The downside of using JGroups is that we have to fork it in order to insert our failure-detection SLAs, network-partition detection and handling, ability to shut down in all circumstances, etc.  We largely rewrote failure-detection, but the GMS may need a lot fewer changes thanks to improvements made in JGroups over the years like simultaneous handling of joins & leaves.

We would also need to change JGroups to support rolling upgrade.  The old 2.2.9 JGroups code was modified to use versioned serialization streams and include version information so that recipients of messages would know how to deserialize the messages and react to them.

If we continue to use and modify JGroups one other concern to note is that the code we inserted into the old LGPL version of JGroups is probably still covered by LGPL and can’t just be copied into a newer version to cover our needs.  It will need to be written from scratch but can use the same algorithms as the old implementation.

If JGroups is not used for membership it could still be used for UDP communication, but rolling upgrade support might be an issue if the version of JGroups being used by Geode is ever changed.


Zookeeper

Many Apache projects use Apache Zookeeper for collaboration between processes.  Zookeeper provides collaboration primitives that we can use to build membership services, distributed locking and more.

If you haven’t been exposed to how Zookeeper is used you might take a look at this page: http://nofluffjuststuff.com/blog/scott_leberknight/2013/07/distributed_coordination_with_zookeeper_part_3_group_membership_example

Zookeeper meets requirements 1 through 6 in the list you saw earlier in this document.

Zookeeper can be used to join a distributed system by creating an ephemeral, sequenced z-node in the distributed system’s “group” node.  Each z-node can contain additional details about the member, fulfilling one of the important requirements Geode has for a membership system.

Zookeeper gives its clients notification when a node changes.  This can be used to tell members when a new peer has joined or left the system.  Notification only lets you see the current state of the group z-node, so we have to keep a snapshot of the last state and do a diff to see what’s changed.  Apparently it is possible for changes to the group node to be missed so there probably needs to be some periodic polling of the group node to check for changes in addition to the notification mechanism.

Zookeeper’s getChildren() method returns a list of nodes that has no particular order but if we use sequential z-nodes each node is assigned a sequential number that can be used to impose ordering and select an Elder for the Distributed Lock Service.

To detect crashed clients Zookeeper uses UDP heartbeats.  This is notoriously fragile under heavy CPU load and is one reason I crippled the similar mechanism in JGroups 2.2.9 by making it defer to the decisions of the stronger tcp/ip stream connection-based FD_SOCK.  I have no proof but suspect this will result in members being kicked out of the system under heavy load, transferring the load to other already-burdened servers...

There doesn’t appear to be a way for clients to raise suspicion about another client, so Geode could not hint to Zookeeper to check up on a member that’s not responding to messages.  It does seem possible to delete another client’s node though, so that could be used to remove a problematic member.  Another mechanism for doing a health check on that member would be needed in Geode before taking the drastic step of unilaterally kicking it out.  Adding an external health checker, such as a phi-based system or a membership-ring system, is a possibility.

Zookeeper doesn’t deal with network partitioning but this could be handled when membership changes.  Each member would have to decide whether now-missing members have crashed or not and perform loss calculations.  Geode sends Shutdown messages, so this could be done.

Zookeeper does not include any form of send/reply messaging so we would still need some other solution for out-of-band messaging and broadcast messaging.

 

Zookeeper would not support the notion of blocking a member from joining based Geode version information.  We would have to build this on top of zookeeper as this can cause distributed hangs for algorithms that are changing behavior in the newer version of Geode.

 Zookeeper has a pluggable authentication system and ACL-based access control.  It’s not clear to me whether it is compatible with the pluggable authentication system in Geode.  It certainly wouldn’t interact with Geode’s authentication system directly.  This, too, would have to be built on top of zookeeper.

The down-side to Zookeeper is that it is not a shared-nothing membership system like we currently have.  Geode would be dependent on Zookeeper servers which must be deployed and managed, and if Zookeeper goes down the entire distributed system will become unavailable.  It’s not exactly a single point of failure since Zookeeper has redundant servers, but it forms a static-membership core to Geode that can put a large distributed system at risk.  If the Zookeeper servers are not reachable a node in a Geode Distributed System will have to shut down immediately since it can draw no other conclusion but that it is isolated from the primary partition.  This could be a significant risk in a large distributed system.  It’s akin to requiring locators to always be available.

Zookeeper might give us trouble in large distributed systems because access to zookeeper would be required for servers to remain up.  We should consider using the Apache Curator project or perhaps parts of Helix that handle some of the problems people usually run into with Zookeeper and for UDP messaging we could use a JGroups channel having no membership or failure detection protocols.

So, to sum up, zookeeper could be used as the basis for a membership management service, replacing some of the functionality we currently have built into JGroups.  We would have to implement a fair portion of what we need outside of zookeeper, and using zookeeper comes with some risks.  We'd still need a different solution for UDP messaging.

Akka Clustering

Akka is used by Google Compute Engine and other projects for clustering.  Google has posted that it achieved 1500 nodes in a cluster with stable performance using a simple application and fairly loose timeouts.

Akka clustering documentation can be found here: http://doc.akka.io/docs/akka/snapshot/common/cluster.html

Akka uses configured seed-nodes to join the cluster, which is compatible with Geode’s Locator discovery pattern, but they have not solved the concurrent-startup problem for seed nodes that Geode has licked in its JGroups improvements.  The first seed node needs to be brought up before any other seed nodes are started.

Akka delivers membership change notifications using an Actor model for individual MemberUp/MemberRemoved events and you can also get the cluster state.  Cluster state is fairly complete and has both a lead-member, a sorted set of members and a set of unreachable members.  One thing missing from cluster state is any form of unique identifier.

Akka allows new members to join without taking down the system.  New members can’t join though if the coordinator (“leader”) is examining an unresponsive member.

Akka provides each member with a unique ID containing hostname:port:UID.  The host name can be an address.  IDs are not allowed to rejoin a system once they leave or are kicked out.  This is compatible with what Geode needs.

Akka member IDs do not provide storage for extra information, so Geode member characteristics can’t be transmitted with the cluster state.  

Akka can auto-detect that a member is “unreachable” and boot it out.  The member is left to figure out what to do and by default will just form its own 1-member system.  

Akka member IDs have an olderThan method that could be used to determine the Elder in the system, assuming the Member<->InternalDistributedMember mapping problem already mentioned is solved.

Akka uses a gossip-based algorithm based on Dynamo and Riak to distribute membership changes.  The cluster state is sent to random members of the system with preference for those that have not seen the latest version.  The cluster state is completely installed once all members have been included in the “seenBy” set that is transmitted with cluster state.  Cluster state also includes a set of Unreachable nodes.  Somehow this would have to map to a coherent network-partition-detection algorithm for Geode.  The gossip-based diffusion of membership events doesn’t map at all to the current 2-phased installation algorithm.

Akka uses phi-accrual failure detection, which we have considered using for Geode.  However, it is a heartbeat-based algorithm and so is prone to false-positives when CPU or network load is high.  Akka docs say to tune the phi-threshold to suit the environment.

There is an API in the failure detector registry that will let Geode feed heartbeats to Akka but there is no way to feed suspicion to it based on non-responsiveness in our other communication channels.

Akka would not support the notion of blocking a member from joining based Geode version information.  We would not be able to prevent a node using the old version of the product from joining during a rolling upgrade.  This can cause distributed hangs for algorithms that are changing behavior in the newer version of Geode.

Akka would not integrate with Geode’s peer-to-peer authentication system as it does not allow you to send credentials when joining and does not include credentials of the sender in its cluster-state events.

Akka does provide messaging so it could handle the out-of-band messages we currently send over JGroups.

Akka lets you plug in your own serializer so we could insert DataSerializable and DataSerializableFixedID.  Rolling upgrade could not be supported at this level, though, because no info about the destination address is given to the serializer.

There are a few problems that make the native clustering in Akka unusable for Geode.

First, we can’t support authentication since we can’t send credentials in join messages.  As with zookeeper this would have to be built on top of Akka.

Second, Akka has no notion of time or identity in its cluster state, so if a message is sent by member R to member S we can’t check to see if S is at a compatible state with R before allowing the message to be processed.  We need that for some distributed algorithms.  We would need to build this on top of Akka.

Third, Akka does not have a network partition detection algorithm at all, much less one that maps to Geode’s.  This also would have to be built on top of Akka.

So, Akka looks like a better fit than Zookeeper in some ways and it looks like less of a good fit in others.  We would have to build a lot of stuff on top of it. 

Conclusion

Each of the options has downsides and would require us to implement a lot of additional functionality.

We would like to avoid making heavy modifications to JGroups but we will continue to use JGroups for UDP communication since it fits well with the current Geode architecture.

We will create interfaces for the services we currently get from JGroups so that different implementations can be plugged in but will implement most of these services from the ground up.