GSoC 2010: ZooKeeper Failure Detector Model

Abstract

ZooKeeper servers detect the failure of other servers and clients by counting the number of 'ticks' for which it doesn't get a heartbeat from other machines. This is the 'timeout' method and it works very well; however it is possible that it is too aggressive and not easily tuned for some more unusual ZooKeeper installations. This project's goals are to abstract the failure detector to a separate module, to implement several failure detectors and to compare their appropriateness for ZooKeeper.

The full gsoc-zookeeper-failuredetector.pdf is attached to this page.

Roadmap

  1. Discuss the project with the community (dev/user lists), asking for suggestions and requirements and decide which type and which methods are to be implemented (Community Bonding Period)
  2. Study the chosen failure detection methods specification and the ZooKeeper code (24th May)
  3. Isolate the failure detector model in the ZooKeeper code (14th June)
  4. Implement the chosen failure detector methods (28th June)
  5. Evaluate the QoS metrics for the implemented methods (26th July)

Failure detection methods' references

Specific objectives

The ones with strike-through have already been finished

  1. Write pseudo-codes for the proposed algorithms
  2. Create FailureDetector interface
  3. Write implementations and tests of the FailureDetector interface based on the proposed algorithms
  4. Refactor client-side code of the client-server monitoring to use the proposed FailureDetector interface
  5. Make the failure detection and its parameters configurable on the client
  6. Refactor server-side code of the client-server monitoring to use the proposed FailureDetector interface
  7. Refactor the code of the server-server monitoring to use the proposed FailureDetector interface
  8. Make the failure detection and its parameters configurable on the server (to server-server and client-server monitoring)
  9. Evaluate the QoS metrics with experimentation
  10. Write Forrest docs

Related JIRA

Progress Report

Community bonding period

05/Jun/10

11/Jun/10

22/Jun/10

28/Jun/10

02/Jul/10

08/Jul/10

23/Jul/10

30/Jul/10

09/Aug/10

14/Aug/10

16/Aug/10

Experimentation

Experimental design

Results

Concluding remarks

As expected, we noticed that the fixed heartbeat method works well when we run ZooKeeper in a controlled environment, where the network behavior is expected. In this cases we can tune the fixed timeout after some network analysis. However, in scenarios where we have a changing network behavior, such in a WAN, the adaptive methods can be a good pick. Below, there is an overview of each failure detector:


Design decisions

Should a failure detector instance (FD) run in a separate thread from the application?

How to use application message in adaptive failure detectors?

Due to the usage of application messages as heartbeat, the actual heartbeats are not sent regularly. How to compute the next estimated arrival time?

How to report sampling window statistical data from Learners to Leader?

Future work