Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.
Comment: Migrated to Confluence 4.0

...

It is important to monitor the ZK environment (hardware, network, processes, etc...) in order to more easily troubleshoot problems. Otherwise you miss out on important information for determining the cause of the problem. What type of monitoring are you doing on your cluster? You can monitor at the host level – that will give you some insight on where to look; cpu, memory, disk, network, etc... You can also monitor at the process level – the ZooKeeper server JMX interface will give you information about latencies and such (you can also use the four letter words for that if you want to hack up some scripts instead of using JMX). JMX will also give you insight into the JVM workings - so for example you could confirm/ruleout GC pauses causing the JVM Java threads to hang for long periods of time (see below).

Without monitoring troubleshooting will be more difficult, but not impossible. JMX can be used through jconsole, or access the stats through the four letter words, also the log4j log contains much important/useful information.
You can also use SPM for ZooKeeper to see all ZooKeeper, JVM, and system/host metrics.

Troubleshooting Checklist

...

ZooKeeper is a canary in a coal mine of sorts. Because of the heart-beating performed by the clients and servers ZooKeeper based applications are very sensitive to things like network and system latencies. We often see client disconnects and session expirations associated with these types of problems.

Take a look at this section to start.

Client disconnects due to client side swapping

This link specifically discusses the negative impact of swapping in the context of the server. However swapping can be an issue for clients as well. It will delay, or potentially even stop for a significant period, the heartbeats from client to server, resulting in session expirations.

...

gchisto is a useful tool for analyzing GC logs https://gchisto.dev.java.net/Image Removed

Additionally you can use 'jstat' on a running jvm to gain more insight into realtime GC activity, see: http://java.sun.com/j2se/1.5.0/docs/tooldocs/share/jstat.htmlImage Removed

This issue can be resolved in a few ways:

...

  • Verify that logging isn't at DEBUG.Check your log4j.properties file and change the line log4j.rootLogger=DEBUG, ROLLINGFILE to log4j.rootLogger=WARN ROLLINGFILE. Logging to disk on every action can greatly effect performance.
  • Verify that you are using fast local disk for the journal.
  • Test with http://github.com/phunt/zk-smoketestImage Removed. This should identify real performance along latency issues. It is built against 32 bit python.