Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.
Comment: Migrated to Confluence 4.0

...

This page details specific problems people have seen, solutions (if solved) to those issues and the types of steps taken to troubleshoot the issue. Feel free to update with your experiences.

The http://wiki.apache.org/hadoop/Hbase/Troubleshooting Hbase troubleshooting page also has insight for identifying/resolving ZK issues.

...

It is important to monitor the ZK environment (hardware, network, processes, etc...) in order to more easily troubleshoot problems. Otherwise you miss out on important information for determining the cause of the problem. What type of monitoring are you doing on your cluster? You can monitor at the host level – that will give you some insight on where to look; cpu, memory, disk, network, etc... You can also monitor at the process level – the ZooKeeper server JMX interface will give you information about latencies and such (you can also use the http://hadoop.apache.org/zookeeper/docs/current/zookeeperAdmin.html#sc_zkCommands four letter words for that if you want to hack up some scripts instead of using JMX). JMX will also give you insight into the JVM workings - so for example you could confirm/ruleout GC pauses causing the JVM Java threads to hang for long periods of time (see below).

Without monitoring troubleshooting will be more difficult, but not impossible. JMX can be used through jconsole, or access the stats through the four letter words, also the log4j log contains much important/useful information.
You can also use SPM for ZooKeeper to see all ZooKeeper, JVM, and system/host metrics.

Troubleshooting Checklist

...

  • hdparm with the -t and -T options to test your disk IO
  • time dd if=/dev/urandom bs=512000 of=/tmp/memtest count=1050
    • time md5sum /tmp/memtest; time md5sum /tmp/memtest; time md5sum /tmp/memtest
    • See ECC memory section below for more on this
  • ethtool to check the configuration of your network
  • ifconfig also to check network and examine error counts
    • ZK uses TCP for network connectivity, errors on the NICs can cause poor performance
  • scp/ftp/etc... can be used to verify connectivity, try copying large files between nodes
  • httpthesehttp://github.com/phunt/zk-smoketest#readme smoke and latency tests can be useful to verify a cluster

Compare your results to some baselines

See the ZooKeeper/ServiceLatencyOverview Latency Overview page for some latency baselines. You can also compare the performance of cpu/disk/mem/etc... that you have available to what is used in this test.

...

ZooKeeper is a canary in a coal mine of sorts. Because of the heart-beating performed by the clients and servers ZooKeeper based applications are very sensitive to things like network and system latencies. We often see client disconnects and session expirations associated with these types of problems.

Take a look at http://hadoop.apache.org/zookeeper/docs/current/zookeeperAdmin.html#sc_commonProblems this section to start.

Client disconnects due to client side swapping

http://hadoop.apache.org/zookeeper/docs/current/zookeeperAdmin.html#sc_commonProblems This link specifically discusses the negative impact of swapping in the context of the server. However swapping can be an issue for clients as well. It will delay, or potentially even stop for a significant period, the heartbeats from client to server, resulting in session expirations.

As told by a user:

"This https://issues.apache.org/jira/browse/ZOOKEEPER-344?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12706402#action_12706402 issue is clearly linked to heavy utilization or swapping on the clients. I find that if I keep the clients from swapping that this error materializes relatively infrequently, and when it does materialize it is linked to a sudden increase in load. For example, the concurrent start of 100 clients on 14 machines will sometimes trigger this issue. <...> All in all, it is my sense that Java processes must avoid swapping if they want to have not just timely but also reliable behavior."

...

GC pressure

The Java GC can cause https://issues.apache.org/jira/browse/HBASE-1316 starvation of the Java threads in the VM. This manifests itself as client disconnects and session expirations due to starvation of the heartbeat thread. The GC runs, locking out all Java threads from running.

...

gchisto is a useful tool for analyzing GC logs https://gchisto.dev.java.net/Image Removed

Additionally you can use 'jstat' on a running jvm to gain more insight into realtime GC activity, see: http://java.sun.com/j2se/1.5.0/docs/tooldocs/share/jstat.htmlImage Removed

This issue can be resolved in a few ways:

First look at using one of the alternative GCs, in particular http://developer.amd.com/documentation/articles/pages/4EasyWaystodoJavaGarbageCollectionTuning.aspx: low latency GC:

e.g. the following JVM option: -XX:ParallelGCThreads=8 -XX:+UseConcMarkSweepGC

...

Some things to keep in mind while tuning zookeeper performance.

  • Verify that logging isn't at DEBUG.Check your log4j.properties file and change the line log4j.rootLogger=DEBUG, ROLLINGFILE to log4j.rootLogger=WARN ROLLINGFILE. Logging to disk on every action can greatly effect performance.
  • Verify that you are using fast local disk for the journal.
  • Test with http://github.com/phunt/zk-smoketestImage Removed. This should identify real performance along latency issues. It is built against 32 bit python.