You are viewing an old version of this page. View the current version.

Compare with Current View Page History

« Previous Version 8 Next »

(this page has been created from an earlier document, while transferring the contents some of the commands listed here got truncated, they need to be fixed.)

How to profile pig on map reduce clusters

hadoop profile doc outlines the methodology for Java Map reduce jobs.
The following command line is a sample one for use with pig:
java -Dmapred.task.profile.maps=0-0 -Dmapred.tasks.profile.reduces=0-0 -Dmapred.task.profile=true -Dmapred.task.profile.params=-agentlib:hprof=cpu=samples,heap=sites,force=n,thread=y,verbose=n,file=%s
The <-agent..> is the relevant profiler specific option you would supply on the java command line

YourKit

YourKit is a commercial Java profiler. It has really good analysis features and more importantly much better performance (lesser impact on run time of your program). Yourkit has granted license for use by pig contributors. Contact another committer or pig pmc (private@pig.apache.org) for access to the licence key. You can download it from yourkit's website.

Set up

Get hold a Yourkit distribution (I have used yjp-7.0.7.zip), extract it and get hold of the license key. The key is required to use the UI which is the only way to examine the profile output.
Let us assume the Yourkit distribution has been extracted to BASEDIR. Assuming you are running on Linux and using version 7.0.7, the profiling agent is at: BASEDIR/yjp-7.0.7/bin/linux-x86-32/libyjpagent.so This is the profiling agent which can be provided with the java commandline. To use yourkit on a cluster, the above file needs to be copied to a readable location on each of the nodes of the cluster. Lets assume that this location is CLUSTER_BASEDIR.

Usage

See http://yourkit.com/docs/index.jsp for detailed docs.

CPU profiling

Here's a quick overview based on one developer's experience To use yourkit for a standalone java program - say pig on local file system, the commandline to use is:
java -agentpath:BASEDIR/yjp-7.0.7/bin/linux-x86-32/libyjpagent.so=dir=/tmp/yourkit_snapnshot,tracing,disablealloc,disablej2ee -cp <location of pig.jar> org.apache.pig.Main <pigscript>
In the above command line /tmp/yourkit_snapshot is the output directory into which yourkit outputs a ".snapshot" file. You can specify any directory to which you have write permissions. Yourkit seems to create the final dir in the path specification if it does not exist. The "tracing" option means that yourkit will trace the method calls to provide profile information (this gives accurate invocation counts since it is achieved by tracing every method call and is not based on sampling - which has the side effect that it is slower).

Pig Profiling "disablealloc" option means memory allocations are not traced. "disablej2ee" means j2ee specific profiling is disabled.
Using yourkit on a pig script running on a*cluster* in sampling mode: java -Dmapred.task.profile.maps=0-0 -Dmapred.task.profile.reduces=0-0 -Dmapred.task.profile=true -Dmapred.task.profile.params=-agentpath:CLUSTER_BASEDIR/libyjpagent.so=dir=/grid/0/tmp/yourkit_snapnshot,sampling,disablealloc,disablej2ee -cp <pig.jar pathname>:<dir containing of hadoop-site.xml> org.apache.pig.Main <pig script>

Using yourkit on a pig script running on a*cluster* in tracing mode:

  • you need to disable the filter so that org.apache.* is also traced
  • specify value for mapred.max.split.size smaller than block size, so that the map task has smaller input and finishes sooner.
  • specify value for mapred.task.timeout so that it does not timeout

java -Dmapred.max.split.size=10000000 -Dmapred.task.timeout=60000000 -Dmapred.task.profile.maps=0-0 -Dmapred.task.profile.reduces=0-0 -Dmapred.task.profile=true -Dmapred.task.profile.params=-agentpath:CLUSTER_BASEDIR/libyjpagent.so=dir=/grid/0/tmp/yourkit_snapnshot,filters=/dev/null,tracing,disablealloc,disablej2ee -cp <pig.jar pathname>:<dir containing of hadoop-site.xml> org.apache.pig.Main <pig script>

With the above cmd, 0th mapper and reducer tasks are profiled and on the cluster machines running those tasks, a yourkit snapshot file is created at /grid/0/tmp/yourkit_snapnshot. This should be copied to the machine with the yourkit gui and loaded using the GUI to look at the profile informaiton. In the above cmd, "sampling" is used - to use tracing instead replace sampling with tracing in the above command.

Memory profiling
  • Same steps as in CPU profiling with modified commandline to NOT disable memory allocation tracing:
    java -Dmapred.task.profile.maps=0-0 -Dmapred.task.profile.reduces=0-0 -Dmapred.task.profile=true
GUI
  • The GUI to view the profile output is present in: BASEDIR/yjp-7.0.7/bin/yjp.sh
  • org.apache is not something most yourkit users are interested in exploring, so they are filtered out by default in the display. You need to click on Settings | Filters, and uncheck org.apache .
  • On mac, the GUI does not work with java 1.6 . If you have java 1.6 as default, set export JAVA_HOME=/System/Library/Frameworks/JavaVM.framework/Versions/1.5/Home/ , to use 1.5 instead .
  • No labels