Table of Contents | ||
---|---|---|
|
How To Test Pig
This document covers how to test Pig. It is intended for Pig developers who need to know how to test their work. It can also be used by Pig users who wish to verify their instance of Pig.
...
A single unit test can be run by setting the testcase
property. For example:
Code Block |
---|
ant -Dtestcase=TestRegisteredJarVisibility clean test
|
...
Now, to run the unit tests with clover:
Code Block |
---|
ant clean
ant -Dclover.home=<clover_home> -Drun.clover=true clover jar test
ant -Dclover.home=<clover_home> -Drun.clover=true generate-clover-reports
ant -Dclover.home=<clover_home> -Drun.clover=true generate-pdf-clover-reports
|
...
Before you can run the test harness against your cluster for the first time, you must generate the test data in your cluster. To do this, do:
Code Block |
---|
ant -Dharness.old.pig=old_pig -Dharness.cluster.conf=hadoop_conf_dir -Dharness.cluster.bin=hadoop_script test-e2e-deploy
|
Where old_pig
is where you installed the old version of Pig, hadoop_conf_dir
is the directory where your hadoop-site.xml
or mapred-site.xml
file is, and hadoop_script
is where your hadoop
executable is located. For example, if you have installed Pig 0.8.1 in /usr/local/pig/pig-0.8.1
and Hadoop in /usr/local/hadoop
, then your command line would look like:
Code Block |
---|
ant -Dharness.old.pig=/usr/local/pig/pig-0.8.1 -Dharness.cluster.conf=/usr/local/hadoop/conf -Dharness.cluster.bin=/usr/local/hadoop/bin/hadoop test-e2e-deploy
|
...
Once you have loaded your cluster with data, you can run the tests by doing:
Code Block |
---|
ant -Dharness.old.pig=old_pig -Dharness.cluster.conf=hadoop_conf_dir -Dharness.cluster.bin=hadoop_script test-e2e
|
Running the full test suite is rarely what you want, as it takes around 10 hours. To run only some tests, set the tests.to.run
property. This value can be passed a group of tests (e.g. Checkin), or a single test (e.g. Checkin_1). You can pass multiple tests or groups in this property. Each test or group of tests must be proceeded by a {{-t }}. For example, to run the Checkin tests and the first MergeJoin test, do:
Code Block |
---|
ant -Dharness.old.pig=old_pig -Dharness.cluster.conf=hadoop_conf_dir -Dharness.cluster.bin=hadoop_script -Dtests.to.run="-t Checkin -t MergeJoin_1" test-e2e
|
...
If you want to clean the data off of your cluster, you can use the undeploy target:
Code Block |
---|
ant -Dharness.old.pig=old_pig -Dharness.cluster.conf=hadoop_conf_dir -Dharness.cluster.bin=hadoop_script test-e2e-undeploy
|
There is no need to do this on a regular basis.
If you want to generate a junit format xml file out of the e2e test log and use it for displaying test results in Jenkins, you can run test/e2e/harness/xmlReport.pl against the log file.
Code Block |
---|
test/e2e/harness/xmlReport.pl testdist/out/log/test_harnesss_1411157020 > test-report.xml |
Running e2e in Local Mode
...
To generate the test data in local mode, do:
Code Block |
---|
ant -Dharness.old.pig=old_pig -Dharness.cluster.conf=hadoop_conf_dir -Dharness.cluster.bin=hadoop_script test-e2e-deploy-local
|
...
To run the local mode tests themselves, do:
Code Block |
---|
ant -Dharness.old.pig=old_pig -Dharness.cluster.conf=hadoop_conf_dir -Dharness.cluster.bin=hadoop_script test-e2e-local
|
...
To Start a Cluster:
Code Block |
---|
export AWS_ACCESS_KEY_ID=your_amazon_access_key
export AWS_SECRET_ACCESS_KEY=your_secret_amazon_access_key
export SSH_PRIVATE_KEY_FILE=your_private_rsa_key_file
cd your_path_to_apache_whirr/bin
./whirr launch-cluster --config your_path_to_pig_trunk/test/e2e/pig/whirr/pigtest.properties
|
...
Running the tests:
Open the file ~/.whirr/pigtest/hadoop-site.xml
and find the line that has mapred.job.tracker
. The next line should have the hostname that is running your Job Tracker. Copy that host name, but NOT the port numbers (ie the :nnnn
where nnnn
is 9001
or something similar). This value will be referred to as your_namenode
.
Code Block |
---|
cd your_path_to_pig_src
scp -i your_private_rsa_key_file test/e2e/pig/whirr/whirr_test_patch.sh your_namenode:~
if you have a patch you want to run
scp -i your_private_rsa_key_file your_patch your_namenode:~
ssh -i your_private_rsa_key_file your_namenode
|
...
Shutting down your cluster:
In the same shell you started the cluster:
Code Block |
---|
./whirr destroy-cluster --config your_path_to_pig_trunk/test/e2e/pig/whirr/pigtest.properties
|
...
Writing a new e2e test does not require writing any new Java code (assuming you don't need to write a UDF for your job). The e2e test harness is written in Perl, and the tests are stored in .conf files, each of which is one big Perl hash (if you squint just right, it almost looks like JSON). These files are in test/e2e/pig/tests/
. This hash is expected to have a groups
key, which is an array. Each element in the array describes a collection of tests, usually oriented around a particular feature. For example the group FilterBoolean
tests boolean predicates in filters. Every group in the array is a hash. It must have name
and a tests
keys. tests
is expected to be an array of tests. Each test is again a hash, and must have a num
, the test number and pig
, the Pig Latin code to run. As an example look at the following, taken from nightly.conf
:
Code Block |
---|
$cfg = {
'driver' => 'Pig',
'groups' => [
{
'name' => 'Checkin',
'tests' => [
{
'num' => 1,
'pig' => q\a = load ':INPATH:/singlefile/studenttab10k' as (name, age, gpa);
store a into ':OUTPATH:';\,
},
{
'num' => 2,
'pig' => q\a = load ':INPATH:/singlefile/studenttab10k' as (name, age, gpa);
b = load ':INPATH:/singlefile/votertab10k' as (name, age, registration, contributions);
c = filter a by age < 50;
d = filter b by age < 50;
e = cogroup c by (name, age), d by (name, age) ;
f = foreach e generate flatten(c), flatten(d);
g = group f by registration;
h = foreach g generate group, SUM(f.d::contributions);
i = order h by $1;
store i into ':OUTPATH:';\,
'sortArgs' => ['-t', ' ', '+1', '-2'],
}
]
},
{
'name' => 'LoaderPigStorageArg',
'tests' => [
{
'num' => 1,
'pig' => q\a = load ':INPATH:/singlefile/studentcolon10k' using PigStorage(':') as (name, age, gpa);
store a into ':OUTPATH:';\,
},
]
}
]
};
|
...
For features that are new, you cannot test against old versions of Pig. For example, macros in 0.9 cannot be tested against 0.8.1. As an alternate to running the same Pig Latin script against an old version, you can run a different script. This script will be run using the current version, not the old one. To specify a different script, you need a key verify_pig_script
. For example:
Code Block |
---|
{
# simple macro, no args
'num' => 1,
'pig' => q#define simple_macro() returns void {
a = load ':INPATH:/singlefile/studenttab10k' as (name, age, gpa);
b = foreach a generate age, name;
store b into ':OUTPATH:';
}
simple_macro();#,
'verify_pig_script' => q#a = load ':INPATH:/singlefile/studenttab10k' as (name, age, gpa);
b = foreach a generate age, name;
store b into ':OUTPATH:';#,
}
|
...
conf File | Tests | Comments |
---|---|---|
bigdata.conf | larger size data | We keep these to a minimum as they take much longer than other tests. |
cmdline.conf | Pig command line output (such as describe) |
|
grunt.conf | grunt operators, like |
|
macro.conf | macro and import |
|
multiquery.conf | multiquery scripts |
|
negative.conf | negative conditions where Pig is expected to return an error |
|
nightly.conf | general positive tests | Your test goes here if it doesn't fit anywhere else |
orc.conf | OrcStorage tests | |
streaming.conf | streaming feature |
|
streaming_local.conf | streaming feature in local mode |
|
turing_jython.conf | Pig scripts embedded in Python scripts |
|
...