Table of Contents

minLevel	2

How To Test Pig

This document covers how to test Pig. It is intended for Pig developers who need to know how to test their work. It can also be used by Pig users who wish to verify their instance of Pig.

...

Unit tests are executed via JUnit. Currently, many "unit tests" are really end-to-end tests. We are in the process of changing this so that all of end-to-end tests will be run by the e2e harness (see below). See PigTestProposal for details.

Preparation

If you want Prior to run running unit testtests, make sure to set umask 0022.

We to 0022 before running unit test (we also see unit test tests fail due to extended acl, so use setfacl -b to remove extened extended acl if applicable).

Running all unit tests

To run the unit tests do ant test in the top level Pig directory. Currently this takes 8 hours to run. We intend to drive this to under five minutes. Until this is done it is not expected that contributors will run all these tests before submitting their patch.

...

A single unit test can be run by setting the testcase property. For example:

Code Block
ant -Dtestcase=TestRegisteredJarVisibility clean test

...

Now, to run the unit tests with clover:

Code Block


ant clean
ant -Dclover.home=<clover_home> -Drun.clover=true clover jar test
ant -Dclover.home=<clover_home> -Drun.clover=true generate-clover-reports
ant -Dclover.home=<clover_home> -Drun.clover=true generate-pdf-clover-reports

...

Running the e2e tests requires three things: a cluster to run them on, an old version of Pig to use to generate expected results, and perl and a cpan module IPC::Run (plus a few CPAN modules) on your client machine. The cluster can be quite small. Four machines will enough, one for a Name Node/Job Tracker and three for Data Node/Task TrackersA single machine will enough. Since performance is not the goal it is fine if these are virtual machinesthis is a virtual machine. If you do not have access to a cluster, see below for information on how to run the tests on EC2.For installing the

You will need the following CPAN modules:

IPC::Run
Parallel::ForkManager
DBI

For help installing CPAN modules perl module, see cpan module install instructions .

...

Before you can run the test harness against your cluster for the first time, you must generate the test data in your cluster. To do this, do:

Code Block
ant -Dharness.old.pig=old_pig -Dharness.cluster.conf=hadoop_conf_dir -Dharness.cluster.bin=hadoop_script -Dharness.hadoop.home=hadoop_home_dir test-e2e-deploy

Where old_pig is where you installed the old version of Pig, hadoop_conf_dir is the directory where your hadoop-site.xml or mapred-site.xml file is, and hadoop_script is where your hadoop executable is located. For example, if you have installed Pig 0.8.1 in /usr/local/pig/pig-0.8.1 and Hadoop in /usr/local/hadoop, then your command line would look like:

Code Block
ant -Dharness.old.pig=/usr/local/pig/pig-0.8.1 -Dharness.cluster.conf=/usr/local/hadoop/conf -Dharness.cluster.bin=/usr/local/hadoop/bin/hadoop -Dharness.hadoop.home=hadoop_home_dir test-e2e-deploy

This takes a couple of minutes and only needs to be run once. After building Pig itself it will display information on the data it is generating.

Once you have loaded your cluster with data, you can run the tests by doing:

Code Block
ant -Dharness.old.pig=old_pig -Dharness.cluster.conf=hadoop_conf_dir -Dharness.cluster.bin=hadoop_script -Dharness.hadoop.home=hadoop_home_dir test-e2e

Run with test-e2e-tez instead of test-e2e to run tests with Tez as execution engine.

Running the full test suite is rarely what you want, as it takes around 10 hours. If you are running against a cluster with more capacity, you can speedup the execution of the tests by parallelizing it. The fork.factor.conf.file property tells how many test conf files to run in parallel. The fork.factor.group property tells how many groups to run in parallel within each test file. Within a group, each tests are run sequentially. For eg: -Dfork.factor.conf.file=2 -Dfork.factor.group=5 will run 2 test files and 5 groups in each totaling 10 tests in parallel.

To run only some tests, set the tests.to.run property. This value can be passed a group of tests (e.g. Checkin), To run only some tests, set the tests.to.run property. This value can be passed a group of tests (e.g. Checkin), or a single test (e.g. Checkin_1). You can pass multiple tests or groups in this property. Each test or group of tests must be proceeded by a {{-t }}. For example, to run the Checkin tests and the first MergeJoin test, do, do:

Code Block
ant -Dharness.old.pig=old_pig -Dharness.cluster.conf=hadoop_conf_dir -Dharness.cluster.bin=hadoop_script -Dharness.hadoop.home=hadoop_home_dir -Dtests.to.run="-t Checkin -t MergeJoin_1" test-e2e

Status will be provided as each test is run. Tests either succeed, fail, or abort. A test fails when actual results do not match expected results. A test aborts when the test or expected results generation failed to execute. The harness prints out the path to the log file where details of the test run are provided.

If you want to clean the data off of your cluster, you can use the undeploy target:

Code Block
ant -Dharness.old.pig=old_pig -Dharness.cluster.conf=hadoop_conf_dir -Dharness.cluster.bin=hadoop_script -DtestsDharness.tohadoop.run="-t Checkin -t MergeJoin_1"home=hadoop_home_dir test-e2e-undeploy

Status will be provided as each test is run. Tests either succeed, fail, or abort. A test fails when actual results do not match expected results. A test aborts when the test or expected results generation failed to execute. The harness prints out the path to the log file where details of the test run are providedThere is no need to do this on a regular basis.

If you want to

...

generate a junit format xml file out of the e2e test log and use it for displaying test results in Jenkins, you can run test/e2e/harness/xmlReport.pl against the log file.

Code Block
test/e2e/harness/xmlReport.pl testdist/out/log/test_harnesss_1411157020 > test-report.xml

Code Block
ant -Dharness.old.pig=old_pig -Dharness.cluster.conf=hadoop_conf_dir -Dharness.cluster.bin=hadoop_script test-e2e-undeploy

...

Running e2e in Local Mode

...

To generate the test data in local mode, do:

Code Block
ant -Dharness.old.pig=old_pig -Dharness.cluster.conf=hadoop_conf_dir -Dharness.cluster.bin=hadoop_script -Dharness.hadoop.home=hadoop_home_dir test-e2e-deploy-local

(Yes you still have to give cluster information even though you aren't using a cluster. Pig doesn't use it in this case and you can pass bogus info if you want.)

To run the local mode tests themselves, do:

Code Block
ant -Dharness.old.pig=old_pig -Dharness.cluster.conf=hadoop_conf_dir -Dharness.cluster.bin=hadoop_script -Dharness.hadoop.home=hadoop_home_dir test-e2e-local

Running on EC2

...

To Start a Cluster:

Code Block


export AWS_ACCESS_KEY_ID=your_amazon_access_key
export AWS_SECRET_ACCESS_KEY_ID=your_secret_amazon_access_key
export SSH_PRIVATE_KEY_FILE=your_private_rsa_key_file
cd your_path_to_apache_whirr/bin
./whirr launch-cluster --config your_path_to_pig_trunk/test/e2e/pig/whirr/pigtest.properties

...

Running the tests:
Open the file ~/.whirr/pigtest/hadoop-site.xml and find the line that has mapred.job.tracker. The next line should have the hostname that is running your Job Tracker. Copy that host name, but NOT the port numbers (ie the :nnnn where nnnn is 9001 or something similar). This value will be referred to as your_namenode.

Code Block


cd your_path_to_pig_src
scp -i your_private_rsa_key_file test/e2e/pig/whirr/whirr_test_patch.sh your_namenode:~

if you have a patch you want to run
    scp -i your_private_rsa_key_file your_patch your_namenode:~
    
ssh -i your_private_rsa_key_file your_namenode

...

Shutting down your cluster:
In the same shell you started the cluster:

Code Block
./whirr destroy-cluster --config your_path_to_pig_trunk/test/e2e/pig/whirr/pigtest.properties

...

Writing a new e2e test does not require writing any new Java code (assuming you don't need to write a UDF for your job). The e2e test harness is written in Perl, and the tests are stored in .conf files, each of which is one big Perl hash (if you squint just right, it almost looks like JSON). These files are in test/e2e/pig/tests/. This hash is expected to have a groups key, which is an array. Each element in the array describes a collection of tests, usually oriented around a particular feature. For example the group FilterBoolean tests boolean predicates in filters. Every group in the array is a hash. It must have name and a tests keys. tests is expected to be an array of tests. Each test is again a hash, and must have a num, the test number and pig, the Pig Latin code to run. As an example look at the following, taken from nightly.conf:

Code Block


$cfg = {
    'driver' => 'Pig',

    'groups' => [
        {
            'name' => 'Checkin',
            'tests' => [
                {
                    'num' => 1,
                    'pig' => q\a = load ':INPATH:/singlefile/studenttab10k' as (name, age, gpa);
                               store a into ':OUTPATH:';\,
                },
                {
                    'num' => 2,
                    'pig' => q\a = load ':INPATH:/singlefile/studenttab10k' as (name, age, gpa);
                               b = load ':INPATH:/singlefile/votertab10k' as (name, age, registration, contributions);
                               c = filter a by age < 50;
                               d = filter b by age < 50;
                               e = cogroup c by (name, age), d by (name, age) ;
                               f = foreach e generate flatten(c), flatten(d);
                               g = group f by registration;
                               h = foreach g generate group, SUM(f.d::contributions);
                               i = order h by $1;
                               store i into ':OUTPATH:';\,
                    'sortArgs' => ['-t', '  ', '+1', '-2'],
                }
            ]
        },
        {
            'name' => 'LoaderPigStorageArg',
            'tests' => [
                {
                    'num' => 1,
                    'pig' => q\a = load ':INPATH:/singlefile/studentcolon10k' using PigStorage(':') as (name, age, gpa);
                               store a into ':OUTPATH:';\,
                },
            ]
        }
    ]
};

...

For features that are new, you cannot test against old versions of Pig. For example, macros in 0.9 cannot be tested against 0.8.1. As an alternate to running the same Pig Latin script against an old version, you can run a different script. This script will be run using the current version, not the old one. To specify a different script, you need a key verify_pig_script. For example:

Code Block


       { 
          # simple macro, no args
          'num' => 1,
          'pig' => q#define simple_macro() returns void {
                         a = load ':INPATH:/singlefile/studenttab10k' as (name, age, gpa);
                         b = foreach a generate age, name;
                         store b into ':OUTPATH:';
                     }

                     simple_macro();#,
          'verify_pig_script' => q#a = load ':INPATH:/singlefile/studenttab10k' as (name, age, gpa);
                                   b = foreach a generate age, name;
                                   store b into ':OUTPATH:';#,
        }

...

Key	What it Does	Example	Required?
delimiter	Provides `floatpostprocess` with delimiter to use	`'delimiter' => ':'`	Only with `flostpostprocess`
execonly	This test will only be executed in specified mode; options are `local` and `mapred`	`'execonly' => 'mapred'`	No
expected_err_regex	Checks stderr error for the provided regular expression	`'expected_err_regex' => "Out of bound access."`	No
expected_out_regex	Checks stdout error for the provided regular expression	`'expected_out_regex' => "A: {name: bytearray,age: bytearray,gpa: bytearray}"`	No
floatpostprocess	Run floating point numbers through a post processor, since due to precision issues different runs of the same script will produce slightly different values. All floating point numbers are rounded to 3 decimal places. This must be used in conjunction with `delimiter`	`'floatpostprocess' => 1`	For outputs that include calculated floating point values.
ignore	Do not run this test, used when a test is failing but we don't want to remove it because it will be needed once the issue is fixed. A reason for ignoring the test should be given.	`'ignore' => 'JIRA-19999'`	No
<ac:structured-macro ac:name="unmigrated-wiki-markup" ac:schema-version="1" ac:macro-id="096642aa-64a8-4f7b-972a-45fd93a9b827"><ac:plain-text-body><![CDATA[	java_params	Values to be passed on the `pig` command line before other Pig parameters; useful for passing properties.	`'java_params' => ['-Dpig.cachedbag.memusage=0']`	No	`']`	No]]></ac:plain-text-body></ac:structured-macro>
not_expected_err_regex	Checks that stderr does not match the provided regular expression	`'not_expected_err_regex' => "ERROR"`	No
not_expected_out_regex	Checks that stdout does not match the provided regular expression	`'not_expected_out_regex' => "datafile"`	No
notmq	Tells the test harness this is not a multi-query test; only necessary when a test has multiple `store` operators but should not be verified as if it were multi-query.	`'notmq' => 1`	No
num	Test number; must be unique in the test group	`'num' => 1`	Yes
pig	The Pig Latin script to run in the test	`q#a = load ':INPATH:/dir/studenttab10k' as (name, age, gpa); store a into ':OUTPATH:';#`	Yes
<ac:structured-macro ac:name="unmigrated-wiki-markup" ac:schema-version="1" ac:macro-id="5d2602d1-43d1-4520-ac3e-61ff5b23eacd"><ac:plain-text-body><![CDATA[	pig_params	Command line arguments to pass to `pig` when running this test.	`'pig_params' => ['-p', qq(fname='studenttab10k')]`	No]]></ac:plain-text-body></ac:structured-macro>
rc	Expected return code	`'rc' => 0`	No
<ac:structured-macro ac:name="unmigrated-wiki-markup" ac:schema-version="1" ac:macro-id="2bead003-e571-4826-8f60-240ff4141113"><ac:plain-text-body><![CDATA[	sortArgs	Arguments to pass to the Unix `sort` utility. When these are given, sort will be called before data is sorted for comparison with the expected results.	`'sortArgs' => ['-t', ' ', '+0', '-1']`	Only when job output should be sorted]]></ac:plain-text-body></ac:structured-macro>
verify_pig_script	Alternate Pig Latin script to use to generate the expected results	`'verify_pig_script' => q\A = load ':INPATH:/singlefile/studenttab10k' as (name, age, gpa); store A into ':OUTPATH:';\,`	No

...

conf File	Tests	Comments
bigdata.conf	larger size data	We keep these to a minimum as they take much longer than other tests.
cmdline.conf	Pig command line output (such as describe)
grunt.conf	grunt operators, like `ls`
macro.conf	macro and import
multiquery.conf	multiquery scripts
negative.conf	negative conditions where Pig is expected to return an error
nightly.conf	general positive tests	Your test goes here if it doesn't fit anywhere else
orc.conf	OrcStorage tests
streaming.conf	streaming feature
streaming_local.conf	streaming feature in local mode
turing_jython.conf	Pig scripts embedded in Python scripts

...

Child pages

Versions Compared

Old Version 10

New Version Current

Key

How To Test Pig

Preparation

Running all unit tests

Running e2e in Local Mode

Running on EC2

Child pages

Page History

Versions Compared

Old Version 10

New Version Current

Key

How To Test Pig

Preparation

Running all unit tests

Running e2e in Local Mode

Running on EC2