Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.
Comment: Migrated to Confluence 5.3

...

Some other tricks is also possible. You can use "bin/pig -secretDebugCmd" to inspect the command line of Pig. Make sure you are using the right version of hadoop.
And this This issue will be solved in Pig 0.9.1 and beyond.

Q: How can I pass a specific hadoop configuration parameter to Pig?

There are multiple places you can pass hadoop configuration parameter to Pig. Here is a list from high priority to low priority (configuration in high priority will override the configuration in low priority):
1. set command
2. -P properties_file
3. pig.properties
4. java system property/environmental variable
5. Hadoop configuration file: hadoop-site.xml/core-site.xml/hdfs-site.xml/mapred-site.xml, or Pig specific hadoop configuration file: pig-cluster-hadoop-site.xml)

Both 3 and 5 require the configuration file in classpath.

Q: I already register my LoadFunc/StoreFunc jars in "register" statement, but why I still get "Class Not Found" exception?

Try to put your jars in PIG_CLASSPATH as well. "register" guarantees your jar will be shipped to backend. But in the frontend, you still need to put the jars in CLASSPATH by setting "PIG_CLASSPATH" environment variable.

Q: How can I load data using Unicode control characters as delimiters?

...

Code Block
LOAD 'input.dat' USING PigStorage('\u0001')as (x,y,z);

Q: How do I

...

control the number of mappers?

It is determined by your InputFormat. If you are using PigStorage, FileInputFormat will allocate at least 1 mapper for each file. If the file is large, FileInputFormat will split the file into smaller trunks. You can control this process by two hadoop setting: "mapred.min.split.size", "mapred.max.split.size". In addition, after InputFormat tells Pig all the splits information, Pig will try to combine small input splits into one mapper. This process can be controlled by "pig.noSplitCombination" and "pig.maxCombinedSplitSize".

Use the PARALLEL clause:

...

Q: How do I make my Pig jobs run on a specified number of reducers?

...

Besides PARALLEL clause, you can also use "set default_parallel" statement in Pig script, or set "mapred.reduce.tasks" system property to specify default parallel to use. If none of these values are set, Pig will only use 1 reducers. (In Pig 0.8, we change the default reducer from 1 to a number calculated by a simple heuristic for foolproof purpose)

More details can be found at http://pig.apache.org/docs/r0.9.0/perf.html#parallel.

Q: Can I do a numerical comparison while filtering?

...

In Pig 2.0 you can test the existence of values in a map using the null construct:
m#'key' is not null

Q:

...

I

...

Code Block

> pig -Dhod.param='-m 3' my_script.pig

Three (3) nodes is the minimum.

Q: I load data from a directory which contains different file. How do I find out where the data comes from?

...

Code Block
public class PigStorageWithInputPath extends PigStorage {
    Path path = null;

    @Override
    public void prepareToRead(RecordReader reader, PigSplit split) {
        super.prepareToRead(reader, split);
        path = ((FileSplit)split.getWrappedSplit()).getPath();
    }

    @Override
    public Tuple getNext() throws IOException {
        Tuple myTuple = super.getNext();
        if (myTuple != null)
            myTuple.append(path.toString());
        return myTuple;
    }
}

In Pig 0.8/0.0 and beyond9.0/0.9.1, you need to set "pig.splitCombination" to false for PigStorageWithInputPath work correctly. 0.9.2 fix the issue.

Q: How can I calculate a percentage (partial aggregate / total aggregate)?

The challenge here is to get the total aggregate into the same statement as the partial aggregate. The key is to cast the relation for the total aggregate to a scalar:

Code Block

A = LOAD 'sample.txt' AS (x:int, y:int);
-- calculate the denominator
B = foreach (group A all) generate COUNT(A) as total;
-- cacluate the percentage
C = foreach (group A by x) generate group as x, (double)COUNT(A) / (double) B.total as percentage;

Q: How can I pass a parameter with space to a pig script?

Code Block

# Following should work
-p "NAME='Firstname Lastname'"
-p "NAME=Firstname\ Lastname"
# Following are incorrect
-p "NAME=Firstname Lastname"
-p NAME="Firstname Lastname"