Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.
Comment: Migrated to Confluence 5.3

...

At this point Pig actually launches a MapReduce job on the cluster.

On the cluster, HCatBaseInputFormat.createRecordReader is called with an HCatSplit, the wrapper we created earlier that contains an actual input split, and the partition information needed to deserialize its records. An HCatRecordReader that contains a storage handler is returned to the framework; the storage handler contains information necessary to read data from the underlying storage and convert them into useable records.

With the RecordReader initialized, its time to get some actual records! Pig calls HCatBaseLoader.getNext which gets an HCatRecord from the HCatRecordReader we just initialized, converts to a Pig tuple, and hands off to Pig for processing.

Let's explore how HCatRecord's are created. First, HCatRecordReader.nextKeyValue is called to get the actual record from the wrapped input format we created earlier. The record is first deserialized with the SerDe defined for the partition, and wrapped in a LazyHCatRecord, which delays further deserialization until required. Using the output schema set earlier, we create an HCatRecord with just the necessary fields. Finally, the HCatRecord is converted into a Pig tuple and handed off to Pig for processing.