Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.
Comment: Migrated to Confluence 5.3

...

Due to discrepancies of Avro and Pig data models, AvroStorage has:

  • Wiki MarkupLimited support for "record": we do not support recursively defined record because the number of fields in such records is data dependent. For instance, _\{"type":"record","name":"LinkedListElem", "fields":\[{"name":"data","type":"int"\},\{"name":"next", "type":\["null","LinkedListElem"\]\}\]\}_;
  • Wiki MarkupLimited support for "union": we only accept nullable union like \ ["null", "some-type"\].

For simplicity, we also make the following assumption:

...

Users can choose not to provide any parameter to AvroStorage and Avro schema of output data is derived from its Pig schema. This may result in undesirable schemas due to discrepancies of Pig and Avro data models or problems of Pig itself:

...

  • The derived Avro schema will wrap each (nested) field with a nullable union because Pig allows NULL values for every type and Avro doesn't. For instance, when you read in Avro data of schema _"boolean"_ and store it using AvroStorage(), you will get _\["null","boolean"\]_.
  • The derived Avro schema may contain unwanted tuple wrappers because: 1) Pig only generates tuples; 2) items of Pig bags can only be tuples. AvroStorage can automatically get rid of such wrappers, but sometimes you still see them as in example B.

...

  • field<n> notnull
    This indicates the n th field (and its nested fields) in the output tuple is notnull.
  • data pathunmigrated-wiki-markup
  • +field<n> def:name+
    Users can provide predefined schemas in Avro files using option +\--data path+, where _path_ points to a directory of Avro files or a single Avro file. This is used together with field parameter +field<n> def:name+. AvroStorage internally constructs two maps: map\[typeName\]=>schema and map\[fieldName\]=>schema and users can specify which schema to use by providing corresponding _name_. This option is useful when users want to do simple processing of input data (like filtering and projection) and store it using predefined schemas in input. Please refer to example C for more details.
  • field<n> str
    Users can directly specify the schema of field n where str is a string representation of Avro schema. The usage of this option is similar to schema str except that the schema is only applied to the n th field.

...

type name

schema

ImpressionSetEvent

the whole schema

ImpressionDetailsRecord

{"type": "record", "name": ImpressionDetailsRecord","fields" : [{"name":"itemId", "

<ac:structured-macro ac:name="unmigrated-wiki-markup" ac:schema-version="1" ac:macro-id="f40087f2-f364-4e00-9f63-d8d450feae7e"><ac:plain-text-body><![CDATA[

ImpressionDetailsRecord

{"type": "record", "name": ImpressionDetailsRecord","fields" : [{"name":"itemId", "type":"int"}, {"name":"itemType", "type":{"type":"enum", "name":"ItemType","symbols":["person", "job", "group", "company", "nus", "news", "ayn"]}}, {"name":"details","type":{"type":"map","values":"string" }}

]]></ac:plain-text-body></ac:structured-macro>

ItemType

{"type":"enum", "name":"ItemType","symbols":["person", "job", "group", "

<ac:structured-macro ac:name="unmigrated-wiki-markup" ac:schema-version="1" ac:macro-id="c404ef06-897e-45f0-954d-fbca531cdf79"><ac:plain-text-body><![CDATA[

ItemType

{"type":"enum", "name":"ItemType","symbols":["person", "job", "group", "company", "nus", "news", "ayn"]}

]]></ac:plain-text-body></ac:structured-macro>

The other maps from field names to schema as:

field name

schema

<ac:structured-macro ac:name="unmigrated-wiki-markup" ac:schema-version="1" ac:macro-id="4cf3c8d9-fc04-4f3c-bc2d-c8bd17d1b1ff"><ac:plain-text-body><![CDATA[

pageNumber

["int", "null"]

]]></ac:plain-text-body></ac:structured-macro>

pageNumber

["int", "null"]

impressionDetails

ImpressionDetailsRecord

impressionDetails.id

int

impressionDetails.type

ItemType

impressionDetails.details

{"type":"map","values":"string" }

...

No Format
REGISTER avro-1.4.0.jar
REGISTER json-simple-1.1.jar
REGISTER piggybank.jar
REGISTER jackson-core-asl-1.5.5.jar
REGISTER jackson-mapper-asl-1.5.5.jar

data = LOAD 'input/part-00000.avro' USING AvroStorage ();
ret = FILTER data BY value.member_id > 2000;

STORE ret INTO 'output' USING AvroStorage (
'same', 'input/part-00000.avro');

...

 > 2000;

STORE ret INTO 'output' USING AvroStorage (
'same', 'input/part-00000.avro');

5. Known Issues

  • AvroStorage does not load JSon encoded Avro files
  • Map data creation for Avro in a Pig script has some issues. It has not been implemented in the AvroStorage. The below script does not work
    No Format
    
    A = load 'complex.txt' using PigStorage('\t') as (mymap:map[chararray], mytuple:(num:int, str:chararray, dbl:double), 
    bagofmap:{t:(m:map[chararray])}, rownum:int);
    describe A;
    store A into 'avro_complex.out' USING org.apache.pig.piggybank.storage.avro.AvroStorage();
    
  • Column Pruning is not implemented in the current AvroStorage

6. Related Work

...

...

7 Acknowledgments

This documentation was originally written by Lin Guo, and appeared at http://linkedin.jira.com/wiki/display/HTOOLS/AvroStorage+-+Pig+support+for+Avro+dataImage Removed