...
Due to discrepancies of Avro and Pig data models, AvroStorage has:
Limited support for "record": we do not support recursively defined record because the number of fields in such records is data dependent. For instance, _\{"type":"record","name":"LinkedListElem", "fields":\[{"name":"data","type":"int"\},\{"name":"next", "type":\["null","LinkedListElem"\]\}\]\}_;Wiki Markup Limited support for "union": we only accept nullable union like \ ["null", "some-type"\].Wiki Markup
For simplicity, we also make the following assumption:
...
Users can choose not to provide any parameter to AvroStorage and Avro schema of output data is derived from its Pig schema. This may result in undesirable schemas due to discrepancies of Pig and Avro data models or problems of Pig itself:
...
- The derived Avro schema will wrap each (nested) field with a nullable union because Pig allows NULL values for every type and Avro doesn't. For instance, when you read in Avro data of schema _"boolean"_ and store it using AvroStorage(), you will get _\["null","boolean"\]_.
- The derived Avro schema may contain unwanted tuple wrappers because: 1) Pig only generates tuples; 2) items of Pig bags can only be tuples. AvroStorage can automatically get rid of such wrappers, but sometimes you still see them as in example B.
...
- field<n> notnull
This indicates the n th field (and its nested fields) in the output tuple is notnull.
- data pathunmigrated-wiki-markup
- +field<n> def:name+
Users can provide predefined schemas in Avro files using option +\--data path+, where _path_ points to a directory of Avro files or a single Avro file. This is used together with field parameter +field<n> def:name+. AvroStorage internally constructs two maps: map\[typeName\]=>schema and map\[fieldName\]=>schema and users can specify which schema to use by providing corresponding _name_. This option is useful when users want to do simple processing of input data (like filtering and projection) and store it using predefined schemas in input. Please refer to example C for more details.
- field<n> str
Users can directly specify the schema of field n where str is a string representation of Avro schema. The usage of this option is similar to schema str except that the schema is only applied to the n th field.
...
type name | schema | |||||
ImpressionSetEvent | the whole schema | |||||
ImpressionDetailsRecord | {"type": "record", "name": ImpressionDetailsRecord","fields" : [{"name":"itemId", " | <ac:structured-macro ac:name="unmigrated-wiki-markup" ac:schema-version="1" ac:macro-id="f40087f2-f364-4e00-9f63-d8d450feae7e"><ac:plain-text-body><![CDATA[ | ImpressionDetailsRecord | {"type": "record", "name": ImpressionDetailsRecord","fields" : [{"name":"itemId", "type":"int"}, {"name":"itemType", "type":{"type":"enum", "name":"ItemType","symbols":["person", "job", "group", "company", "nus", "news", "ayn"]}}, {"name":"details","type":{"type":"map","values":"string" }} | ]]></ac:plain-text-body></ac:structured-macro> | |
ItemType | {"type":"enum", "name":"ItemType","symbols":["person", "job", "group", " | <ac:structured-macro ac:name="unmigrated-wiki-markup" ac:schema-version="1" ac:macro-id="c404ef06-897e-45f0-954d-fbca531cdf79"><ac:plain-text-body><![CDATA[ | ItemType | {"type":"enum", "name":"ItemType","symbols":["person", "job", "group", "company", "nus", "news", "ayn"]} | ]]></ac:plain-text-body></ac:structured-macro> |
The other maps from field names to schema as:
field name | schema | ||
<ac:structured-macro ac:name="unmigrated-wiki-markup" ac:schema-version="1" ac:macro-id="4cf3c8d9-fc04-4f3c-bc2d-c8bd17d1b1ff"><ac:plain-text-body><![CDATA[ | pageNumber | ["int", "null"] | ]]></ac:plain-text-body></ac:structured-macro> |
pageNumber | ["int", "null"] | ||
impressionDetails | ImpressionDetailsRecord | ||
impressionDetails.id | int | ||
impressionDetails.type | ItemType | ||
impressionDetails.details | {"type":"map","values":"string" } |
...
No Format |
---|
REGISTER avro-1.4.0.jar
REGISTER json-simple-1.1.jar
REGISTER piggybank.jar
REGISTER jackson-core-asl-1.5.5.jar
REGISTER jackson-mapper-asl-1.5.5.jar
data = LOAD 'input/part-00000.avro' USING AvroStorage ();
ret = FILTER data BY value.member_id > 2000;
STORE ret INTO 'output' USING AvroStorage (
'same', 'input/part-00000.avro');
|
...
> 2000;
STORE ret INTO 'output' USING AvroStorage (
'same', 'input/part-00000.avro');
|
5. Known Issues
- AvroStorage does not load JSon encoded Avro files
- Map data creation for Avro in a Pig script has some issues. It has not been implemented in the AvroStorage. The below script does not work
No Format A = load 'complex.txt' using PigStorage('\t') as (mymap:map[chararray], mytuple:(num:int, str:chararray, dbl:double), bagofmap:{t:(m:map[chararray])}, rownum:int); describe A; store A into 'avro_complex.out' USING org.apache.pig.piggybank.storage.avro.AvroStorage();
- Column Pruning is not implemented in the current AvroStorage
6. Related Work
- PIG-794 uses Avro serialization in Pig.
...
- AvroStorageUtils logic for merge multiple schemas.
...
7 Acknowledgments
This documentation was originally written by Lin Guo, and appeared at http://linkedin.jira.com/wiki/display/HTOOLS/AvroStorage+-+Pig+support+for+Avro+data