Table of Contents
Title: Avro Avro Intermediate Data Format
JIRA : https://issues.apache.org/jira/browse/SQOOP-1902
Summary
The connector-sdk package in sqoop currently supports CSVIDF and JSONIDF. Th goal of this ticket is to use avro GenericRecord
to represent the sqoop data as it transfers from the "FROM" part of the sqoop job to the "TO" part of the sqoop job
How does the connectors use the AvroIDF?
By declaring in the connector implementation class as below. This will direct sqoop to store the data flowing from the FROM to the TO in the sqoop job in avro format. i.e the in memory intermediate representation will always be a avro record with its schema.
Code Block |
---|
/**
* Returns the {@linkplain IntermediateDataFormat} this connector
* can return natively in. This will support retrieving the data as text
* and an array of objects. This should never return null.
*
* @return {@linkplain IntermediateDataFormat} object
*/
public Class<? extends IntermediateDataFormat<?>> getIntermediateDataFormat() {
return AvroIntermediateDataFormat.class;
} |
Background
Read IDFAPI for more information on the core aspects of the iDF.
Requirements
Create a IDF implementation that represents sqoop data in avro
GenericRecord
The source of truth stored in memory is the avro record, which is the native format, the remaining formats i.e text and object array are constructed lazily if and when invoked by the connector code.
Provide a reliable way to convert from the sqoop schema to avro schema for all the 14 sqoop data types supported.
Design
- Extend the IDF API
Code Block |
---|
/**
* IDF representing the intermediate format in Avro object
*/
public class AvroIntermediateDataFormat extends IntermediateDataFormat<GenericRecord> { |
External Jar Dependencies added?
Yes, avro 1.7
Testing
Open Questions
7 dependency added to the connector-sdk package
Testing
The unit tests should cover the following use cases for the 14 ColumnTypes supported by sqoop, including the null representation.
Code Block |
---|
// convert from avro to other formats
setDataGetCSV
setDataGetObjectArray
setDataGetData
// convert from csv to other forms
setCSVGetData
setCSVGetObjectArray
setCSVGetCSV
// convert from object array to other formats
setObjectArrayGetData
setObjectArrayGetCSV
setObjectArrayGetObjectArray |
Open Questions
Avro 1.7 that we use does not yet support the date/dateTime/Time as a first class primitive types. Until 1.8 represent date as long
How to represent Decimal?
Handling nulls via the union type, is this ok?