...
Code Block |
---|
/** * Returns the {@linkplain IntermediateDataFormat} this connector * can return natively in. This will support retrieving the data as text * and an array of objects. This should never return null. * * @return {@linkplain IntermediateDataFormat} object */ public Class<? extends IntermediateDataFormat<?>> getIntermediateDataFormat() { return AvroIntermediateDataFormat.class; } |
Background
Read IDFAPI for more information on the core aspects of the iDFIDF.
Requirements
Create Create a IDF implementation that represents sqoop data in avro
GenericRecord
The The source of truth stored in memory is the avro record, which is the native format, the remaining formats i.e text and object array are constructed "lazily" if and when invoked by the underlying code using the IDF.
ProvideProvide a reliable way to convert from the sqoop schema to avro schema for all the 14 sqoop data types supported.
Design
- Extend the IDF API
Code Block |
---|
/** * IDF representing the intermediate format in Avro object */ public class AvroIntermediateDataFormat extends IntermediateDataFormat<GenericRecord> { |
...} |
Sqoop sqoop schema is mandated, since we need a schema to construct a avro record
Code Block // convert the sqoop schema to avro schema public AvroIntermediateDataFormat(org.apache.sqoop.schema.Schema schema) { super.setSchema(schema); }
- Implement a method to convert csv text to avro GenericRecord
private GenericRecord toAvro(String csv) {..}
- Implement a method to convert the object array to avro GenericRecord
private GenericRecord toAvro(Object[] toObject(GenericRecord data) { ..}
- Conversely, implement a method to lazily construct the csv from avro GenericRecord when invoked
private String toCSV(GenericRecord record) { ..}
- implement a method to lazily construct the object arrat from avro GenericRecord when invoked
private Object[] toObject(GenericRecord data) {...}
Implement methods to ser/ deser the avro record into a string - wire format
Code Block /** * {@inheritDoc} */ @Override public void write(DataOutput out) throws IOException { // todo } /** * {@inheritDoc} */ @Override public void read(DataInput in) throws IOException { // todo }
Mappings from sqoop to avro types.
Column Type | Object Format | Avro Format / Feld Type |
---|---|---|
NULL value in the field | java null | UNION for any field that is nullable Schema.Type.NULL |
ARRAY | java Object[] | Schema.Type.ARRAY |
BINARY | java byte[] | Schema.Type.BYTES |
...
BIT | java boolean | Schema.Type.BOOLEAN |
DATE | org.joda.time.LocalDate | Schema.Type.LONG |
...
DATE_TIME | org.joda.time. DateTime or org.joda.time. LocalDateTime (depends on timezone attribute ) | Schema.Type.LONG |
...
DECIMAL | java BigDecimal | Schema.Type.FIXED ??? |
ENUM | java String | Schema.Type.ENUM |
FIXED_POINT | java Integer or java Long ( depends on byteSize attribute) |
...
if (((org.apache.sqoop.schema.type.FixedPoint) column).getByteSize() <= Integer.SIZE) { |
...
return Schema.Type.INT; } else { |
...
return Schema.Type.LONG; } | |
FLOATING_POINT | java Double or java Float ( depends on byteSize attribute) |
...
if (((org.apache.sqoop.schema.type.FloatingPoint) column).getByteSize() <= Float.SIZE) { |
...
return Schema.Type.FLOAT; } else { |
...
return Schema.Type.DOUBLE; } | |
MAP | java.util.Map<Object, Object> |
...
Schema.Type.MAP | ||
SET | java Object[] | Schema.Type.ARRAY |
TEXT | java String |
...
Schema.Type.STRING | ||
TIME | org.joda.time.LocalTime ( No Timezone) | Schema.Type.LONG |
...
UNKNOWN | same as java byte[] | Schema.Type.BYTES |
...
External Jar Dependencies added?
...