...
Create a IDF implementation that represents sqoop data in avro
GenericRecord
The source of truth stored in memory is the avro record, which is the native format, the remaining formats i.e text and object array are constructed lazily if and when invoked by the connector underlying code using the IDF.
Provide a reliable way to convert from the sqoop schema to avro schema for all the 14 sqoop data types supported.
Design
- Extend the IDF API
Code Block |
---|
/** * IDF representing the intermediate format in Avro object */ public class AvroIntermediateDataFormat extends IntermediateDataFormat<GenericRecord> { |
sqoop schema is mandated, since we need a schema to construct a avro record
Code Block // convert the sqoop schema to avro schema public AvroIntermediateDataFormat(org.apache.sqoop.schema.Schema schema) { setSchema(schema); }
- Implement a method to convert csv text to avro GenericRecord
private GenericRecord toAvro(String csv) {..}
- Implement a method to convert the object array to avro GenericRecord
private Object[] toObject(GenericRecord data) { ..}
- Conversely, implement a method to lazily construct the csv from avro GenericRecord when invoked
private String toCSV(GenericRecord record) { ..}
- implement a method to lazily construct the object arrat from avro GenericRecord when invoked
private Object[] toObject(GenericRecord data) {...}
Implement methods to ser/ deser the avro record into a string - wire format
Code Block /** * {@inheritDoc} */ @Override public void write(DataOutput out) throws IOException { // todo } /** * {@inheritDoc} */ @Override public void read(DataInput in) throws IOException { // todo }
Column Type Object Format Avro Format / Feld Type NULL value in the field java null UNION for any field that is nullable Schema.Type.NULL
ARRAY
java Object[] Schema.Type.ARRAY
BINARY
java byte[] Schema.Type.BYTES;
BIT
java boolean
Schema.Type.BOOLEAN
DATE
org.joda.time.LocalDate Schema.Type.LONG;
DATE_TIME
org.joda.time. DateTime
or
org.joda.time. LocalDateTime
(depends on timezone attribute )
Schema.Type.LONG;
DECIMAL
java BigDecimal
?? ENUM
java String Schema.Type.ENUM
FIXED_POINT
java Integer
or
java Long
( depends on
byteSize attribute)
if (((org.apache.sqoop.schema.type.FixedPoint) column).getByteSize() <= Integer.SIZE) {
return Schema.Type.INT;
} else {
return Schema.Type.LONG;
}
FLOATING_POINT
java Double
or
java Float
( depends on
byteSize attribute)
if (((org.apache.sqoop.schema.type.FloatingPoint) column).getByteSize() <= Float.SIZE) {
return Schema.Type.FLOAT;
} else {
return Schema.Type.DOUBLE;
}
MAP
java.util.Map<Object, Object> Schema.Type.MAP
SET
java Object[]
Schema.Type.ARRAY
TEXT
java String Schema.Type.STRING
TIME
org.joda.time.LocalTime ( No Timezone) Schema.Type.LONG;
UNKNOWN
same as java byte[] Schema.Type.BYTES;
External Jar Dependencies added?
...
Avro 1.7 that we use does not yet support the date/dateTime/Time as a first class primitive types. Until 1.8 represent date as long
How to represent Decimal? Should we use the "FIXED" type in avro
Handling Handling nulls via the union type, is this ok?
...