Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

...

  •  Create a IDF implementation that represents sqoop data in avro GenericRecord

  •  The source of truth stored in memory is the avro record, which is the native format, the remaining formats i.e text and object array are constructed lazily if and when invoked by the connector underlying code using the IDF

  •  Provide a reliable way to convert from the sqoop schema to avro schema for all the 14 sqoop data types supported.


Design

  • Extend the IDF API

 

Code Block
/**
 * IDF representing the intermediate format in Avro object
 */
public class AvroIntermediateDataFormat extends IntermediateDataFormat<GenericRecord> {

 

  • sqoop schema is mandated, since we need a schema to construct a avro record

    Code Block
     // convert the sqoop schema to avro schema
      public AvroIntermediateDataFormat(org.apache.sqoop.schema.Schema schema) {
        setSchema(schema);
      } 
  • Implement a method to convert csv text to avro GenericRecord
    •   private GenericRecord toAvro(String csv) {..}

  • Implement a method to convert the object array to avro GenericRecord
    •   private Object[] toObject(GenericRecord data) { ..}

  • Conversely, implement a method to lazily construct the csv from avro GenericRecord when invoked
    •   private String toCSV(GenericRecord record) { ..}

  • implement a method to lazily construct the object arrat from avro GenericRecord when invoked
    •   private Object[] toObject(GenericRecord data) {...}

    Implement methods to ser/ deser the avro record into a string - wire format

    Code Block
    /**
       * {@inheritDoc}
       */
      @Override
      public void write(DataOutput out) throws IOException {
       // todo
      }
      /**
       * {@inheritDoc}
       */
      @Override
      public void read(DataInput in) throws IOException {
        // todo
      }
    
    


    Column TypeObject FormatAvro Format / Feld Type
    NULL value in the fieldjava nullUNION for any field that is nullable

    Schema.Type.NULL

    ARRAY
    java Object[]

    Schema.Type.ARRAY

    BINARY
    java byte[]

    Schema.Type.BYTES;

    BIT

    java boolean

    Schema.Type.BOOLEAN

    DATE
    org.joda.time.LocalDate

    Schema.Type.LONG;

    DATE_TIME

    org.joda.time. DateTime

    or

    org.joda.time. LocalDateTime

    (depends on timezone attribute )

    Schema.Type.LONG;

    DECIMAL

    java BigDecimal

    ??
    ENUM
    java String

    Schema.Type.ENUM

    FIXED_POINT

    java Integer

    or

    java Long

    ( depends on

    byteSize attribute)

      if (((org.apache.sqoop.schema.type.FixedPoint) column).getByteSize() <= Integer.SIZE) {

            return Schema.Type.INT;

          } else {

            return Schema.Type.LONG;

          }

    FLOATING_POINT

    java Double

    or

    java Float

    ( depends on

    byteSize attribute)

     if (((org.apache.sqoop.schema.type.FloatingPoint) column).getByteSize() <= Float.SIZE) {

            return Schema.Type.FLOAT;

          } else {

            return Schema.Type.DOUBLE;

          }

    MAP
    java.util.Map<Object, Object>

     Schema.Type.MAP

    SET

    java Object[]

    Schema.Type.ARRAY

    TEXT
    java String

     Schema.Type.STRING

    TIME
    org.joda.time.LocalTime ( No Timezone)

    Schema.Type.LONG;

    UNKNOWN
    same as java byte[]

    Schema.Type.BYTES;

External Jar Dependencies added?

...

  • Avro 1.7 that we use does not yet support the date/dateTime/Time as a first class primitive types. Until 1.8 represent date as long

  • How to represent Decimal?  Should we use the "FIXED" type in avro 


  • Handling  Handling nulls via the union type, is this ok? 

...