Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.
Comment: Migrated to Confluence 5.3

...

Code Block
 /**
   * Returns the {@linkplain IntermediateDataFormat} this connector
   * can return natively in. This will support retrieving the data as text
   * and an array of objects. This should never return null.
   *
   * @return {@linkplain IntermediateDataFormat} object
   */
  public Class<? extends IntermediateDataFormat<?>> getIntermediateDataFormat() {
    return AvroIntermediateDataFormat.class;
  }

 

Background

Read IDFAPI  for more information on the core aspects of the iDFIDF

Requirements

  •  Create Create a IDF implementation that represents sqoop data in avro GenericRecord

  •  The The source of truth stored in memory is the avro record, which is the native format, the remaining formats i.e text and object array are constructed "lazily" if and when invoked by the underlying code using the IDF. 

  •  ProvideProvide a reliable way to convert from the sqoop schema to avro schema for all the 14 sqoop data types supported.

Design

  • Extend the IDF API

 

Code Block
/**
 * IDF representing the intermediate format in Avro object
 */
public class AvroIntermediateDataFormat extends IntermediateDataFormat<GenericRecord> {

 

...}
  • Sqoop sqoop schema is mandated, since we need a schema to construct a avro record

    Code Block
     // convert the sqoop schema to avro schema
      public AvroIntermediateDataFormat(org.apache.sqoop.schema.Schema schema) {
        super.setSchema(schema);
      } 
  • Implement a method to convert csv text to avro GenericRecord
    •   private GenericRecord toAvro(String csv) {..}

  • Implement a method to convert the object array to avro GenericRecord
    •   private GenericRecord toAvro(Object[] toObject(GenericRecord  data) { ..}

  • Conversely, implement a method to lazily construct the csv from avro GenericRecord when invoked
    •   private String toCSV(GenericRecord record) { ..}

  • implement a method to lazily construct the object arrat from avro GenericRecord when invoked
    •   private Object[] toObject(GenericRecord data) {...}

  • Implement methods to ser/ deser the avro record into a string - wire format

    Code Block
    /**
       * {@inheritDoc}
       */
      @Override
      public void write(DataOutput out) throws IOException {
       // todo
      }
      /**
       * {@inheritDoc}
       */
      @Override
      public void read(DataInput in) throws IOException {
        // todo
      }
    
    


  • Mappings from sqoop to avro types.

 

Column TypeObject FormatAvro Format / Feld Type
NULL value in the fieldjava nullUNION for any field that is nullable

Schema.Type.NULL

ARRAY
java Object[]

Schema.Type.ARRAY

BINARY
java byte[]

Schema.Type.BYTES

...

BIT

java boolean

Schema.Type.BOOLEAN

DATE
org.joda.time.LocalDate

Schema.Type.LONG

...

DATE_TIME

org.joda.time. DateTime

or

org.joda.time. LocalDateTime

(depends on timezone attribute )

Schema.Type.LONG

...

DECIMAL

java BigDecimal

Schema.Type.FIXED ???
ENUM
java String

Schema.Type.ENUM

FIXED_POINT

java Integer

or

java Long

( depends on

byteSize attribute)

...

if (((org.apache.sqoop.schema.type.FixedPoint) column).getByteSize() <= Integer.SIZE) {

...

return Schema.Type.INT;

      } else {

...

return Schema.Type.LONG;

      }

FLOATING_POINT

java Double

or

java Float

( depends on

byteSize attribute)

...

if (((org.apache.sqoop.schema.type.FloatingPoint) column).getByteSize() <= Float.SIZE) {

...

return Schema.Type.FLOAT;

      } else {

...

return Schema.Type.DOUBLE;

      }

MAP
java.util.Map<Object, Object>

...

Schema.Type.MAP

SET

java Object[]

Schema.Type.ARRAY

TEXT
java String

...

Schema.Type.STRING

TIME
org.joda.time.LocalTime ( No Timezone)

Schema.Type.LONG

...

UNKNOWN
same as java byte[]

Schema.Type.BYTES

...

 

External Jar Dependencies added?

...