Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

...

In simple words, every data source has one thing in common, it is collection of rows and each row is a collection of fields / columns. Most of not all data sources have strict schema that tells what each field type is.

 

  1. Native format - each row in the data source is a native object, for instance in JSONIDF, an entire row and its fields in sqoop will be represented as a JSON object, in AvroIDF, entire row and its fields will be represented as a Avro record
  2. CSV text format - each row and its fields are represented as CSV text
  3. Object Array format  - each field in the row is an element in the object array. Hence a row in the data source is represented as a object array. 

...

NOTE: The CSV text format and the Object Array format are custom and  prescribed by Sqoop and the details of this format for every supported column type in the schema are are described below.

Design goals

There are a few prior documents that depict the design goals, but it is not crystal clear. Refer to this this doc for some context on the research done prior to defining the IDF API. It explains some of the goals of using CSV and Object Array formats. Some of the design influence comes from the its predecessor Sqoop1.Design goals

  1. Support data transfer across connectors using an "internal" in memory data representation
  2. CSV is a common format in many databases and hence sqoop's design goals primarily want to optimize for such data sources. But it is unclear how much of performance gains does CSV text provide. 
    The following is a java doc comment I pulled from the code that explains the choice of CSV.

     

    Why a "native" internal format and then return CSV text too?
    Imagine a connector that moves data from a system that stores data as a
    serialization format called FooFormat. If I also need the data to be
    written into HDFS as FooFormat, the additional cycles burnt in converting
    the FooFormat to text and back is useless - so using the sqoop specified
    CSV text format saves those extra cycles
    <p/>
    Most fast access mechanisms, like mysqldump or pgsqldump write the data
    out as CSV, and most often the source data is also represented as CSV
    - so having a minimal CSV support is mandated for all IDF, so we can easily read the
    data out as text and write as text.

     

...

Column TypeCSV FormatNotes
NULL value in the field

  public static final String NULL_FIELD = "NULL";

 
ARRAY
  • Will be encoded as String (and hence enclosed with '\, inside there will be JSON encoding of the top level array elements (hence the entire value will be enclosed in [] pair), Nested values are not JSON encoded..
  • Few examples:
    • Array of FixedPoint '[1,2,3]'
    • Array of Text '["A","B","C"]'
    • Array of Objects of type FixedPoint '["[11, 12]","[14, 15]"]
    • Array of Objects of type Text ["[A, B]","[X, Y]"]' - 

 

BINARY
byte array enclosed in quotes and encoded with ISO-8859-1 charset 
BIT

true, TRUE, 1

false, FALSE, 0

( not encoded in quotes )

Unsupported values should throw an exception
DATE
YYYY-MM-DD ( no time zone) 
DATE_TIME
YYYY-MM-DD HH:MM:DD[.ZZZ][+/-XX] ( fraction and timezone are optional) 
DECIMAL
Bigdecimal (not encoded in quotes ) 
ENUM
Same as TEXT 
FIXED_POINT
integer or long, ( not encoded in quotes ) 
FLOATING_POINT
float or double ( not encoded in quotes ) 
MAP
  • Will be encoded as String (and hence enclosed with '\, inside there will be JSON encoding of the map (hence the entire value will be enclosed in }{ pair (the order is of course opposite, I'm having troubles to push it through JIRA markdown). Few examples:in  pair { }
  • Map<Number, Number> '{1:2,2:320}'
  • Map<String, String> String> - '{"key1testKey":"value1","key2":"value2"testValue\}'
 
SET
same as ARRAY 
TEXT

Entire string will be enclosed in single quotes and all bytes will be printed as they are will exception of following bytes

Byte

Encoded as

0x5C

\ \ (no space) 

0x27

\'

0x22

\"

0x1A

\Z

0x0D

\r

0x0A

\n

0x00

\0

 
TIME
HH:MM:DD[.ZZZ] ( fraction is optional )3 digit milli second support only for time
UNKNOWN
same as BINARY 

...