Page History

...

In simple words, every data source has one thing in common, it is collection of rows and each row is a collection of fields / columns. Most of not all data sources have strict schema that tells what each field type is.

Native format - each row in the data source is a native object, for instance in JSONIDF, an entire row and its fields in sqoop will be represented as a JSON object, in AvroIDF, entire row and its fields will be represented as a Avro record
CSV text format - each row and its fields are represented as CSV text
Object Array format - each field in the row is an element in the object array. Hence a row in the data source is represented as a object array.

...

NOTE: The CSV text format and the Object Array format are custom and prescribed by Sqoop and the details of this format for every supported column type in the schema are are described below.

Design goals

There are a few prior documents that depict the design goals, but it is not crystal clear. Refer to this this doc for some context on the research done prior to defining the IDF API. It explains some of the goals of using CSV and Object Array formats. Some of the design influence comes from the its predecessor Sqoop1.Design goals

Support data transfer across connectors using an "internal" in memory data representation
CSV is a common format in many databases and hence sqoop's design goals primarily want to optimize for such data sources. But it is unclear how much of performance gains does CSV text provide.
The following is a java doc comment I pulled from the code that explains the choice of CSV.

Why a "native" internal format and then return CSV text too?
Imagine a connector that moves data from a system that stores data as a
serialization format called FooFormat. If I also need the data to be
written into HDFS as FooFormat, the additional cycles burnt in converting
the FooFormat to text and back is useless - so using the sqoop specified
CSV text format saves those extra cycles
<p/>
Most fast access mechanisms, like mysqldump or pgsqldump write the data
out as CSV, and most often the source data is also represented as CSV
- so having a minimal CSV support is mandated for all IDF, so we can easily read the
data out as text and write as text.

...

Column Type

CSV Format

Notes

NULL value in the field

public static final String NULL_FIELD = "NULL";

ARRAY

Will be encoded as String (and hence enclosed with '\, inside there will be JSON encoding of the top level array elements (hence the entire value will be enclosed in [] pair), Nested values are not JSON encoded..
Few examples:
- Array of FixedPoint '[1,2,3]'
- Array of Text '["A","B","C"]'
- Array of Objects of type FixedPoint '["[11, 12]","[14, 15]"]
- Array of Objects of type Text ["[A, B]","[X, Y]"]' -

BINARY

byte array enclosed in quotes and encoded with ISO-8859-1 charset

BIT

true, TRUE, 1

false, FALSE, 0

( not encoded in quotes )

Unsupported values should throw an exception

DATE

YYYY-MM-DD ( no time zone)

DATE_TIME

YYYY-MM-DD HH:MM:DD[.ZZZ][+/-XX] ( fraction and timezone are optional)

DECIMAL

Bigdecimal (not encoded in quotes )

ENUM

Same as TEXT

FIXED_POINT

integer or long, ( not encoded in quotes )

FLOATING_POINT

float or double ( not encoded in quotes )

MAP

Will be encoded as String (and hence enclosed with '\, inside there will be JSON encoding of the map (hence the entire value will be enclosed in }{ pair (the order is of course opposite, I'm having troubles to push it through JIRA markdown). Few examples:in pair { }
Map<Number, Number> '{1:2,2:320}'
Map<String, String> String> - '{"key1testKey":"value1","key2":"value2"testValue\}'

SET

same as ARRAY

TEXT

Entire string will be enclosed in single quotes and all bytes will be printed as they are will exception of following bytes

Byte	Encoded as
0x5C	\ \ (no space)
0x27	\'
0x22	\"
0x1A	\Z
0x0D	\r
0x0A	\n
0x00	\0

TIME

HH:MM:DD[.ZZZ] ( fraction is optional )

3 digit milli second support only for time

UNKNOWN

same as BINARY

...

Child pages

Versions Compared

Old Version 12

New Version 13

Key