Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.
Comment: Migrated to Confluence 5.3
Warning

This wiki does not yet cover the complex type such as Array/Map/NestedArray representation that will be used inside one of the CSV implementations

See Intermediate Data Format API for 1.99.5 version

Intermediate representation

...

I would like to make an proposal for suitable intermediate representation for Sqoop 2 based on my research of current solutions. I come to a conclusion that neither of the format is fully suitable. The most promising formats are mysqldump and pg_dump, however both are having issues - mysqldump is not supporting timezones and special constants for floating numbers (NaN, Infinity) whereas pg_dump is not escaping new line characters that might break any following hadoop processing. Therefore I would like to propose combination of both formats. Each row will be represented as a single line (no new line characters are allowed) where all columns will be present in CSV structure with comma as a column separator. Data types will be encoded as follows:

Warning

Note: the formats for Date related types has changed. See the comment below for more upto date information on the changes in the spec

or refer to Intermediate Data Format API for the latest details as of 1.99.5

Data type

Serialized as

BIT

String (array of bites rounded up to 1 byte, 20 bits are rounded to 24 bits/3 bytes)

INT(small, big, ...)

Direct value (666)

BOOL

Direct number (1 or 0)

DECIMAL(fixed, ...)

Direct value (66.6)

FLOAT (double, ...)

Direct value, might be in scientific notation (666.6, 5.5e-39) and special sting constants 'Infinity', '-Infinity', 'NaN'

DATE

String with format YYYY-MM-DD[+/-XX] (2012-01-01)

DATETIME

String with format YYYY-MM-DD HH:MM:DD[.ZZZZZZ][+/-XX] (2012-01-01 09:09:09)

TIMESTAMP

String with format YYYY-MM-DD HH:MM:DD[.ZZZZZZ][+/-XX] (2012-01-01 09:09:09)

TIME

String with format HH:MM:DD[.ZZZZZZ][+/-XX] (09:09:09)

CHAR(varchar, text, ...)

String

BINARY(blob, ...)

String

ENUM

String with enumerated value

SET

String with comma separated enumerated values

...