Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

Intermediate representation

In sqoop2 sqoop 2 connectors will supply their own map phase that will import data into HDFS. Because this piece of code will be fully under connector maintenance, we need to agree on common intermediate (map output) form format for all connectors and all cases. This page goal is to do comparison of different intermediate representation, so that we can pick up the appropriate one for sqoop 2.

...

MySQL's mysqldump format 

No Format

Comma separated list of values present in one single Text instance. Various data types are encoded as following:

Data type

Serialized as

BIT

String (array of bites rounded up to 1 byte, 20 bits are rounded to 24 bits/3 bytes)

INT(small, big, ...)

Direct value (666)

BOOL

Direct number (1 or 0)

DECIMAL(fixed, ...)

Direct value (66.6)

FLOAT (double, ...)

Direct value, might be in scientific notation (666.6, 5.5e-39). MySQL is not supporting NaN and +/- Inf.

DATE

String with format YYYY-MM-DD (2012-01-01)

DATETIME

String with format YYYY-MM-DD HH:MM:DD (2012-01-01 09:09:09)

TIMESTAMP

String with format YYYY-MM-DD HH:MM:DD (2012-01-01 09:09:09)

TIME

String with format HH:MM:DD (09:09:09)

CHAR(varchar, text, blob)

String

ENUM

String with enumerated value

SET

String with comma separated enumerated values

...

Similarly as in case of MySQL dump format, data would be represented as one Text instance where multiple colums would be separated by commas. Strings are quoted in single quotes  (for example 'string'). All characters are printed as they are with exception of single quote that is doubled – e.g. two single quotes '' represents one single quote inside the string and not end of the string (for example 'Jarcec''s notes'). One quoted single quote is represented by four single qootes -- '''' represents just one ' (first one is opening, than there are two single quotes in a row that encodes one single quote inside the string and lastly the last single quote is representing end of closing the stringencoding). Null byte (0x00) is not allowed inside string constants.  Binary constants are also quoted in single quotes, however all binary bytes are entire field is converted to hexa with \x prefix – for example '\x4d7953514c' stands for string 'MySQL' (saved in binary column like bytea).

Data type

Serialized as

INT (and all variants)

Direct value (666)

NUMERIC

Direct value (66.60)

REAL(and all variants)

Direct value (66.5999985, 55e55) or string constant for special cases  ('Infinity', '-Infinity', 'NaN')

VARCHAR(text, ...)

String

CHAR

String, unused positions at the end are filled with spaces

TIMESTAMP(date, time, ...)

String in format YYYY-MM-DD HH:MM:SS.ZZZZZZ (Date and hour part)

<ac:structured-macro ac:name="unmigrated-wiki-markup" ac:schema-version="1" ac:macro-id="8ac161e7afa3be3b-77c09cad-4d1247f9-9197a8e5-04881e5becb1e0b5c711d5d0"><ac:plain-text-body><![CDATA[

TIMESTAMP with time zone (and others)

String in format YYY-MM-DD HH:MM:SS.ZZZZZZ[+-]XX ('2012-07-03 14:07:11.876239+02') 

]]></ac:plain-text-body></ac:structured-macro>

BOOLEAN

Constants true and false (not quoted as a String)

ENUM

String

ARRAY

String that contains special structure - '{ITEM1, ITEM2, ITEM3}', ITEMX itself might be in separate quotes if needed.

...