Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.
Comment: First version of the proposal for intermediate format

...

Table of Contents
outlinetrue

...

Current solutions

List of ideas that we've explored. Rather than reinventing the wheel, this page contains summary of already present ways of data representation.

...

Data type

Serialized as

INT (and all variants)

Direct value (666)

NUMERIC

Direct value (66.60)

REAL(and all variants)

Direct value (66.5999985, 55e55) or string constant for special cases  ('Infinity', '-Infinity', 'NaN')

VARCHAR(text, ...)

String

CHAR

String, unused positions at the end are filled with spaces

TIMESTAMP(date, time, ...)

String in format YYYY-MM-DD HH:MM:SS.ZZZZZZ (Date and hour part)

<ac:structured-macro ac:name="unmigrated-wiki-markup" ac:schema-version="1" ac:macro-id="3df618444889411e-96e7fa38-454446b6-8acd8833-9220153cfce4cf5d5f794dc3"><ac:plain-text-body><![CDATA[

TIMESTAMP with time zone (and others)

String in format YYY-MM-DD HH:MM:SS.ZZZZZZ[+-]XX ('2012-07-03 14:07:11.876239+02') 

]]></ac:plain-text-body></ac:structured-macro>

BOOLEAN

Constants true and false (not quoted as a String)

ENUM

String

ARRAY

String that contains special structure - '{ITEM1, ITEM2, ITEM3}', ITEMX itself might be in separate quotes if needed.

...

Code Block
         id  description
-----------  ------------------------------------------------------------
          5  Cus
          5  Cus
          3  Hi
          1  Ahoj
          6  Bus
          4  Hello
          2  Nazdar

Intermediate format representation proposal

I would like to make an proposal for suitable intermediate representation for Sqoop 2 based on my research of current solutions. I come to a conclusion that neither of the format is fully suitable. The most promising formats are mysqldump and pg_dump, however both are having issues - mysqldump is not supporting timezones and special constants for floating numbers (NaN, Infinity) whereas pg_dump is not escaping new line characters that might break any following hadoop processing. Therefore I would like to propose combination of both formats. Each row will be represented as a single line (no new line characters are allowed) where all columns will be present in CSV structure with comma as a column separator. Data types will be encoded as follows:

Data type

Serialized as

BIT

String (array of bites rounded up to 1 byte, 20 bits are rounded to 24 bits/3 bytes)

INT(small, big, ...)

Direct value (666)

BOOL

Direct number (1 or 0)

DECIMAL(fixed, ...)

Direct value (66.6)

FLOAT (double, ...)

Direct value, might be in scientific notation (666.6, 5.5e-39) and special sting constants 'Infinity', '-Infinity', 'NaN'

<ac:structured-macro ac:name="unmigrated-wiki-markup" ac:schema-version="1" ac:macro-id="2f8d29e7-df02-487f-b154-96c4314ac698"><ac:plain-text-body><![CDATA[

DATE

String with format YYYY-MM-DD[+/-XX] (2012-01-01)

]]></ac:plain-text-body></ac:structured-macro>

<ac:structured-macro ac:name="unmigrated-wiki-markup" ac:schema-version="1" ac:macro-id="6f562281-6f27-4d4d-a843-c1b04b94b0ff"><ac:plain-text-body><![CDATA[

DATETIME

String with format YYYY-MM-DD HH:MM:DD[.ZZZZZZ][+/-XX] (2012-01-01 09:09:09)

]]></ac:plain-text-body></ac:structured-macro>

<ac:structured-macro ac:name="unmigrated-wiki-markup" ac:schema-version="1" ac:macro-id="b888f585-8b00-4b18-96c5-482d48fc83ad"><ac:plain-text-body><![CDATA[

TIMESTAMP

String with format YYYY-MM-DD HH:MM:DD[.ZZZZZZ][+/-XX] (2012-01-01 09:09:09)

]]></ac:plain-text-body></ac:structured-macro>

<ac:structured-macro ac:name="unmigrated-wiki-markup" ac:schema-version="1" ac:macro-id="4c5b162a-dc2f-4c4a-8ec8-0e34a5fe2beb"><ac:plain-text-body><![CDATA[

TIME

String with format HH:MM:DD[.ZZZZZZ][+/-XX] (09:09:09)

]]></ac:plain-text-body></ac:structured-macro>

CHAR(varchar, text, ...)

String

BINARY(blob, ...)

String

ENUM

String with enumerated value

SET

String with comma separated enumerated values

Note: Date/time values have optional timezone specification in format +XX or -XX (number of hours that is different from UTC). This format was chosen based on pg_dump example. I'm not sure whether it's the best for zones with daylight saving shifts and for timezones that are not off by entire hour (like +11:30).

Note: Time values might have optional second fraction parts.

String will be encoded like in mysqldump case. Entire string will be enclosed in single quotes and all bytes will be printed as they are will exception of following bytes:

Byte

Encoded as

0x00

\0

0x0A

\n

0x0D

\r

0x1A

\Z

0x22

\"

0x27

\'

0x5C

\ \ (no space) 

Fast export utility compatibility

...