Sqoop2 Intermediate representation

Intermediate representation

In sqoop2 connectors will supply their own map phase that will import data into HDFS. Because this piece of code will be fully under connector maintenance, we need to agree on common intermediate (map output) form for all connectors and all cases. This page goal is to do comparison of different intermediate representation, so that we can pick up the appropriate one for sqoop 2.

Goals

Simple
Fast (no necessary parsing, encoding, ...)

Ideas

List of ideas that we've explored.

MySQL's mysqldump format

Comma separated list of values present in one single Text instance. Various data types are encoded as following:

Data type	Serialized as
BIT	String (array of bites rounded up to 1 byte, 20 bits are rounded to 24 bits/3 bytes)
INT(small, big, ...)	Direct value (666)
BOOL	Direct number (1 or 0)
DECIMAL(fixed, ...)	Direct value (66.6)
FLOAT (double, ...)	Direct value, might be in scientific notation (666.6, 5.5e-39)
DATE	String with format YYYY-MM-DD (2012-01-01)
DATETIME	String with format YYYY-MM-DD HH:MM:DD (2012-01-01 09:09:09)
TIMESTAMP	String with format YYYY-MM-DD HH:MM:DD (2012-01-01 09:09:09)
TIME	String with format HH:MM:DD (09:09:09)
CHAR(varchar, text, blob)	String
ENUM	String with enumerated value
SET	String with comma separated enumerated values

TODO(jarcec): Inf and NaN representation for real/float numbers.

DATE and DATETIME types are returning same content as was stored in the table (no timezone conversions), whereas TIMESTAMP is always stored as UTC and is converted to connection timezone automatically. Explicit timezone specification do not seem to be part of the export.

Missing value is represented as constant NULL (it's not a string constant, therefore it's not quoted). Strings have very simple encoding -- most of the bytes (characters) are printed as they are with exception of following bytes:

Byte	Written as
0x00	\0
0x0A	\n
0x0D	\r
0x1A	\Z
0x22	\"
0x27	\'
0x5C	\ \ (no space)

For example:

0,'Hello world','Jarcec\'s notes',NULL,66.6,'2012-06-06 06:06:06'

PostgreSQL's pg_dump format

Strings are quoted in single quotes (for example 'string'). All characters are printed as they are with exception of single quote that is double – e.g. two single quotes '' represents one single quote inside the string and not end of the string (for example 'Jarcec''s notes'). One quoted single quote is represented by four single qootes -- '''' represents just one ' (first one is opening, than there are two single quotes in a row that encodes one single quote inside the string and lastly the last single quote is representing end of the string). Null byte (0x00) is not allowed inside string constants. Binary constants are also quoted in single quotes, however all binary bytes are converted to hexa with \x prefix – for example '\x4d7953514c' stands for string 'MySQL' (saved in binary column like bytea).

Data type	Serialized as
INT (and all variants)	Direct value (666)
NUMERIC	Direct value (66.60)
REAL(and all variants)	Direct value (66.5999985) TODO(jarcec): Big numbers, Inf, NaN
VARCHAR(text, ...)	String
CHAR	String, unused positions at the end are filled with spaces
TIMESTAMP(date, time, ...)	String in format YYYY-MM-DD HH:MM:SS.ZZZZZZ (Date and hour part)
<ac:structured-macro ac:name="unmigrated-wiki-markup" ac:schema-version="1" ac:macro-id="496829c6-e67f-4c07-b99f-520999d824dd"><ac:plain-text-body><![CDATA[	TIMESTAMP with time zone (and others)	String in format YYY-MM-DD HH:MM:SS.ZZZZZZ[+-]XX (offset for the timezone, see examples below)	]]></ac:plain-text-body></ac:structured-macro>
BOOLEAN	Constants true and false (not quoted as a String)
ENUM	String
ARRAY	String that contains special structure - '{ITEM1, ITEM2, ITEM3}', ITEMX itself might be in separate quotes if needed.

avro

pre class="code-java"

Child pages