Page History

...

Table of Contents

outline	true

...

Current solutions

List of ideas that we've explored. Rather than reinventing the wheel, this page contains summary of already present ways of data representation.

...

Data type	Serialized as
INT (and all variants)	Direct value (666)
NUMERIC	Direct value (66.60)
REAL(and all variants)	Direct value (66.5999985, 55e55) or string constant for special cases ('Infinity', '-Infinity', 'NaN')
VARCHAR(text, ...)	String
CHAR	String, unused positions at the end are filled with spaces
TIMESTAMP(date, time, ...)	String in format YYYY-MM-DD HH:MM:SS.ZZZZZZ (Date and hour part)
<ac:structured-macro ac:name="unmigrated-wiki-markup" ac:schema-version="1" ac:macro-id="3df618444889411e-96e7fa38-454446b6-8acd8833-9220153cfce4cf5d5f794dc3"><ac:plain-text-body><![CDATA[	TIMESTAMP with time zone (and others)	String in format YYY-MM-DD HH:MM:SS.ZZZZZZ[+-]XX ('2012-07-03 14:07:11.876239+02')	]]></ac:plain-text-body></ac:structured-macro>
BOOLEAN	Constants true and false (not quoted as a String)
ENUM	String
ARRAY	String that contains special structure - '{ITEM1, ITEM2, ITEM3}', ITEMX itself might be in separate quotes if needed.

...

Code Block
id description ----------- ------------------------------------------------------------ 5 Cus 5 Cus 3 Hi 1 Ahoj 6 Bus 4 Hello 2 Nazdar

Intermediate format representation proposal

I would like to make an proposal for suitable intermediate representation for Sqoop 2 based on my research of current solutions. I come to a conclusion that neither of the format is fully suitable. The most promising formats are mysqldump and pg_dump, however both are having issues - mysqldump is not supporting timezones and special constants for floating numbers (NaN, Infinity) whereas pg_dump is not escaping new line characters that might break any following hadoop processing. Therefore I would like to propose combination of both formats. Each row will be represented as a single line (no new line characters are allowed) where all columns will be present in CSV structure with comma as a column separator. Data types will be encoded as follows:

Data type	Serialized as
BIT	String (array of bites rounded up to 1 byte, 20 bits are rounded to 24 bits/3 bytes)
INT(small, big, ...)	Direct value (666)
BOOL	Direct number (1 or 0)
DECIMAL(fixed, ...)	Direct value (66.6)
FLOAT (double, ...)	Direct value, might be in scientific notation (666.6, 5.5e-39) and special sting constants 'Infinity', '-Infinity', 'NaN'
<ac:structured-macro ac:name="unmigrated-wiki-markup" ac:schema-version="1" ac:macro-id="2f8d29e7-df02-487f-b154-96c4314ac698"><ac:plain-text-body><![CDATA[	DATE	String with format YYYY-MM-DD[+/-XX] (2012-01-01)	]]></ac:plain-text-body></ac:structured-macro>
<ac:structured-macro ac:name="unmigrated-wiki-markup" ac:schema-version="1" ac:macro-id="6f562281-6f27-4d4d-a843-c1b04b94b0ff"><ac:plain-text-body><![CDATA[	DATETIME	String with format YYYY-MM-DD HH:MM:DD[.ZZZZZZ][+/-XX] (2012-01-01 09:09:09)	]]></ac:plain-text-body></ac:structured-macro>
<ac:structured-macro ac:name="unmigrated-wiki-markup" ac:schema-version="1" ac:macro-id="b888f585-8b00-4b18-96c5-482d48fc83ad"><ac:plain-text-body><![CDATA[	TIMESTAMP	String with format YYYY-MM-DD HH:MM:DD[.ZZZZZZ][+/-XX] (2012-01-01 09:09:09)	]]></ac:plain-text-body></ac:structured-macro>
<ac:structured-macro ac:name="unmigrated-wiki-markup" ac:schema-version="1" ac:macro-id="4c5b162a-dc2f-4c4a-8ec8-0e34a5fe2beb"><ac:plain-text-body><![CDATA[	TIME	String with format HH:MM:DD[.ZZZZZZ][+/-XX] (09:09:09)	]]></ac:plain-text-body></ac:structured-macro>
CHAR(varchar, text, ...)	String
BINARY(blob, ...)	String
ENUM	String with enumerated value
SET	String with comma separated enumerated values

Note: Date/time values have optional timezone specification in format +XX or -XX (number of hours that is different from UTC). This format was chosen based on pg_dump example. I'm not sure whether it's the best for zones with daylight saving shifts and for timezones that are not off by entire hour (like +11:30).

Note: Time values might have optional second fraction parts.

String will be encoded like in mysqldump case. Entire string will be enclosed in single quotes and all bytes will be printed as they are will exception of following bytes:

Byte	Encoded as
0x00	\0
0x0A	\n
0x0D	\r
0x1A	\Z
0x22	\"
0x27	\'
0x5C	\ \ (no space)

Fast export utility compatibility

...

Child pages

Versions Compared

Old Version 34

New Version 35

Key

Current solutions

Intermediate format representation proposal

Fast export utility compatibility