...
Table of Contents | ||
---|---|---|
|
...
Current solutions
List of ideas that we've explored. Rather than reinventing the wheel, this page contains summary of already present ways of data representation.
...
Data type | Serialized as | ||
---|---|---|---|
INT (and all variants) | Direct value (666) | ||
NUMERIC | Direct value (66.60) | ||
REAL(and all variants) | Direct value (66.5999985, 55e55) or string constant for special cases ('Infinity', '-Infinity', 'NaN') | ||
VARCHAR(text, ...) | String | ||
CHAR | String, unused positions at the end are filled with spaces | ||
TIMESTAMP(date, time, ...) | String in format YYYY-MM-DD HH:MM:SS.ZZZZZZ (Date and hour part) | ||
<ac:structured-macro ac:name="unmigrated-wiki-markup" ac:schema-version="1" ac:macro-id="3df618444889411e-96e7fa38-454446b6-8acd8833-9220153cfce4cf5d5f794dc3"><ac:plain-text-body><![CDATA[ | TIMESTAMP with time zone (and others) | String in format YYY-MM-DD HH:MM:SS.ZZZZZZ[+-]XX ('2012-07-03 14:07:11.876239+02') | ]]></ac:plain-text-body></ac:structured-macro> |
BOOLEAN | Constants true and false (not quoted as a String) | ||
ENUM | String | ||
ARRAY | String that contains special structure - '{ITEM1, ITEM2, ITEM3}', ITEMX itself might be in separate quotes if needed. |
...
Code Block |
---|
id description ----------- ------------------------------------------------------------ 5 Cus 5 Cus 3 Hi 1 Ahoj 6 Bus 4 Hello 2 Nazdar |
Intermediate format representation proposal
I would like to make an proposal for suitable intermediate representation for Sqoop 2 based on my research of current solutions. I come to a conclusion that neither of the format is fully suitable. The most promising formats are mysqldump and pg_dump, however both are having issues - mysqldump is not supporting timezones and special constants for floating numbers (NaN, Infinity) whereas pg_dump is not escaping new line characters that might break any following hadoop processing. Therefore I would like to propose combination of both formats. Each row will be represented as a single line (no new line characters are allowed) where all columns will be present in CSV structure with comma as a column separator. Data types will be encoded as follows:
Data type | Serialized as | ||
---|---|---|---|
BIT | String (array of bites rounded up to 1 byte, 20 bits are rounded to 24 bits/3 bytes) | ||
INT(small, big, ...) | Direct value (666) | ||
BOOL | Direct number (1 or 0) | ||
DECIMAL(fixed, ...) | Direct value (66.6) | ||
FLOAT (double, ...) | Direct value, might be in scientific notation (666.6, 5.5e-39) and special sting constants 'Infinity', '-Infinity', 'NaN' | ||
<ac:structured-macro ac:name="unmigrated-wiki-markup" ac:schema-version="1" ac:macro-id="2f8d29e7-df02-487f-b154-96c4314ac698"><ac:plain-text-body><![CDATA[ | DATE | String with format YYYY-MM-DD[+/-XX] (2012-01-01) | ]]></ac:plain-text-body></ac:structured-macro> |
<ac:structured-macro ac:name="unmigrated-wiki-markup" ac:schema-version="1" ac:macro-id="6f562281-6f27-4d4d-a843-c1b04b94b0ff"><ac:plain-text-body><![CDATA[ | DATETIME | String with format YYYY-MM-DD HH:MM:DD[.ZZZZZZ][+/-XX] (2012-01-01 09:09:09) | ]]></ac:plain-text-body></ac:structured-macro> |
<ac:structured-macro ac:name="unmigrated-wiki-markup" ac:schema-version="1" ac:macro-id="b888f585-8b00-4b18-96c5-482d48fc83ad"><ac:plain-text-body><![CDATA[ | TIMESTAMP | String with format YYYY-MM-DD HH:MM:DD[.ZZZZZZ][+/-XX] (2012-01-01 09:09:09) | ]]></ac:plain-text-body></ac:structured-macro> |
<ac:structured-macro ac:name="unmigrated-wiki-markup" ac:schema-version="1" ac:macro-id="4c5b162a-dc2f-4c4a-8ec8-0e34a5fe2beb"><ac:plain-text-body><![CDATA[ | TIME | String with format HH:MM:DD[.ZZZZZZ][+/-XX] (09:09:09) | ]]></ac:plain-text-body></ac:structured-macro> |
CHAR(varchar, text, ...) | String | ||
BINARY(blob, ...) | String | ||
ENUM | String with enumerated value | ||
SET | String with comma separated enumerated values |
Note: Date/time values have optional timezone specification in format +XX or -XX (number of hours that is different from UTC). This format was chosen based on pg_dump example. I'm not sure whether it's the best for zones with daylight saving shifts and for timezones that are not off by entire hour (like +11:30).
Note: Time values might have optional second fraction parts.
String will be encoded like in mysqldump case. Entire string will be enclosed in single quotes and all bytes will be printed as they are will exception of following bytes:
Byte | Encoded as |
---|---|
0x00 | \0 |
0x0A | \n |
0x0D | \r |
0x1A | \Z |
0x22 | \" |
0x27 | \' |
0x5C | \ \ (no space) |
Fast export utility compatibility
...