Warning |
---|
This wiki does not yet cover the complex type such as Array/Map/NestedArray representation that will be used inside one of the CSV implementations See Intermediate Data Format API for 1.99.5 version |
Intermediate representation
...
Code Block |
---|
id description ----------- ------------------------------------------------------------ 5 Cus 5 Cus 3 Hi 1 Ahoj 6 Bus 4 Hello 2 Nazdar |
CSV Intermediate format representation proposal
I would like to make an proposal for suitable intermediate representation for Sqoop 2 based on my research of current solutions. I come to a conclusion that neither of the format is fully suitable. The most promising formats are mysqldump and pg_dump, however both are having issues - mysqldump is not supporting timezones and special constants for floating numbers (NaN, Infinity) whereas pg_dump is not escaping new line characters that might break any following hadoop processing. Therefore I would like to propose combination of both formats. Each row will be represented as a single line (no new line characters are allowed) where all columns will be present in CSV structure with comma as a column separator. Data types will be encoded as follows:
Warning |
---|
Note: the formats for Date related types has changed. See the comment below for more upto date information on the changes in the spec or refer to Intermediate Data Format API for the latest details as of 1.99.5 |
Data type | Serialized as |
---|---|
BIT | String (array of bites rounded up to 1 byte, 20 bits are rounded to 24 bits/3 bytes) |
INT(small, big, ...) | Direct value (666) |
BOOL | Direct number (1 or 0) |
DECIMAL(fixed, ...) | Direct value (66.6) |
FLOAT (double, ...) | Direct value, might be in scientific notation (666.6, 5.5e-39) and special sting constants 'Infinity', '-Infinity', 'NaN' |
DATE | String with format YYYY-MM-DD[+/-XX] (2012-01-01) |
DATETIME | String with format YYYY-MM-DD HH:MM:DD[.ZZZZZZ][+/-XX] (2012-01-01 09:09:09) |
TIMESTAMP | String with format YYYY-MM-DD HH:MM:DD[.ZZZZZZ][+/-XX] (2012-01-01 09:09:09) |
TIME | String with format HH:MM:DD[.ZZZZZZ][+/-XX] (09:09:09) |
CHAR(varchar, text, ...) | String |
BINARY(blob, ...) | String |
ENUM | String with enumerated value |
SET | String with comma separated enumerated values |
...