Intermediate representation
In sqoop2 connectors will supply their own map phase that will import data into HDFS. Because this piece of code will be fully under connector maintenance, we need to agree on common intermediate (map output) form for all connectors and all cases. This page goal is to do comparison of different intermediate representation, so that we can pick up the appropriate one for sqoop 2.
Goals
- Simple
- Fast (no necessary parsing, encoding, ...)
Ideas
List of ideas that we've explored.
mysqldump format
Comma separated list of values present in one single Text instance. Various data types are encoded as following:
Data type |
Encoding |
---|---|
BIT |
TBD |
INT(small, big, ...) |
Direct value (666) |
BOOL |
Direct number (1 or 0) |
DECIMAL(fixed, ...) |
Direct value (66.6) |
FLOAT (double, ...) |
Direct value (666.6, TBD big number) |
DATE |
String with format YYYY-MM-DD (2012-01-01) |
DATETIME |
String with format YYYY-MM-DD HH:MM:DD (2012-01-01 09:09:09) |
TIMESTAMP |
String with format YYYY-MM-DD HH:MM:DD (2012-01-01 09:09:09) |
TIME |
String with format HH:MM:DD (09:09:09) |
CHAR(varchar, text, blob) |
String |
ENUM |
String with enumerated value |
SET |
String with comma separated enumerated values |
TODO: Explore difference between DATETIME and TIMESTAMP with regards to the timezone.
Missing value is represented as constant NULL (it's not a string constant, therefore it's not quoted). Strings have very simple encoding -- most of the bytes (characters) are printed as they are with exception of following bytes:
Byte |
Written as |
---|---|
0x00 |
\0 |
0x0A |
\n |
0x0D |
\r |
0x1A |
\Z |
0x22 |
\" |
0x27 |
\' |
0x5C |
\ \ (no space) |
For example:
0,'Hello world','Jarcec\'s notes',NULL,66.6,'2012-06-06 06:06:06'