You are viewing an old version of this page. View the current version.

Compare with Current View Page History

« Previous Version 13 Next »

Intermediate representation

In sqoop2 connectors will supply their own map phase that will import data into HDFS. Because this piece of code will be fully under connector maintenance, we need to agree on common intermediate (map output) form for all connectors and all cases. This page goal is to do comparison of different intermediate representation, so that we can pick up the appropriate one for sqoop 2.

Goals 

  • Simple
  • Fast (no necessary parsing, encoding, ...)

Ideas

List of ideas that we've explored.

MySQL's mysqldump format 

Comma separated list of values present in one single Text instance. Various data types are encoded as following:

Data type

Encoding

BIT

String (array of bites rounded up to 1 byte, 20 bits are rounded to 24 bits/3 bytes)

INT(small, big, ...)

Direct value (666)

BOOL

Direct number (1 or 0)

DECIMAL(fixed, ...)

Direct value (66.6)

FLOAT (double, ...)

Direct value, might be in scientific notation (666.6, 5.5e-39)

DATE

String with format YYYY-MM-DD (2012-01-01)

DATETIME

String with format YYYY-MM-DD HH:MM:DD (2012-01-01 09:09:09)

TIMESTAMP

String with format YYYY-MM-DD HH:MM:DD (2012-01-01 09:09:09)

TIME

String with format HH:MM:DD (09:09:09)

CHAR(varchar, text, blob)

String

ENUM

String with enumerated value

SET

String with comma separated enumerated values

DATE and DATETIME types are returning same content as was stored in the table (no timezone conversions), whereas TIMESTAMP is always stored as UTC and is converted to connection timezone automatically. Explicit timezone specification do not seem to be part of the export.

Missing value is represented as constant NULL (it's not a string constant, therefore it's not quoted). Strings have very simple encoding -- most of the bytes (characters) are printed as they are with exception of following bytes:

Byte

Written as

0x00

\0

0x0A

\n

0x0D

\r

0x1A

\Z

0x22

\"

0x27

\'

0x5C

\ \ (no space) 

For example:

0,'Hello world','Jarcec\'s notes',NULL,66.6,'2012-06-06 06:06:06'
PostgreSQL's pg_dump format

String are quoted with single quotes  - '  -. All characters are printed as they are with exception of single quote that is doubled (e.g. two single quotes '' are encoding single quote and not end of the string. Null byte is not allowed inside string constants. 

avro 
  • No labels