Warning

This wiki is outdated. Please refer to the latest version docs at http://sqoop.apache.org/docs/does not yet cover the complex type such as Array/Map/NestedArray representation that will be used inside one of the CSV implementations

See Intermediate Data Format API for 1.99.5 version

Intermediate representation

In sqoop 2 connectors will supply their own map phase that will import data into HDFS. Because this piece of code will be fully under connector maintenance, we need to agree on common intermediate (map output) format for all connectors and all cases. This page goal is to do comparison of different intermediate representation, so that we can pick up the appropriate one for sqoop 2.

Table of Contents

outline	true

Current solutions

List of ideas that we've explored. Rather than reinventing the wheel, this page contains summary of already present ways of data representation.

MySQL's mysqldump format

Comma separated list of values present in one single Text instance. Various data types are encoded as following:

...

Code Block
0,'Hello world','Jarcec\'s notes',NULL,66.6,'2012-06-06 06:06:06'

PostgreSQL's pg_dump format

Similarly as in case of MySQL dump format, data would be represented as one Text instance where multiple colums would be separated by commas. Strings are quoted in single quotes (for example 'string'). All characters are printed as they are with exception of single quote that is doubled – e.g. two single quotes '' represents one single quote inside the string and not end of the string (for example 'Jarcec''s notes'). One quoted single quote is represented by four single qootes -- '''' represents just one ' (first one is opening, than there are two single quotes in a row that encodes one single quote inside the string and lastly the last single quote is closing the encoding). Null byte (0x00) is not allowed inside string constants. Binary constants are also quoted in single quotes, however entire field is converted to hexa with \x prefix – for example '\x4d7953514c' stands for string 'MySQL' (saved in binary column like bytea).

...

Code Block
666, 66.60, 'jarcec', 'Jarcec''s notes', '2012-07-03 14:07:11.876239', true, '{1,2,3}', NULL, '\x4d7953514c'

Microsoft SQL Server's bcp utility

SQL Server bcp utility is by default producing binary output that would be hard to interpret. Fortunately it can be forced to produce character output using command line switch "-c". In this case bcp utility will produce CSV file -- rowdelimiter can be specified by another command line switch "-t" (tab by default), similarly using "-r" switch user can specify row delimiter (new line by default).

...

Utility bcp is also using so called format files where user can specify different column separator for each column. This functionality might be used to some extent to "emulate" escaping. For example knowing that all columns are string, user could specify separator for first column as a quote and for second (and others) as a quote, comma and an additional quote. Resulting in example row 'jarcec','arvind'. Unfortunately this technique is not powerful enough to solve a case when advanced multicharacter separator is still included in the data.

AVRO

We can also utilize AVRO for intermediate format. Avro do have support only for limited number of data types, so that we would have to encode some types (for example we might encode Date as a string with same format as above). List of all supported types can be found on following wiki page. Example of mapper code in the Connector itself would be:

Code Block

GenericRecordBuilder builder = new GenericRecordBuilder(getSchemaSomehow());
builder.set("column1", value1);
builder.set("columnX", valueX);
context.write(new AvroKey(builder.build()), NullWritable.get());

Netezza

I did not find any special data exporting utility. Just recommendation to use nzsql and save output to file, that is generating tables in similar way as normal mysql client.

...

Code Block
ID \| TXT ----+------ 2 \| cau 1 \| ahoj (2 rows)

Teradata

Teradata have FastExport mechanism that is implemented by fexp utility. Default output formats are binary and a table like structure (show below). User might put together any arbitrary output format by adjusting select query. For example I've seen on many places recommendation to simply concatenate all columns together with proper column separator.

Code Block
id description ----------- ------------------------------------------------------------ 5 Cus 5 Cus 3 Hi 1 Ahoj 6 Bus 4 Hello 2 Nazdar

CSV Intermediate format representation proposal

I would like to make an proposal for suitable intermediate representation for Sqoop 2 based on my research of current solutions. I come to a conclusion that neither of the format is fully suitable. The most promising formats are mysqldump and pg_dump, however both are having issues - mysqldump is not supporting timezones and special constants for floating numbers (NaN, Infinity) whereas pg_dump is not escaping new line characters that might break any following hadoop processing. Therefore I would like to propose combination of both formats. Each row will be represented as a single line (no new line characters are allowed) where all columns will be present in CSV structure with comma as a column separator. Data types will be encoded as follows:

Warning

Note: the formats for Date related types has changed. See the comment below for more upto date information on the changes in the spec

or refer to Intermediate Data Format API for the latest details as of 1.99.5

Data type	Serialized as
BIT	String (array of bites rounded up to 1 byte, 20 bits are rounded to 24 bits/3 bytes)
INT(small, big, ...)	Direct value (666)
BOOL	Direct number (1 or 0)
DECIMAL(fixed, ...)	Direct value (66.6)
FLOAT (double, ...)	Direct value, might be in scientific notation (666.6, 5.5e-39) and special sting constants 'Infinity', '-Infinity', 'NaN'
DATE	String with format YYYY-MM-DD[+/-XX] (2012-01-01)
DATETIME	String with format YYYY-MM-DD HH:MM:DD[.ZZZZZZ][+/-XX] (2012-01-01 09:09:09)
TIMESTAMP	String with format YYYY-MM-DD HH:MM:DD[.ZZZZZZ][+/-XX] (2012-01-01 09:09:09)
TIME	String with format HH:MM:DD[.ZZZZZZ][+/-XX] (09:09:09)
CHAR(varchar, text, ...)	String
BINARY(blob, ...)	String
ENUM	String with enumerated value
SET	String with comma separated enumerated values

...

Byte	Encoded as
0x00	\0
0x0A	\n
0x0D	\r
0x1A	\Z
0x22	\"
0x27	\'
0x5C	\ \ (no space)

Fast export utility compatibility

We've also investigated compatibility between various fast exporting tools that are available out there.

mysqldump --compatible

Utility mysqldump do contain parameter --compatible that accept parameter "postgresql". However this parameter do not seem valid as using it results on nearly the same output format which is definitely not compatible with postgresql:

...

Child pages

Versions Compared

Old Version 39

New Version Current

Key

Intermediate representation

Current solutions

MySQL's mysqldump format

PostgreSQL's pg_dump format

Microsoft SQL Server's bcp utility

AVRO

Netezza

Teradata

CSV Intermediate format representation proposal

Fast export utility compatibility

mysqldump --compatible

Child pages

Page History

Versions Compared

Old Version 39

New Version Current

Key

Intermediate representation

Current solutions

MySQL's mysqldump format

PostgreSQL's pg_dump format

Microsoft SQL Server's bcp utility

AVRO

Netezza

Teradata

CSV Intermediate format representation proposal

Fast export utility compatibility

mysqldump --compatible