Intermediate Data Format API

Associated JIRA ticket : SQOOP-1350 and its sub tickets for more granular discussion points

Intermediate Data Format(IDF)

Connectors have FROM and TO parts. A sqoop job represents data transfer between FROM and TO. IDF API represents how the data is represented as it flows between the FROM and TO via sqoop. Connectors represent different data sources and each data source can have its custom/ native format that it uses. For instance MongoDb might use JSON as its optimal native format, HDFS can use plain CSV text, S3 can use its own custom format. In simple words, every data source has one thing in common, it is collection of rows and each row is a collection of fields / columns. Most if not all data sources have strict schema that tells what each field type is.

IDF API provides 3 main ways to represent data that flows within sqoop

Native format - each row in the data source is a native object, for instance in JSONIDF, an entire row and its fields in sqoop will be represented as a JSON object, in AvroIDF, entire row and its fields will be represented as a Avro record
CSV text format - each row and its fields are represented as CSV text
Object Array format - each field in the row is an element in the object array. Hence a row in the data source is represented as a object array.

NOTE: The CSV text format and the Object Array format are custom and prescribed by Sqoop and the details of this format for every supported column type in the schema are are described below.

Design goals

There are a few prior documents that depict the design goals, but it is not crystal clear. Refer to this doc for some context on the research done prior to defining the IDF API. It explains some of the goals of using CSV and Object Array formats. Some of the design influence comes from the its predecessor Sqoop1.

Support data transfer across connectors using an "internal" in memory data representation
CSV is a common format in many databases and hence sqoop's design goals primarily want to optimize for such data sources. But it is unclear how much of performance gains does CSV text provide.
The following is a java doc comment I pulled from the code that explains the choice of CSV.

Why a "native" internal format and then return CSV text too?
Imagine a connector that moves data from a system that stores data as a
serialization format called FooFormat. If I also need the data to be
written into HDFS as FooFormat, the additional cycles burnt in converting
the FooFormat to text and back is useless - so using the sqoop specified
CSV text format saves those extra cycles
<p/>
Most fast access mechanisms, like mysqldump or pgsqldump write the data
out as CSV, and most often the source data is also represented as CSV
- so having a minimal CSV support is mandated for all IDF, so we can easily read the
data out as text and write as text.

Schema ( a.k.a Row )

Schema represents a row of fields that are transferred between FROM and TO. Hence schema holds the list of column/ field types for that row.

/**
 * Schema represents the data fields that are transferred between {@link #From} and {@link #To}
 */
public class Schema {
  /**
   * Name of the schema, usually a table name.
   */
  private String name;
  /**
   * Optional note.
   */
  private String note;
  /**
   * Creation date.
   */
  private Date creationDate;
  /**
   * Columns associated with the schema.
   */
  private List<Column> columns;
  /**
   * Helper set for quick column name lookups.
   */
  private Set<String> columNames;

Column & ColumnType ( a.k.a Row Fields )

Column is an abstraction to represent a field in a row. There are custom classes for sub type such as String, Number, Date, Map, Array. It has attributes that provide metadata about the column data such as is that field nullable?, if that field is a String, what is its maxsize?, if it is DateTime, does it support timezone?, if it is a Map, what is the type of the key and what the is the type of the value?, if it is Array, what is the type of the elements?, if it is Enum, what are the supported options for the enum?

/**
 * Base class for all the supported types in the Sqoop {@link #Schema}
 */
public abstract class Column {
  /**
   * Name of the column. It is optional
   */
  String name;
  /**
   * Whether the column value can be empty/null
   */
  Boolean nullable;
  /**
   * By default a column is nullable
   */

ColumnType is an handy enum that represents all the field types Sqoop supports. Note there is a umbrella UNKNOWN type for fields that sqoop does not support.

/**
 * All {@link #Column} types supported by Sqoop.
 */
public enum ColumnType {
  ARRAY,
  BINARY,
  BIT,
  DATE,
  DATE_TIME,
  DECIMAL,
  ENUM,
  FIXED_POINT,
  FLOATING_POINT,
  MAP,
  SET,
  TEXT,
  TIME,
  UNKNOWN,
  ;
}

The following is the spec as per 1.99.5, Please do not edit this directly in future. If there is format spec change in future releases add a new section to highlight what changed.

1.99.5 SQOOP CSV Format

Column Type

CSV Format

Notes

NULL value in the field

public static final String NULL_FIELD = "NULL";

ARRAY

Will be encoded as String (and hence enclosed with '\, inside there will be JSON encoding of the top level array elements (hence the entire value will be enclosed in [] pair), Nested values are not JSON encoded..
Few examples:
- Array of FixedPoint '[1,2,3]'
- Array of Text '["A","B","C"]'
- Array of Objects of type FixedPoint '["[11, 12]","[14, 15]"]
- Array of Objects of type Text ["[A, B]","[X, Y]"]' -

BINARY

byte array enclosed in quotes and encoded with ISO-8859-1 charset

BIT

true, TRUE, 1

false, FALSE, 0

( not encoded in quotes )

Unsupported values should throw an exception

DATE

YYYY-MM-DD ( no time zone)

DATE_TIME

YYYY-MM-DD HH:MM:DD[.ZZZ][+/-XX] ( fraction and timezone are optional)

DECIMAL

Bigdecimal (not encoded in quotes )

ENUM

Same as TEXT

FIXED_POINT

integer or long, ( not encoded in quotes )

FLOATING_POINT

float or double ( not encoded in quotes )

MAP

Will be encoded as String (and hence enclosed with '\, inside there will be JSON encoding of the map (hence the entire value will be enclosed in pair { }
Map<Number, Number> '{1:20}'
Map<String, String> - '{"testKey":"testValue\}'

SET

same as ARRAY

TEXT

Entire string will be enclosed in single quotes and all bytes will be printed as they are will exception of following bytes

Byte	Encoded as
0x5C	\ \ (no space)
0x27	\'
0x22	\"
0x1A	\Z
0x0D	\r
0x0A	\n
0x00	\0

TIME

HH:MM:DD[.ZZZ] ( fraction is optional )

3 digit milli second support only for time

UNKNOWN

same as BINARY

1.99.5 SQOOP Object Format

SqoopDataUtils exposes a few utility methods to use to convert into the sqoop expected object format.

Column Type	Object Format
NULL value in the field	java null
ARRAY	java Object[]
BINARY	javabyte[]
BIT	java boolean
DATE	org.joda.time.LocalDate
DATE_TIME	org.joda.time. DateTime or org.joda.time. LocalDateTime (depends on timezone attribute )
DECIMAL	javaBigDecimal
ENUM	java String
FIXED_POINT	java Integer or java Long ( depends on byteSize attribute)
FLOATING_POINT	java Double or java Float ( depends on byteSize attribute)
MAP	java.util.Map<Object, Object>
SET	java Object[]
TEXT	java String
TIME	org.joda.time.LocalTime ( No Timezone)
UNKNOWN	same as java byte[]

Custom Implementations of IDF

CSV IDF - SQOOP-555 and SQOOP-1350

It is one of the sample implementation of the IDF API.

CSV IDF piggy backs on the the Sqoop CSV Format and its native format is the CSV Format. It main functionality is to provide a way to translate between the text and object formats.

See the implementation class in the connector-sdk package for more details

Open Questions

The choice of using CSVText and ObjectArray as mandated formats for Sqoop IDF are influenced from the Sqoop 1 design, It favors some traditional fast dump databases but there is no real benchmark to prove how optimal it is vs using Avro or other formats for representing the data
Using intermediate format might lead to discrepancies in some specific column types, for instance using JODA for representing the date time objects only gives 3 digit precision, where as the sql timestamp from JDBC sources supports 6 digit precision
More importantly SqoopConnector API has a getIDF..() method, that ties a connector a specific intermediate format for all the supported directions ( i.e both FROM and TO) . This means the connector in both FROM and TO side has to provide this format and expect this format respectively. There are 3 different ways in each format, so each connector can potentially support one of these formats and that is not obvious at all when a connector proclaims to use a particular implementation of IDF. For instance the GenericJDBCConnector says it uses CSVIntermediateDataFormat but chooses to write objectArray in extractor and readObjectArray in Loader. Hence it is not obvious what is the format underneath that it will read and write to. On the other hand, HDFSConnector also says it uses CSVIntermediateDataFormat but, uses only the CSV text format in the extractor and loader.
A connector possibly should be able to handle multiple IDFs, and expose the supported IDFs per direction. It is not possible today, For instance a sqoop job should be able to dynamically choose the IDF for HDFSConnector when used in the TO direction. The job might be able to say, use AVRO IDF for the TO side and hence load all my data into HDFS in avro format. This means when doing the load, the HDFS will use the readContent API of the SqoopOutputFormatDataReader. But today HDFS can only say it uses CSVIntermediateDataFormat and the data loaded into HDFS will need conversion from CSV to Avro as a separate step.

Child pages