Intermediate Data Format API

//todo VB

Associated JIRA ticket : SQOOP-1350

Intermediate Data Format(IDF)

Connectors has a FROM and TO parts. A sqoop job represents data transfer between FROM and TO. IDF API represents how the data is represented as it flows between the FROM and TO via sqoop. Connectors represent different data sources and each data source can have its custom/ native format that it uses. For instance MongoDb might use JSON as its optimal native format, HDFS can use plain CSV text, S3 can use its own custom format. IDF API provides 3 main ways to represent data.

In simple words, every data source has one thing in common, it is collection of rows and each row is a collection of fields / columns. Most of not all data sources have strict schema that tells what each field type is.

Native format - each row in the data source is a native object, for instance in JSONIDF, entire row in sqoop will be represented as a JSON object, in Avro IDF, entire row and its fields will be a Acro record
CSV text format - each row and its fields are represented as CSV text
Object Array format - each field in the row in a an element in the object array. Hence a row in the data source is represented as a object array.

NOTE: The CSV text format and the Object Array format are mandated by Sqoop and the details of this format for every supported column type are described below.

Schema ( a.k.a Row )

Schema represents a row of fields that are transferred between FROM and TO. Hence schema holds the list of column/ field types for that row.

/**
 * Schema represents the data fields that are transferred between {@link #From} and {@link #To}
 */
public class Schema {
  /**
   * Name of the schema, usually a table name.
   */
  private String name;
  /**
   * Optional note.
   */
  private String note;
  /**
   * Creation date.
   */
  private Date creationDate;
  /**
   * Columns associated with the schema.
   */
  private List<Column> columns;
  /**
   * Helper set for quick column name lookups.
   */
  private Set<String> columNames;

Column & ColumnType ( a.k.a Row Fields )

Column is an abstraction to represent a field in a row. There are custom classes for sub type such as String, Number, Date, Map, Array. It has attributed that provide metadata about the column data such as is that field nullable, if that field is a String, what is its maxsize, if it is DateTime, does it support timezone, if it is a Map, what is the type of the key and what the is the type of the value, if it is Array, what is the type of the elements, if it is Enum, what are the supported options for the enum

/**
 * Base class for all the supported types in the Sqoop {@link #Schema}
 */
public abstract class Column {
  /**
   * Name of the column. It is optional
   */
  String name;
  /**
   * Whether the column value can be empty/null
   */
  Boolean nullable;
  /**
   * By default a column is nullable
   */

ColumnType is an handy enum that represents all the field types Sqoop supports. Note there is a umbrella UNKNOWN type for fields that sqoop does not support.

/**
 * All {@link #Column} types supported by Sqoop.
 */
public enum ColumnType {
  ARRAY,
  BINARY,
  BIT,
  DATE,
  DATE_TIME,
  DECIMAL,
  ENUM,
  FIXED_POINT,
  FLOATING_POINT,
  MAP,
  SET,
  TEXT,
  TIME,
  UNKNOWN,
  ;
}

SQOOP CSV Format

Column Type	CSV Format	Example

SQOOP Object Format

Column Type	Object Format	Example

Custom Implementations of IDF

CSV IDF

CSV IDF piggy backs on the the Sqoop CSV Format and its native format is the CSV Format. It main functionality is to provide a way to translate between the text and object formats.

If a connector claims to use CSV IDF here are the few ways it can be used

Child pages