You are viewing an old version of this page. View the current version.

Compare with Current View Page History

« Previous Version 9 Next »

 

Associated  JIRA ticket : SQOOP-1350 and its sub tickets for more granular discussion points

Intermediate Data Format(IDF)

Connectors has a FROM and TO parts. A sqoop job represents data transfer between FROM and TO. IDF API represents how the data is represented as it flows between the FROM and TO via sqoop.  Connectors represent different data sources and each data source can have its custom/ native format that it uses. For instance MongoDb might use JSON as its optimal native format, HDFS can use plain CSV text, S3 can use its own custom format. IDF API provides 3 main ways to represent data.

In simple words, every data source has one thing in common, it is collection of rows and each row is a collection of fields / columns. Most of not all data sources have strict schema that tells what each field type is.

 

  1. Native format - each row in the data source is a native object, for instance in JSONIDF, an entire row and its fields in sqoop will be represented as a JSON object, in AvroIDF, entire row and its fields will be represented as a Avro record
  2. CSV text format - each row and its fields are represented as CSV text
  3. Object Array format  - each field in the row is an element in the object array. Hence a row in the data source is represented as a object array. 


NOTE: The CSV text format and the Object Array format are custom and  prescribed by Sqoop and the details of this format for every supported column type in the schema are are described below.

There are a few prior documents that depict the design goals, but it is not cyrstal clear. Refer to this doc for some context on the research done prior to defining the IDF API. It explains some of the goals of using CSV and Object Array formats. Some of the design influence comes from the its predecessor Sqoop1.

 

 

 

Design goals

  1. Support data transfer across connectors using an "internal" in memory data representation
  2. CSV is a common format in many databases and hence sqoop's design goals primarily want to optimize for such data sources. But it is unclear how much of performance gains does CSV text provide. 
    The following is a java doc comment I pulled from the code that explains the choice of CSV.

     

    Why a "native" internal format and then return CSV text too?
    Imagine a connector that moves data from a system that stores data as a
    serialization format called FooFormat. If I also need the data to be
    written into HDFS as FooFormat, the additional cycles burnt in converting
    the FooFormat to text and back is useless - so using the sqoop specified
    CSV text format saves those extra cycles
    <p/>
    Most fast access mechanisms, like mysqldump or pgsqldump write the data
    out as CSV, and most often the source data is also represented as CSV
    - so having a minimal CSV support is mandated for all IDF, so we can easily read the
    data out as text and write as text.

     

Schema ( a.k.a Row ) 

Schema represents a row of fields that are transferred between FROM and TO. Hence schema holds the list of column/ field types for that row. 

/**
 * Schema represents the data fields that are transferred between {@link #From} and {@link #To}
 */
public class Schema {
  /**
   * Name of the schema, usually a table name.
   */
  private String name;
  /**
   * Optional note.
   */
  private String note;
  /**
   * Creation date.
   */
  private Date creationDate;
  /**
   * Columns associated with the schema.
   */
  private List<Column> columns;
  /**
   * Helper set for quick column name lookups.
   */
  private Set<String> columNames;

Column & ColumnType ( a.k.a Row Fields )

Column is an abstraction to represent a field in a row. There are custom classes for sub type such as String, Number, Date, Map, Array. It has attributed that provide metadata about the column data such as is that field nullable, if that field is a String, what is its maxsize, if it is DateTime, does it support timezone, if it is a Map, what is the type of the key and what the is the type of the value, if it is Array, what is the type of the elements, if it is Enum, what are the supported options for the enum 

/**
 * Base class for all the supported types in the Sqoop {@link #Schema}
 */
public abstract class Column {
  /**
   * Name of the column. It is optional
   */
  String name;
  /**
   * Whether the column value can be empty/null
   */
  Boolean nullable;
  /**
   * By default a column is nullable
   */

 

ColumnType is an handy enum that represents all the field types Sqoop supports. Note there is a umbrella UNKNOWN type for fields that sqoop does not support.

/**
 * All {@link #Column} types supported by Sqoop.
 */
public enum ColumnType {
  ARRAY,
  BINARY,
  BIT,
  DATE,
  DATE_TIME,
  DECIMAL,
  ENUM,
  FIXED_POINT,
  FLOATING_POINT,
  MAP,
  SET,
  TEXT,
  TIME,
  UNKNOWN,
  ;
}

SQOOP CSV Format

Column TypeCSV FormatUtils
ARRAY
  • Will be encoded as String (and hence enclosed with '), inside there will be JSON encoding of the array (hence the entire value will be enclosed in [] pair.
  • Few examples:
    • Array of numbers '[1,2,3]'
    • Array of Strings '["A","B","C"]'
    • Array of Objects '["

To convert a object Array to CSV format

// column provides the metadata information about the data in the list
private String encodeList(Object[] list, Column column) {...}

To parse a CSV String to object Array

private Object[] parseListElementFromJSON(String jsonString) {
...}

 

 

BINARY
encoded as string 
BIT
  
DATE
  
DATE_TIME
  
DECIMAL
  
ENUM
  
FIXED_POINT
  
FLOATING_POINT
  
MAP
  
SET
  
TEXT
  
TIME
  
UNKNOWN
  

 

SQOOP Object Format

SqoopDataUtils exposes a few utility methods to use to convert into the sqoop expected object format.

 

Column TypeObject FormatNotes
ARRAY
 Object[] 
BINARY
byte[] 
BIT

boolean

 
DATE
org.joda.time.LocalDate 
DATE_TIME

org.joda.time. DateTime

or

org.joda.time. LocalDateTime

(depends on timezone attribute )

 
DECIMAL

BigDecimal

 
ENUM
String 
FIXED_POINT

Integer

or

Long

( depends on

byteSize attribute)

 
FLOATING_POINT

Double

or

Float

( depends on

byteSize attribute)

 
MAP
java.util.Map<Object, Object> 
SET

Object[]

 
TEXT
String 
TIME
org.joda.time.LocalTime ( No Timezone) 
UNKNOWN
same as byte[] 

Custom Implementations of IDF

 

CSV IDF

CSV IDF piggy backs on the the Sqoop CSV Format and its native format is the CSV Format. It main functionality is to provide a way to translate between the text and object formats. 

If a connector claims to use CSV IDF here are the few ways it can be used 

 

 

Open Questions

  • The choice of using CSVText and ObjectArray as manadated fomrats 

 

 

 

 

 

 

  • No labels