Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

...

Code Block
languagejava
titleIntermediateDataFormat API
collapsetrue
public abstract class IntermediateDataFormat<T> {
  protected volatile T data;
  public int hashCode() {
    return data.hashCode();
  }

  /**
   * SetGet one row of data.
 If validate is*
 set to true,* the@return data- isOne validated
row of data, *represented againstin the schema.internal/native format of
   *
   * @param data - A single rowthe ofintermediate data toformat be movedimplementation.
   */
  public voidT setDatagetData(T data) {
    this.data =return data;
  }
  /**
   * GetSet one row of data.
 If validate *
is set to *true, @returnthe -data Oneis rowvalidated
 of data, represented* inagainst the internal/native format ofschema.
   *
   * @param data - A single therow intermediateof data formatto implementationbe moved.
   */
  public void setData(T getData(data) {
    this.data return= data;
  }
  /**
   * Get one row of data as CSV text.
 Use SqoopDataUtils *
for reading and *writing
 @return - String* representinginto the datasqoop inspecified CSV, accordingtext toformat thefor "FROM" schema.
   * No schema conversion is done on textData, to keep it as "high performance" option.
   */
  public abstract String getTextData();
  /**
   * Set one row of data as CSV.
   *
   */
  public abstract void setTextData(String text);
  /**
   * Get one row of data as an Object array.
   *each {@link #ColumnType} field in the row
   * Why a "native" internal format and then return CSV text too?
   * Imagine a connector that moves data from a system that stores data as a
   * serialization format called FooFormat. If I also need the data to be
   * written into HDFS as FooFormat, the additional cycles burnt in converting
   * the FooFormat to text and back is useless - so using the sqoop specified
   * CSV text format saves those extra cycles
   * <p/>
   * Most fast access mechanisms, like mysqldump or pgsqldump write the data
   * out as CSV, and most often the source data is also represented as CSV
   * - so having a minimal CSV support is mandated for all IDF, so we can easily read the
   * data out as text and write as text.
   * <p/>
   * @return - String representing the data in CSV text format.
   */
  public abstract String getCSVTextData();
  /**
   * Set one row of data as CSV.
   *
   */
  public abstract void setCSVTextData(String csvText);
  /**
   * Get one row of data as an Object array. Sqoop uses defined object representation
   * for each column type. For instance org.joda.time to represent date.Use SqoopDataUtils
   * for reading and writing into the sqoop specified object format
   * for each {@link #ColumnType} field in the row
   * </p>
   * @return - String representing the data as an Object array
   * If FROM and TO schema exist, we will use SchemaMatcher to get the data according to "TO" schema
   */
  public abstract Object[] getObjectData();
  /**
   * Set one row of data as an Object array.
   *
   */
  public abstract void setObjectData(Object[] data);
  /**
   * Set the schema for serializing/de-serializing  data.
   *
   * @param schema - the schema used for serializing/de-serializing  data
   */
  public abstract void setSchema(Schema schema);
  /**
   * Serialize the fields of this object to <code>out</code>.
   *
   * @param out <code>DataOuput</code> to serialize this object into.
   * @throws IOException
   */
  public abstract void write(DataOutput out) throws IOException;
  /**
   * Deserialize the fields of this object from <code>in</code>.
   *
   * <p>For efficiency, implementations should attempt to re-use storage in the
   * existing object where possible.</p>
   *
   * @param in <code>DataInput</code> to deseriablize this object from.
   * @throws IOException
   */
  public abstract void read(DataInput in) throws IOException;
}

NOTE: The CSV text format and the Object Array format are custom and  prescribed by Sqoop and the details of this format for every supported column type in the schema are are described below.

...

Schema represents a row of fields that are transferred between FROM and TO. Hence schema holds the list of column/ field types for that row. 

Code Block
languagejava
titleSchema
collapsetrue
/**
 * Schema represents the data fields that are transferred between {@link #From} and {@link #To}
 */
public class Schema {
  /**
   * Name of the schema, usually a table name.
   */
  private String name;
  /**
   * Optional note.
   */
  private String note;
  /**
   * Creation date.
   */
  private Date creationDate;
  /**
   * Columns associated with the schema.
   */
  private List<Column> columns;
  /**
   * Helper set for quick column name lookups.
   */
  private Set<String> columNames;

...

ColumnType is an handy enum that represents all the field types Sqoop supports. Note there is a umbrella UNKNOWN type for fields that sqoop does not support.

Code Block
languagejava
titleColumnType
collapsetrue
/**
 * All {@link #Column} types supported by Sqoop.
 */
public enum ColumnType {
  ARRAY,
  BINARY,
  BIT,
  DATE,
  DATE_TIME,
  DECIMAL,
  ENUM,
  FIXED_POINT,
  FLOATING_POINT,
  MAP,
  SET,
  TEXT,
  TIME,
  UNKNOWN,
  ;
}

...