Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

...

Open Questions related to the current IDF design

(Some of the below are some serious shortcomings of the current design as it exists)

  • The choice of using CSVText and ObjectArray as mandated formats for Sqoop IDF are influenced from the Sqoop 1 design, It favors some traditional fast dump databases but there is no real benchmark to prove how optimal it is vs using Avro or other formats for representing the data
  • Using  intermediate format might lead to discrepancies in some specific column types, for instance using JODA for representing the date time objects only gives 3 digit precision, where as the sql timestamp from JDBC sources supports 6 digit precision
  • More importantly SqoopConnector API has a getIDF..() method, that ties a connector to a  specific intermediate format for all the supported directions ( i.e both FROM and TO at this point) . This means the connector in both FROM and TO side has to provide this format and expect this format respectively. 
  • There are 3 different formats as described above in each IDF implementation, so each connector can potentially support one of these formats and that is not obvious at all when a connector proclaims to use a particular implementation of IDF such as CSVIntermediateDataFormat. For instance the GenericJDBCConnector says it uses CSVIntermediateDataFormat but chooses to write objectArray in extractor and readObjectArray in Loader. Hence it is not obvious what is the format underneath that it will read and write to. On the other hand, HDFSConnector also says it uses CSVIntermediateDataFormat but, uses only the CSV text format in the extractor and loader at this point. May change in future. 
  • A connector possibly should be able to handle multiple IDFs, and expose the supported IDFs per direction. It is not possible today, For instance a sqoop job should be able to dynamically choose the IDF for HDFSConnector when used in the TO direction. The job might be able to say, use AVRO IDF for the TO side and hence load all my data into HDFS in avro format. This means when doing the load, the HDFS will use the readContent API of the  SqoopOutputFormatDataReader. But today HDFS can only say it uses CSVIntermediateDataFormat and the data loaded into HDFS will need conversion from CSV to Avro as a separate step.
  • Assuming that every IDF support to implement a CSVText equivalent is a overkill. If at all we mandated to use CSV and ObjectArray as the 2 formats, we should have made IDF not an API, but in fact a standard implementation, it could have been further extended.  Imagine having to write a JSONIDF or a AvroIDF and still having to replicate the same logic that the default/degenerate CSVIntermediateDataFormat provides.