Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

...

Code Block
org.apache.hadoop.mapreduce.job.getCredentials().addSecretKey(SCHEMA_FROM_KEY,jsonSchema.getBytes());
org.apache.hadoop.mapreduce.job.getCredentials().addSecretKey(SCHEMA_TO_KEY, jsonSchema.getBytes())
org.apache.hadoop.mapreduce.job.getCredentials().addSecretKey(MR_JOB_CONFIG_FROM_CONNECTOR_LINK_KEY, ConfigUtils.toJson(obj).getBytes());
org.apache.hadoop.mapreduce.job.getConfiguration().set(MR_JOB_CONFIG_CLASS_FROM_CONNECTOR_LINK, obj.getClass().getName());

SqoopMapper

  1. Creates the ExtractorContext from the data stored in the configuration and credential store to pass to the connectors extract API
  2. Creates the SqoopSplit that holds the partition information for the data to be extracted
  3. Post extract call, records the Hadoop counters related to Extraction logic
  4. Passing data out of Mapper : DistributedCache can be used if we need to write any information from the extractor back to the sqoop repository

Sqoop Writable(Comparable)

  1. Having a Writable class is required by Hadoop framework - we are using the current one as a wrapper for IntermediateDataFormat. Read more on IDF here
  2. We're not using a concrete implementation such as Text, so that we don't have to convert all records to String to transfer data between mappers and reducers.
  3. SqoopWritable delegates a lot of its functionality to the IntermediateDataFormat implementation used in the sqoop job, for instance the compareTo method on the Writable can used any custom logic of the underlying IDF for sorting records Extracted and then eventually used to write in the Load phase

SqoopSplit

  1. An InputSplit describes a unit of work that comprises a single map task in a MapReduce program, SqoopSplit extends InputSplit
  2. Instantiates a custom Partition class to be used for splitting the input, in our case it is the data Extracted in the extract phase
  3. Delegates to the Partition object to read and write data
  4. It is the Key to the SqoopInputFormat

 

SqoopInputFormat

InputFormat defines: How these data in FROM are split up and read. Provides a factory for RecordReader objects that read the file

SqoopInputFormat is a custom implementation of the MR InputFormat class.  SqoopRecordReader is the custom implementation of the RecordReader.  The InputSplit has defined a slice of work, but does not describe how to access it. The SqoopRecordReader class actually loads the data from its source and converts it into (key, value) pairs suitable for reading by the Mapper.  The SqoopRecordReader is invoked repeatedly on the input until the entire SqoopSplit has been consumed. Each invocation of the SqoopRecordReader leads to another call to the run() method of the Mapper.

Code Block
public class SqoopInputFormat extends InputFormat<SqoopSplit, NullWritable> {
...} 

SqoopNullOutputFormat

The (key, value) pairs provided by the mapper are passed on the Loader for the TO part. The way they are written is governed by the OutputFormat. SqoopNullOutputFormat extends the OutputFormat class. The real goal of this custom outputformat is generates NullOutputFormat : generates no output files  on HDFS since HDFS may not always be the source, instead it destination. In our case it is not so, it is just named SqoopNullOutputFormat ( VB: Am I right?)  it relies on the SqoopOutputFormatLoadExecutor to pass the data to the Loader via the SqoopRecordWriter. Much like how the SqoopInputFormat actually reads individual records through the SqoopRecordReader implementation, the SqoopNullOutputFormat class is a factory for SqoopRecordWriter objects; these are used to write the individual records to the files as directed by the OutputFormat.final destination ( in our case the Loader)

Code Block
public class SqoopNullOutputFormat extends OutputFormat<SqoopWritable, NullWritable> {
...}

SqoopDestroyerOutputCommitter is a custom outputcommiter that provides hooks to do final cleanup or in some cases the one-time operations we want to invoke when sqoop job finishes, i,e either fails or succeeds.

SqoopReducer SqoopReducer 

Extends the Reducer API and at this point only runs the progressService. It is invoked only when the numLoaders driver config is > 0. See above. VB: It is still un clear to me, how this would support throttling as indicated in this ticket, Looking for some details on how invoking a reducer helps.

...

Code Block
public class SqoopReducer extends Reducer<SqoopWritable, NullWritable, SqoopWritable, NullWritable> {
..
      progressService.scheduleAtFixedRate(new SqoopProgressRunnable(context), 0, 2, TimeUnit.MINUTES);
} 

SqoopOutputFormatLoadExecutor and SqoopOutputFormatDataReader

VB: The internal details of this class is not something I clearly understand, so I am hoping @jarcec can add some notes here on the ConsumerThread class and its significance. I understand reader/ writer problem but in the context of Map tasks and missing Reduce Tasks and sometimes with both Map and Reduce Tasks, I would like to validate my understanding

...