Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

...

SqoopDestroyerOutputCommitter is a custom outputcommiter that provides hooks to do final cleanup or in some cases the one-time operations we want to invoke when sqoop job finishes, i,e either fails or succeeds.

SqoopReducer 

Extends the Reducer API and at this point only runs the progressService. It is invoked only when the numLoaders driver config is > 0. See above. VB: It is still unclear to me, how this would support throttling as indicated in this ticket, Looking for some details on how invoking a reducer helpsIt primary use case is throttling.

Code Block
public class SqoopReducer extends Reducer<SqoopWritable, NullWritable, SqoopWritable, NullWritable> {
..
      progressService.scheduleAtFixedRate(new SqoopProgressRunnable(context), 0, 2, TimeUnit.MINUTES);
} 


Why do we have ability to run reduce phase and why it’s part of throttling?

 

The original idea was that you want to throttle “From” and “To” side independently. For example if I’m exporting data from HBase to relational database, I might want to have one extractor (=mapper) per HBase region - but number of regions very likely will be more then number of pumping transactions that I want to have on my database, so I might want to specify a different number of loaders to throttle that down. But having reduce phase means to serialize all data and transfer them across network, so we are not running reduce phase unless user explicitly sets different number of loaders then reducers.

 

SqoopOutputFormatLoadExecutor and SqoopOutputFormatDataReader 

  1. The LoaderContext is set up in the ConsumerThread.run(..) method. 
  2. Loader's load method is invoked passing the SqoopOutputFormatDataReader and the LoaderContext
  3. The load method invokes the SqoopOutputFormatDataReader to read to records from the SqoopRecordWriter associated with the SqoopNullOutputFormat

...

  1. The SqoopOutputFormatLoadExecutor uses ConsumerThread to parallelize the extraction and loading process in addition to the parallelizing the extract only part using the numExtractors configured. More details are explained in the SQOOP-1938
     

    TL;DR: Parallelize reads and writes rather than have them be sequential.

    Most of the threading magic is for a pretty simple reason - each mapper does I/O in 2 places - one is writes to HDFS, the other is read from the DB (at that time, extend it to the new from/to architecture, you'd still have 2 I/O). By having a linear read-write code, you are essentially not reading anything while the write is happening, which seems like a pretty inefficient thing to do - you could easily read while the write is happening by parallelizing the reads and writes, which is what is being done. In addition, there is also some additional processing/handling that the output format does, which can cost time and CPU - at which point you could rather read from the DB.


 

Few related tickets proposed for enhancement 

...