Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

...

Components of Sqoop using MR

 By default sqoop job is a map only job. It does not utilize the reducer by default, unless 

# Extractors# LoadersOutcome
DefaultDefaultMap only job with 10 map tasks
Number XDefaultMap only job with X map tasks
Number XNumber YMap-reduce job with X map tasks and Y reduce tasks
DefaultNumber YMap-reduce job with 10 map tasks and Y reduce tasks

The purpose have been to provide ability to user to throttle both number of loader and extractors in an independent way (e.g. have different number of loaders then extractors) and to have default values that won't run reduce phase if not necessary.

...

Passing data into the sqoop job ( via the mapper)

 

...

There is various information such as the job configs, driver configs, schema of the data read and the schema of the data written required by the Extractor and Loader that has to be passed via the SqoopMapper. It is currently passed securely like this via the credential store or via the configuration

 

Code Block
org.apache.hadoop.mapreduce.job.getCredentials().addSecretKey(SCHEMA_FROM_KEY,jsonSchema.getBytes());
org.apache.hadoop.mapreduce.job.getCredentials().addSecretKey(SCHEMA_TO_KEY, jsonSchema.getBytes())
org.apache.hadoop.mapreduce.job.getCredentials().addSecretKey(MR_JOB_CONFIG_FROM_CONNECTOR_LINK_KEY, ConfigUtils.toJson(obj).getBytes());
org.apache.hadoop.mapreduce.job.getConfiguration().set(MR_JOB_CONFIG_CLASS_FROM_CONNECTOR_LINK, obj.getClass().getName());

 

SqoopMapper

 

  1. Creates the ExtractorContext from the data stored in the configuration and credential store to pass to the connectors extract API
  2. Creates the SqoopSplit that holds the partition information for the data to be extracted
  3. Post extract call, records the Hadoop counters related to Extraction logic
  4. Passing data out of Mapper : DistributedCache can be used if we need to write any information from the extractor back to the sqoop repository

 

Sqoop Writable(Comparable)

 

  1. Having a Writable class is required by Hadoop framework - we are using the current one as a wrapper for IntermediateDataFormat. Read more on IDF here
  2. We're not using a concrete implementation such as Text, so that we don't have to convert all records to String to transfer data between mappers and reducers.
  3. SqoopWritable delegates a lot of its functionality to the IntermediateDataFormat implementation used in the sqoop job, for instance the compareTo method on the Writable can used any custom logic of the underlying IDF for sorting records Extracted and then eventually used to write in the Load phase

 

SqoopSplit

 

  1. An InputSplit describes a unit of work that comprises a single map task in a MapReduce program, SqoopSplit extends InputSplit
  2. Instantiates a custom Partition class to be used for splitting the input, in our case it is the data Extracted in the extract phase
  3. Delegates to the Partition object to read and write data

 

SqoopInputFormat

 

InputFormat defines: How these data in FROM are split up and read. Provides a factory for RecordReader objects that read the file

...

Code Block
public class SqoopInputFormat extends InputFormat<SqoopSplit, NullWritable> {
...} 

SqoopNullOutputFormat

 The (key, value) pairs provided by the mapper are passed on the Loader for the TO part. The way they are written is governed by the OutputFormat. SqoopNullOutputFormat extends the OutputFormat class. The goal of this custom outputformat is generates no output files  on HDFS since HDFS may not always be the source, instead it relies on the SqoopOutputFormatLoadExecutor to pass the data to the Loader via the SqoopRecordWriter.Much like how the InputFormat actually reads individual records through the RecordReader implementation, the OutputFormat class is a factory for RecordWriter objects; these are used to write the individual records to the files as directed by the OutputFormat.

 

Code Block
public class SqoopNullOutputFormat extends OutputFormat<SqoopWritable, NullWritable> {
...}

 

SqoopReducer

 

Code Block
public class SqoopReducer extends Reducer<SqoopWritable, NullWritable, SqoopWritable, NullWritable> {
..
} 


SqoopOutputFormatLoadExecutor

...