Sqoop MR Execution Engine

JIRA : SQOOP-1938

This document provides details of how the Sqoop MR Execution Engine works, its major components and details about the internals of the the implementation

Summary

The main entry point in Sqoop2 for job execution is the JobManager that is part of the package org.apache.sqoop.driver. JobManager holds the handle to the SubmissionEngine and the ExecutionEngine.

Submission Engine will use the concrete apis of either YARN/ Mesos/ Oozie that handle the resource management for job execution and submit the job to the execution engine. ExecutionEngine is the actual job executor that will use the apis of Apache Hadoop MR or Apache Spark to execute the sqoop job

JobManager does the following 3 things

Prepare the JobRequest object for the ExecutionEngine
Submits the job via the SubmissionEngine submit API, waits for the submission engine to return
Based on the result of the submit API, creates and saves the Job submission record into its repository( Derby/Postgres, depending on the configured store ) to store the history across multiple job runs

Here we discuss the details of the implementation of the MRSubmissionEngine and MRExecutionEngine

MR SubmissionEngine

Has a handle to the concrete execution engine which is the org.apache.hadoop.mapred.JobClient in our case
Initialize API to set up the submission engine
Submit API is blocking if using the hadoopLocalRunner and returns a boolean for success or failure of submission and async if non-local. In case of async, the update API is used subsequently to track the progress of the job submission
Update API can be invoked to query the status of the running job and update the Job submission record that holds the history information of a sqoop job across multiple runs
Stop API to abort a running job
Destroy API to mirror the initialize to clean up the submission engine on exit

MR ExecutionEngine

Has a handle to the JobRequest object populated by the JobManager
PrepareJob API to set up the necessary information required by the org.apache.hadoop.mapred.JobClient in our case

NOTE : The ExecutionEngine api is very bares bones and most of the functionality of job execution/ failure/ exception handling resulting from the MR engine happens inside the MRSubmissionEngine

Components of Sqoop using MR

SqoopMapper

The current semantics is:

# Extractors	# Loaders	Outcome
Default	Default	Map only job with 10 map tasks
Number X	Default	Map only job with X map tasks
Number X	Number Y	Map-reduce job with X map tasks and Y reduce tasks
Default	Number Y	Map-reduce job with 10 map tasks and Y reduce tasks

The purpose have been to provide ability to user to throttle both number of loader and extractors in an independent way (e.g. have different number of loaders then extractors) and to have default values that won't run reduce phase if not necessary.

Sqoop Writable

Having a Writable class is required by Hadoop framework - we are using the current one as a wrapper forIntermediateDataFormat that we can't use directly in MR as Hadoop doesn't support that (to my best knowledge). We're not using a concrete implementation such as Text, so that we don't have to convert all records to String to transfer data between mappers and reducers.

Child pages

Summary

MR SubmissionEngine

Components of Sqoop using MR

SqoopMapper

Sqoop Writable

Passing data into the sqoop job ( via the mapper)

SqoopInputFormat

SqoopSplit

SqoopNullOutputFormat

Passing data out of the sqoop job ( via the outputFormat)

SqoopReducer

SqoopOutputFormatLoadExecutor