You are viewing an old version of this page. View the current version.

Compare with Current View Page History

« Previous Version 5 Next »

 

JIRA : SQOOP-1938

This document provides details of how the Sqoop MR Execution Engine works, its major components and details about the internals of the the implementation

Summary

The main entry point in Sqoop2 for job execution is the JobManager that is part of the package org.apache.sqoop.driver. JobManager holds the handle to the SubmissionEngine and the ExecutionEngine.

Submission Engine will use the concrete apis of either YARN/ Mesos/ Oozie  that handle the resource management for job execution and submit the job to the execution engine. ExecutionEngine is the actual job executor that will use the apis of  Apache Hadoop MR or Apache Spark to execute the sqoop job

JobManager does the following 3 things

  1. Prepare the JobRequest object for the ExecutionEngine
  2. Submits the job via the SubmissionEngine  submit API, waits for the submission engine to return
  3. Based on the result of the submit API, creates and saves the Job submission record into its repository( Derby/Postgres, depending on the configured store ) to store the history across multiple job runs

Here we discuss the details of the implementation of the MRSubmissionEngine and MRExecutionEngine

MR SubmissionEngine

  • Has a handle to the concrete execution engine which is the org.apache.hadoop.mapred.JobClient in our case
  • Initialize API to set up the submission engine 
  • Submit API is blocking if using the hadoopLocalRunner and returns a boolean for success or failure of submission and async if non-local. In case of async, the update API is used subsequently to track the progress of the job submission
  • Update API can be invoked to query the status of the running job and update the Job submission record that holds the history information of a sqoop job across multiple runs
  • Stop API to abort a running job
  • Destroy API to mirror the initialize to clean up the submission engine on exit

MR ExecutionEngine

  • Has a handle to the JobRequest object populated by the  JobManager 
  • PrepareJob API to set up the necessary information required by the org.apache.hadoop.mapred.JobClient in our case 

NOTE : The ExecutionEngine api is very bares bones and most of the functionality of job execution/ failure/ exception handling resulting from the MR engine happens inside the MRSubmissionEngine 

Components of Sqoop using MR

SqoopMapper

  • The current semantics is:
# Extractors# LoadersOutcome
DefaultDefaultMap only job with 10 map tasks
Number XDefaultMap only job with X map tasks
Number XNumber YMap-reduce job with X map tasks and Y reduce tasks
DefaultNumber YMap-reduce job with 10 map tasks and Y reduce tasks

The purpose have been to provide ability to user to throttle both number of loader and extractors in an independent way (e.g. have different number of loaders then extractors) and to have default values that won't run reduce phase if not necessary.

Sqoop Writable

Having a Writable class is required by Hadoop framework - we are using the current one as a wrapper forIntermediateDataFormat that we can't use directly in MR as Hadoop doesn't support that (to my best knowledge). We're not using a concrete implementation such as Text, so that we don't have to convert all records to String to transfer data between mappers and reducers.

Passing data into the sqoop job ( via the mapper)


SqoopInputFormat

 

 

SqoopSplit

SqoopNullOutputFormat

 

Passing data out of the sqoop job ( via the outputFormat)

SqoopReducer

 

SqoopOutputFormatLoadExecutor

 

  • No labels