Sqoop2 Quickstart

Building from sources

Checkout sources and switch to sqoop2 branch:

$ git clone https://git-wip-us.apache.org/repos/asf/sqoop.git sqoop2
$ cd sqoop2
$ git checkout sqoop2

Then you can build sqoop using mvn:

$ mvn package

Optionally you can build Sqoop with skipping tests:

$ mvn package -DskipTests

Creating binaries

Now build and package Sqoop2 binary distribution:

$ mvn package -Pbinary

This process will create a directory and a tarball under dist/target directory. The directory (named sqoop-2.0.0-SNAPSHOT as of this writing) contains necessary binaries to run Sqoop2, and its structure looks something like

--+ bin --+ sqoop.sh
  |
  + client --+ lib --+ sqoop-common.jar
  |                  |
  |                  + sqoop-client.jar
  |                  |
  |                  + (3rd-party client dependency jars)
  |
  + server --+ bin --+ setenv.sh
  |          |
  |          + conf --+ sqoop_bootstrap.properties
  |          |        |
  |          |        + sqoop.properties
  |          |
  |          + webapps --+ ROOT
  |                      |
  |                      + sqoop.war
  |
  + ...

As part of this process, a copy of the Tomcat server is also downloaded and put under the server directory in the above structure.

Setting up a build environment with Eclipse

Install Eclipse, maven and Oracle's JDK and run mvn eclipse:eclipse successfully
Import the project into eclipse by going to File > Import... > General > Existing Projects into Workspace > Next.
In the next wizard window, click the browse button next to "Select root directory" and browse to the root of the workspace from where you have checked out sqoop2. This will populate about 10 projects into your workspace - all of which are different modules within Sqoop 2. Click Finish button to get these projects into the workspace and start working.

Note - if this is the first time you are setting up Eclipse for a maven project, the import will show class path problems due to missing variable M2_REPO (Unbound classpath variable: 'M2_REPO/...). To fix this error, go to Preferences > Java > Build Path > Classpath Variables. Click on New..., enter name M2_REPO, click on Folder and browse upto the directory ~/.m2/repository. Click OK and close the preferences dialog. This will force the rebuild of the workspace and all projects should turn green.

Installing Sqoop2 on remote server

To install generated binaries on remote server simply copy directory sqoop-2.0.0-SNAPSHOT to your remote server:

scp -r dist/target/sqoop-2.0.0-SNAPSHOT remote-server.company.org:/remote/path/

Install dependencies

Sqoop server is depending on hadoop binaries, but they are not part of the distribution and thus you need to install them into Sqoop server manually. We currently supports only version 2.0, but other version will be added later. To install hadoop libraries execute command addtowar.sh with argument -hadoop $version $location. Following example is for Cloudera distribution version 4(CDH4):

 ./bin/addtowar.sh -hadoop 2.0 /usr/lib/hadoop/client/

If you're running CDH4 MR1:

cd dist/target/sqoop-2.0.0-SNAPSHOT-bin-hadoop200
./bin/addtowar.sh -hadoop-version cdh4mr1 -hadoop-path /usr/lib

In case that you're running original Mapreduce implementation (MR1), you will also need to install it's jar:

 ./bin/addtowar.sh -jars /usr/lib/hadoop-0.20-mapreduce/hadoop-2.0.0-mr1-cdh4.1.1-core.jar

You can install any arbitrary jars (connectors, JDBC drivers) using -jars argument that takes list of jars separated by ":". Here is example for installing MySQL jdbc driver into Sqoop server:

  ./bin/addtowar.sh -jars /path/to/jar/mysql-connector-java-5.1.21-bin.jar

Starting/Stopping Sqoop2 server

To start Sqoop2 server invoke the sqoop shell script:

cd dist/target/sqoop-2.0.0-SNAPSHOT
bin/sqoop.sh server start

The Sqoop2 server is then running as a web application within the Tomcat server.

Similarly, to stop Sqoop2 server, do the following:

bin/sqoop.sh server stop

Starting/Running Sqoop2 client

To start an interactive shell,

bin/sqoop.sh client

This will bring up an interactive client ready for input commands:

Sqoop Shell: Type 'help' or '\h' for help.

sqoop:000>

The command for the shell client looks something like <command> <function> <options>:

set
- set server
  - set server --host <host>
  - set server --port <port>
  - set server --webapp <webapp>

show
- show version
  - show version --all
  - show version --server
  - show version --client
  - show version --protocol

Type "help" for getting list of all possible command line commands.

Full import demo

This example will walk you through entire process of creating all objects in order to create actual map reduce job that will be executed on your Hadoop cluster. This demo is expecting that you have installed Sqoop2 server as instructed in previous sections. I'll use MySQL in this example, please change actual values to the one corresponding with your database box.

Firstly you need to create connection metadata object. Sqoop2 currently ships with single connector called "Generic JDBC Connector" that will most likely have id 1. You can see list of connector their ids by running command show connector --all:

sqoop:000> show connector --all
1 connector(s) to show: 1 connector(s) to show:
Connector with id 1:
  Name: generic-jdbc-connector
  Class: org.apache.sqoop.connector.jdbc.GenericJdbcConnector
  Supported job types: [EXPORT, IMPORT]
    Connection form 1:
      Name: form-connection
      Label: Configuration configuration
      Help: You must supply the information requested in order to create a connection object.
      Input 1:
...

New connection is created using command create connection. This command is requiring parameter --cid to specify which connector should be used. Use number retrieved from previous command:

sqoop:000> create connection --cid 1

You should be asked for several parameters. Please fill them with your specific use case. Here is example. Please note that arguments might vary in your case as sqoop develops:

sqoop:000> create connection --cid 1
Creating connection for connector with id 1
Please fill following values to create new connection object
Name: first-connection

Configuration configuration
JDBC Driver Class: com.mysql.jdbc.Driver
JDBC Connection String: jdbc:mysql://remote-server/dbname
Username: username
Password: ********
JDBC Connection Properties:
There are currently 0 values in the map:
entry#

Database configuration
Table name: example_table
Table SQL statement:
Table column names:
Data warehouse:
Data directory:
Partition column name:
Boundary query:

Security related configuration options
Max connections: 123
New connection was successfully created with validation status FINE and persistent id 1

Next step is to create job. It's created similarly as a connection using command create job. You need to pass argument --xid notifying which connection should be used for this job object and --typ}}e argument with job type. Right now only {{import job type is supported:

sqoop:000> create job --xid 1 --type import

Example at the time of writing this demo:

sqoop:000> create job --xid 1 --type import
Creating job for connection with id 1 Please fill following values to create new job object
Name: first-job

Ignored
Ignored:

Output configuration
Output format:
Output directory: /user/jarcec/first_table

New job was successfully created with validation status FINE and persistent id 1

Finally you can execute your job using submission command:

 sqoop:000> submission start --jid 1
Submission details
Job id: 1
Status: BOOTING
Creation date: 2012-23-09 17:23:11 PST
Last update date: 2012-23-09 17:23:11 PST
External Id: job_201210300909_0059
        http://vm-cdh4:50030/jobdetails.jsp?jobid=job_201210300909_0059
Progress: Progress is not available

And you can check progress of the job using following command:

 sqoop:000> submission status --jid 1
Submission details
Job id: 1
Status: RUNNING
Creation date: 2012-23-09 17:23:11 PST
Last update date: 2012-23-09 17:23:35 PST
External Id: job_201210300909_0059
        http://vm-cdh4:50030/jobdetails.jsp?jobid=job_201210300909_0059
Progress: 0.25 %

Full export demo

This example will walk you through entire process of creating all objects in order to create actual map reduce job that will be executed on your Hadoop cluster, and will export data from HDFS to a SQL database. This demo is expecting that you have installed Sqoop2 server as instructed in previous sections. I'll use MySQL in this example, please change actual values to the one corresponding with your database box.

Firstly you need to create connection metadata object. Sqoop2 currently ships with single connector called "Generic JDBC Connector" that will most likely have id 1. You can see list of connector their ids by running command show connector --all:

 
sqoop:000> show connector --all 
1 connector(s) to show: 
Connector with id 1:
Name: generic-jdbc-connector 
Class: org.apache.sqoop.connector.jdbc.GenericJdbcConnector
Version: 1.99.2
Supported job types: [IMPORT, EXPORT]
Connection form 1:
Name: connection
Label: Connection configuration
Help: You must supply the information requested in order to create a connection object.
Input 1:
...

New connection is created using command create connection. This command is requiring parameter --cid to specify which connector should be used. Use number retrieved from previous command:

 
sqoop:000> create connection --cid 1

You should be asked for several parameters. Please fill them with your specific use case. Here is example. The parameters that are not specified in this demo can be left blank. Please note that arguments might vary in your case as sqoop develops:

 
sqoop:000> create connection --cid 1 
Creating connection for connector with id 1 
Please fill following values to create new connection object 
Name: export-connection 

Connection configuration 
JDBC Driver Class: com.mysql.jdbc.Driver 
JDBC Connection String: jdbc:mysql://remote-server/dbname 
Username: username 
Password: ******** 
JDBC Connection Properties: 
There are currently 0 values in the map: 
entry# 

Security related configuration options

Max connections: 10

Database configuration 
Table name: example_table 
Table SQL statement: 
Table column names: 
Data warehouse: 
Data directory: 
Partition column name: 
Boundary query: 

Security related configuration options 
Max connections: 123 
New connection was successfully created with validation status FINE and persistent id 1

Next step is to create job. It's created similarly as a connection using command create job. You need to pass argument --xid notifying which connection should be used for this job object and --typ}}e argument with job type. Right now only {{import job type is supported:

 
sqoop:000> create job --xid 1 --type export

Example at the time of writing this demo:

 
sqoop:000> create job --xid 1 --type import 
Creating job for connection with id 1
Please fill following values to create new job object
Name: export-job

Database configuration

Schema name: dbname
Table name: mytable
Table SQL statement: 
Table column names: 

Input configuration

Input directory: /user/jarcec/mydata

Throttling resources

Extractors: 
Loaders: 
New job was successfully created with validation status FINE  and persistent id 1

Finally you can execute your job using submission command:

 
sqoop:000> submission start --jid 1 
Submission details 
Job id: 1 
Status: BOOTING 
Creation date: 2012-23-09 17:23:11 PST 
Last update date: 2012-23-09 17:23:11 PST 
External Id: job_201210300909_0059 
http://vm-cdh4:50030/jobdetails.jsp?jobid=job_201210300909_0059 
Progress: Progress is not available

And you can check progress of the job using following command:

 
sqoop:000> submission status --jid 1 
Submission details 
Job id: 1 
Status: RUNNING 
Creation date: 2012-23-09 17:23:11 PST 
Last update date: 2012-23-09 17:23:35 PST 
External Id: job_201210300909_0059 
http://vm-cdh4:50030/jobdetails.jsp?jobid=job_201210300909_0059 
Progress: 0.25 %

After the job succeeds, the same command will give the following output:

sqoop:000> submission status --jid 1
Submission details
Job id: 26
Status: SUCCEEDED
Creation date: 2012-23-09 17:23:11 PST 
Last update date: 2012-23-09 17:23:35 PST 
External Id: job_201210300909_0059 
http://vm-cdh4:50030/jobdetails.jsp?jobid=job_201210300909_0059
Counters:
	org.apache.hadoop.mapreduce.JobCounter
		SLOTS_MILLIS_MAPS: 4367
		TOTAL_LAUNCHED_MAPS: 1
		SLOTS_MILLIS_REDUCES: 0
		OTHER_LOCAL_MAPS: 1
	org.apache.hadoop.mapreduce.lib.output.FileOutputFormatCounter
		BYTES_WRITTEN: 1033
	org.apache.hadoop.mapreduce.lib.input.FileInputFormatCounter
		BYTES_READ: 0
	org.apache.hadoop.mapreduce.TaskCounter
		MAP_INPUT_RECORDS: 0
		MERGED_MAP_OUTPUTS: 0
		PHYSICAL_MEMORY_BYTES: 105304064
		SPILLED_RECORDS: 0
		COMMITTED_HEAP_BYTES: 46137344
		CPU_MILLISECONDS: 730
		FAILED_SHUFFLE: 0
		VIRTUAL_MEMORY_BYTES: 642871296
		SPLIT_RAW_BYTES: 118
		MAP_OUTPUT_RECORDS: 1
		GC_TIME_MILLIS: 28
	org.apache.hadoop.mapreduce.FileSystemCounter
		FILE_WRITE_OPS: 0
		FILE_READ_OPS: 0
		FILE_LARGE_READ_OPS: 0
		FILE_BYTES_READ: 0
		HDFS_BYTES_READ: 118
		FILE_BYTES_WRITTEN: 83438
		HDFS_LARGE_READ_OPS: 0
		HDFS_WRITE_OPS: 2
		HDFS_READ_OPS: 4
		HDFS_BYTES_WRITTEN: 1033
	org.apache.sqoop.submission.counter.SqoopCounters
		ROWS_READ: 1

Modifying configuration

Both the default bootstrap configuration sqoop_bootstrap.properties and the main configuration sqoop.properties are located under the conf directory in the Sqoop2 distribution directory.

The bootstrap configuration sqoop_bootstrap.properties controls what the mechanism is to provide configuration:

sqoop.config.provider=org.apache.sqoop.core.PropertiesConfigurationProvider

The main configuration sqoop.properties controls what the mechanism is for repository, where the log files are, what the logging levels are, etc.

# Log4J system
org.apache.sqoop.log4j.appender.file=org.apache.log4j.RollingFileAppender
org.apache.sqoop.log4j.appender.file.File=logs/sqoop.log
org.apache.sqoop.log4j.appender.file.MaxFileSize=25MB
org.apache.sqoop.log4j.appender.file.MaxBackupIndex=5
org.apache.sqoop.log4j.appender.file.layout=org.apache.log4j.PatternLayout
org.apache.sqoop.log4j.appender.file.layout.ConversionPattern=%d{ISO8601} %-5p %c{2} [%l] %m%n
org.apache.sqoop.log4j.debug=true
org.apache.sqoop.log4j.rootCategory=WARN, file
org.apache.sqoop.log4j.category.org.apache.sqoop=DEBUG
org.apache.sqoop.log4j.category.org.apache.derby=INFO

# Repository
org.apache.sqoop.repository.provider=org.apache.sqoop.repository.JdbcRepositoryProvider
org.apache.sqoop.repository.jdbc.handler=org.apache.sqoop.repository.derby.DerbyRepositoryHandler
org.apache.sqoop.repository.jdbc.transaction.isolation=READ_COMMITTED
org.apache.sqoop.repository.jdbc.maximum.connections=10
org.apache.sqoop.repository.jdbc.url=jdbc:derby:repository/db;create=true
org.apache.sqoop.repository.jdbc.create.schema=true
org.apache.sqoop.repository.jdbc.driver=org.apache.derby.jdbc.EmbeddedDriver
org.apache.sqoop.repository.jdbc.user=sa
org.apache.sqoop.repository.jdbc.password=
org.apache.sqoop.repository.sysprop.derby.stream.error.file=logs/derbyrepo.log

Debugging information

The logs of the Tomcat server is located under the server/logs directory in the Sqoop2 distribution directory.

The logs of the Sqoop2 server and the Derby repository are located as sqoop.log and derbyrepo.log (by default unless changed by the above configuration), respectively, under the (LOGS) directory in the Sqoop2 distribution directory.

Child pages