Sqoop2 Quickstart

Building from sources

Checkout sources and switch to sqoop2 branch:

$ git clone https://git-wip-us.apache.org/repos/asf/sqoop.git sqoop2
$ cd sqoop2
$ git checkout sqoop2

Then you can build sqoop using mvn:

$ mvn package

Optionally you can build Sqoop with skipping tests:

$ mvn package -DskipTests

Creating binaries

Now build and package Sqoop2 binary distribution:

$ mvn package -Pbinary

This process will create a directory and a tarball under dist/target directory. The directory (named sqoop-2.0.0-SNAPSHOT as of this writing) contains necessary binaries to run Sqoop2, and its structure looks something like

--+ bin --+ sqoop.sh
  |
  + client --+ lib --+ sqoop-common.jar
  |                  |
  |                  + sqoop-client.jar
  |                  |
  |                  + (3rd-party client dependency jars)
  |
  + server --+ bin --+ setenv.sh
  |          |
  |          + conf --+ sqoop_bootstrap.properties
  |          |        |
  |          |        + sqoop.properties
  |          |
  |          + webapps --+ ROOT
  |                      |
  |                      + sqoop.war
  |
  + ...

As part of this process, a copy of the Tomcat server is also downloaded and put under the server directory in the above structure.

Installing Sqoop2 on remote server

To install generated binaries on remote server simply copy directory sqoop-2.0.0-SNAPSHOT to your remote server:

scp -r dist/target/sqoop-2.0.0-SNAPSHOT remote-server.company.org:/remote/path/

Install dependencies

Sqoop server is depending on hadoop binaries, but they are not part of the distribution and thus you need to install them into Sqoop server manually. We currently supports only version 2.0, but other version will be added later. To install hadoop libraries execute command addtowar.sh with argument -hadoop $version $location. Following example is for Cloudera distribution version 4(CDH4):

 ./bin/addtowar.sh -hadoop 2.0 /usr/lib/hadoop/client/

In case that you're running original Mapreduce implementation (MR1), you will also need to install it's jar:

 ./bin/addtowar.sh -jars /usr/lib/hadoop-0.20-mapreduce/hadoop-2.0.0-mr1-cdh4.1.1-core.jar

You can install any arbitrary jars (connectors, JDBC drivers) using -jars argument that takes list of jars separated by ":". Here is example for installing MySQL jdbc driver into Sqoop server:

  ./bin/addtowar.sh -jars /path/to/jar/mysql-connector-java-5.1.21-bin.jar

Starting/Stopping Sqoop2 server

To start Sqoop2 server invoke the sqoop shell script:

cd dist/target/sqoop-2.0.0-SNAPSHOT
bin/sqoop.sh server start

The Sqoop2 server is then running as a web application within the Tomcat server.

Similarly, to stop Sqoop2 server, do the following:

bin/sqoop.sh server stop

Starting/Running Sqoop2 client

To start an interactive shell,

bin/sqoop.sh client

This will bring up an interactive client ready for input commands:

Sqoop Shell: Type 'help' or '\h' for help.

sqoop:000>

The command for the shell client looks something like <command> <function> <options>:

set
- set server
  - set server --host <host>
  - set server --port <port>
  - set server --webapp <webapp>

show
- show version
  - show version --all
  - show version --server
  - show version --client
  - show version --protocol

Type "help" for getting list of all possible command line commands.

Full demo

This example will walk you through entire process of creating all objects in order to create actual map reduce job that will be executed on your Hadoop cluster. This demo is expecting that you have installed Sqoop2 server as instructed in previous sections. I'll use MySQL in this example, please change actual values to the one corresponding with your database box.

Firstly you need to create connection metadata object. Sqoop2 currently ships with single connector called "Generic JDBC Connector" that will most likely have id 1. You can see list of connector their ids by running command show connector --all:

sqoop:000> show connector --all
1 connector(s) to show: 1 connector(s) to show:
Connector with id 1:
  Name: generic-jdbc-connector
  Class: org.apache.sqoop.connector.jdbc.GenericJdbcConnector
  Supported job types: [EXPORT, IMPORT]
    Connection form 1:
      Name: form-connection
      Label: Configuration configuration
      Help: You must supply the information requested in order to create a connection object.
      Input 1:
...

New connection is created using command create connection. This command is requiring parameter --cid to specify which connector should be used. Use number retrieved from previous command:

sqoop:000> create connection --cid 1

You should be asked for several parameters. Please fill them with your specific use case. Here is example. Please note that arguments might vary in your case as sqoop develops:

sqoop:000> create connection --cid 1
Creating connection for connector with id 1
Please fill following values to create new connection object
Name: first-connection

Configuration configuration
JDBC Driver Class: com.mysql.jdbc.Driver
JDBC Connection String: jdbc:mysql://remote-server/dbname
Username: username
Password: ********
JDBC Connection Properties:
There are currently 0 values in the map:
entry#

Database configuration
Table name: example_table
Table SQL statement:
Table column names:
Data warehouse:
Data directory:
Partition column name:
Boundary query:

Security related configuration options
Max connections: 123
New connection was successfully created with validation status FINE and persistent id 1

Next step is to create job. It's created similarly as a connection using command create job. You need to pass argument --xid notifying which connection should be used for this job object and --typ}}e argument with job type. Right now only {{import job type is supported:

sqoop:000> create job --xid 1 --type import

Example at the time of writing this demo:

sqoop:000> create job --xid 1 --type import
Creating job for connection with id 1 Please fill following values to create new job object
Name: first-job

Ignored
Ignored:

Output configuration
Output format:
Output directory: /user/jarcec/first_table

New job was successfully created with validation status FINE and persistent id 1

Finally you can execute your job using submission command:

 sqoop:000> submission start --jid 1
Submission details
Job id: 1
Status: BOOTING
Creation date: 2012-23-09 17:23:11 PST
Last update date: 2012-23-09 17:23:11 PST
External Id: job_201210300909_0059
        http://vm-cdh4:50030/jobdetails.jsp?jobid=job_201210300909_0059
Progress: Progress is not available

And you can check progress of the job using following command:

 sqoop:000> submission status --jid 1
Submission details
Job id: 1
Status: RUNNING
Creation date: 2012-23-09 17:23:11 PST
Last update date: 2012-23-09 17:23:35 PST
External Id: job_201210300909_0059
        http://vm-cdh4:50030/jobdetails.jsp?jobid=job_201210300909_0059
Progress: 0.25 %

Modifying configuration

Both the default bootstrap configuration sqoop_bootstrap.properties and the main configuration sqoop.properties are located under the conf directory in the Sqoop2 distribution directory.

The bootstrap configuration sqoop_bootstrap.properties controls what the mechanism is to provide configuration:

sqoop.config.provider=org.apache.sqoop.core.PropertiesConfigurationProvider

The main configuration sqoop.properties controls what the mechanism is for repository, where the log files are, what the logging levels are, etc.

# Log4J system
org.apache.sqoop.log4j.appender.file=org.apache.log4j.RollingFileAppender
org.apache.sqoop.log4j.appender.file.File=logs/sqoop.log
org.apache.sqoop.log4j.appender.file.MaxFileSize=25MB
org.apache.sqoop.log4j.appender.file.MaxBackupIndex=5
org.apache.sqoop.log4j.appender.file.layout=org.apache.log4j.PatternLayout
org.apache.sqoop.log4j.appender.file.layout.ConversionPattern=%d{ISO8601} %-5p %c{2} [%l] %m%n
org.apache.sqoop.log4j.debug=true
org.apache.sqoop.log4j.rootCategory=WARN, file
org.apache.sqoop.log4j.category.org.apache.sqoop=DEBUG
org.apache.sqoop.log4j.category.org.apache.derby=INFO

# Repository
org.apache.sqoop.repository.provider=org.apache.sqoop.repository.JdbcRepositoryProvider
org.apache.sqoop.repository.jdbc.handler=org.apache.sqoop.repository.derby.DerbyRepositoryHandler
org.apache.sqoop.repository.jdbc.transaction.isolation=READ_COMMITTED
org.apache.sqoop.repository.jdbc.maximum.connections=10
org.apache.sqoop.repository.jdbc.url=jdbc:derby:repository/db;create=true
org.apache.sqoop.repository.jdbc.create.schema=true
org.apache.sqoop.repository.jdbc.driver=org.apache.derby.jdbc.EmbeddedDriver
org.apache.sqoop.repository.jdbc.user=sa
org.apache.sqoop.repository.jdbc.password=
org.apache.sqoop.repository.sysprop.derby.stream.error.file=logs/derbyrepo.log

Debugging information

The logs of the Tomcat server is located under the server/logs directory in the Sqoop2 distribution directory.

The logs of the Sqoop2 server and the Derby repository are located as sqoop.log and derbyrepo.log (by default unless changed by the above configuration), respectively, under the (LOGS) directory in the Sqoop2 distribution directory.

Child pages