Building from sources
Checkout sources and switch to sqoop2 branch:
$ git clone https://git-wip-us.apache.org/repos/asf/sqoop.git sqoop2 $ cd sqoop2 $ git checkout sqoop2
Then you can build sqoop using mvn:
$ mvn package
Optionally you can build Sqoop with skipping tests:
$ mvn package -DskipTests
Creating binaries
Now build and package Sqoop2 binary distribution:
$ mvn package -Pbinary
This process will create a directory and a tarball under dist/target
directory. The directory (named sqoop-2.0.0-SNAPSHOT
as of this writing) contains necessary binaries to run Sqoop2, and its structure looks something like
--+ bin --+ sqoop.sh | + client --+ lib --+ sqoop-common.jar | | | + sqoop-client.jar | | | + (3rd-party client dependency jars) | + server --+ bin --+ setenv.sh | | | + conf --+ sqoop_bootstrap.properties | | | | | + sqoop.properties | | | + webapps --+ ROOT | | | + sqoop.war | + ...
As part of this process, a copy of the Tomcat server is also downloaded and put under the server
directory in the above structure.
Installing Sqoop2 on remote server
To install generated binaries on remote server simply copy directory sqoop-2.0.0-SNAPSHOT
to your remote server:
scp -r dist/target/sqoop-2.0.0-SNAPSHOT remote-server.company.org:/remote/path/
Install dependencies
Sqoop server is depending on hadoop binaries, but they are not part of the distribution and thus you need to install them into Sqoop server manually. We currently supports only version 2.0, but other version will be added later. To install hadoop libraries execute command addtowar.sh
with argument -hadoop $version $location
. Following example is for Cloudera distribution version 4(CDH4):
./bin/addtowar.sh -hadoop 2.0 /usr/lib/hadoop/client/
In case that you're running original Mapreduce implementation (MR1), you will also need to install it's jar:
./bin/addtowar.sh -jars /usr/lib/hadoop-0.20-mapreduce/hadoop-2.0.0-mr1-cdh4.1.1-core.jar
You can install any arbitrary jars (connectors, JDBC drivers) using -jars
argument that takes list of jars separated by ":". Here is example for installing MySQL jdbc driver into Sqoop server:
./bin/addtowar.sh -jars /path/to/jar/mysql-connector-java-5.1.21-bin.jar
Starting/Stopping Sqoop2 server
To start Sqoop2 server invoke the sqoop
shell script:
cd dist/target/sqoop-2.0.0-SNAPSHOT bin/sqoop.sh server start
The Sqoop2 server is then running as a web application within the Tomcat server.
Similarly, to stop Sqoop2 server, do the following:
bin/sqoop.sh server stop
Starting/Running Sqoop2 client
To start an interactive shell,
bin/sqoop.sh client
This will bring up an interactive client ready for input commands:
Sqoop Shell: Type 'help' or '\h' for help. sqoop:000>
The command for the shell client looks something like <command> <function> <options>:
- set
- set server
- set server --host <host>
- set server --port <port>
- set server --webapp <webapp>
- set server
- show
- show version
- show version --all
- show version --server
- show version --client
- show version --protocol
- show version
Type "help" for getting list of all possible command line commands.
Full demo
This example will walk you through entire process of creating all objects in order to create actual map reduce job that will be executed on your Hadoop cluster. This demo is expecting that you have installed Sqoop2 server as instructed in previous sections. I'll use MySQL in this example, please change actual values to the one corresponding with your database box.
Firstly you need to create connection metadata object. Sqoop2 currently ships with single connector called "Generic JDBC Connector" that will most likely have id 1. You can see list of connector their ids by running command show connector --all
:
sqoop:000> show connector --all 1 connector(s) to show: 1 connector(s) to show: Connector with id 1: Name: generic-jdbc-connector Class: org.apache.sqoop.connector.jdbc.GenericJdbcConnector Supported job types: [EXPORT, IMPORT] Connection form 1: Name: form-connection Label: Configuration configuration Help: You must supply the information requested in order to create a connection object. Input 1: ...
New connection is created using command create connection
. This command is requiring parameter --cid
to specify which connector should be used. Use number retrieved from previous command:
sqoop:000> create connection --cid 1
You should be asked for several parameters. Please fill them with your specific use case. Here is example. Please note that arguments might vary in your case as sqoop develops:
sqoop:000> create connection --cid 1 Creating connection for connector with id 1 Please fill following values to create new connection object Name: first-connection Configuration configuration JDBC Driver Class: com.mysql.jdbc.Driver JDBC Connection String: jdbc:mysql://remote-server/dbname Username: username Password: ******** JDBC Connection Properties: There are currently 0 values in the map: entry# Database configuration Table name: example_table Table SQL statement: Table column names: Data warehouse: Data directory: Partition column name: Boundary query: Security related configuration options Max connections: 123 New connection was successfully created with validation status FINE and persistent id 1
Next step is to create job. It's created similarly as a connection using command create job
. You need to pass argument --xid
notifying which connection should be used for this job object and --typ}}e argument with job type. Right now only {{import
job type is supported:
sqoop:000> create job --xid 1 --type import
Example at the time of writing this demo:
sqoop:000> create job --xid 1 --type import Creating job for connection with id 1 Please fill following values to create new job object Name: first-job Ignored Ignored: Output configuration Output format: Output directory: /user/jarcec/first_table New job was successfully created with validation status FINE and persistent id 1
Finally you can execute your job using submission
command:
sqoop:000> submission start --jid 1 Submission details Job id: 1 Status: BOOTING Creation date: 2012-23-09 17:23:11 PST Last update date: 2012-23-09 17:23:11 PST External Id: job_201210300909_0059 http://vm-cdh4:50030/jobdetails.jsp?jobid=job_201210300909_0059 Progress: Progress is not available
And you can check progress of the job using following command:
sqoop:000> submission status --jid 1 Submission details Job id: 1 Status: RUNNING Creation date: 2012-23-09 17:23:11 PST Last update date: 2012-23-09 17:23:35 PST External Id: job_201210300909_0059 http://vm-cdh4:50030/jobdetails.jsp?jobid=job_201210300909_0059 Progress: 0.25 %
Modifying configuration
Both the default bootstrap configuration sqoop_bootstrap.properties
and the main configuration sqoop.properties
are located under the conf
directory in the Sqoop2 distribution directory.
The bootstrap configuration sqoop_bootstrap.properties
controls what the mechanism is to provide configuration:
sqoop.config.provider=org.apache.sqoop.core.PropertiesConfigurationProvider
The main configuration sqoop.properties
controls what the mechanism is for repository, where the log files are, what the logging levels are, etc.
# Log4J system org.apache.sqoop.log4j.appender.file=org.apache.log4j.RollingFileAppender org.apache.sqoop.log4j.appender.file.File=logs/sqoop.log org.apache.sqoop.log4j.appender.file.MaxFileSize=25MB org.apache.sqoop.log4j.appender.file.MaxBackupIndex=5 org.apache.sqoop.log4j.appender.file.layout=org.apache.log4j.PatternLayout org.apache.sqoop.log4j.appender.file.layout.ConversionPattern=%d{ISO8601} %-5p %c{2} [%l] %m%n org.apache.sqoop.log4j.debug=true org.apache.sqoop.log4j.rootCategory=WARN, file org.apache.sqoop.log4j.category.org.apache.sqoop=DEBUG org.apache.sqoop.log4j.category.org.apache.derby=INFO # Repository org.apache.sqoop.repository.provider=org.apache.sqoop.repository.JdbcRepositoryProvider org.apache.sqoop.repository.jdbc.handler=org.apache.sqoop.repository.derby.DerbyRepositoryHandler org.apache.sqoop.repository.jdbc.transaction.isolation=READ_COMMITTED org.apache.sqoop.repository.jdbc.maximum.connections=10 org.apache.sqoop.repository.jdbc.url=jdbc:derby:repository/db;create=true org.apache.sqoop.repository.jdbc.create.schema=true org.apache.sqoop.repository.jdbc.driver=org.apache.derby.jdbc.EmbeddedDriver org.apache.sqoop.repository.jdbc.user=sa org.apache.sqoop.repository.jdbc.password= org.apache.sqoop.repository.sysprop.derby.stream.error.file=logs/derbyrepo.log
Debugging information
The logs of the Tomcat server is located under the server/logs
directory in the Sqoop2 distribution directory.
The logs of the Sqoop2 server and the Derby repository are located as sqoop.log
and derbyrepo.log
(by default unless changed by the above configuration), respectively, under the
(LOGS)
directory in the Sqoop2 distribution directory.