Building from sources
Checkout sources and switch to sqoop2 branch:
$ git clone https://git-wip-us.apache.org/repos/asf/sqoop.git sqoop2 $ cd sqoop2 $ git checkout sqoop2
Then you can build sqoop using mvn:
$ mvn package
Optionally you can build Sqoop with skipping tests:
$ mvn package -DskipTests
Creating binaries
Now build and package Sqoop2 binary distribution:
$ mvn package -Pbinary
This process will create a directory and a tarball under dist/target
directory. The directory (named sqoop-2.0.0-SNAPSHOT
as of this writing) contains necessary binaries to run Sqoop2, and its structure looks something like
--+ bin --+ sqoop.sh | + client --+ lib --+ sqoop-common.jar | | | + sqoop-client.jar | | | + (3rd-party client dependency jars) | + server --+ bin --+ setenv.sh | | | + conf --+ sqoop_bootstrap.properties | | | | | + sqoop.properties | | | + webapps --+ ROOT | | | + sqoop.war | + ...
As part of this process, a copy of the Tomcat server is also downloaded and put under the server
directory in the above structure.
Setting up a build environment with Eclipse
- Install Eclipse, maven and Oracle's JDK and run mvn eclipse:eclipse successfully
- Import the project into eclipse by going to File > Import... > General > Existing Projects into Workspace > Next.
- In the next wizard window, click the browse button next to "Select root directory" and browse to the root of the workspace from where you have checked out sqoop2. This will populate about 10 projects into your workspace - all of which are different modules within Sqoop 2. Click Finish button to get these projects into the workspace and start working.
Note - if this is the first time you are setting up Eclipse for a maven project, the import will show class path problems due to missing variable M2_REPO (Unbound classpath variable: 'M2_REPO/...). To fix this error, go to Preferences > Java > Build Path > Classpath Variables. Click on New..., enter name M2_REPO, click on Folder and browse upto the directory ~/.m2/repository. Click OK and close the preferences dialog. This will force the rebuild of the workspace and all projects should turn green.
Installing Sqoop2 on remote server
To install generated binaries on remote server simply copy directory sqoop-2.0.0-SNAPSHOT
to your remote server:
scp -r dist/target/sqoop-2.0.0-SNAPSHOT remote-server.company.org:/remote/path/
Install dependencies
Sqoop server is depending on hadoop binaries, but they are not part of the distribution and thus you need to install them into Sqoop server manually. We currently supports only version 2.0, but other version will be added later. To install hadoop libraries execute command addtowar.sh
with argument -hadoop $version $location
. Following example is for Cloudera distribution version 4(CDH4):
./bin/addtowar.sh -hadoop 2.0 /usr/lib/hadoop/client/
If you're running CDH4 MR1:
cd dist/target/sqoop-2.0.0-SNAPSHOT-bin-hadoop200 ./bin/addtowar.sh -hadoop-version cdh4mr1 -hadoop-path /usr/lib
In case that you're running original Mapreduce implementation (MR1), you will also need to install it's jar:
./bin/addtowar.sh -jars /usr/lib/hadoop-0.20-mapreduce/hadoop-2.0.0-mr1-cdh4.1.1-core.jar
You can install any arbitrary jars (connectors, JDBC drivers) using -jars
argument that takes list of jars separated by ":". Here is example for installing MySQL jdbc driver into Sqoop server:
./bin/addtowar.sh -jars /path/to/jar/mysql-connector-java-5.1.21-bin.jar
Starting/Stopping Sqoop2 server
To start Sqoop2 server invoke the sqoop
shell script:
cd dist/target/sqoop-2.0.0-SNAPSHOT bin/sqoop.sh server start
The Sqoop2 server is then running as a web application within the Tomcat server.
Similarly, to stop Sqoop2 server, do the following:
bin/sqoop.sh server stop
Starting/Running Sqoop2 client
To start an interactive shell,
bin/sqoop.sh client
This will bring up an interactive client ready for input commands:
Sqoop Shell: Type 'help' or '\h' for help. sqoop:000>
The command for the shell client looks something like <command> <function> <options>:
- set
- set server
- set server --host <host>
- set server --port <port>
- set server --webapp <webapp>
- set server
- show
- show version
- show version --all
- show version --server
- show version --client
- show version --protocol
- show version
Type "help" for getting list of all possible command line commands.
Full import demo
This example will walk you through entire process of creating all objects in order to create actual map reduce job that will be executed on your Hadoop cluster. This demo is expecting that you have installed Sqoop2 server as instructed in previous sections. I'll use MySQL in this example, please change actual values to the one corresponding with your database box.
Firstly you need to create connection metadata object. Sqoop2 currently ships with single connector called "Generic JDBC Connector" that will most likely have id 1. You can see list of connector their ids by running command show connector --all
:
sqoop:000> show connector --all 1 connector(s) to show: 1 connector(s) to show: Connector with id 1: Name: generic-jdbc-connector Class: org.apache.sqoop.connector.jdbc.GenericJdbcConnector Supported job types: [EXPORT, IMPORT] Connection form 1: Name: form-connection Label: Configuration configuration Help: You must supply the information requested in order to create a connection object. Input 1: ...
New connection is created using command create connection
. This command is requiring parameter --cid
to specify which connector should be used. Use number retrieved from previous command:
sqoop:000> create connection --cid 1
You should be asked for several parameters. Please fill them with your specific use case. Here is example. Please note that arguments might vary in your case as sqoop develops:
sqoop:000> create connection --cid 1 Creating connection for connector with id 1 Please fill following values to create new connection object Name: first-connection Configuration configuration JDBC Driver Class: com.mysql.jdbc.Driver JDBC Connection String: jdbc:mysql://remote-server/dbname Username: username Password: ******** JDBC Connection Properties: There are currently 0 values in the map: entry# Database configuration Table name: example_table Table SQL statement: Table column names: Data warehouse: Data directory: Partition column name: Boundary query: Security related configuration options Max connections: 123 New connection was successfully created with validation status FINE and persistent id 1
Next step is to create job. It's created similarly as a connection using command create job
. You need to pass argument --xid
notifying which connection should be used for this job object and --typ}}e argument with job type. Right now only {{import
job type is supported:
sqoop:000> create job --xid 1 --type import
Example at the time of writing this demo:
sqoop:000> create job --xid 1 --type import Creating job for connection with id 1 Please fill following values to create new job object Name: first-job Ignored Ignored: Output configuration Output format: Output directory: /user/jarcec/first_table New job was successfully created with validation status FINE and persistent id 1
Finally you can execute your job using submission
command:
sqoop:000> submission start --jid 1 Submission details Job id: 1 Status: BOOTING Creation date: 2012-23-09 17:23:11 PST Last update date: 2012-23-09 17:23:11 PST External Id: job_201210300909_0059 http://vm-cdh4:50030/jobdetails.jsp?jobid=job_201210300909_0059 Progress: Progress is not available
And you can check progress of the job using following command:
sqoop:000> submission status --jid 1 Submission details Job id: 1 Status: RUNNING Creation date: 2012-23-09 17:23:11 PST Last update date: 2012-23-09 17:23:35 PST External Id: job_201210300909_0059 http://vm-cdh4:50030/jobdetails.jsp?jobid=job_201210300909_0059 Progress: 0.25 %
Full export demo
This example will walk you through entire process of creating all objects in order to create actual map reduce job that will be executed on your Hadoop cluster, and will export data from HDFS to a SQL database. This demo is expecting that you have installed Sqoop2 server as instructed in previous sections. I'll use MySQL in this example, please change actual values to the one corresponding with your database box.
Firstly you need to create connection metadata object. Sqoop2 currently ships with single connector called "Generic JDBC Connector" that will most likely have id 1. You can see list of connector their ids by running command show connector --all
:
sqoop:000> show connector --all 1 connector(s) to show: Connector with id 1: Name: generic-jdbc-connector Class: org.apache.sqoop.connector.jdbc.GenericJdbcConnector Version: 1.99.2 Supported job types: [IMPORT, EXPORT] Connection form 1: Name: connection Label: Connection configuration Help: You must supply the information requested in order to create a connection object. Input 1: ...
New connection is created using command create connection
. This command is requiring parameter --cid
to specify which connector should be used. Use number retrieved from previous command:
sqoop:000> create connection --cid 1
You should be asked for several parameters. Please fill them with your specific use case. Here is example. The parameters that are not specified in this demo can be left blank. Please note that arguments might vary in your case as sqoop develops:
sqoop:000> create connection --cid 1 Creating connection for connector with id 1 Please fill following values to create new connection object Name: export-connection Connection configuration JDBC Driver Class: com.mysql.jdbc.Driver JDBC Connection String: jdbc:mysql://remote-server/dbname Username: username Password: ******** JDBC Connection Properties: There are currently 0 values in the map: entry# Security related configuration options Max connections: 10 Database configuration Table name: example_table Table SQL statement: Table column names: Data warehouse: Data directory: Partition column name: Boundary query: Security related configuration options Max connections: 123 New connection was successfully created with validation status FINE and persistent id 1
Next step is to create job. It's created similarly as a connection using command create job
. You need to pass argument --xid
notifying which connection should be used for this job object and --typ}}e argument with job type. Right now only {{import
job type is supported:
sqoop:000> create job --xid 1 --type export
Example at the time of writing this demo:
sqoop:000> create job --xid 1 --type import Creating job for connection with id 1 Please fill following values to create new job object Name: export-job Database configuration Schema name: dbname Table name: mytable Table SQL statement: Table column names: Input configuration Input directory: /user/jarcec/mydata Throttling resources Extractors: Loaders: New job was successfully created with validation status FINE and persistent id 1
Finally you can execute your job using submission
command:
sqoop:000> submission start --jid 1 Submission details Job id: 1 Status: BOOTING Creation date: 2012-23-09 17:23:11 PST Last update date: 2012-23-09 17:23:11 PST External Id: job_201210300909_0059 http://vm-cdh4:50030/jobdetails.jsp?jobid=job_201210300909_0059 Progress: Progress is not available
And you can check progress of the job using following command:
sqoop:000> submission status --jid 1 Submission details Job id: 1 Status: RUNNING Creation date: 2012-23-09 17:23:11 PST Last update date: 2012-23-09 17:23:35 PST External Id: job_201210300909_0059 http://vm-cdh4:50030/jobdetails.jsp?jobid=job_201210300909_0059 Progress: 0.25 %
After the job succeeds, the same command will give the following output:
sqoop:000> submission status --jid 1 Submission details Job id: 26 Status: SUCCEEDED Creation date: 2012-23-09 17:23:11 PST Last update date: 2012-23-09 17:23:35 PST External Id: job_201210300909_0059 http://vm-cdh4:50030/jobdetails.jsp?jobid=job_201210300909_0059 Counters: org.apache.hadoop.mapreduce.JobCounter SLOTS_MILLIS_MAPS: 4367 TOTAL_LAUNCHED_MAPS: 1 SLOTS_MILLIS_REDUCES: 0 OTHER_LOCAL_MAPS: 1 org.apache.hadoop.mapreduce.lib.output.FileOutputFormatCounter BYTES_WRITTEN: 1033 org.apache.hadoop.mapreduce.lib.input.FileInputFormatCounter BYTES_READ: 0 org.apache.hadoop.mapreduce.TaskCounter MAP_INPUT_RECORDS: 0 MERGED_MAP_OUTPUTS: 0 PHYSICAL_MEMORY_BYTES: 105304064 SPILLED_RECORDS: 0 COMMITTED_HEAP_BYTES: 46137344 CPU_MILLISECONDS: 730 FAILED_SHUFFLE: 0 VIRTUAL_MEMORY_BYTES: 642871296 SPLIT_RAW_BYTES: 118 MAP_OUTPUT_RECORDS: 1 GC_TIME_MILLIS: 28 org.apache.hadoop.mapreduce.FileSystemCounter FILE_WRITE_OPS: 0 FILE_READ_OPS: 0 FILE_LARGE_READ_OPS: 0 FILE_BYTES_READ: 0 HDFS_BYTES_READ: 118 FILE_BYTES_WRITTEN: 83438 HDFS_LARGE_READ_OPS: 0 HDFS_WRITE_OPS: 2 HDFS_READ_OPS: 4 HDFS_BYTES_WRITTEN: 1033 org.apache.sqoop.submission.counter.SqoopCounters ROWS_READ: 1
Modifying configuration
Both the default bootstrap configuration sqoop_bootstrap.properties
and the main configuration sqoop.properties
are located under the conf
directory in the Sqoop2 distribution directory.
The bootstrap configuration sqoop_bootstrap.properties
controls what the mechanism is to provide configuration:
sqoop.config.provider=org.apache.sqoop.core.PropertiesConfigurationProvider
The main configuration sqoop.properties
controls what the mechanism is for repository, where the log files are, what the logging levels are, etc.
# Log4J system org.apache.sqoop.log4j.appender.file=org.apache.log4j.RollingFileAppender org.apache.sqoop.log4j.appender.file.File=logs/sqoop.log org.apache.sqoop.log4j.appender.file.MaxFileSize=25MB org.apache.sqoop.log4j.appender.file.MaxBackupIndex=5 org.apache.sqoop.log4j.appender.file.layout=org.apache.log4j.PatternLayout org.apache.sqoop.log4j.appender.file.layout.ConversionPattern=%d{ISO8601} %-5p %c{2} [%l] %m%n org.apache.sqoop.log4j.debug=true org.apache.sqoop.log4j.rootCategory=WARN, file org.apache.sqoop.log4j.category.org.apache.sqoop=DEBUG org.apache.sqoop.log4j.category.org.apache.derby=INFO # Repository org.apache.sqoop.repository.provider=org.apache.sqoop.repository.JdbcRepositoryProvider org.apache.sqoop.repository.jdbc.handler=org.apache.sqoop.repository.derby.DerbyRepositoryHandler org.apache.sqoop.repository.jdbc.transaction.isolation=READ_COMMITTED org.apache.sqoop.repository.jdbc.maximum.connections=10 org.apache.sqoop.repository.jdbc.url=jdbc:derby:repository/db;create=true org.apache.sqoop.repository.jdbc.create.schema=true org.apache.sqoop.repository.jdbc.driver=org.apache.derby.jdbc.EmbeddedDriver org.apache.sqoop.repository.jdbc.user=sa org.apache.sqoop.repository.jdbc.password= org.apache.sqoop.repository.sysprop.derby.stream.error.file=logs/derbyrepo.log
Debugging information
The logs of the Tomcat server is located under the server/logs
directory in the Sqoop2 distribution directory.
The logs of the Sqoop2 server and the Derby repository are located as sqoop.log
and derbyrepo.log
(by default unless changed by the above configuration), respectively, under the
(LOGS)
directory in the Sqoop2 distribution directory.