Table of Contents | ||||
---|---|---|---|---|
|
Warning | ||
---|---|---|
| ||
Please see the docs for latest release in 1.99.* http://sqoop.apache.org/docs/ . Some of the information below might be outdated |
Building from sources
Checkout sources and switch to sqoop2 branch:
No Format |
---|
$ git clone https://git-wip-us.apache.org/repos/asf/sqoop.git sqoop2 $ cd sqoop2 $ git checkout sqoop2 |
Then you can build sqoop using mvn:
No Format |
---|
$ mvn package
|
Optionally you can build Sqoop with skipping tests:
Code Block |
---|
$ mvn package -DskipTests
|
Creating binaries
Now build and package Sqoop2 binary distribution:
No Format |
---|
$ mvn package -Pbinary
|
This process will create a directory and a tarball under dist/target
directory. The directory (named sqoop-2.0.0-SNAPSHOT
as of this writing) contains necessary binaries to run Sqoop2, and its structure looks something like
Setting up a build environment with Eclipse
Installation
- Install Eclipse,
- Install maven if not already on your machine
- Install Oracle's JDK
Set up
- Run the following commands
Code Block |
---|
mvn eclipse:configure-workspace -Declipse.workspace=<path to your eclipse workspace>
&&
mvn eclipse:eclipse -DdownloadSources=true -DdownloadJavadocs=true |
- Import the project into eclipse by going to File > Import... > General > Existing Projects into Workspace > Next.
- In the next wizard window, click the browse button next to "Select root directory" and browse to the root of the workspace from where you have checked out sqoop2. This will populate about 10 projects into your workspace - all of which are different modules within Sqoop2. Click Finish button to get these projects into the workspace and start working.
Note - if this is the first time you are setting up Eclipse for a maven project, the import will show class path problems due to missing variable M2_REPO (Unbound classpath variable: 'M2_REPO/...). To fix this error, go to Preferences > Java > Build Path > Classpath Variables. Click on New..., enter name M2_REPO, click on Folder and browse upto the directory ~/.m2/repository. Click OK and close the preferences dialog. This will force the rebuild of the workspace and all projects should turn green.
Note |
---|
Similar steps need to be followed with IDEA Intellij as well. |
Setting up the Code Formatter
Quick commands to compile and run tests
Sqoop clean:
Code Block |
---|
mvn clean |
Sqoop compile:
Code Block |
---|
mvn compile |
Run all unit tests:
Code Block |
---|
mvn test |
Run all integration tests :
Warning |
---|
Running integration tests does take up a lot of CPU, since these tests run on the actual execution engine ( such as Hadoop MR ) esp. Running org.apache.sqoop.integration.connector.jdbc.generic.PartitionerTest |
Code Block |
---|
mvn clean integration-test |
Run one integration test:
Code Block |
---|
mvn clean integration-test -Dtest=org.apache.sqoop.integration.connector.jdbc.generic.FromRDBMSToHDFSTest -DfailIfNoTests=false |
If you want to run tests against the postgres repository, have a working installation of postgres and then point to it when running tests. In the following case we have a working postgres installation as
postgresql://postgresql.ent.cloudera.com/sqoop_test
Code Block |
---|
mvn clean integration-test -pl repository/repository-postgresql -Dsqoop.provider.class=org.apache.sqoop.common.test.db.PostgreSQLProvider -Dsqoop.provider.postgresql.jdbc=jdbc:postgresql://postgresql.ent.cloudera.com/sqoop_test -Dsqoop.provider.postgresql.username=sqoop -Dsqoop.provider.postgresql.password=sqoop -Dpostgresql |
Sadly, as of this writing it does not really run the integration tests, it runs only the unit tests.
Build sqoop :
No Format |
---|
mvn package
|
Optionally you can build Sqoop with skipping tests ( both unit tests and integration tests )
Code Block |
---|
mvn package -DskipTests |
Other handy commands that does build and run all tests from scratch
Code Block |
---|
mvn verify
or
mvn clean install |
Creating Sqoop binaries
Now build and package Sqoop2 binary distribution:
No Format |
---|
mvn package -Pbinary
or
mvn package -DskipTests=true -Dmaven.javadoc.skip=true -Pbinary -Dhadoop.profile=200 // for a specific hadoop profile |
This process will create a directory and a tarball under dist/target
directory. The directory (named sqoop-2.0.0-SNAPSHOT
or sqoop-2.0.0-SNAPSHOT-bin-hadoop200
, depending on the hadoop profile used ) contains necessary binaries to run Sqoop2, and its structure looks something like below.
Warning |
---|
VB: There is NO lib folder under the client in the latest code as of this writing |
No Format |
---|
--+ bin --+ sqoop.sh
|
+ client --+ lib --+ sqoop-common.jar
| |
No Format |
--+ bin --+ sqoop.sh | + client --+ lib --+ sqoop-common.jar | | | + sqoop-client.jar | | | + (3rd-party client dependency jars) | + server --+ bin --+ setenv.sh | | | + conf --+ sqoop_bootstrap.properties | | | | | + sqoop.properties | | | + webapps --+ ROOT | | | + sqoop.war | + ... |
As part of this process, a copy of the Tomcat server is also downloaded and put under the server
directory in the above structure.
Setting up a build environment with Eclipse
- Install Eclipse, maven and Oracle's JDK and run mvn eclipse:eclipse successfully
- Import the project into eclipse by going to File > Import... > General > Existing Projects into Workspace > Next.
- In the next wizard window, click the browse button next to "Select root directory" and browse to the root of the workspace from where you have checked out sqoop2. This will populate about 10 projects into your workspace - all of which are different modules within Sqoop 2. Click Finish button to get these projects into the workspace and start working.
Note - if this is the first time you are setting up Eclipse for a maven project, the import will show class path problems due to missing variable M2_REPO (Unbound classpath variable: 'M2_REPO/...). To fix this error, go to Preferences > Java > Build Path > Classpath Variables. Click on New..., enter name M2_REPO, click on Folder and browse upto the directory ~/.m2/repository. Click OK and close the preferences dialog. This will force the rebuild of the workspace and all projects should turn green.
Installing Sqoop2 on remote server
To install generated binaries on remote server simply copy directory sqoop-2.0.0-SNAPSHOT
to your remote server:
Code Block |
---|
scp -r dist/target/sqoop-2.0.0-SNAPSHOT remote-server.company.org:/remote/path/
|
Install dependencies
Sqoop server is depending on hadoop binaries, but they are not part of the distribution and thus you need to install them into Sqoop server manually. We currently supports only version 2.0, but other version will be added later. To install hadoop libraries execute command addtowar.sh
with argument -hadoop $version $location
. Following example is for Cloudera distribution version 4(CDH4):
|
| + sqoop-client.jar
| |
| + (3rd-party client dependency jars)
|
+ server --+ bin --+ setenv.sh
| |
| + conf --+ sqoop_bootstrap.properties
| | |
| | + sqoop.properties
| |
| + webapps --+ ROOT
| |
| + sqoop.war
|
+ ...
|
As part of this process, a copy of the Tomcat server is also downloaded and put under the server
directory in the above structure.
Note |
---|
If you are on particular release branch such as 1.99.4, all the artifacts in it will be created with the 1.99.4 build version. for instance sqoop-1.99.4-bin-hadoop200.tar.gz |
Installing Sqoop2 on remote server
To install generated binaries on remote server simply copy directory sqoop-2.0.0-SNAPSHOT
to your remote server:
Code Block |
---|
scp -r dist/target/sqoop-2.0.0-SNAPSHOT remote-server.company.org:/remote/path/
|
Install dependencies
Sqoop server is depending on hadoop binaries, but they are not part of the distribution and thus you need to install them into Sqoop server manually. The latest hadoop version we support is 2.5.2 .
Warning |
---|
VB: There is no addtowar.sh in the in the latest code under sqoop-2.0.0-SNAPSHOT/bin as of this writing |
To install hadoop libraries execute command addtowar.sh
with argument -hadoop $version $location
. Following example is for Cloudera distribution version 4(CDH4):
Code Block |
---|
./bin/addtowar.sh -hadoop 2.0 /usr/lib/hadoop/client/
|
If you're running CDH4 MR1:
Code Block |
---|
cd dist/target/sqoop-2.0.0-SNAPSHOT-bin-hadoop200 or cd dist/target/sqoop-2.0.0-SNAPSHOT
./bin/addtowar.sh -hadoop-version cdh4mr1 -hadoop-path /usr/lib
|
In case that you're running original Mapreduce implementation (MR1), you will also need to install it's jar:
Code Block |
---|
./bin/addtowar.sh -jars /usr/lib/hadoop-0.20-mapreduce/hadoop-2.0.0-mr1-cdh4.1.1-core.jar
|
You can install any arbitrary jars (connectors, JDBC drivers) using -jars
argument that takes list of jars separated by ":". Here is example for installing MySQL jdbc driver into Sqoop server:
Code Block |
---|
|
Code Block |
./bin/addtowar.sh -hadoopjars 2.0 /usrpath/libto/hadoop/client/ |
If you're running CDH4 MR1:
Code Block |
---|
cd dist/target/sqoop-2.0.0-SNAPSHOT-bin-hadoop200
./bin/addtowar.sh -hadoop-version cdh4mr1 -hadoop-path /usr/lib
|
In case that you're running original Mapreduce implementation (MR1), you will also need to install it's jar:
Code Block |
---|
./bin/addtowar.sh -jars /usr/lib/hadoop-0.20-mapreduce/hadoop-2.0.0-mr1-cdh4.1.1-core.jar
|
You can install any arbitrary jars (connectors, JDBC drivers) using -jars
argument that takes list of jars separated by ":". Here is example for installing MySQL jdbc driver into Sqoop server:
...
jar/mysql-connector-java-5.1.21-bin.jar
|
Installing a new connector to Sqoop2
If you are contributing or adding a new connector say sqoop-foo-connector
to the sqoop2, here are steps to follow.
Step 1: Create a sqoop-foo-connector.jar
. Make sure the jar contains the sqoopconnector.properties
for it to be picked up by sqoop
A typical sqoopconnector.properties for a sqoop2 connector looks like below
Code Block |
---|
# Generic JDBC Connector Properties
org.apache.sqoop.connector.class = org.apache.sqoop.connector.foo.FooConnector
org.apache.sqoop.connector.name = sqoop-foo-connector |
Step 2: Add this jar to the a folder on your installation machine and update the path to this folder in the sqoop.properties
located under the server/conf
directory under the Sqoop2 for the key
org.apache.sqoop.connector.external.loadpath
Code Block |
---|
#
# External connectors load path
# "/path/to/external/connectors/": Add all the connector JARs in the specified folder
#
org.apache.sqoop.connector.external.loadpath=/path/to/connector
|
Step 3: Start the server and while initalizing the server this jar should be loaded into the sqoop's class path and registered into the sqoop repository/
Starting/Stopping Sqoop2 server
To start Sqoop2 server invoke the sqoop
shell script:
No Format |
---|
cd dist/target/sqoop-2.0.0-SNAPSHOT
bin/sqoop.sh server start
|
...
Similarly, to stop Sqoop2 server, do the following:
No Format |
---|
bin/sqoop.sh server stop
|
Starting/Running Sqoop2 client
To start an interactive shell,
No Format |
---|
bin/sqoop.sh client
|
This will bring up an interactive client ready for input commands:
No Format |
---|
Sqoop Shell: Type 'help' or '\h' for help.
sqoop:000>
|
The command for the shell client looks something like <command> <function> <options>:
- set
- set server
- set server --host <host>
- set server --port <port>
- set server --webapp <webapp>
- set server
- show
- show version
- show version --all
- show version --server
- show version --client
- show version --protocol
- show version
Type "help" for getting list of all possible command line commands.
Full demo
This example will walk you through entire process of creating all objects in order to create actual map reduce job that will be executed on your Hadoop cluster. This demo is expecting that you have installed Sqoop2 server as instructed in previous sections. I'll use MySQL in this example, please change actual values to the one corresponding with your database box.
Firstly you need to create connection metadata object. Sqoop2 currently ships with single connector called "Generic JDBC Connector" that will most likely have id 1. You can see list of connector their ids by running command show connector --all
:
Code Block |
---|
sqoop:000> show connector --all
1 connector(s) to show: 1 connector(s) to show:
Connector with id 1:
Name: generic-jdbc-connector
Class: org.apache.sqoop.connector.jdbc.GenericJdbcConnector
Supported job types: [EXPORT, IMPORT]
Connection form 1:
Name: form-connection
Label: Configuration configuration
Help: You must supply the information requested in order to create a connection object.
Input 1:
...
|
New connection is created using command create connection
. This command is requiring parameter --cid
to specify which connector should be used. Use number retrieved from previous command:
Code Block |
---|
sqoop:000> create connection --cid 1
|
You should be asked for several parameters. Please fill them with your specific use case. Here is example. Please note that arguments might vary in your case as sqoop develops:
Code Block |
---|
sqoop:000> create connection --cid 1
Creating connection for connector with id 1
Please fill following values to create new connection object
Name: first-connection
Configuration configuration
JDBC Driver Class: com.mysql.jdbc.Driver
JDBC Connection String: jdbc:mysql://remote-server/dbname
Username: username
Password: ********
JDBC Connection Properties:
There are currently 0 values in the map:
entry#
Database configuration
Table name: example_table
Table SQL statement:
Table column names:
Data warehouse:
Data directory:
Partition column name:
Boundary query:
Security related configuration options
Max connections: 123
New connection was successfully created with validation status FINE and persistent id 1
|
Next step is to create job. It's created similarly as a connection using command create job
. You need to pass argument --xid
notifying which connection should be used for this job object and --typ}}e argument with job type. Right now only {{import
job type is supported:
Code Block |
---|
sqoop:000> create job --xid 1 --type import
|
Example at the time of writing this demo:
Code Block |
---|
sqoop:000> create job --xid 1 --type import
Creating job for connection with id 1 Please fill following values to create new job object
Name: first-job
Ignored
Ignored:
Output configuration
Output format:
Output directory: /user/jarcec/first_table
New job was successfully created with validation status FINE and persistent id 1
|
Finally you can execute your job using submission
command:
Code Block |
---|
sqoop:000> submission start --jid 1
Submission details
Job id: 1
Status: BOOTING
Creation date: 2012-23-09 17:23:11 PST
Last update date: 2012-23-09 17:23:11 PST
External Id: job_201210300909_0059
http://vm-cdh4:50030/jobdetails.jsp?jobid=job_201210300909_0059
Progress: Progress is not available
|
And you can check progress of the job using following command:
Code Block |
---|
sqoop:000> submission status --jid 1
Submission details
Job id: 1
Status: RUNNING
Creation date: 2012-23-09 17:23:11 PST
Last update date: 2012-23-09 17:23:35 PST
External Id: job_201210300909_0059
http://vm-cdh4:50030/jobdetails.jsp?jobid=job_201210300909_0059
Progress: 0.25 %
|
Modifying configuration
Both the default bootstrap configuration sqoop_bootstrap.properties
and the main configuration sqoop.properties
are located under the conf
directory in the Sqoop2 distribution directory.
The bootstrap configuration sqoop_bootstrap.properties
controls what the mechanism is to provide configuration:
No Format |
---|
sqoop.config.provider=org.apache.sqoop.core.PropertiesConfigurationProvider
|
The main configuration sqoop.properties
controls what the mechanism is for repository, where the log files are, what the logging levels are, etc.
No Format |
---|
# Log4J system
org.apache.sqoop.log4j.appender.file=org.apache.log4j.RollingFileAppender
org.apache.sqoop.log4j.appender.file.File=logs/sqoop.log
org.apache.sqoop.log4j.appender.file.MaxFileSize=25MB
org.apache.sqoop.log4j.appender.file.MaxBackupIndex=5
org.apache.sqoop.log4j.appender.file.layout=org.apache.log4j.PatternLayout
org.apache.sqoop.log4j.appender.file.layout.ConversionPattern=%d{ISO8601} %-5p %c{2} [%l] %m%n
org.apache.sqoop.log4j.debug=true
org.apache.sqoop.log4j.rootCategory=WARN, file
org.apache.sqoop.log4j.category.org.apache.sqoop=DEBUG
org.apache.sqoop.log4j.category.org.apache.derby=INFO
# Repository
org.apache.sqoop.repository.provider=org.apache.sqoop.repository.JdbcRepositoryProvider
org.apache.sqoop.repository.jdbc.handler=org.apache.sqoop.repository.derby.DerbyRepositoryHandler
org.apache.sqoop.repository.jdbc.transaction.isolation=READ_COMMITTED
org.apache.sqoop.repository.jdbc.maximum.connections=10
org.apache.sqoop.repository.jdbc.url=jdbc:derby:repository/db;create=true
org.apache.sqoop.repository.jdbc.create.schema=true
org.apache.sqoop.repository.jdbc.driver=org.apache.derby.jdbc.EmbeddedDriver
org.apache.sqoop.repository.jdbc.user=sa
org.apache.sqoop.repository.jdbc.password=
org.apache.sqoop.repository.sysprop.derby.stream.error.file=logs/derbyrepo.log
|
Debugging information
The logs of the Tomcat server is located under the server/logs
directory in the Sqoop2 distribution directory.
Note |
---|
Please see the 5 min Demo Guide or the Command Line Shell Guide for the latest release 1.99.* http://sqoop.apache.org/docs/ |
Sqoop configuration files
Both the default bootstrap configuration sqoop_bootstrap.properties
and the main configuration sqoop.properties
are located under the server/conf
directory in the Sqoop2 distribution directory.
The bootstrap configuration sqoop_bootstrap.properties
controls what the mechanism is to provide configuration for different managers in the Sqoop.
No Format |
---|
sqoop.config.provider=org.apache.sqoop.core.PropertiesConfigurationProvider
|
The main configuration sqoop.properties
controls what the mechanism is for where the
- Where are the log files are, what the logging levels are?
- What is the repository used?
- What is the submission/ execution engine used?
- What is the Authentication mechanism used?
No Format |
---|
# Logging Configuration
# Any property that starts with the prefix
# org.apache.sqoop.log4j is parsed out by the configuration
# system and passed to the log4j subsystem. This allows you
# to specify log4j configuration properties from within the
# Sqoop configuration.
#
org.apache.sqoop.log4j.appender.file=org.apache.log4j.RollingFileAppender
org.apache.sqoop.log4j.appender.file.File=@LOGDIR@/sqoop.log
org.apache.sqoop.log4j.appender.file.MaxFileSize=25MB
org.apache.sqoop.log4j.appender.file.MaxBackupIndex=5
org.apache.sqoop.log4j.appender.file.layout=org.apache.log4j.PatternLayout
org.apache.sqoop.log4j.appender.file.layout.ConversionPattern=%d{ISO8601} %-5p %c{2} [%l] %m%n
org.apache.sqoop.log4j.debug=true
org.apache.sqoop.log4j.rootCategory=WARN, file
org.apache.sqoop.log4j.category.org.apache.sqoop=DEBUG
org.apache.sqoop.log4j.category.org.apache.derby=INFO
#
# Audit Loggers Configuration
# Multiple audit loggers could be given here. To specify an
# audit logger, you should at least add org.apache.sqoop.
# auditlogger.[LoggerName].class. You could also provide
# more configuration options by using org.apache.sqoop.
# auditlogger.[LoggerName] prefix, then all these options
# are parsed to the logger class.
#
org.apache.sqoop.auditlogger.default.class=org.apache.sqoop.audit.FileAuditLogger
org.apache.sqoop.auditlogger.default.file=@LOGDIR@/default.audit
#
# Repository configuration
# The Repository subsystem provides the special prefix which
# is "org.apache.sqoop.repository.sysprop". Any property that
# is specified with this prefix is parsed out and set as a
# system property. For example, if the built in Derby repository
# is being used, the sysprop prefixed properties can be used
# to affect Derby configuration at startup time by setting
# the appropriate system properties.
#
# Repository provider
org.apache.sqoop.repository.provider=org.apache.sqoop.repository.JdbcRepositoryProvider
# Repository upgrade
# If set to true, it will not upgrade the sqoop respository schema, by default it will iniate the upgrade on server start-up
org.apache.sqoop.repository.schema.immutable=false
# JDBC repository provider configuration
org.apache.sqoop.repository.jdbc.handler=org.apache.sqoop.repository.derby.DerbyRepositoryHandler
org.apache.sqoop.repository.jdbc.transaction.isolation=READ_COMMITTED
org.apache.sqoop.repository.jdbc.maximum.connections=10
org.apache.sqoop.repository.jdbc.url=jdbc:derby:@BASEDIR@/repository/db;create=true
org.apache.sqoop.repository.jdbc.driver=org.apache.derby.jdbc.EmbeddedDriver
org.apache.sqoop.repository.jdbc.user=sa
org.apache.sqoop.repository.jdbc.password=
# System properties for embedded Derby configuration
org.apache.sqoop.repository.sysprop.derby.stream.error.file=@LOGDIR@/derbyrepo.log
#
# Sqoop Connector configuration
# If set to true will initiate Connectors config upgrade during server startup
#
org.apache.sqoop.connector.autoupgrade=false
#
# Sqoop Driver configuration
# If set to true will initiate the Driver config upgrade during server startup
#
org.apache.sqoop.driver.autoupgrade=false
# Sleeping period for reloading configuration file (once a minute)
org.apache.sqoop.core.configuration.provider.properties.sleep=60000
#
# Submission engine configuration
#
# Submission engine class
org.apache.sqoop.submission.engine=org.apache.sqoop.submission.mapreduce.MapreduceSubmissionEngine
# Number of milliseconds, submissions created before this limit will be removed, default is one day
#org.apache.sqoop.submission.purge.threshold=
# Number of milliseconds for purge thread to sleep, by default one day
#org.apache.sqoop.submission.purge.sleep=
# Number of milliseconds for update thread to sleep, by default 5 minutes
#org.apache.sqoop.submission.update.sleep=
#
# Configuration for Mapreduce submission engine (applicable if it's configured)
#
# Hadoop configuration directory
org.apache.sqoop.submission.engine.mapreduce.configuration.directory=/etc/hadoop/conf/
#
# Execution engine configuration
#
org.apache.sqoop.execution.engine=org.apache.sqoop.execution.mapreduce.MapreduceExecutionEngine
#
# Authentication configuration
#
#org.apache.sqoop.authentication.type=SIMPLE
#org.apache.sqoop.authentication.handler=org.apache.sqoop.security.SimpleAuthenticationHandler
#org.apache.sqoop.anonymous=true
#org.apache.sqoop.authentication.type=KERBEROS
#org.apache.sqoop.authentication.handler=org.apache.sqoop.security.KerberosAuthenticationHandler
#org.apache.sqoop.authentication.kerberos.principal=sqoop/_HOST@NOVALOCAL
#org.apache.sqoop.authentication.kerberos.keytab=/home/kerberos/sqoop.keytab
#org.apache.sqoop.authentication.kerberos.http.principal=HTTP/_HOST@NOVALOCAL
#org.apache.sqoop.authentication.kerberos.http.keytab= |
Debug Logs information
- The logs of the Tomcat server is located under the
server/logs
directory in the Sqoop2 distribution directory, most relevant would becatalina.out
- The logs of the Sqoop2 server as
sqoop.log
(by default unless changed by the above sqoop.properties configuration file ) under the(@LOGDIR)
directory in the Sqoop2 distribution directory. - The logs for the Derby repository is
derbyrepo.log
(by default unless changed by the above sqoop.properties configuration file ) under the(@LOGDIR
...
)
directory in the Sqoop2 distribution directory.