Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.
Comment: Migrated to Confluence 5.3

NOTE THAT this page is still in draft and may not be fully implemented yet.

Table of Contents
maxLevel5
styleoutline

Warning
titleNote

Please see the docs for latest release in 1.99.* http://sqoop.apache.org/docs/ . Some of the information below might be outdated

Building from sources

Checkout sources and switch to sqoop2 branch:

No Format

$git svnclone co https://svngit-wip-us.apache.org/repos/asf/sqoop/branches/sqoop2

Then, change to Sqoop2 directory and build them:

No Format

$ cd sqoop2
$ mvn install

Creating distribution

Now build and package Sqoop2 as distribution:

.git sqoop2
cd sqoop2
git checkout sqoop2

 

Setting up a build environment with Eclipse

Installation

  • Install Eclipse,
  • Install maven if not already on your machine
  • Install Oracle's JDK 

Set up 

  • Run the following commands
Code Block
mvn eclipse:configure-workspace -Declipse.workspace=<path to your eclipse workspace>
&&
mvn eclipse:eclipse -DdownloadSources=true -DdownloadJavadocs=true
  • Import the project into eclipse by going to File > Import... > General > Existing Projects into Workspace > Next.
  • In the next wizard window, click the browse button next to "Select root directory" and browse to the root of the workspace from where you have checked out sqoop2. This will populate about 10 projects into your workspace - all of which are different modules within Sqoop2. Click Finish button to get these projects into the workspace and start working.

Note - if this is the first time you are setting up Eclipse for a maven project, the import will show class path problems due to missing variable M2_REPO (Unbound classpath variable: 'M2_REPO/...). To fix this error, go to Preferences > Java > Build Path > Classpath Variables. Click on New..., enter name M2_REPO, click on Folder and browse upto the directory ~/.m2/repository. Click OK and close the preferences dialog. This will force the rebuild of the workspace and all projects should turn green.

Note

Similar steps need to be followed with IDEA Intellij as well.

Setting up the Code Formatter 

See Sqoop 2 Coding Guidelines

Quick commands to compile and run tests

Sqoop clean:

Code Block
mvn clean

Sqoop compile:

Code Block
mvn compile

Run all unit tests:

Code Block
mvn test

Run all  integration tests :

Warning

Running integration tests does take up a lot of CPU, since these tests run on the actual execution engine ( such as Hadoop MR ) esp.

Running org.apache.sqoop.integration.connector.jdbc.generic.PartitionerTest

Code Block
mvn clean integration-test

Run one integration test:

Code Block
mvn clean integration-test -Dtest=org.apache.sqoop.integration.connector.jdbc.generic.FromRDBMSToHDFSTest -DfailIfNoTests=false

If you want to run tests against the postgres repository, have a working installation of postgres and then point to it when running tests. In the following case we have a working postgres installation as 

postgresql://postgresql.ent.cloudera.com/sqoop_test

Code Block
mvn clean integration-test -pl repository/repository-postgresql -Dsqoop.provider.class=org.apache.sqoop.common.test.db.PostgreSQLProvider -Dsqoop.provider.postgresql.jdbc=jdbc:postgresql://postgresql.ent.cloudera.com/sqoop_test -Dsqoop.provider.postgresql.username=sqoop -Dsqoop.provider.postgresql.password=sqoop -Dpostgresql

Sadly, as of this writing it does not really run the integration tests, it runs only the unit tests.

 

Build sqoop :

No Format
mvn package

 

Optionally you can build Sqoop with skipping tests ( both unit tests and integration tests )

Code Block
mvn package -DskipTests

 

Other handy commands that does build and run all tests from scratch

Code Block
mvn verify
or
mvn clean install

Creating Sqoop binaries

Now build and package Sqoop2 binary distribution:

No Format
mvn package -Pbinary

 
or 
 
mvn package -DskipTests=true -Dmaven.javadoc.skip=true -Pbinary -Dhadoop.profile=200  // for a specific hadoop profile
No Format

$ mvn package -Pdist

This process will create a directory and a tarball under dist/target directory. The directory (named sqoop-2.0.0-SNAPSHOT or  sqoop-2.0.0-SNAPSHOT as of this writing-bin-hadoop200, depending on the hadoop profile used ) contains necessary binaries to run Sqoop2, and its layout structure looks something like below.

Warning

VB: There is NO lib folder under the client in the latest code as of this writing

No Format

--+ bin --+ sqoop.sh
  |
  
+ client --+ conflib --+ sqoop_bootstrap-common.propertiesjar
  |                  |
  |         +-         + sqoop.properties-client.jar
  |                  |
  |                  + (3rd-party client dependency jars)
  |
  + server --+ bin --+ setenv.sh
  |          |
  |          + conf --+ sqoop_bootstrap.properties
  |          |        |
  |          |        + sqoop.properties
  |          |
  |          + webapps --+ ROOT
  |                      |
  |                      + sqoop.war
  |
  + ...

As part of this process, a copy of the Tomcat server is also downloaded and put under the server directory in the above structure.

Note

If you are on particular release branch such as 1.99.4, all the artifacts in it will be created with the 1.99.4 build version. for instance sqoop-1.99.4-bin-hadoop200.tar.gz

Installing Sqoop2 on remote server

To install generated binaries on remote server simply copy directory sqoop-2.0.0-SNAPSHOT to your remote server:

Code Block
scp -r dist/target/sqoop-2.0.0-SNAPSHOT remote-server.company.org:/remote/path/

Install dependencies

Sqoop server is depending on hadoop binaries, but they are not part of the distribution and thus you need to install them into Sqoop server manually. The latest hadoop version we support is 2.5.2 .

Warning

VB: There is no addtowar.sh in the in the latest code under sqoop-2.0.0-SNAPSHOT/bin as of this writing

 

To install hadoop libraries execute command addtowar.sh with argument -hadoop $version $location. Following example is for Cloudera distribution version 4(CDH4):

Code Block
 ./bin/addtowar.sh -hadoop 2.0 /usr/lib/hadoop/client/

If you're running CDH4 MR1:

Code Block
cd dist/target/sqoop-2.0.0-SNAPSHOT-bin-hadoop200 or cd dist/target/sqoop-2.0.0-SNAPSHOT
./bin/addtowar.sh -hadoop-version cdh4mr1 -hadoop-path /usr/lib

In case that you're running original Mapreduce implementation (MR1), you will also need to install it's jar:

Code Block
 ./bin/addtowar.sh -jars /usr/lib/hadoop-0.20-mapreduce/hadoop-2.0.0-mr1-cdh4.1.1-core.jar

You can install any arbitrary jars (connectors, JDBC drivers) using -jars argument that takes list of jars separated by ":". Here is example for installing MySQL jdbc driver into Sqoop server:

Code Block
  ./bin/addtowar.sh  |
  + sqoop-common.jar
-jars /path/to/jar/mysql-connector-java-5.1.21-bin.jar

Installing a new connector to Sqoop2

If you are contributing or adding a new connector say sqoop-foo-connector to the sqoop2, here are steps to follow.

 

Step 1: Create a sqoop-foo-connector.jar. Make sure the jar contains the sqoopconnector.properties for it to be picked up by sqoop

A typical sqoopconnector.properties for a sqoop2 connector looks like below

Code Block
# Generic JDBC Connector Properties
org.apache.sqoop.connector.class = org.apache.sqoop.connector.foo.FooConnector
org.apache.sqoop.connector.name = sqoop-foo-connector

 

Step 2: Add this jar to the a folder on your installation machine and update the path to this folder in the sqoop.properties located under the server/conf directory under the Sqoop2  for the key 

org.apache.sqoop.connector.external.loadpath

Code Block
#
# External connectors load path
# "/path/to/external/connectors/": Add all the connector JARs in the specified folder
#
org.apache.sqoop.connector.external.loadpath=/path/to/connector

 

Step 3: Start the server and while initalizing the server this jar should be loaded into the sqoop's class path and registered into the sqoop repository/

Starting/Stopping Sqoop2 server

To start Sqoop2 server invoke the sqoop shell script:

No Format
cd dist/target/sqoop-2.0.0-SNAPSHOT
bin/sqoop.sh server start

The Sqoop2 server is then running as a web application within the Tomcat server.

Similarly, to stop Sqoop2 server, do the following:

No Format
bin/sqoop.sh server stop

Starting/Running Sqoop2 client

To start an interactive shell,

No Format
bin/sqoop.sh client

This will bring up an interactive client ready for input commands:

No Format
Sqoop Shell: Type 'help' or '\h' for help.

sqoop:000>
Note

Please see the 5 min Demo Guide or the Command Line Shell Guide for the latest release 1.99.* http://sqoop.apache.org/docs/

Sqoop configuration files 

Both the default bootstrap configuration sqoop_bootstrap.properties and the main configuration sqoop.properties are located under the server/conf directory in the Sqoop2 distribution directory.

The bootstrap configuration sqoop_bootstrap.properties controls what the mechanism is to provide configuration for different managers in the Sqoop.

No Format
sqoop.config.provider=org.apache.sqoop.core.PropertiesConfigurationProvider

The main configuration sqoop.properties controls what the mechanism is for where the

  • Where are the log files are, what the logging levels are?
  • What is the repository used?
  • What is the submission/ execution engine used?
  • What is the Authentication mechanism used?
No Format
# Logging Configuration
# Any property that starts with the prefix
# org.apache.sqoop.log4j is parsed out by the configuration
# system and passed to the log4j subsystem. This allows you
# to specify log4j configuration properties from within the
# Sqoop configuration.
#
org.apache.sqoop.log4j.appender.file=org.apache.log4j.RollingFileAppender
org.apache.sqoop.log4j.appender.file.File=@LOGDIR@/sqoop.log
org.apache.sqoop.log4j.appender.file.MaxFileSize=25MB
org.apache.sqoop.log4j.appender.file.MaxBackupIndex=5
org.apache.sqoop.log4j.appender.file.layout=org.apache.log4j.PatternLayout
org.apache.sqoop.log4j.appender.file.layout.ConversionPattern=%d{ISO8601} %-5p %c{2} [%l] %m%n
org.apache.sqoop.log4j.debug=true
org.apache.sqoop.log4j.rootCategory=WARN, file
org.apache.sqoop.log4j.category.org.apache.sqoop=DEBUG
org.apache.sqoop.log4j.category.org.apache.derby=INFO
#
# Audit Loggers Configuration
# Multiple audit loggers could be given here. To specify an
# audit logger, you should at least add org.apache.sqoop.
# auditlogger.[LoggerName].class. You could also provide
# more configuration options by using org.apache.sqoop.
# auditlogger.[LoggerName] prefix, then all these options
# are parsed to the logger class.
#
org.apache.sqoop.auditlogger.default.class=org.apache.sqoop.audit.FileAuditLogger
org.apache.sqoop.auditlogger.default.file=@LOGDIR@/default.audit
#
# Repository configuration
# The Repository subsystem provides the special prefix which
# is "org.apache.sqoop.repository.sysprop". Any property that
# is specified with this prefix is parsed out and set as a
# system property. For example, if the built in Derby repository
# is being used, the sysprop prefixed properties can be used
# to affect Derby configuration at startup time by setting
# the appropriate system properties.
#
# Repository provider
org.apache.sqoop.repository.provider=org.apache.sqoop.repository.JdbcRepositoryProvider
# Repository upgrade
# If set to true, it will not upgrade the sqoop respository schema, by default it will iniate the upgrade on server start-up
org.apache.sqoop.repository.schema.immutable=false
# JDBC repository provider configuration
org.apache.sqoop.repository.jdbc.handler=org.apache.sqoop.repository.derby.DerbyRepositoryHandler
org.apache.sqoop.repository.jdbc.transaction.isolation=READ_COMMITTED
org.apache.sqoop.repository.jdbc.maximum.connections=10
org.apache.sqoop.repository.jdbc.url=jdbc:derby:@BASEDIR@/repository/db;create=true
org.apache.sqoop.repository.jdbc.driver=org.apache.derby.jdbc.EmbeddedDriver
org.apache.sqoop.repository.jdbc.user=sa
org.apache.sqoop.repository.jdbc.password=
# System properties for embedded Derby configuration
org.apache.sqoop.repository.sysprop.derby.stream.error.file=@LOGDIR@/derbyrepo.log
#
# Sqoop Connector configuration
# If set to true will initiate Connectors config upgrade during server startup
#
org.apache.sqoop.connector.autoupgrade=false
#
# Sqoop Driver configuration
# If set to true will initiate the Driver config upgrade during server startup
#
org.apache.sqoop.driver.autoupgrade=false
# Sleeping period for reloading configuration file (once a minute)
org.apache.sqoop.core.configuration.provider.properties.sleep=60000
#
# Submission engine configuration
#
# Submission engine class
org.apache.sqoop.submission.engine=org.apache.sqoop.submission.mapreduce.MapreduceSubmissionEngine
# Number of milliseconds, submissions created before this limit will be removed, default is one day
#org.apache.sqoop.submission.purge.threshold=
# Number of milliseconds for purge thread to sleep, by default one day
#org.apache.sqoop.submission.purge.sleep=
# Number of milliseconds for update thread to sleep, by default 5 minutes
#org.apache.sqoop.submission.update.sleep=
#
# Configuration for Mapreduce submission engine (applicable if it's configured)
#
# Hadoop configuration directory
org.apache.sqoop.submission.engine.mapreduce.configuration.directory=/etc/hadoop/conf/
#
# Execution engine configuration
#
org.apache.sqoop.execution.engine=org.apache.sqoop.execution.mapreduce.MapreduceExecutionEngine
#
# Authentication configuration
#
#org.apache.sqoop.authentication.type=SIMPLE
#org.apache.sqoop.authentication.handler=org.apache.sqoop.security.SimpleAuthenticationHandler
#org.apache.sqoop.anonymous=true
#org.apache.sqoop.authentication.type=KERBEROS
#org.apache.sqoop.authentication.handler=org.apache.sqoop.security.KerberosAuthenticationHandler
#org.apache.sqoop.authentication.kerberos.principal=sqoop/_HOST@NOVALOCAL
#org.apache.sqoop.authentication.kerberos.keytab=/home/kerberos/sqoop.keytab
#org.apache.sqoop.authentication.kerberos.http.principal=HTTP/_HOST@NOVALOCAL
#org.apache.sqoop.authentication.kerberos.http.keytab=

Debug Logs information

  • The logs of the Tomcat server is located under the server/logs directory in the Sqoop2 distribution directory, most relevant would be catalina.out
  • The logs of the Sqoop2 server  as sqoop.log  (by default unless changed by the above sqoop.properties configuration file ) under the  (@LOGDIR) directory in the Sqoop2 distribution directory.
  • The logs for the Derby repository is  derbyrepo.log (by default unless changed by the above  sqoop.properties  configuration file ) under the  (@LOGDIR) directory in the Sqoop2 distribution directory.