Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.
Comment: Migrated to Confluence 5.3

...

Running

...

Hadoop

...

Components

...

One

...

of

...

the

...

advantages

...

of

...

Bigtop

...

is

...

the

...

ease

...

of

...

installation

...

of

...

the

...

different

...

Hadoop

...

Components

...

without

...

having

...

to

...

hunt

...

for

...

a

...

specific

...

Hadoop

...

Component

...

distribution

...

and

...

matching

...

it

...

with

...

a

...

specific

...

Hadoop

...

version.

...

Running Pig

  1. Install Pig
    No Format
    
    sudo apt-get install pig
    

...

  1. 
    

...

  1. create 

...

  1. a tab delimited text file using 

...

  1. your 

...

  1. favorite editor,
    1 

...

  1.  

...

  1.  

...

  1.  

...

  1.  

...

  1. A
    2 

...

  1.  

...

  1.  

...

  1.  

...

  1.  

...

  1. B
    3 

...

  1.  

...

  1.  

...

  1.  

...

  1.  C
    
  2. Create a tab delimited file using a text editor and import it into HDFS under your user directory /user/$USER. By default PIG will look here for yoru file. Start the pig shell and verify a load and dump work. Make sure you have a space on both sides of the = sign. The statement using PigStorage('\t')

...

  1. tells

...

  1. Pig

...

  1. the

...

  1. columns

...

  1. in

...

  1. the

...

  1. text

...

  1. file

...

  1. are

...

  1. delimited

...

  1. using

...

  1. tabs.

...

  1. No Format

...

  1. 
    $pig
    grunt>A = load '/pigdata/PIGTESTA.txt' using PigStorage('\t');
    grunt>dump A
    

...

  1. 
    2013-07-06 07:22:56,272 [main] INFO  org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher - Success!
    2013-07-06 07:22:56,276 [main] WARN  org.apache.hadoop.conf.Configuration - fs.default.name is deprecated. Instead, use fs.defaultFS2013-07-06 07:22:56,295 [main] INFO  org.apache.pig.backend.hadoop.executionengine.util.MapRedUtil - Total input paths to process : 1
    (1,A)
    (2,B)
    (3,C)
    ()
    
    2013-07-06 07:22:56,295 [main] INFO  org.apache.hadoop.mapreduce.lib.input.FileInputFormat - Total input paths to process : 12013-07-06 07:22:56,295 [main] INFO  org.apache.pig.backend.hadoop.executionengine.util.MapRedUtil - Total input paths to process : 1
    (1,A)((3,C)(
    

Running HBase

  1. Install HBase
    No Format
    
    sudo apt-get install hbase\*
    
  2. For bigtop-0.2.0 uncomment and set JAVA_HOME in /etc/hbase/conf/hbase-env.sh
  3. For bigtop-0.3.0 this shouldn't be necessary because JAVA_HOME is auto detected
    No Format
    
    sudo service hbase-master start
    hbase shell
    
  4. Test the HBase shell by creating a HBase table named t1 with 3 columns f1, f2 and f3. Verify the table exists in HBase
    No Format
    
    hbase(main):001:0> create 't2','f1','f2','f3'
    SLF4J: Class path contains multiple SLF4J bindings.
    SLF4J: Found binding in [jar:file:/usr/lib/hbase/lib/slf4j-log4j12-1.

...

  1. 5.

...

  1. 8.jar!/org/slf4j/impl/StaticLoggerBinder.class]
    SLF4J: Found binding 

...

  1. in [jar:file:/usr/lib/hadoop/lib/slf4j-log4j12-1.4.3.jar!/org/slf4j/impl/StaticLoggerBinder.class]
    SLF4J: Found binding in [jar:file:/usr/lib/zookeeper/lib/slf4j-log4j12-1.6.1.jar!/org/slf4j/impl/StaticLoggerBinder.class]
    SLF4J: See http://www.slf4j.org/codes.html#multiple_bindings for an explanation.
    0 row(s) in 3.4390 seconds
    
    hbase(main):002:0> list
    TABLE
    t2
    2 row(s) in 0.0220 seconds
    
    hbase(main):003:0>
    
    you should see a verification from HBase the table t2 exists, the symbol t2 which is the table name should appear under list

Running Hive

  1. This is for bigtop-0.2.0

...

  1. where

...

  1. hadoop-hive,

...

  1. hadoop-hive-server,

...

  1. and

...

  1. hadoop-hive-metastore

...

  1. are

...

  1. installed

...

  1. automatically

...

  1. because

...

  1. the

...

  1. hive

...

  1. services

...

  1. start

...

  1. with

...

  1. the

...

  1. word

...

  1. hadoop.

...

  1. For

...

  1. bigtop-0.3.0

...

  1. if

...

  1. you

...

  1. use

...

  1. the

...

  1. sudo

...

  1. apt-get

...

  1. install

...

  1. hadoop

...

  1. *

...

  1. command

...

  1. you

...

  1. won't

...

  1. get

...

  1. the

...

  1. Hive

...

  1. components

...

  1. installed

...

  1. because

...

  1. the

...

  1. Hive

...

  1. Daemon

...

  1. names

...

  1. are changed

...

  1. in

...

  1. Bigtop.

...

  1. For

...

  1. bigtop-0.3.0

...

  1. you

...

  1. will

...

  1. have

...

  1. to

...

  1. do

...

  1. No Format

...

  1. 
    sudo apt-get install hive hive-server hive-metastore
    

...

  1. Create

...

  1. the

...

  1. HDFS

...

  1. directories

...

  1. Hive

...

  1. needs

...


  1. The

...

  1. Hive

...

  1. Post

...

  1. install

...

  1. scripts

...

  1. should

...

  1. create

...

  1. the

...

  1. /tmp

...

  1. and

...

  1. /user/hive/warehouse

...

  1. directories.

...

  1. If

...

  1. they

...

  1. don't

...

  1. exist,

...

  1. create

...

  1. them

...

  1. in

...

  1. HDFS.

...

  1. The

...

  1. Hive

...

  1. post

...

  1. install

...

  1. script

...

  1. doesn't

...

  1. create

...

  1. these

...

  1. directories

...

  1. because

...

  1. HDFS

...

  1. is

...

  1. not

...

  1. up

...

  1. and

...

  1. running

...

  1. during

...

  1. the

...

  1. deb

...

  1. file

...

  1. installation

...

  1. because

...

  1. JAVA_HOME

...

  1. is

...

  1. buried

...

  1. in

...

  1. hadoop-env.sh

...

  1. and

...

  1. HDFS

...

  1. can't

...

  1. start

...

  1. to

...

  1. allow

...

  1. these

...

  1. directories

...

  1. to

...

  1. be

...

  1. created.

...

  1. No Format

...

  1. 
    hadoop fs -mkdir /tmp
    hadoop fs -mkdir /user/hive/warehouse
    hadoop -chmod g+x /tmp
    hadoop -chmod g+x /user/hive/warehouse
    

...

  1. If the post install scripts didn't

...

  1. create

...

  1. directories

...

  1. /var/run/hive

...

  1. and

...

  1. /var/lock/

...

  1. subsys,

...

  1. create

...

  1. directory

...

  1. /var/run/hive

...

  1. and

...

  1. create

...

  1. directory

...

  1. /var/lock/subsys

...

  1. No Format

...

  1. 
    sudo mkdir /var/run/hive
    sudo mkdir /var/lock/subsys
    

...

  1. start the Hive Server
    No Format
    
    sudo /etc/init.d/hive-server start
    

...

  1. create a table in Hive and verify it is there
    No Format
    
    ubuntu@ip-10-101-53-136:~$ hive
    WARNING: org.apache.hadoop.metrics.jvm.EventCounter is deprecated. Please use org.apache.hadoop.log.metrics.EventCounter in all the log4j.properties files.
    Hive history file=/tmp/ubuntu/hive_job_log_ubuntu_201203202331_281981807.txt
    hive> create table doh(id int);
    OK
    Time taken: 12.458 seconds
    hive> show tables;
    OK
    doh
    Time taken: 0.283 seconds
    hive>
    

...

Running Mahout

  1. Set bash environment variables HADOOP_HOME=/usr/lib/hadoop,

...

  1. HADOOP_CONF_DIR=$HADOOP_HOME/conf

...

  1. Install Mahout, sudo apt-get install mahout
  2. Go to /usr/share/doc/mahout/examples/bin

...

  1. and

...

  1. unzip

...

  1. cluster-reuters.sh.gz

...

  1. Code Block

...

  1. 
    export HADOOP_HOME=/usr/lib/hadoop
    export HADOOP_CONF_DIR=$HADOOP_HOME/conf
    

...

  1. modify the contents of cluster-reuters.sh,

...

  1. replace

...

  1. MAHOUT="../../bin/mahout"

...

  1. with

...

  1. MAHOUT="/usr/lib/mahout/bin/mahout"

...

  1. make

...

  1. sure

...

  1. the

...

  1. Hadoop

...

  1. file

...

  1. system

...

  1. is

...

  1. running and you have "curl" command on your system
  2. ./cluster-reuters.sh

...

  1. will

...

  1. display

...

  1. a

...

  1. menu

...

  1. selection

...

  1. Panel

...

  1. ubuntu@ip-10-224-109-199:/usr/share/doc/mahout/examples/bin$

...

  1. ./cluster-reuters.sh

...

  1. Panel

    Please select a number to choose the corresponding clustering algorithm
    1. kmeans clustering
    2. fuzzykmeans clustering
    3. lda clustering
    4. dirichlet clustering
    5. minhash clustering
    Enter your choice : 1
    ok. You chose 1 and we'll use kmeans Clustering
    creating work directory at /tmp/mahout-work-ubuntu

...


  1. Downloading

...

  1. Reuters-21578

...


  1. %

...

  1. Total

...

  1. %

...

  1. Received

...

  1. %

...

  1. Xferd

...

  1. Average

...

  1. Speed

...

  1. Time

...

  1. Time

...

  1. Time

...

  1. Current

...


  1. Dload

...

  1. Upload

...

  1. Total

...

  1. Spent Left Speed
    100 7959k 100 7959k 0 0 346k 0 0:00:22

...

  1. 0:00:22

...

  1. -:

...

  1. :

...

  1. -

...

  1. 356k

...


  1. Extracting...

...


  1. AFTER

...

  1. WAITING

...

  1. 1/2

...

  1. HR...

...


  1. Inter-Cluster

...

  1. Density:

...

  1. 0.8080922658756075

...


  1. Intra-Cluster

...

  1. Density:

...

  1. 0.6978329770855537

...


  1. CDbw

...

  1. Inter-Cluster

...

  1. Density:

...

  1. 0.0

...


  1. CDbw

...

  1. Intra-Cluster

...

  1. Density:

...

  1. 89.38857003754612

...


  1. CDbw

...

  1. Separation:

...

  1. 303.4892272989769

...


  1. 12/03/29

...

  1. 03:42:56

...

  1. INFO

...

  1. clustering.ClusterDumper:

...

  1. Wrote

...

  1. 19

...

  1. clusters

...


  1. 12/03/29

...

  1. 03:42:56

...

  1. INFO

...

  1. driver.MahoutDriver:

...

  1. Program

...

  1. took

...

  1. 261107

...

  1. ms

...

  1. (Minutes:

...

  1. 4.351783333333334)

...

  1. run classify-20newsgroups.sh,

...

  1. first

...

  1. modify

...

  1. the

...

  1. ../bin/mahout

...

  1. to

...

  1. /usr/lib/mahout/bin/mahout.

...

  1. Do

...

  1. a

...

  1. find

...

  1. and

...

  1. replace

...

  1. using

...

  1. your

...

  1. favorite

...

  1. editor.

...

  1. There

...

  1. are

...

  1. several

...

  1. instances

...

  1. of

...

  1. ../bin/mahout

...

  1. which

...

  1. need

...

  1. to

...

  1. be

...

  1. replaced

...

  1. by

...

  1. /usr/lib/mahout/bin/mahout

...

  1. run

...

  1. the

...

  1. rest

...

  1. of

...

  1. the

...

  1. examples

...

  1. under

...

  1. this

...

  1. directory

...

  1. except

...

  1. the

...

  1. netflix

...

  1. data

...

  1. set

...

  1. which

...

  1. is

...

  1. no

...

  1. longer

...

  1. officially

...

  1. available

Running Whirr

  1. Set AWS_ACCESS_KEY_ID

...

  1. and AWS_SECRET_ACCESS_KEY

...

  1. in

...

  1. .bashrc

...

  1. according

...

  1. to

...

  1. the

...

  1. values

...

  1. under

...

  1. your

...

  1. AWS

...

  1. account.

...

  1. Verify

...

  1. using

...

  1. echo

...

  1. $AWS_ACCESS_KEY_ID

...

  1. this

...

  1. is

...

  1. valid

...

  1. before

...

  1. proceeding.

...

  1.  
  2. run the zookeeper recipe as below. 
    Panel

    ~/whirr-0.7.1:bin/whirr

...

  1. launch-cluster

...

  1.  --config

...

  1. recipes/hadoop-ec2.properties

...

  1. if you get an error message like:
    Panel

    Unable to start the cluster. Terminating all nodes.
    org.apache.whirr.net.DnsException:

...

  1. java.net.ConnectException:

...

  1. Connection

...

  1. refused

...


  1. at

...

  1. org.apache.whirr.net.FastDnsResolver.apply(FastDnsResolver.java:83)

...


  1. at

...

  1. org.apache.whirr.net.FastDnsResolver.apply(FastDnsResolver.java:40)

...


  1. at

...

  1. org.apache.whirr.Cluster$Instance.getPublicHostName(Cluster.java:112)

...


  1. at

...

  1. org.apache.whirr.Cluster$Instance.getPublicAddress(Cluster.java:94)

...


  1. at

...

  1. org.apache.whirr.service.hadoop.HadoopNameNodeClusterActionHandler.doBeforeConfigure(HadoopNameNodeClusterActionHandler.java:58)

...


  1. at

...

  1. org.apache.whirr.service.hadoop.HadoopClusterActionHandler.beforeConfigure(HadoopClusterActionHandler.java:87)

...


  1. at

...

  1. org.apache.whirr.service.ClusterActionHandlerSupport.beforeAction(ClusterActionHandlerSupport.java:53)

...


  1. at

...

  1. org.apache.whirr.actions.ScriptBasedClusterAction.execute(ScriptBasedClusterAction.java:100)

...


  1. at

...

  1. org.apache.whirr.ClusterController.launchCluster(ClusterController.java:109)

...


  1. at

...

  1. org.apache.whirr.cli.command.LaunchClusterCommand.run(LaunchClusterCommand.java:63)

...


  1. at

...

  1. org.apache.whirr.cli.Main.run(Main.java:64)

...


  1. at

...

  1. org.apache.whirr.cli.Main.main(Main.java:97)

...

  1. apply Whirr patch 459: https://issues.apache.org/jira/browse/WHIRR-459

...

  1. When

...

  1. whirr

...

  1. is

...

  1. finished

...

  1. launching

...

  1. the

...

  1. cluster,

...

  1. you

...

  1. will

...

  1. see

...

  1. an

...

  1. entry

...

  1. under

...

  1. ~/.whirr

...

  1. to

...

  1. verify

...

  1. the

...

  1. cluster

...

  1. is

...

  1. running

...

  1. cat

...

  1. out

...

  1. the

...

  1. hadoop-proxy.sh

...

  1. command

...

  1. to

...

  1. find

...

  1. the

...

  1. EC2

...

  1. instance

...

  1. address

...

  1. or

...

  1. you

...

  1. can

...

  1. cat

...

  1. out

...

  1. the

...

  1. instance

...

  1. file.

...

  1. Both

...

  1. will

...

  1. give

...

  1. you

...

  1. the

...

  1. Hadoop

...

  1. namenode

...

  1. address

...

  1. even

...

  1. though

...

  1. you

...

  1. started

...

  1. the

...

  1. mahout

...

  1. service

...

  1. using

...

  1. whirr.

...

  1. ssh

...

  1. into

...

  1. the

...

  1. instance

...

  1. to

...

  1. verify

...

  1. you

...

  1. can

...

  1. login.

...

  1. Note:

...

  1. this

...

  1. login

...

  1. is

...

  1. different

...

  1. than

...

  1. a

...

  1. normal

...

  1. EC2

...

  1. instance

...

  1. login.

...

  1. The

...

  1. ssh

...

  1. key

...

  1. is

...

  1. id_rsa

...

  1. and

...

  1. there

...

  1. is

...

  1. no

...

  1. user

...

  1. name

...

  1. for

...

  1. the

...

  1. instance

...

  1. IP

...

  1. address

...

  1. ~/.whirr/mahout:ssh

...

  1. -i

...

  1. ~/.ssh/id_rsa

...

  1. ec2-50-16-85-59.compute-1.amazonaws.com

...


  1. #verify

...

  1. you

...

  1. can

...

  1. access

...

  1. the

...

  1. HDFS

...

  1. file

...

  1. system

...

  1. from

...

  1. the

...

  1. instance

...

  1. No Format

...

  1. 
    dc@ip-10-70-18-203:~$ hadoop fs -ls /
    Found 3 items
    drwxr-xr-x   - hadoop supergroup          0 2012-03-30 23:44 /hadoop
    drwxrwxrwx   - hadoop supergroup          0 2012-03-30 23:44 /tmp
    drwxrwxrwx   - hadoop supergroup          0 2012-03-30 23:44 /user
    

...

Running Oozie

  1. Stop the Oozie daemons using ps -ef | grep oozie to find them then sudo kill -i pid ( the pid from the ps -ef command)
  2. Stopping the Oozie daemons may not remove the oozie.pid file which tells the system an oozie process is running. You may have to manually remove the pid file using sudo rm -rf /var/run/oozie/oozie.pid

...

  1. cd

...

  1. into

...

  1. /usr/lib/oozie

...

  1. and

...

  1. setup

...

  1. the

...

  1. oozie

...

  1. environment

...

  1. variables

...

  1. using

...

  1. bin/oozie-env.sh

...

  1. Download

...

  1. ext-2.2.js

...

  1. from

...

  1. http://incubator.apache.org/oozie/QuickStart.html

...

  1. Install

...

  1. ext-2.2.js

...

  1. using

...

  1. No Format

...

  1. 
    bin/oozie-setup.sh -hadoop 1.0.1 ${HADOOP_HOME} -extjs ext-2.2.zip
    

...

  1. You will get an error message change the above to the highest Hadoop version available,
    No Format
    
    sudo bin/oozie-setup.sh -hadoop 0.20.200 ${HADOOP_HOME} -extjs ext-2.2.zip
    

...

  1. start oozie,

...

  1. sudo

...

  1. bin/oozie-start.sh

...

  1. run

...

  1. oozie,

...

  1. sudo

...

  1. bin/oozie-run.sh

...

  1. you

...

  1. will

...

  1. get

...

  1. a

...

  1. lot

...

  1. of

...

  1. error

...

  1. messages,

...

  1. this

...

  1. is

...

  1. ok.

...

  1. go

...

  1. to

...

  1. the

...

  1. public

...

  1. DNS

...

  1. EC2

...

  1. address/oozie/11000,

...

  1. my

...

  1. address

...

  1. looked

...

  1. like:

...

  1. http://ec2-67-202-18-159.compute-1.amazonaws.com:11000/oozie/
] {div: style= 40px} !https://cwiki.apache.org/confluence/download/attachments/27831258/Screen+Shot+2012-03-31+at+1.19.56+AM.png|border=1! {div} # go to the Oozie apache page and run the oozie examples h1. Running Zookeeper Zookeeper is installed as part of HBase. Add the zookeeper echo example h1. Running Sqoop h1. Running
Div
style
margin:0px
0px
0px
40px

Image Added

  1. go to the Oozie apache page and run the oozie examples

Running Zookeeper

Zookeeper is installed as part of HBase. Add the zookeeper echo example

Running Sqoop

Install SQOOP using: [redhat@ip-10-28-189-235 ~]$ sudo yum install sqoop
*

You should see:

Loaded plugins: amazon-id, rhui-lb, security
Setting up Install Process
Resolving Dependencies
--> Running transaction check
---> Package sqoop.noarch 0:1.4.1-1.fc16 will be installed
---> Package sqoop-metastore.noarch 0:1.4.1-1.fc16 will be installed
--> Finished Dependency Resolution
Dependencies Resolved
==============================================================================================
 Package               Arch         Version               Repository                     Size
==============================================================================================
Installing:
 sqoop                 noarch       1.4.1-1.fc16          bigtop-0.3.0-incubating       3.4 M
 sqoop-metastore       noarch       1.4.1-1.fc16          bigtop-0.3.0-incubating       4.9 k
Transaction Summary
==============================================================================================
Install       2 Package(s)
Total download size: 3.4 M
Installed size: 4.9 M
Is this ok [y/N]: y
Downloading Packages:
(1/2): sqoop-1.4.1-1.fc16.noarch.rpm                                   | 3.4 MB     00:01     
(2/2): sqoop-metastore-1.4.1-1.fc16.noarch.rpm                         | 4.9 kB     00:00     
----------------------------------------------------------------------------------------------
Total                                                         2.0 MB/s | 3.4 MB     00:01     
Running rpm_check_debug
Running Transaction Test
Transaction Test Succeeded
Running Transaction
  Installing : sqoop-1.4.1-1.fc16.noarch                                                  1/2 
  Installing : sqoop-metastore-1.4.1-1.fc16.noarch                                        2/2 
Installed:
  sqoop.noarch 0:1.4.1-1.fc16              sqoop-metastore.noarch 0:1.4.1-1.fc16             
Complete!
Loaded plugins: amazon-id, rhui-lb, security

Setting up Install Process

Resolving Dependencies

--> Running transaction check

---> Package sqoop.noarch 0:1.4.1-1.fc16 will be installed

---> Package sqoop-metastore.noarch 0:1.4.1-1.fc16 will be installed

--> Finished Dependency Resolution

Dependencies Resolved

==============================================================================================

 Package               Arch         Version               Repository                     Size

==============================================================================================

Installing:

 sqoop                 noarch       1.4.1-1.fc16          bigtop-0.3.0-incubating       3.4 M

 sqoop-metastore       noarch       1.4.1-1.fc16          bigtop-0.3.0-incubating       4.9 k

Transaction Summary

==============================================================================================

Install       2 Package(s)

Total download size: 3.4 M

Installed size: 4.9 M

Is this ok [y/N]: y

Downloading Packages:

(1/2): sqoop-1.4.1-1.fc16.noarch.rpm                                   | 3.4 MB     00:01     

(2/2): sqoop-metastore-1.4.1-1.fc16.noarch.rpm                         | 4.9 kB     00:00     

----------------------------------------------------------------------------------------------

Total                                                         2.0 MB/s | 3.4 MB     00:01     

Running rpm_check_debug

Running Transaction Test

Transaction Test Succeeded

Running Transaction

  Installing : sqoop-1.4.1-1.fc16.noarch                                                  1/2 

  Installing : sqoop-metastore-1.4.1-1.fc16.noarch                                        2/2 

Installed:

  sqoop.noarch 0:1.4.1-1.fc16              sqoop-metastore.noarch 0:1.4.1-1.fc16             

Complete!

To test SQOOP is running run the CLI: 

Running Flume/FlumeNG