Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

...

  1. Set bash environment variables HADOOP_HOME=/usr/lib/hadoop, HADOOP_CONF_DIR=$HADOOP_HOME/conf
  2. Go to /usr/share/doc/mahout/examples/bin and unzip cluster-reuters.sh.gz
    Code Block
    export HADOOP_HOME=/usr/lib/hadoop
    export HADOOP_CONF_DIR=$HADOOP_HOME/conf
    
  3. modify the contents of cluster-reuters.sh, replace MAHOUT="../../bin/mahout" with MAHOUT="/usr/lib/mahout/bin/mahout"
  4. make sure the Hadoop file system is running
  5. ./cluster-reuters.sh will display a menu selection
    ubuntu@ip-10-224-109-199:/usr/share/doc/mahout/examples/bin$ ./cluster-reuters.sh
    Please select a number to choose the corresponding clustering algorithm
    1. kmeans clustering
    2. fuzzykmeans clustering
    3. lda clustering
    4. dirichlet clustering
    5. minhash clustering
    Enter your choice : 1
    ok. You chose 1 and we'll use kmeans Clustering
    creating work directory at /tmp/mahout-work-ubuntu
    Downloading Reuters-21578
    % Total % Received % Xferd Average Speed Time Time Time Current
    Dload Upload Total Spent Left Speed
    100 7959k 100 7959k 0 0 346k 0 0:00:22 0:00:22 -::- 356k
    Extracting...
    AFTER WAITING 1/2 HR...
    Inter-Cluster Density: 0.8080922658756075
    Intra-Cluster Density: 0.6978329770855537
    CDbw Inter-Cluster Density: 0.0
    CDbw Intra-Cluster Density: 89.38857003754612
    CDbw Separation: 303.4892272989769
    12/03/29 03:42:56 INFO clustering.ClusterDumper: Wrote 19 clusters
    12/03/29 03:42:56 INFO driver.MahoutDriver: Program took 261107 ms (Minutes: 4.351783333333334)

Running Whirr

  1. Set AWS_ACCESS_KEY_ID and AWS_SECRET_ACCESS_KEY in .bashrc according to the values under your AWS account. Verify using echo $AWS_ACCESS_KEY_ID this is valid before proceeding. 
  2. run the zookeeper recipe as below. 
    Panel

    ~/whirr-0.7.1:bin/whirr launch-cluster  --config recipes/hadoop-ec2.properties

  3. if you get an error message like:
    Panel

    Unable to start the cluster. Terminating all nodes.
    org.apache.whirr.net.DnsException: java.net.ConnectException: Connection refused
    at org.apache.whirr.net.FastDnsResolver.apply(FastDnsResolver.java:83)
    at org.apache.whirr.net.FastDnsResolver.apply(FastDnsResolver.java:40)
    at org.apache.whirr.Cluster$Instance.getPublicHostName(Cluster.java:112)
    at org.apache.whirr.Cluster$Instance.getPublicAddress(Cluster.java:94)
    at org.apache.whirr.service.hadoop.HadoopNameNodeClusterActionHandler.doBeforeConfigure(HadoopNameNodeClusterActionHandler.java:58)
    at org.apache.whirr.service.hadoop.HadoopClusterActionHandler.beforeConfigure(HadoopClusterActionHandler.java:87)
    at org.apache.whirr.service.ClusterActionHandlerSupport.beforeAction(ClusterActionHandlerSupport.java:53)
    at org.apache.whirr.actions.ScriptBasedClusterAction.execute(ScriptBasedClusterAction.java:100)
    at org.apache.whirr.ClusterController.launchCluster(ClusterController.java:109)
    at org.apache.whirr.cli.command.LaunchClusterCommand.run(LaunchClusterCommand.java:63)
    at org.apache.whirr.cli.Main.run(Main.java:64)
    at org.apache.whirr.cli.Main.main(Main.java:97)

    apply Whirr patch 459: https://issues.apache.org/jira/browse/WHIRR-459Image Added

Where to go from here

It is highly recommended that you read documentation provided by the Hadoop project itself (http://hadoop.apache.org/common/docs/r0.20.205.0/) Bigtop 0.2 or https://hadoop.apache.org/common/docs/r1.0.0/ for Bigtop 0.3 and that you browse through the Puppet deployment code that is shipped as part of the Bigtop release (bigtop-deploy/puppet/modules, bigtop-deploy/puppet/manifests).