Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.
Comment: Migration of unmigrated content due to installation of a new plugin

...

  • 1.1) Explore Nutch Documentation:
  • Wiki Markup
    Since I have less knowledge about Nutch codebase, I will likely cover Nutch documentation *\[1\]*.
    \\
     .

  • 1.2) Workspace Setup:
  • Wiki Markup
    Nutch  workspace it built on Ant+Ivy. I have experience with Ant build  framework, so workspace setup would be relatively easier. I have forked  the Nutch codebase to my Git *\[2\]* and after successful completion I will  provide the patch.  Nutch dependency on Hadoop: _hadoop-core.1.x.jar_ is changed in _Hadoop 2.x_

  • No Format
    <dependency org="org.apache.hadoop" name="hadoop-core" rev="1.2.0" conf="*->default">
       <exclude org="hsqldb" name="hsqldb" />
       <exclude org="net.sf.kosmosfs" name="kfs" />
       <exclude org="net.java.dev.jets3t" name="jets3t" />
       <exclude org="org.eclipse.jdt" name="core" />
       <exclude org="org.mortbay.jetty" name="jsp-*" />
       <exclude org="ant" name="ant" />
    </dependency>
    

  • Following dependency needs to be added for Hadoop 2.6 support instead of above.
  • No Format
    <dependency org="org.apache.hadoop" name="hadoop-common" rev="2.6.0" conf="*->default" />
    <dependency org="org.apache.hadoop" name="hadoop-mapreduce-client-core" rev="2.6.0" conf="*->default" />
    

  • Dependency hadoop-test-1.2.0.jar needs to be removed.
  • No Format
    <dependency org="org.apache.hadoop" name="hadoop-test" rev="1.2.0" conf="test->default" />
    
    .
  • 1.3) Experimental setup with of Nutch with Hadoop and their result:
  • I have been using Hadoop 2.3 for my MapReduce application and while trying to setup Nutch 1.9 with Hadoop 2.3. I ran into following error:
  • No Format
    Injector:
      java.lang.!UnsupportedOperationException: Not implemented by the !DistributedFileSystem !FileSystem implementation
      at org.apache.hadoop.fs.!FileSystem.getScheme(!FileSystem.java:214)
      at org.apache.hadoop.fs.!FileSystem.loadFileSystems(!FileSystem.java:2365)
      at org.apache.hadoop.fs.!FileSystem.getFileSystemClass(!FileSystem.java:2375) 
      at org.apache.hadoop.fs.!FileSystem.createFileSystem(!FileSystem.java:2392)
      at org.apache.hadoop.fs.!FileSystem.access$200(!FileSystem.java:89)
      at org.apache.hadoop.fs.!FileSystem$Cache.getInternal(!FileSystem.java:2431) 
      at org.apache.hadoop.fs.!FileSystem$Cache.get(!FileSystem.java:2413)
      at org.apache.hadoop.fs.!FileSystem.get(!FileSystem.java:368)
      at org.apache.hadoop.fs.!FileSystem.get(!FileSystem.java:167)
      at org.apache.nutch.crawl.Injector.inject(Injector.java:297)
      at org.apache.nutch.crawl.Injector.run(Injector.java:380)
      at org.apache.hadoop.util.!ToolRunner.run(!ToolRunner.java:70)
      at org.apache.nutch.crawl.Injector.main(Injector.java:370) .
    
  • May be I will start looking at this point onwards?

...

  • Time to submit Final Report
  • *

References:

  • Wiki Markup
    \[1\] http://wiki.apache.org/nutch/FrontPage

  • Wiki Markup
    \[2\] https://github.com/sumansaurabh/nutch

  • Wiki Markup
    \[3\] https://sites.google.com/site/nutch1936/home/3-methodology

  • Wiki Markup
    \[4\] http://www.cloudera.com/content/cloudera/en/documentation/core/v5-2-x/topics/cdh_ig_mapreduce_to_yarn_migrate.html

  • Wiki Markup
    \[5\] [http://www.slideshare.net/tshooter/strata-conf2014|http://www.slideshare.net/wattsteve/web-crawling-and-data-gathering-with-apache-nutch?related=2]

  • Wiki Markup
    \[6\] http://hadoop.apache.org/docs/current/hadoop-mapreduce-client/hadoop-mapreduce-client-core/MapReduce_Compatibility_Hadoop1_Hadoop2.html<<BR>

  • Wiki Markup
    \[7\] http://www.slideshare.net/wattsteve/web-crawling-and-data-gathering-with-apache-nutch?related=2* *