THIS IS A TEST INSTANCE. ALL YOUR CHANGES WILL BE LOST!!!!
...
- 1.1) Explore Nutch Documentation:
Wiki Markup Since I have less knowledge about Nutch codebase, I will likely cover Nutch documentation *\[1\]*. \\ .
- 1.2) Workspace Setup:
Wiki Markup Nutch workspace it built on Ant+Ivy. I have experience with Ant build framework, so workspace setup would be relatively easier. I have forked the Nutch codebase to my Git *\[2\]* and after successful completion I will provide the patch. Nutch dependency on Hadoop: _hadoop-core.1.x.jar_ is changed in _Hadoop 2.x_
No Format <dependency org="org.apache.hadoop" name="hadoop-core" rev="1.2.0" conf="*->default"> <exclude org="hsqldb" name="hsqldb" /> <exclude org="net.sf.kosmosfs" name="kfs" /> <exclude org="net.java.dev.jets3t" name="jets3t" /> <exclude org="org.eclipse.jdt" name="core" /> <exclude org="org.mortbay.jetty" name="jsp-*" /> <exclude org="ant" name="ant" /> </dependency>
- Following dependency needs to be added for Hadoop 2.6 support instead of above.
No Format <dependency org="org.apache.hadoop" name="hadoop-common" rev="2.6.0" conf="*->default" /> <dependency org="org.apache.hadoop" name="hadoop-mapreduce-client-core" rev="2.6.0" conf="*->default" />
- Dependency hadoop-test-1.2.0.jar needs to be removed.
.No Format <dependency org="org.apache.hadoop" name="hadoop-test" rev="1.2.0" conf="test->default" />
- 1.3) Experimental setup with of Nutch with Hadoop and their result:
- I have been using Hadoop 2.3 for my MapReduce application and while trying to setup Nutch 1.9 with Hadoop 2.3. I ran into following error:
No Format Injector: java.lang.!UnsupportedOperationException: Not implemented by the !DistributedFileSystem !FileSystem implementation at org.apache.hadoop.fs.!FileSystem.getScheme(!FileSystem.java:214) at org.apache.hadoop.fs.!FileSystem.loadFileSystems(!FileSystem.java:2365) at org.apache.hadoop.fs.!FileSystem.getFileSystemClass(!FileSystem.java:2375) at org.apache.hadoop.fs.!FileSystem.createFileSystem(!FileSystem.java:2392) at org.apache.hadoop.fs.!FileSystem.access$200(!FileSystem.java:89) at org.apache.hadoop.fs.!FileSystem$Cache.getInternal(!FileSystem.java:2431) at org.apache.hadoop.fs.!FileSystem$Cache.get(!FileSystem.java:2413) at org.apache.hadoop.fs.!FileSystem.get(!FileSystem.java:368) at org.apache.hadoop.fs.!FileSystem.get(!FileSystem.java:167) at org.apache.nutch.crawl.Injector.inject(Injector.java:297) at org.apache.nutch.crawl.Injector.run(Injector.java:380) at org.apache.hadoop.util.!ToolRunner.run(!ToolRunner.java:70) at org.apache.nutch.crawl.Injector.main(Injector.java:370) .
- May be I will start looking at this point onwards?
...
- Time to submit Final Report
- *
References:
Wiki Markup \[1\] http://wiki.apache.org/nutch/FrontPage
Wiki Markup \[2\] https://github.com/sumansaurabh/nutch
Wiki Markup \[3\] https://sites.google.com/site/nutch1936/home/3-methodology
Wiki Markup \[4\] http://www.cloudera.com/content/cloudera/en/documentation/core/v5-2-x/topics/cdh_ig_mapreduce_to_yarn_migrate.html
Wiki Markup \[5\] [http://www.slideshare.net/tshooter/strata-conf2014|http://www.slideshare.net/wattsteve/web-crawling-and-data-gathering-with-apache-nutch?related=2]
Wiki Markup \[6\] http://hadoop.apache.org/docs/current/hadoop-mapreduce-client/hadoop-mapreduce-client-core/MapReduce_Compatibility_Hadoop1_Hadoop2.html<<BR>
Wiki Markup \[7\] http://www.slideshare.net/wattsteve/web-crawling-and-data-gathering-with-apache-nutch?related=2* *