...
Difference | Hadoop 1.X | Hadoop 2.X |
Number of nodes | ~4,000 nodes per cluster | ~10,000 nodes per cluster |
Running Time | O(#nodes in cluster) | O(cluster size) |
Namespace Config | Only 1 namespace node | Multiple namespaces for managing HDFS |
Application support | Only able to run Map and reduce jobs, that are static | Able to run any java apps that can integrate with Hadoop |
Efficiency | Bottleneck lies in the JobTracker for both resource management and taskTracker task scheduling | Uses YARN (Yet Another Resource Negotiator) to perform effective cluster management |
Wiki Markup *Table 1.1 – Key difference in Hadoop 1.X and 2.X \[11\]* \\
Although this table does not highlight all the differences between the two codebases, it is a good start to start exploring what changes must be made to Apache Nutch’s tasks to port it to 2.X. In Apache Hadoop 2.x the part that deals with resource management capabilities has been placed into Apache Hadoop YARN, a general purpose, distributed application management framework while Apache Hadoop MapReduce (aka MRv2) and it remains as a pure distributed computation framework.
...