Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.
Comment: Migration of unmigrated content due to installation of a new plugin

...

Difference

Hadoop 1.X

Hadoop 2.X

Number of nodes

~4,000 nodes per cluster

~10,000 nodes per cluster

Running Time

O(#nodes in cluster)

O(cluster size)

Namespace Config

Only 1 namespace node

Multiple namespaces for managing HDFS

Application support

Only able to run Map and reduce jobs, that are static

Able to run any java apps that can integrate with Hadoop

Efficiency

Bottleneck lies in the JobTracker for both resource management and taskTracker task scheduling

Uses YARN (Yet Another Resource Negotiator) to perform effective cluster management

  • Wiki Markup
    *Table 1.1 – Key difference in Hadoop 1.X and 2.X \[11\]*
    \\

Although this table does not highlight all the differences between the two codebases, it is a good start to start exploring what changes must be made to Apache Nutch’s tasks to port it to 2.X. In Apache Hadoop 2.x the part that deals with resource management capabilities has been placed into Apache Hadoop YARN, a general purpose, distributed application management framework while Apache Hadoop MapReduce (aka MRv2) and it remains as a pure distributed computation framework.

...