Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.
Comment: Migrated to Confluence 5.3

...

LinkRank contains several MapReduce steps which should be implemented converted to BSP.

  • Counter job. Counter job is simplest in implementation. This job will count number of inlinks and outlinks in whole graph(edges), which is bit tricky in terms of BSP. Since, if we put counting as a superstep in compute() process, every node (or vertex) should send number of outlinks to every other node and then all nodes in next superstep should sum all incoming messages. This will create some overhead to network. Instead we can create two separate BSP jobs. First in its superstep all nodes will send number of outlinks and sum combiner and aggregator will be used, which decreases network overhead.
  • Initializer job. Will initialize nodes with default value for inlink score, depending on how Counter job is implemented this can be first or second supertstep. This superstep is straightforward and all nodes will set their respective default value.
  • Inverter job. Inverter will invert all outlinks as inlink. Inverter job in BSP consists from 2 supersteps. After finishing Initializer superstep (next jobs will be written as superstep if converted to BSP) 1st superstep of Inverter is sending outlink to its respective destination as a message and remove all outlinks. 2nd superstep of Inverter will get messages and add to its outlinks list (edges).
  • Analyzer job. Analyzer will be executed as long as it reaches iteration limit or it doesn’t have any outlinks and votes for halt.
  • Loop detector job. Loop detector is not explicit part of LinkRank job, but users can define whether loops should be considered or not. In that case they will need separate BSP job which detects and eliminates reciprocal links and cycles. Loop detector is also straightforward in Giraph framework, since Giraph is mainly for handling graph processing. Each node will send all outlinks to their neighbors in form of nodelist and in next supersteps incoming messages will be forwarded again to neighbors by adding itself to the nodelist(path) or by voting to halt if they’re already there which means cycle detected. After given iterations which is depth of cycles, all nodes will check their list of incoming messages for inclusion of themselves, if they found match will be included to loopDb list. This is very similar to Semi-Clustering which is described in Google’s Pregel paper, where vertices send semi clusters to neighbors by adding themselves, calculating score and sending to their neighbors high ranked semi clusters. Loop detector can be used for other graph algorithms also.

Implementation details

As implemented in Apache Nutch, we can write below pseudo-code for LinkRank on top of Giraph.

...