Giraph implementation of Nutch LinkRank Algorithm
Author
Ikhtiyor Akhmedov
Motivation
LinkRank algorithm implemented in Apache Nutch relies on MapReduce on top of Apache Hadoop, however recent studies show that graph processing on MapReduce can lead sub-optimal performance and usability issues1. Apache Giraph is good option for processing large graphs and instead of implementing PageRank type of algorithms every time for large graphs can be used Apache Giraph. As mentioned in GIRAPH-584 this can save a bit of code for Apache Nutch project also.
Project Description
LinkRank contains several MapReduce steps which should be implemented converted to BSP.
- Counter job. Counter job is simplest in implementation. This job will count number of inlinks and outlinks in whole graph(edges), which is bit tricky in terms of BSP. Since, if we put counting as a superstep in compute() process, every node (or vertex) should send number of outlinks to every other node and then all nodes in next superstep should sum all incoming messages. This will create some overhead to network. Instead we can create two separate BSP jobs. First in its superstep all nodes will send number of outlinks and sum combiner and aggregator will be used, which decreases network overhead.
- Initializer job. Will initialize nodes with default value for inlink score, depending on how Counter job is implemented this can be first or second supertstep. This superstep is straightforward and all nodes will set their respective default value.
- Inverter job. Inverter will invert all outlinks as inlink. Inverter job in BSP consists from 2 supersteps. After finishing Initializer superstep (next jobs will be written as superstep if converted to BSP) 1st superstep of Inverter is sending outlink to its respective destination as a message and remove all outlinks. 2nd superstep of Inverter will get messages and add to its outlinks list (edges).
- Analyzer job. Analyzer will be executed as long as it reaches iteration limit or it doesn’t have any outlinks and votes for halt.
- Loop detector job. Loop detector is not explicit part of LinkRank job, but users can define whether loops should be considered or not. In that case they will need separate BSP job which detects and eliminates reciprocal links and cycles. Loop detector is also straightforward in Giraph framework, since Giraph is mainly for handling graph processing. Each node will send all outlinks to their neighbors in form of nodelist and in next supersteps incoming messages will be forwarded again to neighbors by adding itself to the nodelist(path) or by voting to halt if they’re already there which means cycle detected. After given iterations which is depth of cycles, all nodes will check their list of incoming messages for inclusion of themselves, if they found match will be included to loopDb list. This is very similar to Semi-Clustering which is described in Google’s Pregel paper, where vertices send semi clusters to neighbors by adding themselves, calculating score and sending to their neighbors high ranked semi clusters. Loop detector can be used for other graph algorithms also.
Implementation details
As implemented in Apache Nutch, we can write below pseudo-code for LinkRank on top of Giraph.
Code Block | ||||
---|---|---|---|---|
| ||||
def compute(): # counting number of links if superstep is 0: sendMessageToAll(vertex.edgeCount()) numLinks = 0 # below superstep should be optimized using aggregators if superstep is 1: for msg in messages: numLinks += msg vertex.setInlinkScore(default) # inverter 2,3 elif superstep is 2: for e in vertex.edges(): sendMessageTo(e.destination, e) vertex.clearAllEdges() elif superstep is 3: # get links and add them to edges which inverts whole graph for msg in messages: vertex.addEdge(msg) # analyzer elif superstep > 3 and superstep < iterationCount: totalInlinkScore = 1.f/numLinks if vertex.edgeCount() is 0: voteForHalt() return for e in vertex.edges(): totalInlinkScore += e.destination.totalInlinkScore linkrankScore = (1 - dampingFactor) + (dampingFactor * totalInlinkScore) vertex.totalInlinkScore = linkrankScore |
Project timeline
The project represents for me a full time commitment (40+ hours/week), not intending to do anything else during the summer (besides school activities until the 10th of June). Project is divided into 2 sub tasks: LinkRank and Loop detector. Timeline created by adding 1-2 days reserve for each step for unexpected cases and hopefully project can finish earlier.
28 May - 10 June.
Prototyping and documenting implementation. Writing first draft of usage for Apache Giraph users.
11 June - 2 July.
Implementation period of each MapReduce jobs with BSP.
3 July - 10 July.
Preparing ground truth for tests and testing on real environment.
11 July - 18 July.
Bug fix and refactoring the code. Starting of profiling for further optimization.
19 July - 26 July.
Optimization of source codes depending on their implementation dependent part and checking for possible optimizations in algorithm dependent part.
27 July - 1 August.
Testing, profiling and bug fixing.
2 August - 16 August.
Working on Loop detector and adding configurations for LinkRank for considering loops.
17 August - 24 August.
Testing and bug fixing.
25 August - 2 September
Preparing for final release. Refactoring. Adding missing documentation for users.
3 September - 23 September
Finding and fixing any missing bugs, mistakes, comments and etc,.
References
1. Pregel: A System for Large-Scale Graph Processing, Grzegorz Malewicz, Matthew H. Austern, Aart J. C. Bik, James C. Dehnert, Ilan Horn,.link
2. Processing over a billion edges on Apache Giraph, Avery Ching. link
4. NUTCH-635
5. GIRAPH-584