Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

Research and Development Roadmap for Flink Gelly

 

1

...

The world wide web, the citation network or general purpose social networks follow a highly skewed power-law degree distribution. We argue that the current one-size fits all solutions Gelly provides for both Vertex-Centric and Gather-Sum-Apply Iterations do not scale in the context of natural graphs. Part of our research on Flink is to efficiently detect the tail of the power law distribution(i.e. the skewed nodes), and afterwards to divide those nodes into subnodes, recursively, in levels, similar to the approach adopted by aggregation trees. The ultimate goal for node splitting will be to make all vertices have a uniform degree distribution, which will result in minimal resource consumption and execution speed-up.

Currently, there is an ongoing effort for building splitting and aggregation operators that take a data set of skewed vertices as input and expands the graph to scale computation up afterwards collapsing the partial results into the initial vertex value. These operators should be applicable to any algorithm.  

Overall, the following concrete tasks should be implemented:

  • a splitting operator that takes a data set of skewed vertices and divides it into a tree of subvertices

  • an aggregation operator that will gather the result into the initial vertex

  • a set of benchmarks that will prove that this solution improves the current approach  

This work is part of Andra Lungu’s master thesis, supervised by Asterios Katsifodimos and Vasia Kalavri.

2). Graph Streaming

This will be an API layer on top of Flink Streaming, which contains a set of utility methods to create graphs from input streams and analyze them on-the-fly. The API provides methods to compute continuous aggregates on graph streams and allows users to easily implement one-pass graph streaming algorithms.

Similarly to Gelly, Gelly Streaming will also contain a library of common Graph Streaming algorithms. We already have working implementations for the following:

  • Degree Counting

  • Triangle Count Approximation

  • Bipartite Graph Detection

The implementation will initially focus on a variant of the semi-streaming model: a GraphStream is a stream of edges, which arrive in random order and do not contain duplicates. We will initially assume one-pass algorithms, where we don’t have the complete graph state stored somewhere at any moment. Essentially, this means that the graph is streamed through the system and we are computing continuous “improving” aggregates, i.e. the more edges of the graph we consume, the better the metric gets.

As a next step, we are planning to extend the API to also support stateful graph streaming, i.e. create and update the graph, so that we can always access it as distributed state in memory. This model will cover the cases of (a) random edge streams with duplicates and (b) graph mutations. This extension will allow querying the “fresh” graph state at any moment and run all sorts of analysis tasks on it.

We are currently looking into the following related issues:

...

efficient streaming partitioning algorithms, so that when we receive an edge, we know to which partition to assign it

...

)

...

.

...

sliding window transformations on the graph streams

This work is part of Daniel Bali’s master thesis, supervised by Vasia Kalavri and Paris Carbone, at KTH, Stockholm.

3). Scala API

Currently, we only provide a  Java API for Gelly. In the near future, we would like to also support Scala applications (FLINK-1962). The first step towards porting the Java API to Scala, removing type information from the Gelly classes, was already made.  

...

Library Methods

Gelly has a growing library of off-the shelf algorithms that users can take advantage of by simply calling the run() method on the graph instance.

In the near future, the following algorithms will be added:         -  Affinity    

- [Machine Learning] HITS and Absorption [as part of Alexander Alexandrov’s IMPRO3 course- SS 2015]

- Strongly Connected Components (Section 3.1)

- Approximate Diameter (DIA) - approximate the maximum length of shortest paths - data comes from the outer-edges

...

2). Graph Partitioning Techniques

Graph Partitioning plays a key role in application parallelization and in scaling data analysis up. Processes need to evenly be assigned to machines while maintaining communication costs to a minimum. We would like to implement several graph partitioning algorithms for Gelly. As a starting point, Flink’s Graph API can support hash partitioning. We chose this particular approach as it is very easy to implement and it maps to Flink. Afterwards, depending on the type of problem, or the type of vertex different algorithms will be proposed.

...

Ongoing work in this direction can be tracked using the following JIRA issue: FLINK-1536.

...

3). Partition-centric Iterations

Using Flink’s mapPartition() operator, it should be straight-forward to also add support for the partition-centric computation model. This model exposes the partition structure to the user and allows exploiting the local graph structure inside a partition to avoid unnecessary communication. A more thorough description of the model and its advantages can be found in this paper.

A POC for partition-centric iterations has already been implemented and is available in this github repository.

...

4). Generic Iterations

The implementation of Boruvka’s distributed minimum spanning tree using Gelly raised several issues as described in the Gelly iteration abstractions mailing list discussion . Among the general problems are the following: branching inside an iteration, convergence checks of internal iterations, tedious/difficult to understand approaches towards differentiating between various algorithm phases (e.g. using aggregators), etc.

These issues could be solved by simply using for-loops. Ongoing work in this direction is the usage of loop unrolling to solve algorithms such as K-core decomposition or DMST. Unfortunately, this approach leads to erroneous behavior such as GC limit exceptions in the Optimizer. Hence, for the loops to be efficiently supported, we should cache intermediate results.

...

5). Performance evaluation

We would like to compare Gelly’s capabilities to other state-of-the art graph processing systems: GraphLab, PowerGraph, GraphX, Giraph and PowerLyra.

...

. Integration with the Graphalytics benchmarks is work in progress in this github repository.

6). Bipartite Graph Support

A bipartite graph is a graph for which the set of vertices can be divided into two disjoint sets such that each edge having a source vertex in the first set, will have a target vertex in the second set. We would like to support efficient operations for this type of graphs along with a set of metrics, in the near future.   

10). Wish-list

Going one step further, Gelly could also be integrated with intuitive graph visualization tools such as Gephi. This will allow users to see the real-time progress of their graph. They could also visually verify results for clustering algorithms, for example. Gephi is free/open-source and provides support for all types of networks, including dynamic and hierarchical graphs.

Neo4j integration is also part of the Gelly roadmap mainly due to this scalable graph database ’s versatility and its ease of use. Numerous applications could be built using this integration. Some examples are: recommender systems, fraud detection applications or authorization and identity networks.

Apache TinkerPop is an API for transactional and analytical graph processing in which vertices can have ACL permission information attached to them, where one could see the name/identifier of the user who created/removed vertex information (for auditing purposes and to support non-repudiation), etc.

Most of Gelly ’s and TinkerPop ’s concepts match, which means that an integration with this system could prove to be an interesting addition.    

For proving several theoretical results, one may need to generate graphs of certain dimensions and that have various properties. LDBCouncil provides excellent support for such operations, which is why we would like to propose an integration with this benchmark reference in the future.

Finally, for research purposes, one may want to derive a smaller, representative sample (i.e. that keeps the same properties), given a large, real world graph. The challenges we expect to encounter in such a context are: picking a good sampling method (random does not always produce satisfactory results), choosing the sample size and finally proving that the resulted graph maintains the properties of the original graphCooresponding JIRA issue: FLINK-2254.