logo

SCIENTIA SINICA Informationis, Volume 49, Issue 8: 988-1004(2019) https://doi.org/10.1360/N112018-00125

An efficient incremental strongly connected components algorithm for evolving directed graphs

Xiaofei LIAO1,2,3,4, Yicheng CHEN1,2,3,4, Yu ZHANG1,2,3,4,*, Hai JIN1,2,3,4, Haikun LIU1,2,3,4, Jin ZHAO1,2,3,4
More info
  • ReceivedMay 16, 2018
  • AcceptedApr 18, 2019
  • PublishedAug 7, 2019

Abstract

The strongly connected components (SCC) algorithm can contract a directed graph into a directed acyclic graph and is widely used in directed graph analysis applications, such as reachability queries. A variety of SCC algorithms for static directed graphs have been proposed but such algorithms require non-negligible runtime overheads to repeatedly perform computations on an entire graph in response to the frequent changes in the evolving directed graphs that are ubiquitous in the real world. In general, evolving directed graphs are often evolving with minor changes (less than 5%).It allows us to compute SCC in an evolving directed graph on the basis of incremental computations in order to reduce the response time. This paper proposes Inc-SCC, an efficient incremental SCC algorithm for evolving directed graphs, reducing the data access and computation overhead of the algorithm by eliminating unnecessary computations, and using the disjoint feature of SCC for parallel processing to improve the performance of the SCC algorithm. We propose a heuristic optimization method to further speed up the convergence of Inc-SCC. Experiments show that Inc-SCC can be used to enhance the timeliness for evolving directed graphs. When the number of the changed edges of the entire directed graph is 5%, the speedup of Inc-SCC over the existing algorithm is from 2.8 to 12 times. When the number of thechanged edges of an entire directed graph is 0.5%, the speedup of Inc-SCC over the existing algorithm is from 2.9 to 12 times.


Funded by

国家重点研发计划(2018YFB1003500)

国家自然科学基金(61832006,61825202,61702202)


References

[1] Leskovec J, Kleinberg J, Faloutsos C. Graphs over time: densification laws, shrinking diameters and possible explanations. In: Proceedings of the 11th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, Chicago, 2005. 177--187. Google Scholar

[2] Zhang Y, Liao X, Shi X. Efficient Disk-Based Directed Graph Processing: A Strongly Connected Component Approach. IEEE Trans Parallel Distrib Syst, 2018, 29: 830-842 CrossRef Google Scholar

[3] Tarjan R. Depth-First Search and Linear Graph Algorithms. SIAM J Comput, 1972, 1: 146-160 CrossRef Google Scholar

[4] Aho A V, Hopcroft J E, Ullman J. Data Structures and Algorithms. Boston: Addison-Wesley, 1983. 1--983. Google Scholar

[5] Orzan S M. On Distributed Verification and Verified Distribution. Vrije Universiteit, 2004. Google Scholar

[6] McLendon III W, Hendrickson B, Plimpton S J. Finding strongly connected components in distributed graphs. J Parallel Distributed Computing, 2005, 65: 901-910 CrossRef Google Scholar

[7] Hellings J, Fletcher G H L, Haverkort H. Efficient external-memory bisimulation on DAGs. In: Proceedings of the 2012 ACM SIGMOD International Conference on Management of Data, Scottsdale, 2012. 553--564. Google Scholar

[8] Sarwat M, Sun Y. Answering location-aware graph reachability queries on geosocial data. In: Proceedings of the 33rd International Conference on Data Engineering, San Diego, 2017. 207--210. Google Scholar

[9] Yildirim H, Chaoji V, Zaki M J. GRAIL. Proc VLDB Endow, 2010, 3: 276-284 CrossRef Google Scholar

[10] Fan W, Li J, Ma S. Graph homomorphism revisited for graph matching. Proc VLDB Endow, 2010, 3: 1161-1172 CrossRef Google Scholar

[11] Fleischer L K, Hendrickson B, Pinar A. On identifying strongly connected components in parallel. In: Prceedings of the 14th International Parallel and Distributed Processing Symposium, Cancun, 2000. 505--511. Google Scholar

[12] Zhang Z, Yu J X, Qin L, et al. I/O efficient: computing SCCs in massive graphs. In: Proceedings of the 2013 ACM SIGMOD International Conference on Management of Data, New York, 2013. 181--192. Google Scholar

[13] Barnat J, Chaloupka J, van de Pol J. Improved Distributed Algorithms for SCC Decomposition. Electron Notes Theor Comput Sci, 2008, 198: 63-77 CrossRef Google Scholar

[14] Hong S, Rodia N C, Olukotun K. On fast parallel detection of strongly connected components in small-world graphs. In: Proceedings of the 2013 International Conference for High Performance Computing, Networking, Storage and Analysis, Denver, 2013. 1--11. Google Scholar

[15] Slota G M, Rajamanickam S, Madduri K. BFS and coloring-based parallel algorithms for strongly connected components and related problems. In: Proceedings of the 28th IEEE Parallel and Distributed Processing Symposium, Phoenix, 2014. 550--559. Google Scholar

[16] Bloemen V, Laarman A, van de Pol J. Multi-core on-the-fly SCC decomposition. In: Proceedings of the 21st ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming, Barcelona, 2016. 1--12. Google Scholar

[17] Ding L L, Li Z D, Ji W T, et al. Reachability query of large scale dynamic graph based on improved Huffman coding. Acta Electron Sin, 2017, 45: 359--367. Google Scholar

[18] Ntoulas A, Cho J, Olston C. What's new on the web?: the evolution of the web from a search engine perspective. In: Proceedings of the 13th International Conference on World Wide Web, New York, 2004. 1--12. Google Scholar

[19] Broder A, Kumar R, Maghoul F. Graph structure in the Web. Comput Networks, 2000, 33: 309-320 CrossRef Google Scholar

[20] Dean J, Ghemawat S. MapReduce: Simplified Data Processing on Large Clusters. Commun ACM, 2008, 51: 107 CrossRef Google Scholar

[21] Wang J, Zhang L, Wang P Y. Memory system optimization for graph processing: a survey. Sci Sin-Inf, 2019, 49: 295-313 CrossRef Google Scholar

[22] Zhang Y, Liao X, Jin H. FBSGraph: Accelerating Asynchronous Graph Processing via Forward and Backward Sweeping. IEEE Trans Knowl Data Eng, 2018, 30: 895-907 CrossRef Google Scholar

[23] Wang L, Yang X J, Dai H D. Scratchpad memory allocation for arrays in permutation graphs. Sci China Inf Sci, 2013, 56: 1-13 CrossRef Google Scholar

[24] Jin H, Yao P C, Liao X F. Towards dataflow based graph processing. Sci China Inf Sci, 2017, 60: 126102 CrossRef Google Scholar

[25] Dai G, Huang T, Chi Y, et al. Foregraph: exploring large-scale graph processing on multi-FPGA architecture. In: Proceedings of the 2017 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays, Monterey, 2017. 217--226. Google Scholar

[26] Ju X, Williams D, Jamjoom H, et al. Version traveler: fast and memory-efficient version switching in graph processing systems. In: Proceedings of the 2016 USENIX Annual Technical Conference, Denver, 2016. 523--536. Google Scholar

[27] Zhang L X, Wang W P, Gao J L, et al. Pattern graph change oriented incremental graph pattern matching. J Softw, 2015, 26: 2964--2980. Google Scholar

[28] Semertzidis K, Pitoura E, Lillis K. TimeReach: historical reachability queries on evolving graphs. In: Proceedings of the 18th International Conference on Extending Database Technology, Brussels, 2015. 121--132. Google Scholar

[29] Jure L, Andrej K. SNAP Datasets: Stanford Large Network Dataset Collection. 2014. http://snap.stanford.edu/data. Google Scholar

  • Figure 1

    The first type of deleted edges: no change in SCC structure

  • Figure 2

    The second type of deleted edges: recomputation in SCC with structure changes

  • Figure 3

    The second type of deleted edges: no recomputation in SCC with structure changes

  • Figure 4

    The CSR format of an evolving graph

  • Figure 5

    Synchronous iteration

  • Figure 6

    Lock-free solution for data conflicts

  • Figure 7

    Performance curves of Inc-SCC when the ratio of changed edges of the entire graph is different on four different datasets. (a) Email-EuAll; (b) Wiki-Talk; (c) Web-NotreDame; (d) Web-Stanford

  • Figure 8

    Performance curves of Inc-SCC when the ratio of nodes in the largest SCC of the entire graph is different

  • Figure 9

    Performance curves of Inc-SCC with graphs continuing changing on two different datasets. (a) Web-NotreDame; (b) Wiki-Talk

  • Figure 10

    The scalability of Inc-SCC on six different datasets. (a) Email-EuAll;(b) Soc-Epinions1; (c) Web-Berstan; (d) Web-NotreDame; (e) Web-Stanford; (f) Wiki-Talk

  •   

    Algorithm 1 Incremental SCC algorithm

    Require:$G_{t_{i}}$, SCCMAP($G_{t_{i-1}}$), increments, $G_{t_{i-1}}$; Output: SCCMAP($G_{t_{i}}$);

    SCCMAP($G_{t_{i}})\Leftarrow$ $\emptyset$;

    $G_{\rm~new}~\Leftarrow$ Preprocess(SCCMAP($G_{t_{i-1}}$), increments, $G_{t_{i-1}}$);

    SCCMAP($G_{t_{i}}$) $\Leftarrow~$ SCCMAP($G_{t_{i}}$) $\cup$ ($G_{\rm~new}$);

    SCCMAP($G_{t_{i}}$) $\Leftarrow~$ SCCMAP($G_{t_{i}}$) $\cup$ LocalColoring($G_{\rm~new}$);

  • Table 1   Graph datasets
    Dataset Nodes Edges SCCs Nodes in the largest SCC
    Soc-epinions1 75888 508837 42185 32223 (42.5%)
    Email-euall 265214 420025 231000 34203 (12.9%)
    Web-stanford 281903 2312497 29919 150527 (53.4%)
    Web-notreDame 325729 1497134 203609 53968 (16.6%)
    Web-Berstan 685230 7600595 109407 334856 (48.9%)
    Wiki-Talk 2394385 5021410 2281879 111881 (4%)
  •   

    Algorithm 2 Preprocess

    Require:SCCMAP($G_{t_{i-1}}$); increments; $G_{t_{i-1}}$;

    Output:$G_{\rm~new}$;

    DAG $\Leftarrow$ Contract(SCCMAP($G_{t_{i-1}}$), $G_{t_{i-1}}$);

    for each edge $e~\in$ increments which is added in parallel

    if ${\rm~SCCMAP}_{e.{\rm~src}}~\neq~{\rm~SCCMAP}_{e.{\rm~dst}}$ then

    DAG $\Leftarrow$ DAG $\cup~e$;

    end if

    end for

    SCCMAP($G_{t_{i-1}}$) $\Leftarrow$ LocalColoring(DAG);

    $G_{\rm~new}~\Leftarrow~\emptyset$;

    for each edge $e~\in$ increments which is deleted in parallel

    if ${\rm~SCCMAP}_{e.{\rm~src}}={\rm~SCCMAP}_{e.{\rm~dst}}$ then

    OldSCC $\Leftarrow~{\rm~SCCMAP}_{e.{\rm~src}}$;

    $G_{\rm~new}~\Leftarrow~G_{\rm~new}~\cup~{\rm~OldSCC}$;

    end if

    end for

  • Table 2   Time cost of each part in Inc-SCC when the ratio of changed edges of the entire graph snapshot is 0.5%
    Dataset Preprocess (ms) Process (ms) LocalFBS (ms) LocalColoring (ms)
    Soc-epinions1 1.135 5.708 5.448 0.260
    Email-euall 2.628 4.502 4.261 0.241
    Web-stanford 2.628 4.502 4.261 0.241
    Web-notredame 3.117 11.438 10.380 1.058
    Web-Berstan 6.946 55.542 48.880 6.662
    Wiki-Talk 19.199 68.166 64.032 3.134
  • Table 3   Speedup of Inc-SCC over baseline
    Dataset SCO (ms) Inc-SCC (ms) Speedup
    Soc-epinions1 46.746 5.708 8.190
    Email-euall 28.100 4.502 6.242
    Web-stanford 120.951 35.729 3.385
    Web-notredame 51.147 11.438 4.472
    Web-Berstan 160.622 55.542 2.897
    Wiki-Talk 825.124 68.166 12.104
  •   

    Algorithm 3 LocalFBS

    Require:$G_{\rm~new}$;

    Output:SCCMAP($G_{t_{i}}$); $G_{\rm~new}$;

    SCCMAP($G_{t_{i}}$) $\Leftarrow$ SCCMAP($G_{t_{i-1}}$)$\setminus~G_{\rm~new}$;

    for all OldSCC in $G_{\rm~new}$ in parallel

    Select root $\in$ OldSCC;

    $D~\Leftarrow~{\rm~FS(OldSCC,~root)}~$;

    $P~\Leftarrow~{\rm~BS(OldSCC,~root)}~$;

    NewSCC(root) $\Leftarrow$ $D~\cap~P$;

    SCCMAP($G_{t_{i}}$) $\Leftarrow~$ SCCMAP($G_{t_{i}}$) $\cup$ NewSCC(root);

    $G_{\rm~new}~\Leftarrow~G_{\rm~new}~\setminus$NewSCC(root);

    end for

  •   

    Algorithm 4 LocalColoring

    Require:$G_{\rm~new}$;

    Output:${\rm~SCCMAP}(G_{t_{i}})$;

    while $G_{\rm~new}~\neq~\emptyset$ do

    for all OldSCC $\in~G_{\rm~new}$ in parallel

    Init colors;

    while colors have changed do

    for edge e $\in$ OldSCC in parallel

    if colorse.dst $>$ colorse.src then

    colorse.dst $\Leftarrow$ colorse.src;

    end if

    end for

    end while

    for all vertex root = colorsroot in parallel

    NewSCC(root)$\Leftarrow$ BS(OldSCC, root);

    SCCMAP($G_{t_{i}}$) $\Leftarrow~$ SCCMAP($G_{t_{i}}$) $\cup$ NewSCC(root);

    $G_{\rm~new}~\Leftarrow~G_{\rm~new}~\setminus$NewSCC(root);

    end for

    end for

    end while

Copyright 2020 Science China Press Co., Ltd. 《中国科学》杂志社有限责任公司 版权所有

京ICP备18024590号-1       京公网安备11010102003388号