SCIENTIA SINICA Informationis, Volume 48, Issue 3: 329-348(2018) https://doi.org/10.1360/N112017-00064

## Data placement approach for scalable online social networks

Jingya ZHOU1,2,3,*,
• AcceptedOct 24, 2017
• PublishedFeb 13, 2018
Share
Rating

### Abstract

Online social networks are attracting more and more users. Faced with hundreds of millions of users, how to store user data in a scalable manner has become a hot issue of focus for both social service providers and researchers. Currently, distributed key value store is widely used; it places user data across multiple storage servers based on a hash approach. However, it results in a huge amount of communication traffic inside a data center, and is not conducive to the scale of social networks. By considering user interaction characteristics, this paper proposes a data placement approach that combines both social graph partitioning and data replication. Considering the network topologies of data centers, we design the data placement for specific topologies. Furthermore, we discuss the incremental adjustment for social network growth and the distributed implementation of the proposed algorithms. Finally, experiments on real world traces indicate that the proposed algorithms can effectively reduce internal communication traffic, thereby enhancing the scalability of online social networks.

### References

[1] Shvachko K, Kuang H, Radia R, et al. The hadoop distributed file system. In: Proceedings of the 26th Symposium on Mass Storage Systems and Technologies, Nevada, 2010. 1--10. Google Scholar

[2] DeCandia G, Hastorun D, Jampani M, et al. Dynamo: amazon's highly available key-value store. In: Proceedings of the 21st ACM SIGOPS Symposium on Operating Systems Principles, Washington, 2007. 205--220. Google Scholar

[3] Lakshman A, Malik P. Cassandra: a decentralized structured storage system. Operat Syst Rev, 2010, 44: 35--40. Google Scholar

[4] Karypis G, Kumar V. A Fast and High Quality Multilevel Scheme for Partitioning Irregular Graphs. SIAM J Sci Comput, 1998, 20: 359-392 CrossRef Google Scholar

[5] Chen H, Jin H, Wu S. Minimizing Inter-Server Communications by Exploiting Self-Similarity in Online Social Networks. IEEE Trans Parallel Distrib Syst, 2016, 27: 1116-1130 CrossRef Google Scholar

[6] Pujol J M, Erramilli V, Siganos G. The Little Engine(s) That Could: Scaling Online Social Networks. IEEE/ACM Trans Networking, 2012, 20: 1162-1175 CrossRef Google Scholar

[7] Tran D A, Nguyen K, Pham C. S-CLONE: Socially-aware data replication for social networks. Comput Networks, 2012, 56: 2001-2013 CrossRef Google Scholar

[8] Liu G, Shen H, Chandler H. Selective Data Replication for Online Social Networks with Distributed Datacenters. IEEE Trans Parallel Distrib Syst, 2016, 27: 2377-2393 CrossRef Google Scholar

[9] Jiao L, Li J, Du W, et al. Multi-objective data placement for multi-cloud socially aware services. In: Proceedings of IEEE Conference on Computer Communications, Toronto, 2014. 28--36. Google Scholar

[10] Tran D A, Zhang T. S-put: an ea-based framework for socially aware data partitioning. Comput Netw, 2014, 504--518. Google Scholar

[11] Yu B Y, Pan J P. Location-aware associated data placement for geo-distributed data-intensive applications. In: Proceedings of IEEE Conference on Computer Communications, Hong Kong, 2015. 603--611. Google Scholar

[12] Yu B Y, Pan J P. Sketch-based data placement among geo-distributed datacenters for cloud storages. In: Proceedings of the 35th Annual IEEE International Conference on Computer Communications, San Francisco, 2016. 1--9. Google Scholar

[13] Tang J, Tang X Y, Yuan J S. Optimizing inter-server communication for online social networks. In: Proceedings of the 35th International Conference on Distributed Computing Systems, Columbus, 2015. 215--224. Google Scholar

[14] Jiao L, Li J, Xu T. Optimizing Cost for Online Social Networks on Geo-Distributed Clouds. IEEE/ACM Trans Networking, 2016, 24: 99-112 CrossRef Google Scholar

[15] Gregory S. Finding overlapping communities in networks by label propagation. New J Phys, 2010, 12: 103018 CrossRef ADS arXiv Google Scholar

[16] Lancichinetti A, Fortunato S, Kertesz J. Detecting the overlapping and hierarchical community structure in complex networks. New J Phys, 2009, 11: 1--18. Google Scholar

[17] Lee C, Reid F, McDaid A, et al. Detecting highly overlapping community structure by greedy clique expansion. 2010,. arXiv Google Scholar

[18] Han Z M, Tan X S, Chen Y, et al. NCSS: an effective and efficient complex network community detection algorithm. Sci Sin Inform, 2016, 46: 431--444. Google Scholar

[19] Qiao S J, Han N, Zhang K F, et al. Algorithm for detecting overlapping communities from complex network big data. J Softw, 2017, 28: 631--647. Google Scholar

[20] Benevenuto F, Rodrigues T, Cha M, et al. Characterizing user behavior in online social networks. In: Proceedings of the 9th ACM SIGCOMM Conference on Internet Measurement, Chicago, 2009. 49--62. Google Scholar

[21] Li D, Chen G H, Ren F Y, et al. Data center network research progress and trends. Chinese J Comput, 2014, 37: 259--274. Google Scholar

[22] Al-Fares M, Loukissas A, Vahdat A. A scalable, commodity data center network architecture. In: Proceedings of ACM SIGCOMM Conference on Data Communication, Seattle, 2008. 63--74. Google Scholar

[23] Clos C. A Study of Non-Blocking Switching Networks. Bell Syst Technical J, 1953, 32: 406-424 CrossRef Google Scholar

[24] Roy A, Zeng H Y, Bagga J, et al. Inside the social network's (datacenter) network. In: Proceedings of ACM SIGCOMM Conference on Data Communication, London, 2015. 123--137. Google Scholar

[25] Wilson C, Sala A, Puttaswamy K P N, et al. Beyond social graphs: user interactions in online social networks and their implications. ACM Trans Web, 2012, 6: 1--31. Google Scholar

[26] Jiang J, Wilson C, Wang X, et al. Understanding latent interactions in online social networks. ACM Trans Web, 2013, 7: 1--39. Google Scholar

[27] Catalyurek U V, Aykanat C. PaToH (partitioning tool for hypergraphs). In: Encyclopedia of Parallel Computing. Berlin: Springer, 2011. 1479--1487. Google Scholar

• Figure 1

An example of social graph

• Figure 2

Two types of data center network topologies. (a) Fat-tree ($k=4$); (b) Fabric ($k_s=2,~k_f=2,~k_a=4,~k=4$)

• Figure 3

The examples of several data placement algorithms. (a) Hash traffic=470 (read=470, write=0, gini=0.036);protectłinebreak (b) SPAR traffic=200 (read=0, write=220, gini=0); (c) ADP traffic=200 (read=100, write=100, gini=0.028); (d) simultaneous optimization traffic=190 (read=40, write=150, gini=0.023)

• Figure 4

The execution logic of SDP

• Figure 5

Traffic reduction after user $v$ joins a community com. (a) User $v$ has never been allocated to any community; (b) user $v$ has been included in multiple communities

• Figure 6

Traffic comparison among different layers under two topologies. (a) Traffic distribution among different layers; (b) average traffic for each layer

• Figure 7

The comparison of algorithms under varied sizes of network topologies. (a) Fat-tree; (b) Fabric

• Figure 8

The comparison of algorithms under varied load balancing constraints. (a) Fat-tree; (b) Fabric

• Figure 9

Incremental adjustment for social network scale growth. (a) Fat-tree; (b) Fabric

•

Algorithm 1 SDP

Require:$G(V,~E),~S,~{\rm~gini}^*$;

Output:${\rm{\{~}}{\phi~_x}{\rm{\}~}},~{C_{{\rm~fattree}~\text{-}~i}}$ or ${C_{{\rm~fabric}~\text{-}~i}}$;

将$S$分成$k$ or $p$个分组, 记为${\rm~sc}1_i$ ; //分别对应两种拓扑结构

根据式(6) 或 (7)计算$l_{{\rm~fattree}\text{-}~i}$ or $l_{{\rm~fabric}\text{-}~i}$, $i=1,2,3$;

$({{\rm~Com}1_i},~{C_{ij}})~\leftarrow$ parOverlapComm$(G,~k$ or $p,~{\rm~gini}^*)$;

根据式(8)计算$C_{{\rm~fattree}\text{-}~1}$ or $C_{{\rm~fabric}\text{-}~1}$;

for $i~\leftarrow~1$ to $k$ or $p$

$\phi1_i~\leftarrow~$ hash$({\rm~Com}1_i)$; //分区与服务器分组一对一映射

end for

for $i~\leftarrow~1$ to $k$ or $p$

将${\rm~sc}1_i$分成$k/2$ or $k_a$个分组, 记为${{\rm~sc}2_j}$;

$({{\rm~Com}2_j},~{C_{jm}})~\leftarrow$ parOverlapComm$({\rm~Com}1_i,~k/2$ or $k_a,~{\rm~gini}^*)$;

for $j~\leftarrow~1$ to $k/2$ or $k_a$

$\phi2_j~\leftarrow~$ hash$({\rm~Com}2_j)$;

end for

根据式(8)计算$C_{{\rm~fattree}\text{-}~2}$ or $C_{{\rm~fabric}\text{-}~2}$;

for $j~\leftarrow~1$ to $k/2$ or $k_a$

将${\rm~sc}2_j$分成$k/2$ or $k-k_f$个服务器, 记为${x}$;

$({{\rm~Com}_x},~{C_{xy}})~\leftarrow$ parOverlapComm$({\rm~Com}2_j,~k/2$ or $k-k_f,~{\rm~gini}^*)$;

for $x~\leftarrow~1$ to $k/2$ or $k-k_f$

$\phi_x~\leftarrow~$ hash$({\rm~Com}_x)$;

end for

根据式(8)计算$C_{{\rm~fattree}\text{-}~3}$ or $C_{{\rm~fabric}\text{-}~3}$;

end for

end for

return ${\rm{\{~}}{\phi~_x}{\rm{\}~}}$, ${C_{{\rm~fattree}~\text{-}~i}}$ or ${C_{{\rm~fabric}~\text{-}~i}}$.

• Table 1   The variables and meanings used in this paper
 Variable Meaning $G=(V,~E)$ Social graph, where $V$ is the vertex set and $E$ is the edge set $r_{uv}$, $w_{uv}$ The read rate and write rate from $u$ to $v$ $w_u$ The write rate of $u$ $N(u)$ The set of $u$'s neighbors $\phi_x$ The set of replicas stored at server $x$ $m_u$ $u$'s master server $s_u$ The set of $u$'s slave servers $M_x$ The set of users whose master server is $x$ $C_{xy}$ The traffic from $x$ to $y$ $g(u,~y)$ Whether $u$'s replica is stored at server $y$ $S$ The server set $C_x$ The traffic transmitted from server $x$ $k$ The number of switch ports $k_f$, $k_a$, $k_s$ The switch numbers on fabric, access and spine levels, respectively $p$ The number of pods in Fabric $l_{{\rm~fattree}\text{-}~i}$, $l_{{\rm~fabric}\text{-}~i}$ The numbers of $i$th level switches between $x$ and $y$ in both topologies $C_{{\rm~fattree}\text{-}~i}$, $C_{{\rm~fabric}\text{-}~i}$ The traffic on $i$th level in both topologies ${\rm~load}_x$ The workload of server $x$ ${\rm~gini}$ Load balancing level $f_u$ $u$'s activity $\Delta~C(v,{\rm~com})$ The traffic change after $v$ joins community ${\rm~com}$ $\theta$ System redundancy $n$ The number of communities returned by Algorithm 2 $d$ The average number of neighbors owned by a user
•

Algorithm 2 parOverlapComm

Require:$G(V,~E),~n,~{\rm~gini}^*$;

Output:$\{{\rm~Com}_i\}$, $\{C_{ij}\}$;

${\rm~load}~\leftarrow~{\rm{|}}V|\theta~/~n$;

根据式(11)计算每个用户的活跃度$f_u$;

根据$f_u$值选出top-$n$用户;

for $u~\in$ top-$n$用户

$u$初始化社区${\rm~com}_i$;

do

找出${\rm~com}_i$的所有邻居$N({\rm~com}_i)$;

for $v~\in~N({\rm~com}_i)$

if $|{\rm~com}_i|~\ge~{\rm~load}(1~+~{\rm~gini}^*)$ then

break;ELSIF$\Delta~C(v,{\rm~com}_i)~>~0$

将$v$加入社区${\rm~com}_i$;

end if

end for

while ${\rm~com}_i$成员有增加

根据现有社区大小更新负载${\rm~load}$;

$i~\leftarrow~i+1$; //社区编号自加

end for

根据式(3)计算社区间通信量$C_{ij}$;

return $\{{\rm~Com}_i\}$, $\{C_{ij}\}$.

• Table 2   Topological structure parameter settings
 Fat-tree Fabric k n k $k_a$ $k_f$ $k_s$ p n 8 128 8 8 4 2 4 128
• Table 3   Topological structure parameter settings with different scales
 Fat-tree Fabric k n k $k_a$ $k_f$ $k_s$ p n 12 432 12 18 4 4 3 432 20 2000 20 25 4 4 5 2000
• Table 4   The running time of algorithms under varied sizes of network topologies
 Hash SPAR ADP SDP n Fat-tree (s) Fabric (s) Fat-tree (min) Fabric (min) Fat-tree (min) Fabric (min) Fat-tree (s) Fabric (s) 128 23 24 32 47 66 79 207 218 432 21 25 41 58 103 136 253 277 2000 20 23 128 175 297 378 725 776

Citations

• #### 0

Altmetric

Copyright 2020 Science China Press Co., Ltd. 《中国科学》杂志社有限责任公司 版权所有