SCIENTIA SINICA Informationis, Volume 50 , Issue 12 : 1834(2020) https://doi.org/10.1360/N112019-00004

## An approximately optimal disk repair algorithm for distributed storage systems

Jing SUN 1,*,
• ReceivedJan 4, 2019
• AcceptedAug 13, 2019
• PublishedNov 20, 2020
Share
Rating

### Abstract

Effectively reducing disk repair time for distributed storage systems is an open problem. Generally, there are two directions, i.e., constructing better erasure codes to reduce disk repair throughput, and dispatching repair tasks to reduce the data flow in switches. In this paper, by analyzing the repair data flow, we provide an approximately optimal disk repair time. We also prove the best ratio of storage nodes for repairing and present the NOPT repair algorithm. We also prove the equivalence of the two decoding algorithms. The results of simulation experiments demonstrate that the repair time is improved significantly and repair traffic is reduced notably.

### References

[1] Ghemawat S, Gobioff H, Leung S T. The Google file system. SIGOPS Oper Syst Rev, 2003, 37: 29-43 CrossRef Google Scholar

[2] Shvachko K, Kuang H, Radia S, et al. The hadoop distributed file system. In: Proceedings of IEEE 26th Symposium on Mass Storage Systems and Technologies (MSST), 2010. 10: 1--10. Google Scholar

[3] Dimakis A G, Godfrey P B, Wu Y. Network Coding for Distributed Storage Systems. IEEE Trans Inform Theor, 2010, 56: 4539-4551 CrossRef Google Scholar

[4] Rashmi K V, Shah N B, Gu D. A "hitchhiker's" guide to fast and efficient data reconstruction in erasure-coded data centers. SIGCOMM Comput Commun Rev, 2015, 44: 331-342 CrossRef Google Scholar

[5] Rashmi K V, Nakkiran P, Wang J, et al. Having your cake and eating it too: jointly optimal erasure codes for i/o, storage, and network-bandwidth. In: Proceedings of the 13th USENIX Conference on File and Storage Technologies, 2015. 81--94. Google Scholar

[6] Lihao Xu , Bruck J. X-code: MDS array codes with optimal encoding. IEEE Trans Inform Theor, 1999, 45: 272-276 CrossRef Google Scholar

[7] Xu S, Li R, Lee P P C. Single Disk Failure Recovery for X-Code-Based Parallel Storage Systems. IEEE Trans Comput, 2014, 63: 995-1007 CrossRef Google Scholar

[8] Chowdhury M, Kandula S, Stoica I. Leveraging endpoint flexibility in data-intensive clusters. SIGCOMM Comput Commun Rev, 2013, 43: 231-242 CrossRef Google Scholar

[9] Dean J, Ghemawat S. MapReduce. Commun ACM, 2008, 51: 107 CrossRef Google Scholar

[10] Li R, Hu Y, Lee P P C. Enabling Efficient and Reliable Transition from Replication to Erasure Coding for Clustered File Systems. IEEE Trans Parallel Distrib Syst, 2017, 28: 2500-2513 CrossRef Google Scholar

[11] Rashmi K V, Shah N B, Kumar P V. Optimal Exact-Regenerating Codes for Distributed Storage at the MSR and MBR Points via a Product-Matrix Construction. IEEE Trans Inform Theor, 2011, 57: 5227-5239 CrossRef Google Scholar

[12] Huang C, Simitci H, Xu Y, et al. Erasure coding in windows azure storage. In: Proceedings of Presented as part of the 2012 USENIX Annual Technical Conference, 2012. 15--26. Google Scholar

[13] Calder B, Wang J, Ogus A, et al. Windows azure storage: a highly available cloud storage service with strong consistency. In: Proceedings of the 23rd ACM Symposium on Operating Systems Principles. New York: ACM, 2011. 143--157. Google Scholar

[14] Zhang Y, Kan H. Locally repairable codes with strict availability from linear functions. Sci China Inf Sci, 2018, 61: 109304 CrossRef Google Scholar

[15] Xu Q Q, Xi W Y, Yong K L. CRL: Efficient Concurrent Regeneration Codes with Local Reconstruction in Geo-Distributed Storage Systems. J Comput Sci Technol, 2018, 33: 1140-1151 CrossRef Google Scholar

[16] Shen Z, Shu J, Lee P P C. Reconsidering single failure recovery in clustered file systems. In: Proceedings of the 46th Annual IEEE/IFIP International Conference on Dependable Systems and Networks (DSN). New York: IEEE, 2016. 323--334. Google Scholar

[17] Hu Y, Li X, Zhang M. Optimal Repair Layering for Erasure-Coded Data Centers. ACM Trans Storage, 2017, 13: 1-24 CrossRef Google Scholar

[18] Chien R. Cyclic decoding procedures for Bose- Chaudhuri-Hocquenghem codes. IEEE Trans Inform Theor, 1964, 10: 357-363 CrossRef Google Scholar

[19] Forney G. On decoding BCH codes. IEEE Trans Inform Theor, 1965, 11: 549-557 CrossRef Google Scholar

• Figure 3

(Color online) Broken disk repair model

• Table 1   Symbol definition
 Symbol Definition $S_i$ The storage node $i$ $n_t$ The number of access switches $n_{t_s}$ The number of switches associated with storage nodes participating in one repair process $n_{t_c}$ The number of switches associated with computational nodes participating in one repair process $\hat{n}_c$ The number of computational nodes $n_c$ The number of computational nodes participating in one repair process $n_s$ The number of storage nodes participating in one repair process $n_{s_i}$ The number of storage nodes connected with access switch $i$ $n_{c_i}$ The number of computational nodes connected with access switch $i$ $W$ The amount of data waiting for repairing $d_{s_i}$ The amount of data related to broken disk for storage node $i$ $V_s$ Disk read rate $V_{l_i}$ Data transmission rate from the access switch $i$ to the core switch $V_{m_i}$ Data transmission rate from the core switch to the access switch $i$ $V_h$ The maximum throughput of the core switch $V_{t_i}$ The maximum throughput of the access switch $i$
•

Algorithm 1 The optimal self-repair ratio algorithm

Require:$(k,W,V_{s_i},V_{c_i},n_s,n_c,d_{s_i})$;

Output:$(\alpha_1,\alpha_2,\ldots,\alpha_{n_s})$;

$f(\beta,~n_c)=~{\rm~min}\{\frac{(k-1)W\beta}{\sum_{i=1}^{n_s}V_{s_i}},~\frac{kW(1-\beta)}{\sum_{i=n_s+1}^{n_c}V_{c_i}}\}$;

$(f_{\rm~min},\beta_{\rm~min},n_{c_{\rm~min}},{\rm~step})=(\infty,0,0,0.01)$;

for each $n_{c_i}\in~[n_s,~\hat~n_c]$

for $\beta_v=0$; $\beta_v\leq~1$; $\beta_v+={\rm~step}$

if $f(\beta_v,~n_{c_i})<f_{\rm~min}$ then

$(f_{\rm~min},~\beta_{\rm~min},~n_{c_{\rm~min}})=(f(\beta_v,~n_{c_i}),\beta_v,~n_{c_i})$;

end if

end for

end for

$(\beta,~n_c)=(\beta_{\rm~min},n_{c_{\rm~min}})$;

for each $i~\in~[1,n_c]$

if $\frac{W\beta~V_{c_i}}{\sum_{i=1}^{n_c}V_{c_i}}\geqslant~d_{s_i}$ then

$\alpha_i=1$;

else

$\alpha_i=\frac{W\beta~V_{c_i}}{(\sum_{i=1}^{n_c}V_{c_i})~d_{s_i}}$;

end if

end for

•

Algorithm 2 The repair task scheduling algorithm

Find all the stripeId of the broken disk diskId;

Let diskWeight be a map structure that key is diskId and value is repairing blocks count of broken disk;

for all stripeId in stripeId

for each $i\in~\{1,\ldots,n\}$

if after adding the $i$ block into the map diskWeight, the capacity of map is more than before then

Choose block $i$ for stripe repair;

else

Skip block $i$;

end if

end for

if the number of choose blocks satisfy $\hat~k<k$ then

Walk all the blocks in stripe stripeId, and random choose $k-\hat~k$ blocks that is not choosed yet for stripe repair;

end if

Add choosed $k$ blocks into map diskWeight;

end for

According Algorithm 1, compute the approximately optimal repair ratio $(\alpha_1,\alpha_2,\ldots,\alpha_{n_c})$;

Generate the repair task list by repair ratio and wait computer node acquire;

Through socket mechanism, run the following process;

repeat

if the event is acquire repairing task then

Decode the request body and get node index $i$;

Run allocTask$(i)$ (see Algorithm 3);ELSIFthe event is finish repaired task

Decode the request body and get node index $i$;

Run finishTask$(i)$ (see Algorithm alg_finishTask);

else

end if

until Close socket.

•

Algorithm 3 allocTask(int $i$)

ELSIFnode $i$ is a compute node in $c_s$ a random repair task;

$n_s=$ len(diskWeight);

$c_s=n_c-n_s$, let $c_s$ be number of compute node that not handle self-repair tasks;

if $i$ a key of map diskWeight then

Let $\hat~\alpha_i$ be the ratio of self-repair tasks which have been alloc before;

if $\hat~\alpha_i~<\alpha_i$ then

return a self-repair task for node $i$;ELSIF$\alpha_i<\hat~\alpha_i<1$, and the last alloc time of task for diskWeight$[i]$ exceeds a particular value

return a self-repair task for node $i$;ELSIF$\alpha_i<\hat~\alpha_i<1$, and the last alloc time of task for diskWeight$[i]$ less than a particular value

return nil;

end if

Citations

Altmetric