logo

SCIENTIA SINICA Informationis, Volume 50 , Issue 12 : 1834(2020) https://doi.org/10.1360/N112019-00004

An approximately optimal disk repair algorithm for distributed storage systems

More info
  • ReceivedJan 4, 2019
  • AcceptedAug 13, 2019
  • PublishedNov 20, 2020

Abstract

Effectively reducing disk repair time for distributed storage systems is an open problem. Generally, there are two directions, i.e., constructing better erasure codes to reduce disk repair throughput, and dispatching repair tasks to reduce the data flow in switches. In this paper, by analyzing the repair data flow, we provide an approximately optimal disk repair time. We also prove the best ratio of storage nodes for repairing and present the NOPT repair algorithm. We also prove the equivalence of the two decoding algorithms. The results of simulation experiments demonstrate that the repair time is improved significantly and repair traffic is reduced notably.


Funded by

国家重点研发项目(2018YFC0830900,2018YFC0830903)


References

[1] Ghemawat S, Gobioff H, Leung S T. The Google file system. SIGOPS Oper Syst Rev, 2003, 37: 29-43 CrossRef Google Scholar

[2] Shvachko K, Kuang H, Radia S, et al. The hadoop distributed file system. In: Proceedings of IEEE 26th Symposium on Mass Storage Systems and Technologies (MSST), 2010. 10: 1--10. Google Scholar

[3] Dimakis A G, Godfrey P B, Wu Y. Network Coding for Distributed Storage Systems. IEEE Trans Inform Theor, 2010, 56: 4539-4551 CrossRef Google Scholar

[4] Rashmi K V, Shah N B, Gu D. A "hitchhiker's" guide to fast and efficient data reconstruction in erasure-coded data centers. SIGCOMM Comput Commun Rev, 2015, 44: 331-342 CrossRef Google Scholar

[5] Rashmi K V, Nakkiran P, Wang J, et al. Having your cake and eating it too: jointly optimal erasure codes for i/o, storage, and network-bandwidth. In: Proceedings of the 13th USENIX Conference on File and Storage Technologies, 2015. 81--94. Google Scholar

[6] Lihao Xu , Bruck J. X-code: MDS array codes with optimal encoding. IEEE Trans Inform Theor, 1999, 45: 272-276 CrossRef Google Scholar

[7] Xu S, Li R, Lee P P C. Single Disk Failure Recovery for X-Code-Based Parallel Storage Systems. IEEE Trans Comput, 2014, 63: 995-1007 CrossRef Google Scholar

[8] Chowdhury M, Kandula S, Stoica I. Leveraging endpoint flexibility in data-intensive clusters. SIGCOMM Comput Commun Rev, 2013, 43: 231-242 CrossRef Google Scholar

[9] Dean J, Ghemawat S. MapReduce. Commun ACM, 2008, 51: 107 CrossRef Google Scholar

[10] Li R, Hu Y, Lee P P C. Enabling Efficient and Reliable Transition from Replication to Erasure Coding for Clustered File Systems. IEEE Trans Parallel Distrib Syst, 2017, 28: 2500-2513 CrossRef Google Scholar

[11] Rashmi K V, Shah N B, Kumar P V. Optimal Exact-Regenerating Codes for Distributed Storage at the MSR and MBR Points via a Product-Matrix Construction. IEEE Trans Inform Theor, 2011, 57: 5227-5239 CrossRef Google Scholar

[12] Huang C, Simitci H, Xu Y, et al. Erasure coding in windows azure storage. In: Proceedings of Presented as part of the 2012 USENIX Annual Technical Conference, 2012. 15--26. Google Scholar

[13] Calder B, Wang J, Ogus A, et al. Windows azure storage: a highly available cloud storage service with strong consistency. In: Proceedings of the 23rd ACM Symposium on Operating Systems Principles. New York: ACM, 2011. 143--157. Google Scholar

[14] Zhang Y, Kan H. Locally repairable codes with strict availability from linear functions. Sci China Inf Sci, 2018, 61: 109304 CrossRef Google Scholar

[15] Xu Q Q, Xi W Y, Yong K L. CRL: Efficient Concurrent Regeneration Codes with Local Reconstruction in Geo-Distributed Storage Systems. J Comput Sci Technol, 2018, 33: 1140-1151 CrossRef Google Scholar

[16] Shen Z, Shu J, Lee P P C. Reconsidering single failure recovery in clustered file systems. In: Proceedings of the 46th Annual IEEE/IFIP International Conference on Dependable Systems and Networks (DSN). New York: IEEE, 2016. 323--334. Google Scholar

[17] Hu Y, Li X, Zhang M. Optimal Repair Layering for Erasure-Coded Data Centers. ACM Trans Storage, 2017, 13: 1-24 CrossRef Google Scholar

[18] Chien R. Cyclic decoding procedures for Bose- Chaudhuri-Hocquenghem codes. IEEE Trans Inform Theor, 1964, 10: 357-363 CrossRef Google Scholar

[19] Forney G. On decoding BCH codes. IEEE Trans Inform Theor, 1965, 11: 549-557 CrossRef Google Scholar

  • Figure 3

    (Color online) Broken disk repair model

  • Table 1   Symbol definition
    Symbol Definition
    $S_i$ The storage node $i$
    $n_t$ The number of access switches
    $n_{t_s}$ The number of switches associated with storage nodes participating in one repair process
    $n_{t_c}$ The number of switches associated with computational nodes participating in one repair process
    $\hat{n}_c$ The number of computational nodes
    $n_c$ The number of computational nodes participating in one repair process
    $n_s$ The number of storage nodes participating in one repair process
    $n_{s_i}$ The number of storage nodes connected with access switch $i$
    $n_{c_i}$ The number of computational nodes connected with access switch $i$
    $W$ The amount of data waiting for repairing
    $d_{s_i}$ The amount of data related to broken disk for storage node $i$
    $V_s$ Disk read rate
    $V_{l_i}$ Data transmission rate from the access switch $i$ to the core switch
    $V_{m_i}$ Data transmission rate from the core switch to the access switch $i$
    $V_h$ The maximum throughput of the core switch
    $V_{t_i}$ The maximum throughput of the access switch $i$
  •   

    Algorithm 1 The optimal self-repair ratio algorithm

    Require:$(k,W,V_{s_i},V_{c_i},n_s,n_c,d_{s_i})$;

    Output:$(\alpha_1,\alpha_2,\ldots,\alpha_{n_s})$;

    $f(\beta,~n_c)=~{\rm~min}\{\frac{(k-1)W\beta}{\sum_{i=1}^{n_s}V_{s_i}},~\frac{kW(1-\beta)}{\sum_{i=n_s+1}^{n_c}V_{c_i}}\}$;

    $(f_{\rm~min},\beta_{\rm~min},n_{c_{\rm~min}},{\rm~step})=(\infty,0,0,0.01)$;

    for each $n_{c_i}\in~[n_s,~\hat~n_c]$

    for $\beta_v=0$; $\beta_v\leq~1$; $\beta_v+={\rm~step}$

    if $f(\beta_v,~n_{c_i})<f_{\rm~min}$ then

    $(f_{\rm~min},~\beta_{\rm~min},~n_{c_{\rm~min}})=(f(\beta_v,~n_{c_i}),\beta_v,~n_{c_i})$;

    end if

    end for

    end for

    $(\beta,~n_c)=(\beta_{\rm~min},n_{c_{\rm~min}})$;

    for each $i~\in~[1,n_c]$

    if $\frac{W\beta~V_{c_i}}{\sum_{i=1}^{n_c}V_{c_i}}\geqslant~d_{s_i}$ then

    $\alpha_i=1$;

    else

    $\alpha_i=\frac{W\beta~V_{c_i}}{(\sum_{i=1}^{n_c}V_{c_i})~d_{s_i}}$;

    end if

    end for

  •   

    Algorithm 2 The repair task scheduling algorithm

    Find all the stripeId of the broken disk diskId;

    Let diskWeight be a map structure that key is diskId and value is repairing blocks count of broken disk;

    for all stripeId in stripeId

    for each $i\in~\{1,\ldots,n\}$

    if after adding the $i$ block into the map diskWeight, the capacity of map is more than before then

    Choose block $i$ for stripe repair;

    else

    Skip block $i$;

    end if

    end for

    if the number of choose blocks satisfy $\hat~k<k$ then

    Walk all the blocks in stripe stripeId, and random choose $k-\hat~k$ blocks that is not choosed yet for stripe repair;

    end if

    Add choosed $k$ blocks into map diskWeight;

    end for

    According Algorithm 1, compute the approximately optimal repair ratio $(\alpha_1,\alpha_2,\ldots,\alpha_{n_c})$;

    Generate the repair task list by repair ratio and wait computer node acquire;

    Through socket mechanism, run the following process;

    repeat

    if the event is acquire repairing task then

    Decode the request body and get node index $i$;

    Run allocTask$(i)$ (see Algorithm 3);ELSIFthe event is finish repaired task

    Decode the request body and get node index $i$;

    Run finishTask$(i)$ (see Algorithm alg_finishTask);

    else

    Handle other tasks;

    end if

    until Close socket.

  •   

    Algorithm 3 allocTask(int $i$)

    ELSIFnode $i$ is a compute node in $c_s$ a random repair task;

    $n_s=$ len(diskWeight);

    $c_s=n_c-n_s$, let $c_s$ be number of compute node that not handle self-repair tasks;

    if $i$ a key of map diskWeight then

    Let $\hat~\alpha_i$ be the ratio of self-repair tasks which have been alloc before;

    if $\hat~\alpha_i~<\alpha_i$ then

    return a self-repair task for node $i$;ELSIF$\alpha_i<\hat~\alpha_i<1$, and the last alloc time of task for diskWeight$[i]$ exceeds a particular value

    return a self-repair task for node $i$;ELSIF$\alpha_i<\hat~\alpha_i<1$, and the last alloc time of task for diskWeight$[i]$ less than a particular value

    return nil;

    end if