logo

SCIENCE CHINA Information Sciences, Volume 61, Issue 11: 112102(2018) https://doi.org/10.1007/s11432-017-9290-4

Fault tolerance on-chip: a reliable computing paradigm using self-test, self-diagnosis, and self-repair (3S) approach

More info
  • ReceivedJun 3, 2017
  • AcceptedNov 1, 2017
  • PublishedMay 24, 2018

Abstract

If your computer crashes, you can revive it by a reboot, an empirical solution that usually turns out to be effective. The rationale behind this solution is that transient faults, either in hardware or software, can be fixed by refreshing the machine state. Such a “silver bullet", however, could be futile in the future because the faults, especially those existing in the hardware such as Integrated Circuit (IC) chips, cannot be eliminated by refreshing. What we need is a more sophisticated mechanism to steer the system back to the right track. The “magic cure" is the Fault Tolerance On-Chip (FTOC) mechanism, which relies on a suite of built-in design-for-reliability logic, including fault detection, fault diagnosis, and error recovery, working in a self-supportive manner. We have exploited the FTOC to build a holistic solution ranging from on-chip fault detection to error recovery mechanisms to address faults caused by chips progressively aging. Besides fault detection, the FTOC paradigm provides attractive benefits, such as facilitating graceful performance degradation, mitigating the impact of verification blind spots, and improving the chip yield.


Acknowledgment

This work was supported by National Natural Science Foundation of China (Grant Nos. 61532017, 61572470, 61521092, 61522406, 61432017, 61376043), and in part by Youth Innovation Promotion Association, CAS (Grant No. Y404441000).


References

[1] Borkar S. Designing reliable systems from unreliable components: the challenges of transistor variability and degradation. IEEE Micro, 2005, 25: 10--16. Google Scholar

[2] Guihai Yan , Yinhe Han , Xiaowei Li . ReviveNet: a self-adaptive architecture for improving lifetime reliability via localized timing adaptation. IEEE Trans Comput, 2011, 60: 1219-1232 CrossRef Google Scholar

[3] Fu B, Han Y, Ma J, et al. An abacus turn model for time/space-efficient reconfigurable routing. In: Proceedings of the 38th Annual International Symposium on Computer Architecture, San Jose, 2011. 259--270. Google Scholar

[4] Yan G, Sun F, Li H. CoreRank: redeeming “Sick Silicon" by dynamically quantifying core-level healthy condition. IEEE Trans Comput, 2016, 65: 716-729 CrossRef Google Scholar

[5] Yan G, Han Y, Li X. SVFD: a versatile online fault detection scheme via checking of stability violation. IEEE Trans VLSI Syst, 2011, 19: 1627-1640 CrossRef Google Scholar

[6] Lei Zhang , Yinhe Han , Qiang Xu . On topology reconfiguration for defect-tolerant noc-based homogeneous manycore systems. IEEE Trans VLSI Syst, 2009, 17: 1173-1186 CrossRef Google Scholar

[7] Dennard R H, Gaensslen F H, Rideout V L. Design of ion-implanted MOSFET's with very small physical dimensions. IEEE J Solid-State Circuits, 1974, 9: 256-268 CrossRef Google Scholar

[8] Srinivasan J, Adve S, Bose P, et al. The impact of technology scaling on lifetime reliability. In: Proceedings of International Conference on Dependable Systems and Networks, Florence, 2004. 177--186. Google Scholar

[9] Borkar S, Karnik T, Narendra S, et al. Parameter variations and impact on circuits and microarchitecture. In: Proceedings of Design Automation Conference, Anaheim, 2003. 338--342. Google Scholar

[10] Wang W P, Yang S Q, Sarvesh B, et al. The impact of NBTI on the performance of combinational and sequential circuits. In: Proceedings of the 44th ACM/IEEE Design Automation Conference, San Diego, 2007. 364--369. Google Scholar

[11] Borkar S, Karnik T, Narendra S, et al. Parameter variations and impact on circuits and microarchitecture. In: Proceedings of Design Automation Conference, Anaheim, 2003. 338--342. Google Scholar

[12] Chen G, Chuah K Y, Li M F, et al. Dynamic NBTI of PMOS transistors and its impact on device lifetime. In: Proceedings of the 41st Annual IEEE International Reliability Physics Symposium, Dallas, 2003. 196--202. Google Scholar

[13] Zhao W, Liu F, Agarwal K. Rigorous extraction of process variations for 65-nm CMOS design. IEEE Trans Semicond Manufact, 2009, 22: 196-203 CrossRef Google Scholar

[14] Xiang D, Zhang Y. Cost-effective power-aware core testing in NoCs based on a new unicast-based multicast scheme. IEEE Trans Comput-Aided Des Integr Circuits Syst, 2011, 30: 135-147 CrossRef Google Scholar

[15] Xiang D, Chakrabarty K, Fujiwara H. A unified test and fault-tolerant multicast solution for network-on-chip designs. In: Proceedings of IEEE International Test Conference (ITC), Fort Worth, 2016. 1--9. Google Scholar

[16] Xiang D, Sui W, Yin B. Compact test generation with an influence input measure for launch-on-capture transition fault testing. IEEE Trans VLSI Syst, 2014, 22: 1968-1979 CrossRef Google Scholar

[17] Ferhani F, Saxena N, McCluskey E, et al. How many test patterns are useless. In: Proceedings of the 26th IEEE VLSI Test Symposium, San Diego, 2008. 23--28. Google Scholar

[18] Wang N J, Patel S J. ReStore: symptom-based soft error detection in microprocessors. IEEE Trans Dependable Secure Comput, 2006, 3: 188-201 CrossRef Google Scholar

[19] Aitken R. Yield learning perspectives. IEEE Des Test Comput, 2012, 29: 59-62 CrossRef Google Scholar

[20] Powell M D, Biswas A, Gupta S, et al. Architectural core salvaging in a multi-core processor for hard-error tolerance. In: Proceedings of the 36th Annual International Symposium on Computer Architecture, Austin, 2009. 93--104. Google Scholar

[21] Eyerman S, Eeckhout L, Karkhanis T. A top-down approach to architecting CPI component performance counters. IEEE Micro, 2007, 27: 84-93 CrossRef Google Scholar

[22] Tschanz J, Bowman K, Lu S, et al. A 45 nm resilient and adaptive microprocessor core for dynamic variation tolerance. In: Proceedings of IEEE International Solid-State Circuits Conference Digest of Technical Papers (ISSCC), San Francisco, 2010. 282--283. Google Scholar

[23] Petrica P, Izraelevitz A, Albonesi D, et al. Flicker: a dynamically adaptive architecture for power limited multicore systems. In: Proceedings of the 40th Annual International Symposium on Computer Architecture, Tel-Aviv, 2013. 13--23. Google Scholar

[24] Carlson T, Heirman W, Eeckhout L. Sniper: exploring the level of abstraction for scalable and accurate parallel multi-core simulation. In: Proceedings of International Conference for High Performance Computing, Networking, Storage and Analysis (SC), Seattle, 2011. 1--12. Google Scholar

[25] Miller J, Kasture H, Kurian G, et al. Graphite: a distributed parallel simulator for multicores. In: Proceedings of IEEE 16th International Symposium on High Performance Computer Architecture (HPCA), Bangalore, 2010. 1--12. Google Scholar

[26] Kohler A, Schley G, Radetzki M. Fault tolerant network on chip switching with graceful performance degradation. IEEE Trans Comput-Aided Des Integr Circuits Syst, 2010, 29: 883-896 CrossRef Google Scholar

[27] Gizopoulos D, Psarakis M, Adve S, et al. Architectures for online error detection and recovery in multicore processors. In: Proceedings of Design, Automation and Test in Europe, Grenoble, 2011. 1--6. Google Scholar

[28] Alizadeh B, Fujita M. A debugging method for repairing post-silicon bugs of high performance processors in the fields. In: Proceedings of International Conference on Field-Programmable Technology, Beijing, 2010. 328--331. Google Scholar

[29] Chang C-W, Chou H-Z, Chang K-H, et al. Constraint generation for software-based post-silicon bug masking with scalable resynthesis technique for constraint optimization. In: Proceedings of the 12th International Symposium on Quality Electronic Design, Santa Clara, 2011. 174--181. Google Scholar

  • Figure 1

    (Color online) Parameter fluctuation distribution with 45 nm and 65 nm process.

  • Figure 2

    (Color online) Timing sensor design. (a) Stability checker; (b) output compressor; (c) clock timing.

  • Figure 3

    Exemplify self-diagnosis logic for a 4-ALU processor.

  • Figure 4

    (Color online) Performance degradation vs. defect degrees of (a) instruction window, (b) L1 instruction cache, (c) L1 data cache, (d) L2 cache, where the following SPEC CPU2006 benchmarks are used: leslie3d, GemsFDTD, gobmk, perlbench, gamess, milc, lbm, xalancbmk, gcc, gromacs, and bwaves.

  • Figure 5

    Circuit-level rejuvenation with timing adaptation.

  • Figure 6

    Microarchitectural rejuvenation.

  • Figure 7

    Topology reconfiguration-based architectural level rejuvenation for a manycore. (a) The topology demand; protectłinebreak(b) the topology with spare cores; (c) the topology with faulty cores.

  • Figure 8

    (Color online) Performance degradation for the processors with (a) mild (L), (b) medium (M), and (c) severe (R) degradation models, without FTOC, 4-thread for multi-thread workloads.

  • Figure 9

    (Color online) Performance degradation for the processors with (a) mild (L), (b) medium (M), and (c) severe (R) degradation models, with FTOC, 4-thread for multi-thread workloads.

  • Table 1   Degradation models
    Degradation component Decoupled capacity
    Front end Branch predictor 1/4 : 1/2 : 3/4
    instruction window 1/4 : 1/2 : 3/4
    Back end Issue width 1/4 : 1/2 : 3/4
    Memory L1 data cache 1/4 : 1/2 : 3/4
    L1 instruction cache 1/4 : 1/2 : 3/4
    L1 D-Cache 1/4 : 1/2 : 3/4
    L2 cache 1/4 : 1/2 : 3/4
  • Table 2   Core configuration
    Parameter Value
    Frequency 1 GHz
    L1 I/D cache 32 KB Cache line 64 B, associativity 4
    L2 cache 512 KB Cache line 64 B, associativity 8
    Issue width 4
    Branch predict entry 1024
    Instruction window 96
  • Table 3   Self-diagnosis hardware overhead
    Component Combinational logic gates (K) Storage (KB)
    MD5 76 0.1
    Bloom filter 1582 12.6
    Pearson Hash 0.62 0.6
    Bucket 0.38 7.3
    Others 61.2 0.2
    Total 1720 20.8
  • Table 4   Tabel caption
    Title a Title b Title c Title d
    Aaa Bbb Ccc Ddd
    Aaa Bbb Ccc Ddd
    Aaa Bbb Ccc Ddd
  • Table 5   Tabel caption
    Title a Title b Title c Title d
    Aaa Bbb Ccc Ddd ddd ddd ddd. Ddd ddd ddd ddd ddd ddd ddd ddd ddd ddd ddd ddd ddd ddd ddd ddd ddd ddd ddd ddd ddd ddd ddd ddd ddd ddd ddd ddd ddd ddd ddd.
    Aaa Bbb Ccc Ddd ddd ddd ddd.
    Aaa Bbb Ccc Ddd ddd ddd ddd.

Copyright 2020 Science China Press Co., Ltd. 《中国科学》杂志社有限责任公司 版权所有

京ICP备18024590号-1       京公网安备11010102003388号