logo

SCIENTIA SINICA Informationis, Volume 50 , Issue 9 : 1303(2020) https://doi.org/10.1360/SSI-2020-0099

Key issues in exascale computing

More info
  • ReceivedApr 21, 2020
  • AcceptedJul 31, 2020
  • PublishedSep 23, 2020

Abstract

Over the past several decades, high performance computing (HPC) in China has undergone tremendous growth under the continuous support of national research programs. The development of exascale computers is the current goal set by the National Key R&D Project on HPC. Starting with a brief historical review of China's HPC development, this article analyzes the major challenges encountered in developing exascale computers. Thereafter, some important issues in realizing exascale computing are discussed, including architecture, processor, interconnect, parallel system software, parallel programming, algorithm, and resilience.


Funded by

国家科技重点研发计划(2016YFB0200100)

国家自然科学基金(61732002)


References

[1] “Yinhe" 100 Mega-scale super computer system is sucessfully developed. Comput Eng Sci, 1984, 1: 137. Google Scholar

[2] Si H W, Feng L S. The development of the first supercomputer YH-1 in China and its inspiration. Studies History Natural Sci, 2017, 36: 563--580. Google Scholar

[3] High performance vector computer. CCF China Computer History. https://www.ccf.org.cn/c/2018-09-12/652327.shtml. Google Scholar

[4] Li G J, Chen H A, Fan J P, et al. Dawning-1 parallel computer. Chin J Comput, 1994, 17: 882--889. Google Scholar

[5] Sun N H, Liu H, Liu W Z, et al. The design of system software of dawning-1000 massively parallel processing system. Chin J Comput, 1997, 20: 259--268. Google Scholar

[6] Sun N, Meng D. Dawning4000A high performance computer. Front Comput Sc China, 2007, 1: 20-25 CrossRef Google Scholar

[7] Zhu M, Xiao L, Ruan L. DeepComp: towards a balanced system design for high performance computer systems. Front Comput Sci China, 2010, 4: 475-479 CrossRef Google Scholar

[8] Yu Y, Zhang Y Q, Wang T, et al. Early Performance Evaluation of Dawning 5000A and DeepComp 7000. In: Proceedings of the 15th IEEE International Conference on Parallel and Distributed Systems, Shenzhen, 2009. 578--585. Google Scholar

[9] Yang X, Liao X, Xu W. TH-1: China's first petaflop supercomputer. Front Comput Sci China, 2010, 4: 445-455 CrossRef Google Scholar

[10] Li Q, Li B, Huo Z. Design and implementation of communication system of the Dawning 6000 supercomputer. Front Comput Sci China, 2010, 4: 466-474 CrossRef Google Scholar

[11] Yang X J, Liao X K, Lu K. The TianHe-1A Supercomputer: Its Hardware and Software. J Comput Sci Technol, 2011, 26: 344-351 CrossRef Google Scholar

[12] Calamia J. China's homegrown supercomputers. IEEE Spectrum, 2012, 49: 60-62. Google Scholar

[13] Niu X, Wang Z, Pan Z. Extreme Learning Machine-Based Deep Model for Human Activity Recognition With Wearable Sensors. Comput Sci Eng, 2019, 21: 16-25 CrossRef ADS Google Scholar

[14] Moore G E. Cramming more components onto integrated circuits, Electronics, 1965, 38: 114--117. Google Scholar

[15] Dennard R H, Gaensslen F H, Yu H N. Design of ion-implanted MOSFET's with very small physical dimensions. IEEE J Solid-State Circuits, 1974, 9: 256-268 CrossRef ADS Google Scholar

[16] Amdahl G M. Validity of the single-processor approach to achieving large-scale computing capabilities. In: Proceedings of the AFIPS '67 Spring Joint Computer Conference, Atlantic City, 1967. 483--485. Google Scholar

[17] Gustafson J L. Reevaluating Amdahl's law. Commun ACM, 1988, 31: 532-533 CrossRef Google Scholar

[18] Wulf W A, McKee S A. Hitting the memory wall: implications of the obvious. SIGARCH Comput Architect News, 1995, 23: 20--24. Google Scholar

[19] Horowitz M. Computing's energy problem. In: Proceedings of IEEE International Solid-State Circuits Conference (ISSCC), 2014. 57: 10--14. Google Scholar

[20] Vazhkudai S S, de Supinski B R, Bland A S, et al. The design, deployment, and evaluation of the CORAL pre-exascale systems. In: Proceedings of the International Conference for High Performance Computing, Networking, Storage, and Analysis (SC'18), Dallas, 2018. 52: 1--12. Google Scholar

[21] Fu H, Liao J, Yang J. The Sunway TaihuLight supercomputer: system and applications. Sci China Inf Sci, 2016, 59: 072001 CrossRef Google Scholar

[22] Qian D P. China's effort on exascale computing: current status and perspectives. In: Proceedings of International Conference for High Performance Computing, Networking, Storage, and Analysis (SC'18), Dallas, 2018. Google Scholar

[23] Cugola G, Margara A. Processing flows of information. ACM Comput Surv, 2012, 44: 1-62 CrossRef Google Scholar

[24] Becker T, Burovskiy P, et al. From exaflop to exaflow. In: Proceedings of the Conference on Design, Automation & Test in Europe Conference & Exhibition (DATE'17), Lausanne, 2017. 404--409. Google Scholar

[25] Kaplan K R, Winder R O. Cache-based Computer Systems. IEEE Comput, 1973, 6: 30--36. Google Scholar

[26] Liptay J S. 1968. Structural aspects of the system/360 model 85: II the cache. IBM Syst J, 1968, 7: 15--21. Google Scholar

[27] Power J, Basu A, Gu J L, et al. Heterogeneous system coherence for integrated CPU-GPU systems. In: Proceedings of the 46th Annual IEEE/ACM International Symposium on Microarchitecture, Davis, 2013. 457--467. Google Scholar

[28] Martin M K, Hill M D, Wood D A. Token Coherence: decoupling performance and correctness. In: Proceedings of the 30th International Symposium on Computer Architecture (ISCA'03), San Diego, 2003. 182--193. Google Scholar

[29] Wang H, Wang R, Luan Z Z. Improving multiprocessor performance with fine-grain coherence bypass. Sci China Inf Sci, 2015, 58: 1-15 CrossRef Google Scholar

[30] Iyer S S, Kalter H L. Embedded DRAM technology: opportunities and challenges. IEEE Spectr, 1999, 36: 56-64 CrossRef Google Scholar

[31] Iyer S S, Barth J E, Parries P C. Embedded DRAM: Technology platform for the Blue Gene/L chip. IBM J Res Dev, 2005, 49: 333-350 CrossRef Google Scholar

[32] Ghose S, Hsieh K, Boroumand A, et al. Enabling the Adoption of Processing-in-Memory: Challenges, Mechanisms, Future Research Directions. 2018,. arXiv Google Scholar

[33] Zhou M, Prodromou A, Wang R. Temperature-Aware DRAM Cache Management -Relaxing Thermal Constraints in 3D Systems. IEEE Trans Comput-Aided Des Integr Circuits Syst, 2019, : 1-1 CrossRef Google Scholar

[34] Wolf S A, Lu J, Stan M R. The Promise of Nanomagnetics and Spintronics for Future Logic and Universal Memory. Proc IEEE, 2010, 98: 2155-2168 CrossRef Google Scholar

[35] Hennessy J L, Patterson D A. A new golden age for computer architecture. Commun ACM, 2019, 62: 48-60 CrossRef Google Scholar

[36] Chien A. Technology Scaling and the Future of Microprocessors: The 10x10 Approach. 2012. http://i2pc.cs.illinois.edu/seminars.html. Google Scholar

[37] Chang L, Frank D J, Montoye R K. Practical Strategies for Power-Efficient Computing Technologies. Proc IEEE, 2010, 98: 215-236 CrossRef Google Scholar

[38] Dreslinski R G, Wieckowski M, Blaauw D. Near-Threshold Computing: Reclaiming Moore's Law Through Energy Efficient Integrated Circuits. Proc IEEE, 2010, 98: 253-266 CrossRef Google Scholar

[39] Ghasemi H R, Sinkar A, Schulte M, et al. Cost-effective power delivery to support per-core voltage domains for power-constrained processors. In: Proceedings of the 49th Annual Design Automation Conference (DAC'12), San Francisco, 2012. 56--61. Google Scholar

[40] Ansari A, Mishra A, Xu J, et al. Tangle: route-oriented dynamic voltage minimization for variation-afflicted, energy-efficient on-chip networks. In: Proceedings of the 20th IEEE International Symposium on High Performance Computer Architecture (HPCA'14), Orlando, 2014. 440--451. Google Scholar

[41] Torrellas J. Extreme-Scale Computer Architecture. National Science Review, 2016, 3 (1). Google Scholar

[42] Kogge P, Borkar S, Campbell D, et al. ExaScale Computing Study: Technology Challenges in Achieving Exascale Systems. DARPA-IPTO Sponsored Study, 2008. Google Scholar

[43] Feautrier P. Some efficient solutions to the affine scheduling problem. I. One-dimensional time. Int J Parallel Prog, 1992, 21: 313-347 CrossRef Google Scholar

[44] Smith B. Architecture and Applications of the HEP Multiprocessor Computer System. In: Proceedings of SPIE, 1982. 241--248. Google Scholar

[45] Chen T S, Du Z D, Sun N H, et al. DianNao: a small-footprint high-throughput accelerator for ubiquitous machine-learning. In: Proceedings of the 19th International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS'14), Salt Lake City, 2014. 269--284. Google Scholar

[46] Chen Y, Chen T, Xu Z. DianNao family. Commun ACM, 2016, 59: 105-112 CrossRef Google Scholar

[47] Liu S L, Du Z D, et al. Cambricon: An Instruction Set Architecture for Neural Networks. In: Proceedings of the 43rd ACM/IEEE Annual International Symposium on Computer Architecture(ISCA 2016), Seoul, 2016. 393--405. Google Scholar

[48] Jouppi N P, Young C, Patil N, et al. In-datacenter performance analysis of a tensor processing unit. In: Proceedings of the 44th Annual International Symposium on Computer Architecture(ISCA'17), Toronto, 2017. 1--12. Google Scholar

[49] Merolla P A, Arthur J V, Alvarez-Icaza R. A million spiking-neuron integrated circuit with a scalable communication network and interface. Science, 2014, 345: 668-673 CrossRef ADS Google Scholar

[50] Davies M, Srinivasa N, Lin T H. Loihi: A Neuromorphic Manycore Processor with On-Chip Learning. IEEE Micro, 2018, 38: 82-99 CrossRef Google Scholar

[51] Imam N, Cleland T A. Rapid online learning and robust recall in a neuromorphic olfactory circuit. Nat Mach Intell, 2020, 2: 181-191 CrossRef Google Scholar

[52] Nai L F, Hadidi R, Sim J, et al. GraphPIM: enabling instruction-level PIM offloading in graph computing frameworks. In: Proceedings of the IEEE International Symposium on High Performance Computer Architecture(HPCA'17), Austin, 2017. 457--468. Google Scholar

[53] Ahn J W, Hong S P, Yoo S, et al. A scalable processing-in-memory accelerator for parallel graph processing. In: Proceedings of the 42nd Annual International Symposium on Computer Architecture (ISCA'15), Portland, 2015. 105--117. Google Scholar

[54] Zhuo Y W, Wang C, Zhang M X, et al. GraphQ: scalable PIM-based graph processing. In: Proceedings of the 52nd Annual IEEE/ACM International Symposium on Microarchitecture(MICRO'19), Columbus, 2019. 712--725. Google Scholar

[55] Ham T J, Wu L, Sundaram N, et al. Graphicionado: a high-performance and energyecient accelerator for graph analytics. In: Proceedings of the 49th Annual IEEE/ACM International Symposium on Microarchitecture, Taipei, 2016. 1--13. Google Scholar

[56] Ophir N, Mineo C, Mountain D. Silicon Photonic Microring Links for High-Bandwidth-Density, Low-Power Chip I/O. IEEE Micro, 2013, 33: 54-67 CrossRef Google Scholar

[57] Kurian G, Sun C, Chen C H O, et al. Cross-layer energy and performance evaluation of a nanophotonic manycore processor system using real application workloads. In: Proceedings of the 26th International Parallel and Distributed Processing Symposium (IPDPS'12), Shanghai, 2012. 1117--1130. Google Scholar

[58] Thakkar I G, Chittamuru S V R, Pasricha S. Run-time laser power management in photonic NoCs with on-chip semiconductor optical amplifiers. Proceedings of the 10th IEEE/ACM International Symposium on Networks-on-Chip (NoCS'16), Nara, 2016. 1--4. Google Scholar

[59] Haurylau M, Chen G, Chen H. On-Chip Optical Interconnect Roadmap: Challenges and Critical Directions. IEEE J Sel Top Quantum Electron, 2006, 12: 1699-1705 CrossRef ADS Google Scholar

[60] Anders M A. High-performance energy-efficient NoC fabrics: evolution and future challenges. In: Proceedings of the 8th IEEE/ACM International Symposium on Networks-on-Chip (NoCS'14), Ferrara, 2014. Google Scholar

[61] Werner S, Navaridas J, Luján M. Efficient Sharing of Optical Resources in Low-Power Optical Networks-on-Chip. J Opt Commun Netw, 2017, 9: 364-374 CrossRef Google Scholar

[62] Li H, Fourmigue A, Le Beux S. Towards Maximum Energy Efficiency in Nanophotonic Interconnects with Thermal-Aware On-Chip Laser Tuning. IEEE Trans Emerg Top Comput, 2018, 6: 343-356 CrossRef Google Scholar

[63] Ramini L, Grani P, et al. Contrasting wavelength-routed optical NoC topologies for power-efficient 3D-stacked multicore processors using physical-layer analysis. In: Proceedings of the Conference on Design, Automation and Test in Europe, Grenoble, 2013. 1589--1594. Google Scholar

[64] Cao R, Wang K, Gu H. A crosstalk-aware wavelength assignment method for optical network-on-chip. IEICE Electron Express, 2016, 13: 20160821 CrossRef Google Scholar

[65] Werner S, Navaridas J, Lujan M. Amon: an advanced mesh-like optical NoC. In: Proceedings of the 23rd IEEE Annual Symposium on High-Performance Interconnects(HOTI'15), Santa Clara, 2015. 52--59. Google Scholar

[66] Vantrease D, Schreiber R, Monchiero M, et al. Corona: system implications of emerging nanophotonic technology. In: Proceedings of the 35th International Symposium on Computer Architecture (ISCA'08), Beijing, 2008. 153--164. Google Scholar

[67] Pan Y, Kim J, Memik G. Flexishare: Channel sharing for an energy-efficient nanophotonic crossbar. In: Proceedings of the 16th International Conference on High-Performance Computer Architecture (HPCA'10), Bangalore, India, 2010. 1--12. Google Scholar

[68] Xu Y, Yang J, Melhem R. Channel borrowing: an energy-efficient nanophotonic crossbar architecture with light-weight arbitration. In: Proceedings of the International Conference on Supercomputing(ICS'12), Venice, 2012. 133--142. Google Scholar

[69] Wu X, Xu J, Ye Y. SUOR. J Emerg Technol Comput Syst, 2014, 10: 1-25 CrossRef Google Scholar

[70] Kirman N, Kirman M, Dokania R K, et al. Leveraging optical technology in future bus-based chip multiprocessors. In: Proceedings of the 39th Annual IEEE/ACM International Symposium on Microarchitecture, Orlando, 2006. 492--503. Google Scholar

[71] Pan Y, Kumar P, Kim J, et al. Firefly: illuminating future network-on-chip with nanophotonics. In: Proceedings of the 36th International Symposium on Computer Architecture (ISCA'09), Austin, 2009. 429--440. Google Scholar

[72] Werner S, Navaridas J, Luján M. A Survey on Optical Network-on-Chip Architectures. ACM Comput Surv, 2018, 50: 1-37 CrossRef Google Scholar

[73] Gerofi B, Takagi M, et al. On the scalability, performance isolation and device driver transparency of the IHK/McKernel hybrid lightweight kernel. In: Proceedings of the IEEE International Parallel and Distributed Processing Symposium (IPDPS'16), Chicago, 2016. 1041--1050. Google Scholar

[74] Zhang L, Liu Y, Wang R. Lightweight dynamic partitioning for last-level cache of multicore processor on real system. J Supercomput, 2014, 69: 547-560 CrossRef Google Scholar

[75] Reed D A, Dongarra J. Exascale computing and big data. Commun ACM, 2015, 58: 56-68 CrossRef Google Scholar

[76] National Supercomputer Center in Guangzhou. Tianhe Star cloud supercomputing platform. http://en.nscc-gz.cn/Product/HighPerformanceComputingService/ServiceCharacteristics.html. Google Scholar

[77] Kulkarni M, Pingali K, Walter B, et al. Optimistic parallelism requires abstractions. In: Proceedings of the ACM SIGPLAN 2007 Conference on Programming Language Design and Implementation, San Diego, 2007. 211--222. Google Scholar

[78] Kulkarni M, Pingali K, Ramanarayanan G, et al. Optimistic parallelism benefits from data partitioning. In: Proceedings of the 13th International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS'08), Seattle, 2008. 233--243. Google Scholar

[79] Kulkarni M, Burtscher M, Inkulu R, et al. How much parallelism is there in irregular applications? In: Proceedings of the 14th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming (PPoPP'09), Raleigh, 2009. 3--14. Google Scholar

[80] Bauer M, Clark J, Schkufza E, et al. Programming the memory hierarchy revisited: supporting irregular parallelism in sequoia. In: Proceedings of the 16th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming (PPoPP'11), San Antonio, 2011. 13--24. Google Scholar

[81] Gao L, Wang R, Qian D P. J Software, 2013, 24: 1390-1402 CrossRef Google Scholar

[82] Xu Y L, Wang R, Goswami N, et al. Software transactional memory for GPU architectures. In: Proceedings of the 12th Annual IEEE/ACM International Symposium on Code Generation and Optimization (CGO'14), Orlando, 2014. 1--10. Google Scholar

[83] Qian X H, Torrellas J, Sahelices B, et al. BulkCommit: scalable and fast commit of atomic blocks in a lazy multiprocessor environment. In: Proceedings of the 46th Annual IEEE/ACM International Symposium on Microarchitecture, Davis, 2013. 371--382. Google Scholar

[84] Qian X H, Sahelices B, Torrellas J, et al. Volition: precise and scalable sequential consistency violation detection. In: Proceedings of the 18th ACM International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS'13), Houston, 2013. 535--548. Google Scholar

[85] Qian X H, Huang H, Sahelices B, et al. Rainbow: efficient memory dependence recording with high replay parallelism for relaxed memory model. In: Proceedings of the 19th IEEE International Symposium on High-Performance Computer Architecture (HPCA'13), Shenzhen, 2013. 554--565. Google Scholar

[86] Qian X H, Sahelices B, Qian D P. Pacifier: record and replay for relaxed-consistency multiprocessors with distributed directory protocol. In: Proceedings of the 41st ACM/IEEE International Symposium on Computer Architecture(ISCA'14), Minneapolis, 2014. 433--444. Google Scholar

[87] Mo Z, Zhang A, Cao X. JASMIN: a parallel software infrastructure for scientific computing. Front Comput Sci China, 2010, 4: 480-488 CrossRef Google Scholar

[88] Liu Q K, Zhao W B, Cheng J, et al. A programming framework for large scale numerical simulations on unstructured mesh. In: Proceedings of the IEEE International Conference on High Performance and Smart Computing (HPSC'16), New York, 2016. 310--315. Google Scholar

[89] Liu Q, Mo Z, Zhang A. JAUMIN: a programming framework for large-scale numerical simulation on unstructured meshes. CCF Trans HPC, 2019, 1: 35-48 CrossRef Google Scholar

[90] Wang W, Wang S Y, Jiang J R, et al. Implementation and optimization of fast multipole method on Sunway manycore processors. Computer Engineering & Science, 2019, 41: 1161--1167. Google Scholar

[91] Zou K, Zhang Z, Zhang J. 3D model retrieval scheme based on fuzzy clustering for physical descriptors. J Algorithms Comput Tech, 2016, 10: 12-22 CrossRef Google Scholar

[92] Yu T Y, Zhao Y H, Zhao L. Optimize a preconditioned block iterative eigensolver on sunway MAC. J Numerical Methods Comput Appl, 2019, 40: 291--309. Google Scholar

[93] Jiang Y, Li S, Xu Y. A Higher-Order Polynomial Method for SPECT Reconstruction. IEEE Trans Med Imag, 2019, 38: 1271-1283 CrossRef Google Scholar

[94] Wu K, Tang H. On physical-constraints-preserving schemes for special relativistic magnetohydrodynamics with a general equation of state. Z Angew Math Phys, 2018, 69: 84 CrossRef ADS arXiv Google Scholar

[95] Tang T, Wang L L, Yuan H. Rational Spectral Methods for PDEs Involving Fractional Laplacian in Unbounded Domains. SIAM J Sci Comput, 2020, 42: A585-A611 CrossRef Google Scholar

[96] Sugon X86 supercomputer prototype: liquid cooling, peak performance. https://www.cnbeta.com/articles/tech/865797.htm. Google Scholar

[97] Boito F Z, Inacio E C, Bez J L. A Checkpoint of Research on Parallel I/O for High-Performance Computing. ACM Comput Surv, 2018, 51: 1-35 CrossRef Google Scholar

Copyright 2020  CHINA SCIENCE PUBLISHING & MEDIA LTD.  中国科技出版传媒股份有限公司  版权所有

京ICP备14028887号-23       京公网安备11010102003388号