logo

SCIENTIA SINICA Informationis, Volume 46, Issue 9: 1175-1210(2016) https://doi.org/10.1360/N112016-00147

Emerging High-Performance Computing Systems and Technology

More info
  • ReceivedJun 26, 2016
  • AcceptedAug 25, 2016
  • PublishedSep 18, 2016

Abstract

High-performance computing (HPC) technology is the strategic height of modern computing among countries in the current information era. This paper focuses on the challenges of emerging HPC technology in the future, and discusses the developing status and trends of HPC architecture, algorithms, and applications. From the viewpoint of foundation research, we propose significant research fields (or scientific issues), as well as related policies for HPC systems and technology in China.


Funded by

国家自然科学基金(61433019)

国家自然科学基金(61402503)

国家自然科学基金(61170288)

国家自然科学基金(61332003)


References

[1] Meuer H, Strohmaier E, Dongarra J, et al. TOP500 Supercomputer Sites. http://www.top500.org. 2016. Google Scholar

[2] Lucas R, Ang J, Bergman K, et al. DOE advanced scientific computing advisory subcommittee (ASCAC) report: top ten exascale research challenges. http://www.osti.gov/scitech/biblio/1222713. 2014. Google Scholar

[3] Reinders J. Knights corner: your path to knights landing. Intel Developer Zone, 2014. Google Scholar

[4] Nvidia. Nvidia Tesla P100--the most advanced Datacenter Accelerator ever built featuring pascal GP100, the world's fastest GPU. Nvidia Whitepaper. Nvidia WP-08019-001{\_}v01.1. 2016. Google Scholar

[5] Lee Y J, Kim J, Jan H, et al. A fully associative, tagless DRAM cache. In: Proceedings of the 42nd International Symposium on Computer Architecture (ISCA2015). New York: ACM, 2015. 211-222. Google Scholar

[6] Jang H, Lee Y, Kim J, et al. Efficient footprint caching for tagless DRAM caches. In: Proceedings of the 22nd IEEE Symposium on High Performance Computer Architecture (HPCA2016), Barcelona, 2016. 237-248. Google Scholar

[7] Bhati I, Chishti Z, Lu S-L, et al. Flexible auto-refresh: enable scalable and energy-efficient DRAM refresh reductions. In: Proceedings of the 42nd International Symposium on Computer Architecture (ISCA2015). New York: ACM, 2015. 235-246. Google Scholar

[8] Yu X Y, Davadas S. Tardis: time traveling coherence algorithm for distributed shared memory. In: Proceedings of the 24th International Conference on Parallel Architecture and Compilation Techniques (PACT2015), San Francisco, 2015. 227-240. Google Scholar

[9] Balasubramonian R, Grot B. Near-data processing. IEEE Micro, 2016, 36: 4-5. Google Scholar

[10] Balfour J, Dally W J. Design tradeoffs for tiled cmp on-chip networks. In: Proceedings of the 20th International Conference on Supercomputing (ICS'06). New York: ACM, 2006. 187-198. Google Scholar

[11] Kim J, Balfour J, Dally W J. Flattened butterfly topology for on-chip networks. In: Proceedings of the 40th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO'07), Chicago, 2007. 172-182. Google Scholar

[12] Demir Y, Hardavellas N. SLaC: stage laser control for a flattened butterfly network. In: Proceedings of the 22nd IEEE Symposium on High Performance Computer Architecture (HPCA2016), Barcelona, 2016. 321-332. Google Scholar

[13] Sodani A, Gramunt R, Corbal J, et al. Knight landing: second-generation Intel Xeon Phi product. IEEE Micro, 2016, 36: 34-46 CrossRef Google Scholar

[14] Amaru L G. New data structure and algorithms for logic synthesis and verification. Dissertation for Ph.D. Degree. Lausanne: Ecole Polytechnique Federale de Lausanne, 2015. Google Scholar

[15] Borkar S. Thousand core chip--a technology perspective. In: Proceedings of the 44th Design Automation Conference (DAC2007), San Diego, 2007. 746-749. Google Scholar

[16] Keckler S W, Dally W J, Khailany B, et al. GPUs and the future of parallel computing. IEEE Micro, 2011, 31: 7-17 CrossRef Google Scholar

[17] Yuffe M, Knoll E, Mehalel M, et al. A fully integrated multi-CPU, GPU and memory controller 32nm processor. In: Proceedings of IEEE International Solid-State Circuits Conference Digest of Technical Papers (ISSCC), San Francisco, 2011. 264-266. Google Scholar

[18] Smith R, Goyal N, Ormont J, et al. Evaluating GPUs for network packet signature matching. In: Proceedings of IEEE International Symposium on Performance Analysis of Systems and Software, Boston, 2009. 175-184. Google Scholar

[19] Wang Z K, He B S, Zhang W, et al. A performance analysis framework for optimizing OpenCL applications on FPGAs. In: Proceedings of the 22nd IEEE Symposium on High Performance Computer Architecture (HPCA2016), Barcelona, 2016. 114-125. Google Scholar

[20] Gao M Y, Kozyrakis C. HRL: efficient and flexible reconfigurable logic for near-data processing. In: Proceedings of the 22nd IEEE Symposium on High Performance Computer Architecture (HPCA2016), Barcelona, 2016. 126-137. Google Scholar

[21] Komornicki A, Mullen-Schultz G, Landon D. Roadrunner: Hardware and software overview. USA: IBM Redbooks, 2009. Google Scholar

[22] Carter N P, Agrawal A, Borkar S, et al. Runnemede: an architecture for ubiquitous high-performance computing. In: Proceedings of the 19th IEEE International Symposium of High Performance Computer Architecture (HPCA2013), Shenzhen, 2013. 198-209. Google Scholar

[23] Mitsuhisa S. Feasibility Study on Future HPC Infrastructure. http://www.ccs.tsukuba.ac.jp/files/ex-review/FS-ccs-eval-2014.pdf. 2014. Google Scholar

[24] Homayoun H, Kontorinis V, Shayan A, et al. Dynamically hterogeneous cores through 3D resource pooling. In: Proceedings of the 18th IEEE International Symposium of High Performance Computer Architecture (HPCA2012), New Orleans, 2012. 277-288. Google Scholar

[25] Branover A, Foley D, Steinman M. AMD fusion APU: llano. IEEE Micro, 2012, 32: 28-37. Google Scholar

[26] Taylor M B. A landscape of the new dark silicon design regime. IEEE Micro, 2013, 33: 8-19. Google Scholar

[27] Merolla P A, Arthur J V, Alvarez-lcaza R, et al. A million spiking-neuron integrated circuit with a scalable communication network and interface. Science, 2014, 345: 668-673 CrossRef Google Scholar

[28] Prezioso M, Merrikh-Bayat F, Hoskins B D, et al. Training and operation of an integrated neuromorphic network based on metal-oxide memristors. Nature, 2015, 521: 61-64 CrossRef Google Scholar

[29] Khan M M, Lester D R, Plana L A, et al. SpiNNaker: mapping neural networks onto a massively-parallel chip multiprocessor. In: Proceedings of IEEE International Joint Conference on Neural Networks (IEEE World Congress on Computational Intelligence), Hong Kong, 2008. 2849-2856. Google Scholar

[30] Chen T, Du Z, Sun N, et al. Diannao: a small-footprint high-throughput accelerator for ubiquitous machine-learning. ACM Sigplan Notices, 2014, 49: 269-284 CrossRef Google Scholar

[31] Shen J C, Ma D, Gu Z H, et al. Darwin: a neuromorphic hardware co-processor based on Spiking Neural Networks. Sci China Inf Sci, 2016, 59: 023401. Google Scholar

[32] Kahle J. The cell processor architecture. In: Proceedings of the 38th Annual IEEE/ACM International Symposium on Microarchitecture. Washington: IEEE Computer Society, 2005. 3. Google Scholar

[33] Nyberg P. The Cray rainier system: integrated scalar/vector computing. http://www.ecmwf.int/sites/default/files/ elibrary/2004/14161-thecray-rainier-system-integrated-scalarvector-computing.pdf. 2004. Google Scholar

[34] Intel Xeon+FPGA Platform for the Data Center. The 4th Workshop on the Intersections of Computer Architecture and Reconfigurable Logic. https://www.ece.cmu.edu/ calcm/carl/lib/exe/fetch.php?media=carl15-gupta.pdf. 2015. Google Scholar

[35] Wirbel L. Xilinx SDAccel--a Unified Development Environment for Tomorrow's Data Center. Technical Report, the Linley Group Inc, 2014. Google Scholar

[36] Shalf J, Quinlan D, Janssen C. Rethinking hardware-software codesign for exascale systems. Computer, 2011, 44: 22-30. Google Scholar

[37] Bertels K, Sima V M, Yankova Y, et al. HArtes: hardware-software codesign for heterogeneous multicore platforms. IEEE Micro, 2010, 5: 88-97. Google Scholar

[38] Kinsy M A, Devadas S. Heracles: a tool for fast RTL-based design space exploration of multicore processors. In: Proceedings of the ACM/SIGDA International Symposium on Field Programmable Gate Arrays, Monterey, 2013. 125-134. Google Scholar

[39] Mavroidis I, Mavroidis I, Papaefstathiou I, et al. FASTCUDA: open Source FPGA Accelerator Hardware-Software Codesign Toolset for CUDA Kernels. In: Proceedings of Euromicro Conference on Digital System Design, Funchal, 2012. 343-348. Google Scholar

[40] Kim G, Lee M, Jeong J, et al. Multi-GPU system design with memory networks. In: Proceedings of the 47th Annual IEEE/ACM International Symposium on Microarchitecture, Cambridge, 2014. 484-495. Google Scholar

[41] Kobayashi M, Seetharaman S, Parulkar G, et al. Maturing of OpenFlow and software-defined networking through deployments. Comput Netw Int J Comput Telecommun Netw, 2014, 61: 151-175. Google Scholar

[42] Huawei. High Throughput Computing Data Center Architecture--Thinking of Data Center 3.0. Technical White Paper, 2014. Google Scholar

[43] Minkenberg C, Rodriguez G. Trace-driven co-simulation of high-performance computing systems using OMNeT. In: Proceedings of the 2nd International Conference on Simulation Tools and Techniques, Belgium, 2009. 65. Google Scholar

[44] Mingyu H, Kevin P, Jie M, et al. SST + gem5 = a scalable simulation infrastructure for high performance computing. In: Proceedings of the 5th International ICST Conference on Simulation Tools and Techniques, Desenzano, 2012. 196-201. Google Scholar

[45] Zheng G, Kakulapati G, Kalé L V. BigSim: a parallel simulator for performance prediction of extremely large parallel machines. In: Proceedings of the 18th International Symposium on Parallel and Distributed Processing, Santa Fe, 2004. 78-87. Google Scholar

[46] Bhatele A, Jain N, Livnat Y, et al. Evaluating System Parameters on a Dragonfly using Simulation and Visualization. Technical Report, 2015. Google Scholar

[47] Besta M, Hoefler T. Slim fly: a cost effective low-diameter network topology. In: Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis (SC2014), New Orleans, 2014. 348-359. Google Scholar

[48] Kathareios G, Minkenberg C. Cost-effective diameter-two topologies: analysis and evaluation. In: Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis (SC2015), Austin, 2015. 36-46. Google Scholar

[49] Magnusson P S, Christensson M, Eskilson J, et al. Simics: a full system simulation platform. IEEE Comput, 2002, 2: 50-58. Google Scholar

[50] Alverson B, Froese E, Kaplan L, et al. Cray XC \textregistered Series Network. White Paper WP-Aries01-1112. http://www. cray.com/Assets/PDF/products/xc/CrayXC30Networking.pdf. 2012. Google Scholar

[51] Ajima Y, Inoue T, Hiramoto S, et al. Tofu interconnect 2: system-on-chip integration of high-performance interconnect. In: Supercomputing. Berlin: Springer, 2014. 498-507. Google Scholar

[52] Mellanox Technologies. Switch-IB\texttrademark EDR Switch Silicon. Product Brief, 2014. Google Scholar

[53] Birrittella M S, Debbage M, Huggahalli R, et al. Intel\textregistered Omni-path architecture enabling scalable, high performance fabrics. In: Proceedings of IEEE Symposium on High-performance Interconnects, Santa Clara, 2015. 1-9. Google Scholar

[54] Tracy N, Wuth T. OIF Next Generation Interconnect Framework. OIF-FD- Client-400G/1T-01.0. http://www. oiforum.com/public/documents/OIF-FD-Client-400G-1T-01.0.pdf. 2013. Google Scholar

[55] Vinaik B, Puri R. Oracle's Sonoma Processor: Advanced Low-cost SPARC Processor for Enterprise Workloads. In: Proceedings of IEEE Hot Chips 27 Symposium (HCS), Cupertino, 2015. 1-23. Google Scholar

[56] PLX. PCI Express Gen3 Switch. Product Brief. https://www.synopsys.com/dw/doc.php/ss/plx\_ss.pdf. 2012. Google Scholar

[57] Regula J, Subramaniyan M, Dodson J. Integrating rack level connectivity into a PCIe switch. In: Proceedings of International Symposium on High Performance Chips, Stanford, 2013. 259-266. Google Scholar

[58] Kumar M J. Rack Scale Architecture for Cloud Keynote of Intel Developer Forum (IDF). http://blog. scottlowe.org/2013/09/11/idf-2013-rack-scale-architecture-for-cloud/. 2013. Google Scholar

[59] Assefa S, Xia F. Reinventing germanium avalanche photo-detector for nanophotonic on-chip optical interconnects. Nature, 2010, 464: 80-84 CrossRef Google Scholar

[60] Lee B G, Rylyakov A V, Green W M J, et al. Monolithic silicon integration of scaled photonic switch fabrics, CMOS logic, and device driver circuits. J Light Wave Tech, 2014, 32: 743-751 CrossRef Google Scholar

[61] Barwicz T, Taira Y, Lichoulas T W, et al. Photonic packaging in high-throughput microelectronic assembly lines for cost-efficiency and scalability. In: Proceedings of Optical Fiber Communications Conference and Exhibition (OFC), Los Angeles, 2015. W3H4. Google Scholar

[62] Krishnamoorthy A V, Ho R, Zheng X, et al. Computer systems based on silicon photonic interconnect. Proc IEEE, 2009, 97: 1337-1361 CrossRef Google Scholar

[63] Dobbelaere P D. Silicon photonics technology platform for integration of optical IOs with ASICs. In: Proceedings of International Symposium on High Performance Chips, Stanford, 2013. 115-123. Google Scholar

[64] Doany F E, Lee B G, Kuchta D M, et al. Terabit/Sec VCSEL-based 48-channel optical module based on holey CMOS transceiver IC. J Light Wave Tech, 2013, 31: 672-680 CrossRef Google Scholar

[65] Sun C, Wade M T, Lee Y, et al. Single-chip microprocessor that communicates directly using light. Nature, 2015, 528: 534-538 CrossRef Google Scholar

[66] Polatis. SERIES 7000-384$\times$384 port software-defined optical circuit switch. Product Brief, 2016. Google Scholar

[67] Cheung S, Su T, Okamoto K, et al. Ultra-compact silicon photonic 512$\times$512 25-GHz arrayed waveguide grating router. IEEE J Sel Topics Quant Electron, 2014, 20: 310-316 CrossRef Google Scholar

[68] Yang X J. Sixty Years of Parallel Computing. Comput Eng Sci, 2012, 34: 1-10 [杨学军. 并行计算六十年. 计算机工程与科学, 2012, 34: 1-10]. Google Scholar

[69] International technology roadmap for semiconductors (itrs). http://www.itrs2.net/2013-itrs.html. Google Scholar

[70] Tolentino M E, Turner J, Cameron K W. Memory MISER: improving main memory energy efficiency in servers. IEEE Trans Comput, 2009, 58: 336-350 CrossRef Google Scholar

[71] Shinnar A, Cunningham D, Saraswat V, et al. M3R: increased performance for in-memory Hadoop jobs. Proc VLDB Endowment, 2012, 5: 1736-1747 CrossRef Google Scholar

[72] Mittal S, Vetter J S. A survey of architectural approaches for data compression in cache and main memory systems. IEEE Trans Parall Distrib Syst, 2016, 27: 1524-1536 CrossRef Google Scholar

[73] Lee B C, Zhou P, Yang J, et al. Phase-change technology and the future of main memory. IEEE Micro, 2010, 30: 143-1536 CrossRef Google Scholar

[74] Zhang L, Zhu M, Huang R, et al. Forming-less unipolar TaOx-based RRAM with large CC-independence range for high density memory applications. ECS Trans, 2010, 27: 3-8. Google Scholar

[75] Li H, Xi H, Chen Y, et al. Thermal-assisted spin transfer torque memory (STT-RAM) cell design exploration. In: Proceedings of IEEE Computer Society Annual Symposium on VLSI, Tampa, 2009. 217-222. Google Scholar

[76] Corporation I. OpenMP Application Program Interface, version 4.5. OpenMP Architecture Review Board. http://www.openmp.org/mp-documents/openmp-4.5.pdf. 2015. Google Scholar

[77] Carter N P, Agrawal A, Borkar S, et al. Runnemede: an architecture for ubiquitous high-performance computing. In: Proceedings of High Performance Computer Architecture (HPCA2013), Shenzhen, 2013. 198-209. Google Scholar

[78] Vetter J S, Mittal S. Opportunities for nonvolatile memory systems in extreme-scale high-performance computing. Comput Sci Eng, 2015, 17: 73-82 CrossRef Google Scholar

[79] Xu W, Lu Y, Li Q, et al. Hybrid hierarchy storage system in MilkyWay-2 supercomputer. Front Comput Sci, 2014, 8: 367-377 CrossRef Google Scholar

[80] Lee B C, Zhou P, Yang J, et al. Phase-change technology and the future of main memory. IEEE Micro, 2010, 30: 143-377 CrossRef Google Scholar

[81] Carlson W W, Draper J M, Culler D E, et al. Introduction to UPC and Language Specification. Technical Report CCS-TR-99-157. 1999. Google Scholar

[82] Numrich R W, Reid J. Co-array fortran for parallel programming. ACM Sigplan Fortran Forum, 1998, 17: 1-31 CrossRef Google Scholar

[83] Nieplocha J, Harrison R J, Littlefield R J. Global arrays: a portable shared-memory programming model for distributed memory computers. In: Proceedings of the ACM/IEEE Conference on Supercomputing, Los Alamitos, 1994. 340-349. Google Scholar

[84] Chamberlain B L, Callahan D, Zima H P. Parallel programmability and the chapel language. Int J High Perfor Comput Appl, 2007, 21: 291-312 CrossRef Google Scholar

[85] Charles P, Grothoff C, Saraswat V, et al. X10: an object-oriented approach to non-uniform cluster computing. In: Proceedings of the 20th Annual ACM SIGPLAN Conference on Object Oriented Programming Systems Languages and Applications, New York, 2005. 519-538. Google Scholar

[86] Steele Jr G L, Allen E, Chase D, et al. Fortress (Sun HPCS language). In: Encyclopedia of Parallel Computing. Berlin: Springer, 2011. 718-735. Google Scholar

[87] Wienke S, Springer P, Terboven C, et al. OpenACC--first experiences with real-world applications. In: Proceedings of Euro-Par 2012 Parallel Processing, Pittsburgh, 2012. 859-870. Google Scholar

[88] David K. NVIDIA CUDA software and GPU parallel computing architecture. In: Proceedings of the 6th International Symposium on Memory Management, Vancouver, 2007. 103-104. Google Scholar

[89] Gaster B, Howes L, Kaeli D R, et al. Heterogeneous Computing with OpenCL: Revised OpenCL 1. London: Newnes, 2012. Google Scholar

[90] Amarasinghe S, Hall M, Lethin R, et al. ASCR programming challenges for exascale computing. In: Report of the 2011 Workshop on Exascale Programming Challenges, Marina del Rey, 2011. Google Scholar

[91] Hou Q, Zhou K, Guo B. BSGP: bulk-synchronous GPU programming. ACM Trans Graph, 2008, 27: 19. Google Scholar

[92] Chen L, Liu L, Tang S, et al. Unified parallel C for GPU clusters: language extensions and compiler implementation. In: Proceedings of International Workshop on Languages and Compilers for Parallel Computing. Berlin: Springer, 2010. 151-165. Google Scholar

[93] Yang C, Wang F, Du Y, et al. Adaptive optimization for petascale heterogeneous CPU/GPU computing. In: Proceedings of 2010 IEEE International Conference on Cluster Computing, Chemnitz, 2010. 19-28. Google Scholar

[94] Gong C Y, Bao W M, Liu H D, et al. A survey of the parallel solutions to convection-diffusion equation. Comput Eng Sci, 2015 37: 628-633 [龚春叶, 包为民, 刘海东, 等. 对流扩散方程并行求解方法研究综述. 计算机工程与科学, 2015, 37: 628-633]. Google Scholar

[95] Gong C, Liu J, Chi L, et al. GPU accelerated simulations of 3D deterministic particle transport using discrete ordinates method. J Comput Phys, 2011, 230: 6010-6022 CrossRef Google Scholar

[96] Wang Q L, Liu J, Gong C Y, et al. Scalability of 3D deterministic particle transport on the Intel MIC architecture. Nucl Sci Tech, 2015, 26: 050502. Google Scholar

[97] Gong C, Bao W, Tang G. A parallel algorithm for the Riesz fractional reaction-diffusion equation with explicit finite difference method. Fract Calc Appl Anal, 2013, 16: 654-669. Google Scholar

[98] Gong C, Bao W, Tang G, et al. An efficient parallel solution for Caputo fractional reaction-diffusion equation. J Supercomput, 2014, 68: 1521-1537 CrossRef Google Scholar

[99] Yang B, Lu K, Gao Y, et al. GPU acceleration of subgraph isomorphism search in large scale graph. J Central South Univ, 2015, 22: 2238-2249 CrossRef Google Scholar

[100] Wu Q, Yang C, Tang T, et al. Exploiting hierarchy parallelism for molecular dynamics on a petascale heterogeneous system. J Parall Distrib Comput, 2013, 73: 1592-1604 CrossRef Google Scholar

[101] Xu C, Deng X, Zhang L, et al. Collaborating CPU and GPU for large-scale high-order CFD simulations with complex grids on the TianHe-1A supercomputer. J Comput Phys, 2014, 278: 275-297 CrossRef Google Scholar

[102] Barker K J, Davis K, Hoisie A, et al. Entering the petaflop era: the architecture and performance of Roadrunner. In: Proceedings of the ACM/IEEE Conference on Supercomputing. Piscataway: IEEE Press, 2008. Google Scholar

[103] Gong C, Liu J, Huang H, et al. Particle transport with unstructured grid on GPU. Comput Phys Commun, 2012, 183: 588-593 CrossRef Google Scholar

[104] Yan J, Tan G M, Sun N H. Optimizing parallel S n sweeps on unstructured grids for multi-core clusters. J Comput Sci Tech, 2013, 28: 657-670 CrossRef Google Scholar

[105] Gong C, Bao W, Liu J, et al. An efficient wavefront parallel algorithm for structured three dimensional LU-SGS. Comput Fluids, 2016, 134: 23-30. Google Scholar

[106] Ghysels P, Vanroose W. Hiding global synchronization latency in the preconditioned conjugate gradient algorithm. Parall Comput, 2014, 40: 224-238 CrossRef Google Scholar

[107] Liu Y, Yang C, Liu F, et al. 623 Tflop/s HPCG run on Tianhe-2: Leveraging millions of hybrid cores. Int J High Perfor Comput Appl, 2016, 30: 39-54 CrossRef Google Scholar

[108] Liu J, Gong C, Bao W, et al. Solving the Caputo fractional reaction-diffusion equation on GPU. Discrete Dyn Nat Soc, 2014, 2014: 1-7. Google Scholar

[109] Jia W, Fu J, Cao Z, et al. Fast plane wave density functional theory molecular dynamics calculations on multi-GPU machines. J Comput Phys, 2013, 251: 102-115 CrossRef Google Scholar

[110] Liu Y Q, Li Y, Zhang Y Q, et al. Memory efficient two-pass 3D FFT algorithm for Intel\textregistered Xeon PhiTM coprocessor. J Comput Sci Tech, 2014, 29: 989-1002 CrossRef Google Scholar

[111] 刘鑫. 面向化学非平衡流的CFD并行计算技术和大规模并行计算平台研究. 博士学位论文. 郑州: 解放军信息工程大学. 2006. Google Scholar

[112] Lee M, Malaya N, Moser R D. Petascale direct numerical simulation of turbulent channel flow on up to 786k cores. In: Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis, Denver, 2013. 61. Google Scholar

[113] Himeno R. Grand challenge in life science on K computer. In: Proceedings of International Conference on High Performance Computing for Computational Science, Kope, 2013. 17-22. Google Scholar

[114] Lei X L, Zhang T, Zhao Y, et al. The Application on Large-Scale Scientific Computing and Complex Engineering Based on TH-1A. Comput Eng Sci, 2012, 34: 176-183 [雷秀丽, 张婷, 赵洋, 等. ``天河一号"大规模科学与工程计算应用. 计算机工程与科学, 2012, 34: 176-183]. Google Scholar

[115] Gong C Y, Bao W M, Tang G J. Recent progress in high-performance parallel computing of the aerospace area. Comput Eng Sci, 2014, 36: 1629-1636 [龚春叶, 包为民, 汤国建, 等. 航天领域高性能并行计算研究进展. 计算机工程与科学, 2014, 36: 1629-1636]. Google Scholar

[116] Bermejo-Moreno I, Bodart J, Larsson J, et al. Solving the compressible Navier-Stokes equations on up to 1.97 million cores and 4.1 trillion grid points. In: Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis. New York: ACM, 2013. 62. Google Scholar

[117] Che Y, Xu C, Fang J, et al. Realistic performance characterization of CFD applications on Intel many integrated core architecture. Comput J, 2015, 58: 3279-3294 CrossRef Google Scholar

[118] Breuer A, Heinecke A, Rettenberger S, et al. Sustained petascale performance of seismic simulations with seissol on supermuc. In: Proceedings of Supercomputing Conference, New Orleans, 2014. 1-18. Google Scholar

[119] Coates A, Huval B, Wang T, et al. Deep learning with COTS HPC systems. In: Proceedings of the 30th International Conference on Machine Learning, Atlanta, 2013. 1337-1345. Google Scholar

[120] Shaw D E, Grossman J P, Bank J A, et al. Anton 2: raising the bar for performance and programmability in a special-purpose molecular dynamics supercomputer. In: Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis, New Orleans, 2014. 41-53. Google Scholar

[121] Staar P, Maier T A, Summers M S, et al. Taking a quantum leap in time to solution for simulations of high-Tc superconductors. In: Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis, Denver, 2013. 1. Google Scholar

[122] Bernaschi M, Bisson M, Fatica M, et al. 20 petaflops simulation of proteins suspensions in crowding conditions. In: Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis, Denver, 2013. 2. Google Scholar

[123] Randles A, Draeger E W, Oppelstrup T, et al. Massively parallel models of the human circulatory system. In: Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis, Austin, 2015. 1. Google Scholar

[124] Berczik P, Spurzem R, Zhong S, et al. Up to 700k GPU cores, Kepler, and the Exascale future for simulations of star clusters around black holes In: Supercomputing. Berlin: Springer, 2013. 13-25. Google Scholar

[125] Rossinelli D, Hejazialhosseini B, Hadjidoukas P, et al. 11 PFLOP/s simulations of cloud cavitation collapse. In: Proceedings of High Performance Computing, Networking, Storage and Analysis (SC), Denver, 2013. 1-13. Google Scholar

[126] Bussmann M, Burau H, Cowan T E, et al. Radiative signatures of the relativistic Kelvin-Helmholtz instability. In: Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis, Denver, 2013. 5. Google Scholar

[127] Bédorf J, Gaburov E, Fujii M S, et al. 24.77 pflops on a gravitational tree-code to simulate the Milky Way galaxy with 18600 GPUs. In: Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis, New Orleans, 2014. 54-65. Google Scholar

[128] Dongarra J J, Du Croz J, Hammarling S, et al. A set of level 3 basic linear algebra subprograms. ACM Trans Math Softw, 1990, 16: 1-17 CrossRef Google Scholar

[129] Balay S, Abhyankar S, Adams M, et al. PETSc Users Manual Revision 3.5. Technical Report, ANL-95/11 Rev. 3.5 108682. 2014. Google Scholar

[130] Blackford L S, Choi J, Cleary A, et al. ScaLAPACK users' guide. Siam, 1997. Google Scholar

[131] Frigo M, Johnson S G. FFTW: an adaptive software architecture for the FFT. In: Proceedings of the IEEE International Conference on Acoustics, Speech and Signal, Seattle, 1998. 3: 1381-1384. Google Scholar

[132] Mo Z, Zhang A, Cao X, et al. JASMIN: a parallel software infrastructure for scientific computing. Front Comput Sci China, 2010, 4: 480-488 CrossRef Google Scholar

[133] Jasak H, Jemcov A, Tukovic Z. OpenFOAM: a C++ library for complex physics simulations. In: Proceedings of International Workshop on Coupled Methods in Numerical Dynamics, Dubrovnik, 2007. 1000: 1-20. Google Scholar

[134] Nelson M T, Humphrey W, Gursoy A, et al. NAMD: a parallel, object-oriented molecular dynamics program. Int J High Perfor Comput Appl, 1996, 10: 251-268 CrossRef Google Scholar

[135] Plimpton S, Crozier P, Thompson A. LAMMPS-large-scale atomic/molecular massively parallel simulator. Sandia National Laboratories, 2007, 18. Google Scholar

[136] Berendsen H J C, van der Spoel D, van Drunen R. GROMACS: a message-passing parallel molecular dynamics implementation. Comput Phys Commun, 1995, 91: 43-56 CrossRef Google Scholar

Copyright 2020 Science China Press Co., Ltd. 《中国科学》杂志社有限责任公司 版权所有

京ICP备18024590号-1