logo

SCIENCE CHINA Information Sciences, Volume 61, Issue 4: 042102(2018) https://doi.org/10.1007/s11432-017-9221-0

Customizing the HPL for China accelerator

More info
  • ReceivedJan 17, 2017
  • AcceptedAug 29, 2017
  • PublishedMar 6, 2018

Abstract

HPL is a Linpack benchmark package widely used in high-performance computing tests. Customizing the HPL is crucial for a heterogeneous system equipped with CPU and the China accelerator because of the complexity of the China accelerator and the specified interface on matrix multiplication built in the China accelerator. Therefore, it is advisable to use delicate partition and encapsulation on matrix (DPEM) to expose a friendly testing configuration. More importantly, we propose the orchestrating algorithm for matrix multiplication (OAMM) to enhance the efficiency of the heterogeneous system composed of CPU and China accelerator. Furthermore, optimization at vectorization (OPTVEC) is applied to shield the architectural details of the vector processing element (VPE) equipped in the China accelerator. The experimental results validate DPEM, OPTVEC and OAMM. OPTVEC optimizations would speed up matrix multiplication more than twofold, moreover OAMM would improve productivity by up to 10% compared to the traditional HPL tested in a heterogeneous system.


Acknowledgment

This work was partly supported by National Natural Science Foundation of China (Grant Nos. 61602495, 61402039, 91430218, 9130324, 11401580), Key Research and Development Program (Grant Nos. 2017YFB0202104, 2016YFB200401), Innovation Program from the National University of Defense Technology (Grant No. ZK16-03-06), partly supported by Specialized Research Fund for State Key Laboratories of Space Weather, Chinese Academy of Sciences, and partly supported by Open Research Fund of Key Laboratory of Spectral Imaging Technology, Chinese Academy of Sciences (Grant No. LIST201602D).


References

[1] Lu Y T. The applications leveraging supercomputing systems. In: International Supercomputing Conference, Frankfurt, 2015. Google Scholar

[2] Dongarra J J, Luszczek P, Petitet A. The LINPACK Benchmark: past, present and future. Concurrency Computat-Pract Exper, 2003, 15: 803-820 CrossRef Google Scholar

[3] Shi R, Potluri S, Hamidouche K, et al. A scalable and portable approach to accelerate hybrid the HPL on heterogeneous CPU-GPU clusters. In: Proceedings of IEEE International Conference on Cluster Computing (CLUSTER). Indianapolis: IEEE, 2014. 1--8. Google Scholar

[4] Wang Q, Ohmura J, Axida S, et al. Parallel matrix-matrix multiplication based on the HPL with a GPU-accelerated PC cluster. In: Proceedings of the International Conference on Networking and Computing. Higashi-Hiroshima: IEEE, 2010. 243--248. Google Scholar

[5] Yang X J, Liao X K, Lu K. The TianHe-1A Supercomputer: Its Hardware and Software. J Comput Sci Technol, 2011, 26: 344-351 CrossRef Google Scholar

[6] Du Y F, Yang C Q, Wang F, et al. Analysis and evaluation method for the Linpack benchmark. J Northeast Univ Nat Sci, 2014, 35: 102--107. Google Scholar

[7] Liu J, Gan X B, Chi L H, et al. A peak performance model for matrix multiplication on general-purpose DSP (in Chinese). J Hunan Univ Nat Sci, 2013, 40: 148--152. Google Scholar

[8] Chi L H, Liu J, Yan Y H, et al. FitenBLAS: high-performance BLAS for a massively multithreaded FT1000 processor (in Chinese). J Hunan Univ Nat Sci, 2015, 42: 100--106. Google Scholar

[9] Gong C, Bao W, Tang G. An efficient parallel solution for Caputo fractional reaction-diffusion equation. J Supercomput, 2014, 68: 1521-1537 CrossRef Google Scholar

[10] Gong C, Bao W, Tang G. A parallel algorithm for the Riesz fractional reaction-diffusion equation with explicit finite difference method. Fract Calc Appl Anal, 2013, 16: 654-669 CrossRef Google Scholar

[11] Gong C, Liu J, Chi L. GPU accelerated simulations of 3D deterministic particle transport using discrete ordinates method. J Comput Phys, 2011, 230: 6010-6022 CrossRef ADS Google Scholar

[12] Zhao X, Chen Y, Zhang H, et al. A new decomposition solver for complex electromagnetic problems. IEEE Antenn Propag Mag, 2017, 59: 131--140. Google Scholar

[13] Xie X L, Liang Y, Li X H, et al. Enabling coordinated register allocation and thread-level parallelism optimization for GPUs. In: Proceedings of the 48th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO). New York: ACM, 2015. 395--406. Google Scholar

[14] Liang Y, Huynh H P, Rupnow K. Efficient GPU Spatial-Temporal Multitasking. IEEE Trans Parallel Distrib Syst, 2015, 26: 748-760 CrossRef Google Scholar

[15] Chen C, Du Y F, Jiang H, et al. HPCG: preliminary evaluation and optimization on Tianhe-2 CPU-only nodes. In: Proceedings of Symposium on Computer Architecture and high-performance Computing. Jussieu: IEEE, 2014. 41--48. Google Scholar

[16] Ao Y L, Liu Y Q, Yang C, et al. Performance evaluation of HPGMG on tianhe-2: early experience. In: Proceedings of International Conference on Algorithms and Architectures for Parallel Processing. New York: Springer, 2015. 230--243. Google Scholar

[17] Liu Y, Yang C, Liu F. 623 Tflop/s HPCG run on Tianhe-2: Leveraging millions of hybrid cores. Int J High Performance Computing Appl, 2016, 30: 39-54 CrossRef Google Scholar

[18] Li D, Xu C, Wang Y. Parallelizing and optimizing large-scale 3D multi-phase flow simulations on the Tianhe-2 supercomputer. Concurrency Computat-Pract Exper, 2016, 28: 1678-1692 CrossRef Google Scholar

[19] Wei S, Zhao R C, Yao Y. Loop-nest auto-vectorizat ion based on SLP (in Chinese). J Software, 2012, 23: 1717-1728 CrossRef Google Scholar

[20] Zhao J, Zhao R C, Ding R. Parallelism recognition technology based on nested loops classifying (in Chinese). J Software, 2012, 23: 2695-2704 CrossRef Google Scholar

[21] Gao W, Zhao R C, Han L, et al. Research on SIMD auto-vectorization compiling optimization (in Chinese). J Softw, 2015, 26: 1265--1284. Google Scholar

[22] Zhao J, Zhao R C, Ding R. An MPI backend for open64 compiler (in Chinese). J Software, 2012, 23: 2695-2704 CrossRef Google Scholar

  • Figure 1

    (Color online) Architecture of the China accelerator.

  • Figure 2

    (Color online) Matrix multiplication updated by the CPU and the China accelerator (FT$\_m$ must be multiple of 576 and greater than or equal to $576~\times~6$, and $M,$ $K,$ $N$ denote arbitrary positive integers).

  • Figure 3

    Structure of the RLU.

  • Figure 4

    Different strategies on the PASMVEC: (a) HPINNER; (b) HOUTER.

  • Figure 5

    (Color online) Architecture of a computing node.

  • Figure 6

    (Color online) Performance comparisons on the matrix transferred.

  • Figure 7

    (Color online) Performance comparisons on the OPTVEC.

  • Figure 8

    (Color online) Performance comparisons on OAMM and SDS/DSD.

  •   

    Algorithm 1 Pseudocode on OAMM

    Init CQ for CPU queue;

    Init AQ for China accelerator queue;

    Set current matrix multiplication $~A(M,K)\times~B(K,N)$;

    Get architecture parameter FT$\_m$;

    Divide $B(K,N)$ into $B_{1}(K,N_{1})+B_{2}(K,N_{2})$ reference to (1) and (2);

    if $M~>~~$FT$\_m$ then

    divide $A(M,K)$ into $A_{1}($FT$\_m,K)+A_{2}(M-$FT$\_m,K)$;

    else

    queue $A\times~B$ into CQ;

    end if

    //queue $A_{1}\times~B_{1}$;

    if (AQ is not full) then

    ${\rm~AQ}~\Leftarrow~A_{1}\times~B_{1}$;// queue AQ;ELSIF(CQ is empty)

    ${\rm~AQ}\Leftarrow~A_{1}\times~B_{1}$;// queue CQ;

    end if

    else

    waiting;

    end if

    // queue $A_{1}\times~B_{2}$;

    if (CQ is not full) then

    ${\rm~AQ}~\Leftarrow~A_{1}~\times~B_{2}$;// queue CQ;

    else

    waiting;

    end if

    Set $A_{2}\times~B$ into current matrix multiplication $A\times~B$;

    goto step 6;

    // matrix multiplication updating;

    while (AQ is not empty) do

    get head from AQ and then call specified interface on matrix multiplication built in the China accelerator;

    end while

    while (CQ is not empty) do

    get head from CQ and then call ordinary interface on matrix multiplication on CPU;

    end while

  • Table 1   Details on computing node
    System Component Attribute Number
    CPU Intel(R) Xeon(R) CPU E5-2692 v2 @ 2.20 GHz 4
    Hardware The China accelerator FT-GPDSP 2000b @1.25 GHz 4
    Memory sub-system 8 X Samsung 8 G DDR3 1333 MHz 4
    OS Linux kylin-phytium+
    Software Compiler Lintel icc 15.0.0 protect+ Phytium Compiler
    BLAS MKL protect+ FTBLAS
  • Table 2   The HPL testing on DPEM
    Parameter Accelerated by China accelerator without DPEM Accelerated by China accelerator with DPEM Comments
    $N<3456$ $\times$ $\times$ The HPL running on CPUonly because of $N$ less than 576$\times$6
    $N=3456$ $\surd$ $\surd$ The HPL running on both CPU and Chinaaccelerator with coordination
    $N>3456$ && $N%576\doteq0$ $\surd$ $\surd$ The HPL running on both CPU and Chinaaccelerator with coordination
    $N>3456$ && $N%576\neq0$ $\times$ $\surd$ The HPL would run both CPU and Chinaaccelerator with coordination assisted by DPEM

Copyright 2020 Science China Press Co., Ltd. 《中国科学》杂志社有限责任公司 版权所有

京ICP备18024590号-1       京公网安备11010102003388号