logo

SCIENCE CHINA Information Sciences, Volume 60, Issue 12: 122106(2017) https://doi.org/10.1007/s11432-015-0989-3

Characterizing and optimizing Java-based HPC applications on Intel many-core architecture

More info
  • ReceivedOct 3, 2016
  • AcceptedDec 13, 2016
  • PublishedMay 8, 2017

Abstract

The increasing demand for performance has stimulated the wide adoption of many-coreaccelerators like Intel®Xeon PhiTMCoprocessor,which is based on Intels Many Integrated Core architecture.While many HPC applications running in native mode have been tuned to run efficiently on Xeon Phi,it is still unclear how a managed runtime like JVM performs on such an architecture.In this paper, we present the first measurement study of a set of Java HPC applications on Xeon Phi under JVM.One key obstacle to the study is that there is currently little support of Java for Xeon Phi.This paper presents the result based on the first porting of OpenJDK platform to Xeon Phi,in which the HotSpot virtual machine acts as the kernel execution engine.The main difficulty includes the incompatibility between Xeon Phi ISA and the assembly library of Hotspot VM.By evaluating the multithreaded Java Grande benchmark suite and our ported Java Phoenix benchmarks,we quantitatively study the performance and scalability issues of JVM on Xeon Phi and draw several conclusions from the study.To fully utilize the vector computing capability and hide the significant memory access latency on the coprocessor,we present a semi-automatic vectorization scheme and software prefetching model in HotSpot.Together with 60 physical cores and tuning, our optimized JVM achieves averagely 2.7x and 3.5x speedupcompared to Xeon CPU processor by using vectorization and prefetching accordingly.Our study also indicates that it is viable and potentially performance-beneficialto run applications written for such a managed runtime like JVM on Xeon Phi.


References

[1] Chrysos G. Intel® Xeon PhiTM Coprocessor-the Architecture. Intel Whitepaper, 2014. Google Scholar

[2] Shafi A, Carpenter B, Baker M. Nested parallelism for multi-core HPC systems using Java. J Parallel Distributed Computing, 2009, 69: 532-545 CrossRef Google Scholar

[3] Moreira J E, Midkiff S P, Gupta M, et al. NINJA: Java for high performance numerical computing. Sci Program, 2002, 10: 19--33. Google Scholar

[4] Amedro B, Bodnartchouk V, Caromel D, et al. Current state of Java for HPC. Technical Report RT-0353. INRIA, 2008. Google Scholar

[5] OMullane W, Luri X, Parsons P. Using Java for distributed computing in the Gaia satellite data processing. Exp Astron, 2011, 31: 243-258 CrossRef ADS arXiv Google Scholar

[6] Taboada G L, Touri no J, Doallo R. Java for high performance computing: assessment of current research and practice. In: Proceedings of the 7th International Conference on Principles and Practice of Programming in Java. New York: ACM, 2009. 30--39. Google Scholar

[7] Boisvert R F, Moreira J, Philippsen M, et al. Java and numerical computing. Comput Sci Eng, 2001, 3: 18--24. Google Scholar

[8] Guide P. Intel® 64 and IA-32 Architectures Software Developers Manual. 2010. Google Scholar

[9] Blumofe R D, Joerg C F, Kuszmaul B C, et al. Cilk: an efficient multithreaded runtime system. In: Proceedings of the 5th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming. New York: ACM, 1995. 207--216. Google Scholar

[10] Lindholm T, Yellin F, Bracha G, et al. The Java Virtual Machine Specification. 8th ed. Redwood City: Pearson Education, 2014. Google Scholar

[11] Intel. Intel® Xeon PhiTM Coprocessor Instruction Set Architecture Reference Manual. 2012. Google Scholar

[12] Smith L A, Bull J M, Obdrizalek J. A parallel Java grande benchmark suite. In: Proceedings of the 2001 ACM/IEEE Conference on Supercomputing. New York: ACM, 2001. 8. Google Scholar

[13] Ranger C, Raghuraman R, Penmetsa A, et al. Evaluating MapReduce for multi-core and multiprocessor systems. In: Proceedings of IEEE 13th International Symposium on High Performance Computer Architecture. Washington, DC: IEEE, 2007. 13--24. Google Scholar

[14] Fang Z, Mehta S, Yew P C, et al. Measuring microarchitectural details of multi-and many-core memory systems through microbenchmarking. ACM Trans Architect Code Optim, 2015, 11: 55. Google Scholar

[15] Intel. Intel® Xeon PhiTM Coprocessor System Software Developers Guide. 2013. Google Scholar

[16] Mehta S, Fang Z, Zhai A, et al. Multi-stage coordinated prefetching for present-day processors. In: Proceedings of the 28th ACM International Conference on Supercomputing. New York: ACM, 2014. 73--82. Google Scholar

[17] Krishnaiyer R, Kultursay E, Chawla P, et al. Compiler-based data prefetching and streaming non-temporal store generation for the Intel® Xeon PhiTM coprocessor. In: Proceedings of IEEE 27th International Parallel and Distributed Processing Symposium Workshops & PhD Forum (IPDPSW), Cambridge, 2013. 1575--1586. Google Scholar

[18] Wurthinger T, Wimmer C, Mossenbock H. Visualization of program dependence graphs. In: Proceedings of the Joint European Conferences on Theory and Practice of Software and the 17th International Conference on Compiler Construction. Berlin/Heidelberg: Springer-Verlag, 2008. 193--196. Google Scholar

[19] Tuck J, Ceze L, Torrellas J. Scalable cache miss handling for high memory-level parallelism. In: Proceedings of the 39th Annual IEEE/ACM International Symposium on Microarchitecture. Washington, DC: IEEE, 2006. 409--422. Google Scholar

[20] Fang J, Varbanescu A L, Sips H, et al. An empirical study of Intel Xeon Phi,. arXiv Google Scholar

[21] Ramachandran A, Vienne J, van der Wijngaart R, et al. Performance evaluation of NAS parallel benchmarks on Intel Xeon Phi. In: Proceedings of the 42nd International Conference on Parallel Processing, Lyon, 2013. 736--743. Google Scholar

[22] Heinecke A, Vaidyanathan K, Smelyanskiy M, et al. Design and implementation of the linpack benchmark for single and multi-node systems based on Intel® Xeon PhiTM coprocessor. In: Proceedings of IEEE 27th International Symposium on Parallel & Distributed Processing (IPDPS), Boston, 2013. 126--137. Google Scholar

[23] Eyerman S, Eeckhout L. The benefit of SMT in the multi-core era: flexibility towards degrees of thread-level parallelism. ACM SIGARCH Comput Architect News, 2014, 42: 591--606. Google Scholar

[24] Kuo-Yi Chen , Chang J M, Ting-Wei Hou J M. Multithreading in Java: performance and scalability on multicore systems. IEEE Trans Comput, 2011, 60: 1521-1534 CrossRef Google Scholar

[25] Gidra L, Thomas G, Sopena J, et al. A study of the scalability of stop-the-world garbage collectors on multicores. In: Proceedings of the 18th International Conference on Architectural Support for Programming Languages and Operating Systems. New York: ACM, 2013. 229--240. Google Scholar

[26] Yan Y H, Grossman M, Sarkar V. JCUDA: a programmer-friendly interface for accelerating Java programs with CUDA. In: Proceedings of the 15th International Euro-Par Conference on Parallel Processing. Berlin/Heidelberg: Springer-Verlag, 2009. 887--899. Google Scholar

[27] Docampo J, Ramos S, Taboada G L, et al. Evaluation of Java for general purpose GPU computing. In: Proceedings of the 27th International Conference on Advanced Information Networking and Applications Workshops. Washington, DC: IEEE, 2013. 1398--1404. Google Scholar

  • Figure 1

    (Color onilne) Throughput of Java Grande benchmark suite with single thread.

  • Figure 2

    (Color online) Throughput speedup of multithreaded benchmarks on CPU and Xeon Phi with varying workload sizes. (a) Scalability on CPU with default size; (b) scalability on CPU with big size; (c) scalability on Xeon Phi with default size; (d) scalability on Xeon Phi with big size.

  • Figure 3

    (Color online) Throughput of Java Grande benchmark suite with big workload size. (a) Crypt; (b) Series; (c) SOR; (d) SparseMatmult; (e) LUFact; (f) KMeans; (g) LinearRegression; (h) MatrixMultiply; (i) PCA.

  • Figure 4

    (Color online) Speedup of throughputs of five benchmarks with varying numbers of threads.

  • Figure 5

    (Color online) Performance gains by vectorization compared to different versions on both architectures.

  • Figure 6

    (Color online) Throughput comparison between prefetching and vectorization.

  • Figure 7

    (Color online) Performance gains by prefetching compared to different versions on both architectures.

  •   
    //The loop after transformation by our modified javac
    double sum;
    double[] sum_a = new double[8]; // Initialize with 0.0
    for (int i = 0; i $<$ n; i+)
    sum_a[0] += a[i] * b;
    sum = sum_a[0] + sum_a[1] + ... + sum_a[7];
  •   
    ...
    0xc90600a1: vmovdqu 0x10(%r14,%r13,8), %ymm2
    0xc90600a8: vmulpd %ymm1, %ymm2, %ymm2
    0xc90600ac: vmovdqu %ymm2, 0x10(%r8,%r13,8)
    0xc90600c8: add 0x4, %r13d
    0xc90600cc: cmp %edi, %r13d
    0xc90600cf: jl 0xc90600a1
    ...
  •   
    //Loop format that can be vectorized by JIT
    for (int i = 0; i $<$ n; i+)
    a[i] = b[i] * c;
  •   
    // Constant times a vector, then plus a vector
    void daxpy(int n, double da, double dx[], int dx_off, double dy[], int dy_off)
    for (int i = 0; i $<$ n; i+)
    dy[i~+~dy\_off] += da * dx[i~+~dx\_off];
  •   
    //Reduction idiom that can not be vectorized by JIT
    double sum;
    for (int i = 0; i $<$ n; i+)
    sum += a[i] * b;
  • Table 1   Hardware configuration
    Parameter Intel Xeon PhiTM Coprocessor 5110P Intel® Xeon® CPU E5-2620
    Chips 1 1
    Core type In-order Out-of-order
    Physical cores 60 6
    Threads per core 4 2
    Frequency 1052.630 MHz 2.00 GHz
    Data caches 32 KB L1, 512 KB L2 per core 32 KB L1d, 32 KB L1i 256 KB L2, per core 15 MB L3, shared
    Memory capacity 7697 MB 32 GB
    Memory technology GDDR5 DDR3
    Peak memory bandwidth 320 GB/s 42.6 GB/s
    Vector length 512 bits 256 bits (Intel® AVX)
    Memory access latency 340 cycles 140 cycles
  • Table 2   L1 cache hit rates of 7 memory-intensive benchmarks
    Hardware events SOR SparseMatmult LUFact KMeans LinearRegression MatrixMultiply PCA
    L1 cache hit rate (%) 98.2 74.4 91.5 87.8 90.9 92.1 88.5
  • Table 3   Comparison of hardware events among different versions
    Benchmarks Hardware events Origin Vectorization Prefetching
    LUFact Instructions retired 90440000000 17710000000 21630000000
    L1 hit ratio (%) 91.5 80.7 96.5
    KMeans Instructions retired 9660000000 6160000000 7070000000
    L1 hit ratio (%) 87.8 85.7 88.7
    LinearRegression Instructions retired 10290000000 6230000000 6720000000
    (trans) L1 hit ratio (%) 96.5 88.6 92.0
    MatrixMultiply Instructions retired 4690000000 3080000000 3290000000
    L1 hit ratio (%) 92.1 84.6 89.2
    PCA Instructions retired 18760000000 7280000000 8470000000
    L1 hit ratio (%) 88.5 79.3 91.9
  • Table 4   Normalized throughputs of 5 Java vectorized benchmarks with different prefetching strategies
    llllll
    Bench-marks Loop length Threads Prefetching Stride/bytes (MEM$\rightarrow$L2, L2$\rightarrow$L1)
    (4096, 1024) (2048, 768)
    None (%) Exclusive (%) Clevict (%) Both (%) None (%) Exclusive (%) Clevict (%) Both (%)
    KMeans Small 1T 100.00 100.13 77.81 77.30 106.16 105.75 79.78 79.74
    60T 100.00 100.27 90.00 88.86 106.41 107.75 92.88 90.87
    Big 1T 100.00 99.55 79.74 80.21 102.63 102.28 80.14 79.74
    60T 100.00 101.56 90.14 91.78 105.55 99.68 94.78 89.66
    Linear-Regression (trans) Small 1T 100.00 100.00 5.61 5.40 100.35 100.40 5.33 5.60
    60T 100.00 97.76 4.09 4.12 99.16 96.96 4.03 4.23
    Big 1T 100.00 99.91 5.45 5.73 100.39 100.13 5.63 5.09
    60T 100.00 101.52 4.32 4.18 102.74 101.83 4.17 4.04
    Matrix-Multiply Small 1T 100.00 100.48 31.80 32.75 97.64 98.57 31.23 32.73
    60T 100.00 104.02 42.49 42.39 101.78 101.57 42.55 43.17
    Big 1T 100.00 102.01 30.37 34.01 97.83 98.03 30.41 32.82
    60T 100.00 95.83 43.15 42.03 86.53 89.78 40.98 41.63
    PCA Small 1T 100.00 97.66 43.03 37.76 93.63 93.34 38.01 40.61
    60T 100.00 102.28 38.85 37.12 105.87 105.69 38.44 39.34
    Big 1T 100.00 98.81 35.81 38.15 97.50 97.58 36.96 41.67
    60T 100.00 102.92 35.86 34.98 90.94 95.19 34.03 33.39
    LUFact Small 1T 100.00 99.87 97.64 97.63 95.31 94.86 94.94 94.93
    60T 100.00 97.43 98.81 101.13 104.69 104.45 100.98 104.15
    Big 1T 100.00 100.30 96.64 96.43 100.13 100.49 96.65 96.69
    60T 100.00 99.26 99.39 99.09 93.00 92.90 100.92 100.19
  • Table 5   Correspondence between innermost loop length and prefetching distance
    Loop length Prefetching stride (in cache line)
    $[0,~1000)$ No prefetch
    $[1000,~2000)$ MEM$\rightarrow$L2: 16 lines; L2$\rightarrow$L1: 8 lines
    $[2000,~5000)$ MEM$\rightarrow$L2: 32 lines; L2$\rightarrow$L1: 12 lines
    $[5000,~+\infty)$ MEM$\rightarrow$L2: 64 lines; L2$\rightarrow$L1: 16 lines

Copyright 2020 Science China Press Co., Ltd. 《中国科学》杂志社有限责任公司 版权所有

京ICP备18024590号-1       京公网安备11010102003388号