logo

SCIENCE CHINA Information Sciences, Volume 60, Issue 12: 122106(2017) https://doi.org/10.1007/s11432-015-0989-3

Characterizing and optimizing Java-based HPC applications on Intel many-core architecture

More info
  • ReceivedOct 3, 2016
  • AcceptedDec 13, 2016
  • PublishedMay 8, 2017

Abstract

The increasing demand for performance has stimulated the wide adoption of many-coreaccelerators like Intel®Xeon PhiTMCoprocessor,which is based on Intels Many Integrated Core architecture.While many HPC applications running in native mode have been tuned to run efficiently on Xeon Phi,it is still unclear how a managed runtime like JVM performs on such an architecture.In this paper, we present the first measurement study of a set of Java HPC applications on Xeon Phi under JVM.One key obstacle to the study is that there is currently little support of Java for Xeon Phi.This paper presents the result based on the first porting of OpenJDK platform to Xeon Phi,in which the HotSpot virtual machine acts as the kernel execution engine.The main difficulty includes the incompatibility between Xeon Phi ISA and the assembly library of Hotspot VM.By evaluating the multithreaded Java Grande benchmark suite and our ported Java Phoenix benchmarks,we quantitatively study the performance and scalability issues of JVM on Xeon Phi and draw several conclusions from the study.To fully utilize the vector computing capability and hide the significant memory access latency on the coprocessor,we present a semi-automatic vectorization scheme and software prefetching model in HotSpot.Together with 60 physical cores and tuning, our optimized JVM achieves averagely 2.7x and 3.5x speedupcompared to Xeon CPU processor by using vectorization and prefetching accordingly.Our study also indicates that it is viable and potentially performance-beneficialto run applications written for such a managed runtime like JVM on Xeon Phi.


References

[1] Chrysos G. Intel® Xeon PhiTM Coprocessor-the Architecture. Intel Whitepaper, 2014. Google Scholar

[2] Shafi A, Carpenter B, Baker M. Nested parallelism for multi-core HPC systems using Java. J Parallel Distributed Computing, 2009, 69: 532-545 CrossRef Google Scholar

[3] Moreira J E, Midkiff S P, Gupta M, et al. NINJA: Java for high performance numerical computing. Sci Program, 2002, 10: 19--33. Google Scholar

[4] Amedro B, Bodnartchouk V, Caromel D, et al. Current state of Java for HPC. Technical Report RT-0353. INRIA, 2008. Google Scholar

[5] OMullane W, Luri X, Parsons P. Using Java for distributed computing in the Gaia satellite data processing. Exp Astron, 2011, 31: 243-258 CrossRef ADS arXiv Google Scholar

[6] Taboada G L, Touri no J, Doallo R. Java for high performance computing: assessment of current research and practice. In: Proceedings of the 7th International Conference on Principles and Practice of Programming in Java. New York: ACM, 2009. 30--39. Google Scholar

[7] Boisvert R F, Moreira J, Philippsen M, et al. Java and numerical computing. Comput Sci Eng, 2001, 3: 18--24. Google Scholar

[8] Guide P. Intel® 64 and IA-32 Architectures Software Developers Manual. 2010. Google Scholar

[9] Blumofe R D, Joerg C F, Kuszmaul B C, et al. Cilk: an efficient multithreaded runtime system. In: Proceedings of the 5th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming. New York: ACM, 1995. 207--216. Google Scholar

[10] Lindholm T, Yellin F, Bracha G, et al. The Java Virtual Machine Specification. 8th ed. Redwood City: Pearson Education, 2014. Google Scholar

[11] Intel. Intel® Xeon PhiTM Coprocessor Instruction Set Architecture Reference Manual. 2012. Google Scholar

[12] Smith L A, Bull J M, Obdrizalek J. A parallel Java grande benchmark suite. In: Proceedings of the 2001 ACM/IEEE Conference on Supercomputing. New York: ACM, 2001. 8. Google Scholar

[13] Ranger C, Raghuraman R, Penmetsa A, et al. Evaluating MapReduce for multi-core and multiprocessor systems. In: Proceedings of IEEE 13th International Symposium on High Performance Computer Architecture. Washington, DC: IEEE, 2007. 13--24. Google Scholar

[14] Fang Z, Mehta S, Yew P C, et al. Measuring microarchitectural details of multi-and many-core memory systems through microbenchmarking. ACM Trans Architect Code Optim, 2015, 11: 55. Google Scholar

[15] Intel. Intel® Xeon PhiTM Coprocessor System Software Developers Guide. 2013. Google Scholar

[16] Mehta S, Fang Z, Zhai A, et al. Multi-stage coordinated prefetching for present-day processors. In: Proceedings of the 28th ACM International Conference on Supercomputing. New York: ACM, 2014. 73--82. Google Scholar

[17] Krishnaiyer R, Kultursay E, Chawla P, et al. Compiler-based data prefetching and streaming non-temporal store generation for the Intel® Xeon PhiTM coprocessor. In: Proceedings of IEEE 27th International Parallel and Distributed Processing Symposium Workshops & PhD Forum (IPDPSW), Cambridge, 2013. 1575--1586. Google Scholar

[18] Wurthinger T, Wimmer C, Mossenbock H. Visualization of program dependence graphs. In: Proceedings of the Joint European Conferences on Theory and Practice of Software and the 17th International Conference on Compiler Construction. Berlin/Heidelberg: Springer-Verlag, 2008. 193--196. Google Scholar

[19] Tuck J, Ceze L, Torrellas J. Scalable cache miss handling for high memory-level parallelism. In: Proceedings of the 39th Annual IEEE/ACM International Symposium on Microarchitecture. Washington, DC: IEEE, 2006. 409--422. Google Scholar

[20] Fang J, Varbanescu A L, Sips H, et al. An empirical study of Intel Xeon Phi,. arXiv Google Scholar

[21] Ramachandran A, Vienne J, van der Wijngaart R, et al. Performance evaluation of NAS parallel benchmarks on Intel Xeon Phi. In: Proceedings of the 42nd International Conference on Parallel Processing, Lyon, 2013. 736--743. Google Scholar

[22] Heinecke A, Vaidyanathan K, Smelyanskiy M, et al. Design and implementation of the linpack benchmark for single and multi-node systems based on Intel® Xeon PhiTM coprocessor. In: Proceedings of IEEE 27th International Symposium on Parallel & Distributed Processing (IPDPS), Boston, 2013. 126--137. Google Scholar

[23] Eyerman S, Eeckhout L. The benefit of SMT in the multi-core era: flexibility towards degrees of thread-level parallelism. ACM SIGARCH Comput Architect News, 2014, 42: 591--606. Google Scholar

[24] Kuo-Yi Chen , Chang J M, Ting-Wei Hou J M. Multithreading in Java: performance and scalability on multicore systems. IEEE Trans Comput, 2011, 60: 1521-1534 CrossRef Google Scholar

[25] Gidra L, Thomas G, Sopena J, et al. A study of the scalability of stop-the-world garbage collectors on multicores. In: Proceedings of the 18th International Conference on Architectural Support for Programming Languages and Operating Systems. New York: ACM, 2013. 229--240. Google Scholar

[26] Yan Y H, Grossman M, Sarkar V. JCUDA: a programmer-friendly interface for accelerating Java programs with CUDA. In: Proceedings of the 15th International Euro-Par Conference on Parallel Processing. Berlin/Heidelberg: Springer-Verlag, 2009. 887--899. Google Scholar

[27] Docampo J, Ramos S, Taboada G L, et al. Evaluation of Java for general purpose GPU computing. In: Proceedings of the 27th International Conference on Advanced Information Networking and Applications Workshops. Washington, DC: IEEE, 2013. 1398--1404. Google Scholar

Copyright 2019 Science China Press Co., Ltd. 《中国科学》杂志社有限责任公司 版权所有

京ICP备18024590号-1