SCIENTIA SINICA Informationis, Volume 50 , Issue 8 : 1178-1196(2020) https://doi.org/10.1360/N112018-00108

Estimating the memory consumption of big data applications based on program analysis

More info
  • ReceivedApr 28, 2018
  • AcceptedApr 29, 2019
  • PublishedJul 31, 2020


Many distributed in-memory data-processing systems such as Flink and Spark suffer from serious memory issues, including limited memory resources shared by many users or groups, which aggravates the competition for memory resources. If a user application is allocated with insufficient memory, significant garbage collection overheads will occur at runtime. In contrast, if the user application is provided with a memory space larger than it actually requires, memory resources will be wasted. Therefore, it is important to ensure that a user application is allocated with an appropriate memory size. In a general case, multiple applications process one specific set of data repeatedly. Often, the process logic of applications working on the same dataset is similar; for example, they can perform machine learning or graph computing tasks. This study further reveals that the process logic reflected in application programming interface and user-defined functions also affects the memory usage at runtime. Based on this observation, this paper presents a method for estimating the optimal memory size for newly submitted applications. The proposed approach was implemented on Spark. It utilizes the information of program analysis and historical applications produced when processing data to estimate a proper memory threshold for a newly submitted application based on the similarity of the data path between the new application and a historical one. The results of the experiments conducted to evaluate the method's accuracy of estimating the memory threshold and performance profit demonstrated that the proposed method is able to (1) estimate the required memory threshold with an error of 4% compared to the actual proper memory threshold, (2) guarantee the overall time overhead of estimating to be negligible compared to the execution time of a submitted job, and (3) reduce the execution time of submitted applications by up to 56% compared to when the proposed method is not applied.

Funded by




[1] Zaharia M, Chowdhury M, Das T, et al. Resilient distributed datasets: a fault-tolerant abstraction for in-memory cluster computing. In: Proceedings of the 9th USENIX Conference on Networked Systems Design and Implementation, 2012. 2. Google Scholar

[2] Carbone P, Katsifodimos A, Ewen S, et al. Apache Flink: Stream and Batch Processing in a Single Engine. Bulletin of the IEEE Computer Society Technical Committee on Data Engineering, 2015. 38: 28--38. Google Scholar

[3] Deutsch L P, Bobrow D G. An efficient, incremental, automatic garbage collector. Commun ACM, 1976, 19: 522-526 CrossRef Google Scholar

[4] Thusoo A, Sarma J S, Jain N, et al. Hive: a warehousing solution over a map-reduce framework. In: Proceedings of VLDB, Endow, 2009. 1626--1629. Google Scholar

[5] Armbrust M, Xin R S, Lian C, et al. Spark SQL: relational data processing in spark. In: Proceedings of the 2015 ACM SIGMOD International Conference on Management of Data, New York, 2015. 1383--1394. Google Scholar

[6] Lu L, Shi X, Zhou Y, et al. Lifetime-based memory management for distributed data processing systems. In: Proceedings of the VLDB Endowment, New Delhi, 2016. 936--947. Google Scholar

[7] Nguyen K, Fang L, Xu G, et al. Yak: a high-performance big-data-friendly garbage collector. In: Proceedings of the 12th USENIX conference on Operating Systems Design and Implementation (OSDI'16), Savannah, 2016. 349--365. Google Scholar

[8] Shi X H, Chen M, He L G. Mammoth: Gearing Hadoop Towards Memory-Intensive MapReduce Applications. IEEE Trans Parallel Distrib Syst, 2015, 26: 2300-2315 CrossRef Google Scholar

[9] Ananthanarayanan G, Agarwal S, Kandula S, et al. Scarlett: coping with skewed content popularity in mapreduce clusters. In: Proceedings of the 6th Conference on Computer Systems, Salzburg, 2011. 287--300. Google Scholar

[10] Gonzalez J E, Xin R S, Dave A, et al. GraphX: Graph Processing in a Distributed Dataflow Framework. In: Proceedings of Symposium on Operating Systems Design and Implementation, Broomfield, 2014. 599--613. Google Scholar

[11] Meng X R, Bradley J, Yavuz B, et al. Mllib: Machine learning in apache spark. J Machine Learn Res, 2016, 17: 1235--1241. Google Scholar

[12] Isard M, Budiu M, Yu Y, et al. Dryad: distributed data-parallel programs from sequential building blocks. In: Proceedings of the 2nd ACM SIGOPS/EuroSys European Conference on Computer Systems 2007, Lisbon, 2007. 59--72. Google Scholar

[13] Huang S, Huang J, Liu Y, et al. Hibench: a representative and comprehensive hadoop benchmark suite. In: Proceedings of ICDE Workshops, 2010. 41--51. Google Scholar

[14] Murray D G, Isard M, Yu Y Steno: automatic optimization of declarative queries. In: Proceedings of the 32nd ACM SIGPLAN Conference on Programming Language Design and Implementation, California, 2011. 121--131. Google Scholar

[15] Nguyen K, Wang K, Bu Y, et al. FACADE: a compiler and runtime for (almost) object-bounded big data applications. In: Proceedings of the Twentieth International Conference on Architectural Support for Programming Languages and Operating Systems, Istanbul, 2015. 675--690. Google Scholar

[16] Cheng R, Hong J, Kyrola A, et al. Kineograph: taking the pulse of a fast-changing and connected world. In: Proceedings of the 7th ACM european conference on Computer Systems, Bern, 2012. 85--98. Google Scholar

[17] Dietrich C, Rothberg V, Füracker L, et al. cHash: detection of redundant compilations via AST hashing. In: Proceedings of the 2017 USENIX Conference on Usenix Annual Technical Conference, 2017. 527--538. Google Scholar

[18] Bruno R, Ferreira P. Alma: Gc-assisted JVM live migration for Java server applications. In: Proceedings of the 17th International Middleware Conference, Trento, 2016. 19--20. Google Scholar

[19] Wang J, Balazinska M. Elastic memory management for cloud data analytics. In: Proceedings of 2017 Annual Technical Conference, Santa Clara, 2017. 745--758. Google Scholar

[20] Ackermann S, Jovanovic V, Rompf T et al. Jet: an embedded DSL for high performance big data processing. In: Proceedings of International Workshop on End-to-end Management of Big Dat, Raleigh, 2012. 206--220. Google Scholar

[21] Garbervetsky D, Pavlinovic Z, Barnett M, et al. Static analysis for optimizing big data queries. In: Proceedings of the 2017 11th Joint Meeting on Foundations of Software Engineering, New York, 2017. 932--937. Google Scholar

[22] Zhang J, Zhou H, Chen R, et al. Optimizing data shuffling in data-parallel computation by understanding user-defined functions. In: Proceedings of NSDI, San Jose, 2012. 12: 22--22. Google Scholar

[23] Shi X H, Ke Z X, Zhou Y L. Deca: a garbage collection optimizer for in-memory data processing. ACM Trans Comput Syst, 2019, 36: 1-47 CrossRef Google Scholar

[24] Herodotou H, Dong F, Babu S. Mapreduce programming and costbased optimization? Crossing this chasm with starfish. In: Proceedings of the Vldb Endowment, 2012. 4: 1446-1449. Google Scholar

[25] Yu Z, Bei Z, Qian X. Datasize-Aware High Dimensional Configurations Auto-Tuning of In-Memory Cluster Computing. In: Proceedings of the 23rd International Conference on Architectural Support for Programming Languages and Operating Systems, Williamsburg, 2018. 564--577. Google Scholar

  • Figure 1

    (Color online) The scenario in big data platform about the relationship among ResourceManager, applications and datasets

  • Figure 2

    (Color online) The performance of Spark PR with different maximum memory sizes, in Spark local mode (one executor)

  • Figure 4

    The data path in a stage of the Spark application

  • Figure 5

    (Color online) The kTree of logistic regression (LR). The root node represents the data objects in cache data

  • Figure 6

    (Color online) The kTree of the second stage produced by family groupByKeyin PR and CC

  • Figure 7

    (Color online) The kTree of the second stage produced by family reduceByKeyin WC

  • Figure 8

    (Color online) The difference between the estimated proper memory threshold and the real one of (a) PR, protectłinebreak (b) CC, (c) WC, (d) LR, (e) KMeans

  • Figure 9

    (Color online) The difference between the estimated proper memory threshold size and the real one of (a) PR, (b) WC with different datasets

  • Figure 10

    (Color online) The performance of PR, CC, Normalizer and StandScaler with different memory ratio in Spark platform

  • Table 1   The analysis of each container in three typical applications
    Application Data container Data structure Data type
    PR Cached data (K1, Array(V1)) Int, Int
    Shuffle buffer (K2, V2) Int, Double
    CC Cached data (K1, Array(V1)) Int, Int
    Shuffle buffer (K2, V3) Int, Double
    WC Shuffle buffer (K1, V4) Int, Int
  • Table 2   The time cost of program analysis
    App Stage num Reachable methods Fusion time (ms) Match time (ms) Total time (ms)
    SparkHdfsLR 1 5164 686 21369 22055
    SparkKMeans 3 6352 963 21929 29244
    SparkPR 3 1526 976 8643 9619
    SparkCC 5 1390 1275 12756 14031
    SparkWC 2 67 685 3773 4458
    Normalizer 2 2440 834 11359 12193
    LinearRegression 5 3003 1251 19677 20928
    TallSkinnySVD 3 1998 1109 11642 12761
    StandardScaler 4 3851 1245 19597 20842
    ChiSqSelector 4 4301 1190 18841 24332

Copyright 2020  CHINA SCIENCE PUBLISHING & MEDIA LTD.  中国科技出版传媒股份有限公司  版权所有

京ICP备14028887号-23       京公网安备11010102003388号