logo

SCIENCE CHINA Information Sciences, Volume 62, Issue 8: 082101(2019) https://doi.org/10.1007/s11432-018-9834-8

Real-time intelligent big data processing: technology, platform, and applications

More info
  • ReceivedDec 27, 2018
  • AcceptedFeb 13, 2019
  • PublishedJul 12, 2019

Abstract

Human beings keep exploring the physical space using information means. Only recently, with the rapid development of information technologies and the increasing accumulation of data, human beings can learn more about the unknown world with data-driven methods. Given data timeliness, there is a growing awareness of the importance of real-time data. There are two categories of technologies accounting for data processing: batching big data and streaming processing, which have not been integrated well. Thus, we propose an innovative incremental processing technology named after Stream Cube to process both big data and stream data. Also, we implement a real-time intelligent data processing system, which is based on real-time acquisition, real-time processing, real-time analysis, and real-time decision-making. The real-time intelligent data processing technology system is equipped with a batching big data platform, data analysis tools, and machine learning models. Based on our applications and analysis, the real-time intelligent data processing system is a crucial solution to the problems of the national society and economy.


References

[1] Pan Y. Heading toward Artificial Intelligence 2.0. Engineering, 2016, 2: 409-413 CrossRef Google Scholar

[2] Chen C. Real-time processing technology, platform and application of streaming big data. Big Data, 2017, 3: 1--8. Google Scholar

[3] Shvachko K, Kuang H, Radia S, et al. The hadoop distributed file system. In: Proceedings of Mass Storage Systems and Technologies (MSST), 2010. 1--10. Google Scholar

[4] Dean J, Ghemawat S. MapReduce: simplified data processing on large clusters. Communications of the ACM, 2008, 51: 107--113. Google Scholar

[5] Zaharia M, Chowdhury M, Franklin M J, et al. Spark: cluster computing with working sets. HotCloud, 2010, 10: 95. Google Scholar

[6] Zhang Q, Cheng L, Boutaba R. Cloud computing: state-of-the-art and research challenges. J Internet Serv Appl, 2010, 1: 7-18 CrossRef Google Scholar

[7] Hashem I A T, Yaqoob I, Anuar N B. The rise of "big data" on cloud computing: Review and open research issues. Inf Syst, 2015, 47: 98-115 CrossRef Google Scholar

[8] Wu Q, Ishikawa F, Zhu Q. Deadline-Constrained Cost Optimization Approaches for Workflow Scheduling in Clouds. IEEE Trans Parallel Distrib Syst, 2017, 28: 3401-3412 CrossRef Google Scholar

[9] Saha B, Shah H, Seth S, et al. Apache tez: A unifying framework for modeling and building data processing applications. In: Proceedings of the 2015 ACM SIGMOD International Conference on Management of Data, 2015. 1357--1369. Google Scholar

[10] Maarala A I, Rautiainen M, Salmi M, et al. Low latency analytics for streaming traffic data with Apache Spark. In: Proceedings of IEEE International Conference on Big Data (Big Data), 2015. 2855--2858. Google Scholar

[11] Toshniwal A, Taneja S, Shukla A, et al. Storm@ twitter. In: Proceedings of the 2014 ACM SIGMOD International Conference on Management of Data, 2014. 147--156. Google Scholar

[12] Carbone P, Katsifodimos A, Ewen S, et al. Apache flink: Stream and batch processing in a single engine. Bulletin of the IEEE Computer Society Technical Committee on Data Engineering, 2015, 36: 4. Google Scholar

[13] Zaharia M, Das T, Li H, et al. Discretized Streams: An Efficient and Fault-Tolerant Model for Stream Processing on Large Clusters. HotCloud, 2012, 12: 10. Google Scholar

[14] Zhao X, Garg S, Queiroz C, Buyya R. A taxonomy and survey of stream processing systems. In: Proceedings of Software Architecture for Big Data and the Cloud, 2017. 183--206. Google Scholar

[15] Ali M. An introduction to microsoft sql server streaminsight. In: Proceedings of the 1st International Conference and Exhibition on Computing for Geospatial Research & Application, 2010. 66. Google Scholar

[16] Hyde J. Data in flight. Commun ACM, 2010, 53: 48--52. Google Scholar

[17] Demers A J, Gehrke J, Panda B, et al. Cayuga: a general purpose event monitoring system. In: Proceedings of the 3rd Biennial Conference on Innovative Data Systems Research, Asilomar, 2007. 7: 412--422. Google Scholar

[18] Strohbach M, Ziekow H, Gazis V, et al. Towards a big data analytics framework for IoT and smart city applications. In: Proceedings of Modeling and Processing for Next-generation Big-data Technologies, 2015. 257--282. Google Scholar

[19] Noghabi S A, Paramasivam K, Pan Y. Samza. Proc VLDB Endow, 2017, 10: 1634-1645 CrossRef Google Scholar

[20] Chauhan J, Chowdhury S A, Makaroff D. Performance evaluation of Yahoo S4: A first look. In: Proceedings of the 7th International Conference on P2P, Parallel, Grid, Cloud and Internet Computing, 2012. 58--65. Google Scholar

[21] Fernandez R C, Pietzuch P R, Kreps J, et al. Liquid: unifying nearline and offline big data integration. In: Proceedings of the 7th Biennial Conference on Innovative Data Systems Research, Asilomar, 2015. Google Scholar

[22] Pacaci A, Özsu M T. Distribution-aware stream partitioning for distributed stream processing systems. In: Proceedings of the 5th ACM SIGMOD Workshop on Algorithms and Systems for MapReduce and Beyond, 2018. 6. Google Scholar

[23] Jin H, Chen F, Wu S. Towards Low-Latency Batched Stream Processing by Pre-Scheduling. IEEE Trans Parallel Distrib Syst, 2019, 30: 710-722 CrossRef Google Scholar

[24] Venkataraman S, Panda A, Ousterhout K, et al. Drizzle: Fast and adaptable stream processing at scale. In: Proceedings of the 26th Symposium on Operating Systems Principles, 2017. 374--389. Google Scholar

[25] Zhang B, Jin X, Ratnasamy S, et al. Awstream: Adaptive wide-area streaming analytics. In: Proceedings of the 2018 Conference of the ACM Special Interest Group on Data Communication, 2018. 236--252. Google Scholar

[26] Li W X, Niu D, Liu Y N, et al. Wide-area spark streaming: automated routing and batch sizing. IEEE Trans Parall Distributed Syst, 2019, 30: 1434--1448. Google Scholar

[27] Traub J, Grulich P M, Cuellar A R, et al. Scotty: Efficient Window Aggregation for out-of-order Stream Processing. In: Proceedings of 2018 IEEE 34th International Conference on Data Engineering, 2018. 1300--1303. Google Scholar

[28] Srinivasan V, Bulkowski B, Chu W L. Aerospike. Proc VLDB Endow, 2016, 9: 1389-1400 CrossRef Google Scholar

[29] Carlson J L. Redis in action. Manning Publications Co, 2013. Google Scholar

  • Figure 3

    (Color online) Computation dimensions of the Stream Cube.

  • Figure 4

    (Color online) Challenge 2: different queries share an overlapping time interval.

  • Figure 5

    (Color online) Challenge 3: decompose the calculation into the incremental computation.

  • Figure 6

    (Color online) The architecture of the real-time intelligent data processing system.

  • Figure 7

    (Color online) Write and read performance of the Stream Cube with up to 8 servers.

  • Figure 8

    (Color online) Comparison of computational cost between Flink and Stream Cube under two different window sizes. Note that Flink (list) exhausted the memory when computing under 90 window size, we thus place a cross mark on the corresponding bar in the right figure.

  • Table 1   Experiment simulation transaction format
    Field Type Comment
    transTime Long Transaction time
    acctId String(32) Transaction initiator
    merId String(32) Transaction receiver
    transAmt Long Transaction limit
    city String(32) Transaction city
    hizCode String(32) Business code. MOB-telephone recharging, 3C-3C product,
    EXP-expensive goods, DIN-dining, HOT-hotel, OTH-others
    chnl String(3) Transaction channel. AND-Android channel, IOS-Apple app channel,
    WEB-PC web browser channel, WAP-mobile phone browser channel
    stat Integer Transaction state. 0-success, 1-not sufficient funds
  • Table 2   The stream data volume in various applications
    Field Scenario Stream data volume (per second)
    Finance China UMS 50000 transactions
    E-Commerce Taobao 11.11 Day 220000 transactions
    IoT Shanghai Railway Line 1 200000 records
    Ticket 12306 Railway Ticket 1.7 million page views

Copyright 2020 Science China Press Co., Ltd. 《中国科学》杂志社有限责任公司 版权所有

京ICP备18024590号-1       京公网安备11010102003388号