SCIENTIA SINICA Informationis, Volume 45, Issue 1: 1-44(2015) https://doi.org/10.1360/N112014-00290

A survey on big data systems

More info
  • AcceptedNov 21, 2014
  • PublishedJan 20, 2015


With the development of the latest technologies, a large amount of data is generated from various domains (such as optical observation and control, healthcare, sensors, user-generated data, Internet and nancial companies, supply chain systems, etc.) during the last two decades. (A more appropriate description could be "in nite" data, e.g., in the application of optical observation and control, data are continuously generated, cre-ating a data disaster.) The term of big data is coined to capture the profound meaning of this emerging trend.Compared with traditional data, big data exhibits some unique characteristics besides the sheer volume, such as commonly un-structured data and more real-time analysis requirements. The development of big data calls for new system architectures for data storage and large-scale data processing mechanisms. In this paper, we present a literature survey of big data analytics. Firstly, the de nition of big data and big data challenges are presented. Secondly, a systematic framework to decompose big data system into four sequential modules, namely data generation, data acquisition, data storage, and data analytics, which form the value chain for big data, is proposed. A detailed survey of numerous approaches and mechanisms related to each module, from research and industry communities is discussed. Finally, some evaluation benchmarks and potential scienti c problems in big data systems are outlined.


[1] Gantz J, Reinsel D. Extracting value from chaos. IDC iView, 2011: 1-12. Google Scholar

[2] Manyika J, Chui M, Brown B, et al. Big data: the next frontier for innovation, competition, and productivity. McKinsey Global Institute, 2011. Google Scholar

[3] Cukier K. Data, data everywhere. Economist, 2010, 394: 3-16. Google Scholar

[4] Lohr S. The age of big data. New York Times, 2012, 11. Google Scholar

[5] Noguchi Y. Following digital breadcrumbs to big data gold. National Public Radio, 2011. Google Scholar

[6] Noguchi Y. The search for analysts to make sense of big data. National Public Radio, 2011. Google Scholar

[7] White House. Fact Sheet: Big Data Across the Federal Government. Office of Science and Technology Policy, 2012. Google Scholar

[8] Howard J H, Kazar M L, Menees S G, et al. Scale and performance in a distributed file system. ACM Trans Comput Syst, 1988, 6: 51-81. Google Scholar

[9] Cattell R. Scalable SQL and NoSQL data stores. SIGMOD Rec, 2011, 39: 12-27. Google Scholar

[10] Dean J, Ghemawat S. MapReduce: simplified data processing on large clusters. Commun ACM, 2008, 51: 107-113. Google Scholar

[11] White T. Hadoop: the definitive guide. O'Reilly Media, Inc., 2012. Google Scholar

[12] Zikopoulos P, Eaton C. Understanding big data: analytics for enterprise class hadoop and streaming data. New York: McGraw-Hill Osborne Media, 2011. Google Scholar

[13] Meijer E. The world according to LINQ. Commun ACM, 2011, 54: 45-51. Google Scholar

[14] Laney D. 3D Data Management: Controlling Data Volume, Velocity and Variety. Gartner, 2001. Google Scholar

[15] Cooper M, Mell P. Tackling Big Data. NIST, 2012. Google Scholar

[16] Team O R. Big Data Now: Current Perspectives from O'Reilly Radar. O'Reilly Media, 2011. Google Scholar

[17] Marche S. Is Facebook making us lonely. Atlantic, 2012, 309: 60-69. Google Scholar

[18] Borkar V R, Carey M J, Li C. Big data platforms: what's next? XRDS: Crossroads, The ACM Magazine for Students,2012, 19: 44-49. Google Scholar

[19] Borkar V, Carey M J, Li C. Inside Big Data management: ogres, onions, or parfaits? In: Proceedings of the 15th International Conference on Extending Database Technology, Berlin, 2012. 3-14. Google Scholar

[20] Dewitt D J, Gray J. Parallel database systems: the future of high performance database systems. Commun ACM,1992, 35: 85-98. Google Scholar

[21] Ghemawat S, Gobioff H, Leung S T. The Google file system. In: Proceedings of the nineteenth ACM symposium on Operating systems principles, New York, NY, USA, 2003. 29-43. Google Scholar

[22] Hey A J, Tansley S, Tolle K M, et al. The fourth paradigm: data-intensive scientific discovery. 2009. Google Scholar

[23] Tatbul N. Streaming data integration: Challenges and opportunities. In: Proceedings of the 26th International Conference on Data Engineering Workshops, California, 2010. 155-158. Google Scholar

[24] Neumeyer L, Robbins B, Nair A, et al. S4: distributed stream computing platform. In: Proceedings of IEEE International Conference on Data Mining Workshops, Sydney, 2010. 170-177. Google Scholar

[25] Goodhope K, Koshy J, Kreps J, et al. Building LinkedIn's real-time activity data pipeline. Data Engineering, 2012,35: 33-45. Google Scholar

[26] Agrawal D, Bernstein P, Bertino E, et al. Challenges and opportunities with big data -a community white paper developed by leading researchers across the United States. Computing Research Association, 2012. Google Scholar

[27] Fisher D, DeLine R, Czerwinski M, et al. Interactions with big data analytics. Interactions, 2012, 19: 50-59. Google Scholar

[28] Isard M, Budiu M, Yu Y, et al. Dryad: distributed data-parallel programs from sequential building blocks. In: Proceedings of the 2nd ACM SIGOPS/EuroSys European Conference on Computer Systems, New York, 2007. 59-72. Google Scholar

[29] Malewicz G, Austern M H, Bik A J, et al. Pregel: a system for large-scale graph processing. In: Proceedings of the ACM SIGMOD International Conference on Management of Data, Indianapolis, 2010. 135-146. Google Scholar

[30] Melnik S, Gubarev A, Long J J, et al. Dremel: interactive analysis of web-scale datasets. Proc VLDB Endowment,2010, 3: 330-339. Google Scholar

[31] Labrinidis A, Jagadish H V. Challenges and opportunities with big data. Proc VLDB Endowment, 2012, 5: 2032-2033. Google Scholar

[32] Chaudhuri S, Dayal U, Narasayya V. An overview of business intelligence technology. ACM Commun, 2011, 54:88-98. Google Scholar

[33] Evans D, Hutley R. The Explosion of Data. White Paper, 2010. Google Scholar

[34] Gantz J, Reinsel D. The digital universe decade-are you ready. White Paper, IDC, 2010. Google Scholar

[35] Bryant R E. Data-intensive scalable computing for scientific applications. Comput Sci Eng, 2011, 13: 25-33. Google Scholar

[36] Wang X Q. Semantically-aware Data Discovery and Placement in Collaborative Computing Environments. Dissertation for Ph.D. Degree. Taiyuan: Taiyuan University of Technology, 2012. Google Scholar

[37] Middleton S E, Sabeur Z A, Löwe P, et al. Multi-disciplinary approaches to intelligently sharing large-volumes of real-time sensor data during natural disasters. Data Sci J, 2013, 12: WDS109-WDS113. Google Scholar

[38] Laurila J K, Gatica-Perez D, Aad I, et al. The mobile data challenge: big data for mobile computing research. In: Proceedings of the Workshop on the Nokia Mobile Data Challenge, in Conjunction with the 10th International Conference on Pervasive Computing, Newcastle, 2012. 1-8. Google Scholar

[39] Chandramohan V, Christensen K. A first look at wired sensor networks for video surveillance systems. In: Proceedings of the 27th Annual IEEE Conference on Local Computer Networks, Tampa, 2002. 728-729. Google Scholar

[40] Selavo L, Wood A, Cao Q, et al. Luster: wireless sensor network for environmental research. In: Proceedings of the5th International Conference on Embedded Networked Sensor Systems, Sydney, 2007. 103-116. Google Scholar

[41] Barrenetxea G, Ingelrest F, Schaefer G, et al. Sensorscope: out-of-the-box environmental monitoring. In: Proceedings of International Conference on Information Processing in Sensor Networks, St. Louis, 2008. 332-343. Google Scholar

[42] Kim Y, Schmid T, Charbiwala Z M, et al. NAWMS: nonintrusive autonomous water monitoring system. In: Proceedings of the 6th ACM Conference on Embedded Network Sensor Systems, New York, 2008. 309-322. Google Scholar

[43] Kim S, Pakzad S, Culler D, et al. Health monitoring of civil infrastructures using wireless sensor networks. In:Proceedings of the 6th International Conference on Information Processing in Sensor Networks, Cambridge, 2007.254-263. Google Scholar

[44] Ceriotti M, Mottola L, Picco G P, et al. Monitoring heritage buildings with wireless sensor networks: the torre tquila deployment. In: Proceedings of the 2009 International Conference on Information Processing in Sensor Networks, San Francisco, 2009. 277-288. Google Scholar

[45] Tolle G, Polastre J, Szewczyk R, et al. A macroscope in the redwoods. In: Proceedings of the 3rd International Conference on Embedded Networked Sensor Systems, San Diego, 2005. 51-63. Google Scholar

[46] Wang F, Liu J C. Networked wireless sensor data collection: issues, challenges, and approaches. IEEE Commun Surv Tutor, 2011, 13: 673-687. Google Scholar

[47] Shi J H, Wan J F, Yan H H, et al. A survey of cyber-physical systems. In: Proceedings of International Conference on Wireless Communications and Signal Processing, Nanjing, 2011. 1-6. Google Scholar

[48] Wahab M H A, Mohd M N H, Hanafi H F, et al. Data pre-processing on web server logs for generalized association rules mining algorithm. World Academy Sci Eng Technol, 2008, 48: 970. Google Scholar

[49] Nanopoulos A, Manolopoulos Y, Zakrzewicz M, et al. Indexing web access-logs for pattern queries. In: Proceedings of the 4th International Workshop on Web Information and Data Management, Hong Kong, 2002. 63-68. Google Scholar

[50] Joshi K P, Joshi A, Yesha Y. On using a warehouse to analyze web logs. Distributed Parallel Databases, 2003, 13:161-180. Google Scholar

[51] Cho J, Garcia-molina H. Parallel crawlers. In: Proceedings of the 11th International Conference on World Wide Web, Honolulu, 2002. 124-135. Google Scholar

[52] Castillo C. Effective web crawling. In: Proceedings of ACM SIGIR Forum, New York, 2005. 39: 55-56. Google Scholar

[53] Choudhary S, Dincturk M E, Mirtaheri S M, et al. Crawling rich internet applications: the state of the art. In: Proceedings of CASCON, Tronto, 2012. 146-160. Google Scholar

[54] Jain A K, Bolle R, Pankanti S. Biometrics: Personal Identification in Networked Society. Kluwer Academic Publishers,1999. Google Scholar

[55] Ghani N, Dixit S, Wang T S. On IP-over-WDM integration. IEEE Commun Mag, 2000, 38: 72-84. Google Scholar

[56] Manchester J, Anderson J, Doshi B, et al. IP over Sonet. IEEE Commun Mag, 1998, 36: 136-142. Google Scholar

[57] Armstrong J. OFDM for optical communications. J Lightwave Technol, 2009, 27: 189-204. Google Scholar

[58] Shieh W. OFDM for flexible high-speed optical networks. J Lightwave Technol, 2011, 29: 1560-1577. Google Scholar

[59] Jinno M, Takara H, Kozicki B. Dynamic optical mesh networks: Drivers, challenges and solutions for the future. In: Proceedings of the 35th European Conference on Optical Communication, Vienna, 2009. 1-4. Google Scholar

[60] Goutelle M, Gu Y, He E, et al. A survey of transport protocols other than standard TCP. Global Grid Forum, 2004. Google Scholar

[61] Hoelzle U, Barroso L A. The Datacenter as a Computer: An Introduction to the Design of Warehouse-Scale Machines.1st ed. Morgan and Claypool Publishers, 2009. Google Scholar

[62] Cisco. Cisco data center interconnect design and deployment guide. Cisco, 2009. Google Scholar

[63] Greenberg A, Hamilton J R, Jain N, et al. VL2: a scalable and flexible data center network. In: Proceedings of the ACM SIGCOMM Conference on Data Communication, Barcelona, 2009. 51-62. Google Scholar

[64] Guo C, Lu G, Li D, et al. BCube: a high performance, server-centric network architecture for modular data centers. SIGCOMM Comput Commun Rev, 2009, 39: 63-74. Google Scholar

[65] Farrington N, Porter G, Radhakrishnan S, et al. Helios: a hybrid electrical/optical switch architecture for modular data centers. In: Proceedings of the ACM SIGCOMM Conference, New Delhi, 2010. 339-350. Google Scholar

[66] Abu-Libdeh H, Costa P, Rowstron A, et al. Symbiotic routing in future data centers. ACM SIGCOMM Comput Commun Rev, 2010, 40: 51-62. Google Scholar

[67] Lam C, Liu H, Koley B, et al. Fiber optic communication technologies: What's needed for datacenter network operations. IEEE Commun Mag, 2010, 48: 32-39. Google Scholar

[68] Kachris C, Tomkos I. The rise of optical interconnects in data centre networks. In: Proceedings of the 14th International Conference on Transparent Optical Networks, Coventry, 2012. 1-4. Google Scholar

[69] Wang G, Andersen D G, Kaminsky M, et al. c-Through: part-time optics in data centers. SIGCOMM Comput Commun Rev, 2010, 41: 327-338. Google Scholar

[70] Ye X, Yin Y, Yoo S B, et al. DOS: A scalable optical switch for datacenters. In: Proceedings of the 6th ACM/IEEE Symposium on Architectures for Networking and Communications Systems, San Diego, 2010. 24. Google Scholar

[71] Singla A, Singh A, Ramachandran K, et al. Proteus: a topology malleable data center network. In: Proceedings of the 9th ACM SIGCOMM Workshop on Hot Topics in Networks, Monterey, 2010. 1-6. Google Scholar

[72] Liboiron-Ladouceur O, Cerutti I, Raponi P G, et al. Energy-efficient design of a scalable optical multiplane interconnection architecture. IEEE J Selected Topics Quantum Electron, 2011, 17: 377-383. Google Scholar

[73] Kodi A K, Louri A. Energy-efficient and bandwidth-reconfigurable photonic networks for high-performance computing (HPC) systems. IEEE J Selected Topics Quantum Electron, 2011, 17: 384-395. Google Scholar

[74] Alizadeh M, Greenberg A, Maltz D A, et al. Data center tcp (dctcp). ACM SIGCOMM Comput Commun Rev,2010, 40: 63-74. Google Scholar

[75] Vamanan B, Hasan J, Vijaykumar T. Deadline-aware datacenter tcp (d2tcp). ACM SIGCOMM Comput Commun Rev, 2012, 42: 115-126. Google Scholar

[76] Kohler E, Handley M, Floyd S. Designing DCCP: Congestion control without reliability. ACM SIGCOMM Comput Commun Rev, 2006, 36: 27-38. Google Scholar

[77] M¨uller H, Freytag J C. Problems, Methods, and Challenges in Comprehensive Data Cleansing. Professoren des Inst. F¨ur Informatik, 2005. Google Scholar

[78] Noy N F. Semantic integration: a survey of ontology-based approaches. ACM Sigmod Record, 2004, 33: 65-70. Google Scholar

[79] Han J W, Kamber M, Pei J. Data mining: concepts and techniques. Morgan Kaufmann, 2006. Google Scholar

[80] Lenzerini M. Data integration: A theoretical perspective. In: Proceedings of the 21st ACM SIGMOD-SIGACTSIGART Symposium on Principles of Database Systems, Madison, 2002. 233-246. Google Scholar

[81] Silberschatz A, Korth H F, Sudarshan S. Database System Concepts. New York: McGraw-Hill Hightstown, 1997. Google Scholar

[82] Cafarella M J, Halevy A, Khoussainova N. Data integration for the relational web. Proc VLDB Endowment, 2009,2: 1090-1101. Google Scholar

[83] Kohavi R, Mason L, Parekh R, et al. Lessons and challenges from mining retail e-commerce data. Mach Learn, 2004,57: 83-113. Google Scholar

[84] Chen H Q, Ku W S, Wang H X, et al. Leveraging spatio-temporal redundancy for RFID data cleansing. In: Proceedings of the 2010 ACM SIGMOD International Conference on Management of Data, Indianapolis, 2010. 51-62. Google Scholar

[85] Zhao Z, Ng W. A model-based approach for RFID data stream cleansing. In: Proceedings of the 21st ACM International Conference on Information and Knowledge Management, Maui, 2012. 862-871. Google Scholar

[86] Khoussainova N, Balazinska M, Suciu D. Probabilistic event extraction from rfid data. In: Proceedings of the 24th International Conference on Data Engineering, Cancún, 2008. 1480-1482. Google Scholar

[87] Herbert K G, Wang J T. Biological data cleaning: a case study. Int J Inf Quality, 2007, 1: 60-82. Google Scholar

[88] Zhang Y, Callan J, Minka T. Novelty and redundancy detection in adaptive filtering. In: Proceedings of the 25th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, Tampere,2002. 81-88. Google Scholar

[89] Salomon D. Data Compression. Berlin: Springer-Verlag, 2004. Google Scholar

[90] Dufaux F, Ebrahimi T. Video surveillance using JPEG 2000. Proc SPIE, 2004, 5588: 268-275. Google Scholar

[91] Symes P D. Digital Video Compression. New York: McGraw-Hill/TAB Electronics, 2004. Google Scholar

[92] Tsai T H, Lin C Y. Exploring Contextual Redundancy in Improving Object-Based Video Coding for Video Sensor Networks Surveillance. IEEE Trans Multimedia, 2012, 14: 669-682. Google Scholar

[93] Sarawagi S, Bhamidipaty A. Interactive deduplication using active learning. In: Proceedings of the 8th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, Edmonton, 2002. 269-278. Google Scholar

[94] Huang Z, Shen H T, Liu J J, et al. Effective data co-reduction for multimedia similarity search. In: Proceedings of the ACM SIGMOD International Conference on Management of Data, Athens, 2011. 1021-1032. Google Scholar

[95] Kamath U, Compton J, Islamaj-Dogan R, et al. An evolutionary algorithm approach for feature generation from sequence data and its application to DNA splice site prediction. IEEE/ACM Trans Comput Biol Bioinform, 2012, 9:. Google Scholar

[96] Leung K, Lee K, Wang J, et al. Data mining on DNA sequences of hepatitis B virus. IEEE/ACM Trans Comput Biol Bioinform, 2011, 8: 428-440. Google Scholar

[97] Bleiholder J, Naumann F. Data fusion. ACM Comput Surv, 2009, 41: 1-41. Google Scholar

[98] G¨unter M. Introducing MapLan to map banking survey data into a time series database. In: Proceedings of the 15th International Conference on Extending Database Technology, Berlin, 2012. 528-533. Google Scholar

[99] Wang Y, Wei D S, Yin X R, et al. Heterogeneity-aware data regeneration in distributed storage systems. In: Proceedings of IEEE International Conference on Computer Communications, Toronto, 2014. 1878-1886. Google Scholar

[100] Goda K, Kitsuregawa M. The History of Storage Systems. Proc IEEE, 2012, 100: 1433-1440. Google Scholar

[101] Strunk J D. Hybrid aggregates: combining SSDs and HDDs in a single storage pool. ACM SIGOPS Operating Syst Rev, 2012, 46: 50-56. Google Scholar

[102] Soundararajan G, Prabhakaran V, Balakrishnan M, et al. Extending SSD lifetimes with disk-based write caches. In: Proceedings of the 8th USENIX Conference on File and Storage Technologies, San Jose, 2010. 101-114. Google Scholar

[103] Guerra J, Pucha H, Glider J S, et al. Cost Effective Storage using Extent Based Dynamic Tiering. In: Proceedings of USENIX Conference on File and Storage Technologies, San Jose, 2011. 273-286. Google Scholar

[104] Troppens U, Erkens R, Mueller-Friedt W, et al. Storage Networks Explained: Basics and Application of Fibre Channel SAN, NAS, iSCSI, Infiniband and FCoE. New York: John Wiley & Sons, 2011. Google Scholar

[105] Mell P, Grance T. The NIST definition of cloud computing. NIST Special Publication 800-145, 2011 106 Clark T. Storage Virtualization: Technologies for Simplifying Data Storage and Management. Boston: Addison- Wesley Professional, 2005. Google Scholar

[106] McKusick M K, Quinlan S. GFS: Evolution on fast-forward. ACM Queue, 2009, 7: 10-20. Google Scholar

[107] Chaiken R, Jenkins B, Larson P, et al. SCOPE: easy and efficient parallel processing of massive data sets. Proc VLDB Endowment, 2008, 1: 1265-1276. Google Scholar

[108] Beaver D, Kumar S, Li H C, et al. Finding a needle in Haystack: Facebook's photo storage. In: Proceedings of 9th USENIX Symposium on Operating Systems Design and Implementation, Vancouver, 2010. Google Scholar

[109] DeCandia G, Hastorun D, Jampani M, et al. Dynamo: Amazon's highly available key-value store. SIGOPS Oper Syst Rev, 2007, 41: 205-220. Google Scholar

[110] Karger D, Lehman E, Leighton T, et al. Consistent hashing and random trees: distributed caching protocols for relieving hot spots on the World Wide Web. In: Proceedings of the 29th Annual ACM Symposium on Theory of Computing, El Paso, 1997. 654-663. Google Scholar

[111] Chang F, Dean J, Ghemawat S, et al. Bigtable: A distributed storage system for structured data. ACM Trans Comput Syst, 2008, 26: 4:1-4:26. Google Scholar

[112] Burrows M. The Chubby lock service for loosely-coupled distributed systems. In: Proceedings of the 7th Symposium on Operating Systems Design and Implementation, Seattle, 2006. 335-350. Google Scholar

[113] Lakshman A, Malik P. Cassandra: structured storage system on a P2P network. In: Proceedings of the 28th ACM Symposium on Principles of Distributed Computing, Calgary, 2009. 5. Google Scholar

[114] Crochford D. The application/json Media Type for JavaScript Object Notation (JSON), RFC 4627, 2006. Google Scholar

[115] Cooper B F, Ramakrishnan R, Srivastava U, et al. PNUTS: Yahoo!'s hosted data serving platform. Proc VLDB Endowment, 2008, 1: 1277-1288. Google Scholar

[116] Zhao Y X, Wu J. Dache: A data aware caching for big-data applications using the MapReduce framework. In: Proceedings of IEEE International Conference on Computer Communications, Turin, 2013. 35-39. Google Scholar

[117] Baker J, Bond C, Corbett J, et al. Megastore: Providing scalable, highly available storage for interactive services. In: Proceedings of Conference on Innovative Data Systems Research, Asilomar, 2011. 223-234. Google Scholar

[118] Corbett J C, Dean J, Epstein M, et al. Spanner: Google's globally-distributed database. In: Proceedings of the 10th USENIX Symposium on Operating Systems Design and Implementation, Hollywood, 2013. 251-264. Google Scholar

[119] Shute J, Oancea M, Ellner S, et al. F1: the fault-tolerant distributed RDBMS supporting google's ad business. In: Proceedings of the 2012 International Conference on Management of Data, Scottsdale, 2012. 777-778. Google Scholar

Copyright 2020 Science China Press Co., Ltd. 《中国科学》杂志社有限责任公司 版权所有