SCIENCE CHINA Information Sciences, Volume 60, Issue 1: 012107(2017) https://doi.org/10.1007/s11432-014-0372-y

Mining authorship characteristics in bug repositories

More info
  • ReceivedJul 8, 2015
  • AcceptedAug 27, 2015
  • PublishedNov 23, 2016


Bug reports are widely employed to facilitate software tasks in software maintenance. Since bug reports are contributed by people, the authorship characteristics of contributors may heavily impact the performance of resolving software tasks. Poorly written bug reports may delay developers when fixing bugs. However, no in-depth investigation has been conducted over the authorship characteristics. In this study, we first leverage byte-level $N$-grams to model the authorship characteristics and employ Normalized Simplified Profile Intersection (NSPI) to identify the similarity of the authorship characteristics. Then, we investigate a series of properties related to contributors' authorship characteristics, including the evolvement over time and the variation among distinct products in open source projects. Moreover, we show how to leverage the authorship characteristics to facilitate a well-known task in software maintenance, namely Bug Report Summarization (BRS). Experiments on open source projects validate that incorporating the authorship characteristics can effectively improve a state-of-the-art method in BRS. Our findings suggest that contributors should retain stable authorship characteristics and the authorship characteristics can assist in resolving software tasks.

Funded by

National Basic Research Program of China(973)

New Century Excellent Talents in University(NCET-13-0073)

"source" : null , "contract" : "2013CB035906"

National Natural Science Foundation of China(61370144)

National Natural Science Foundation of China(61175062)



This work was supported by National Basic Research Program of China (973) (Grant No. 2013CB035906), National Natural Science Foundation of China (Grant Nos. 61175062, 61370144), and New Century Excellent Talents in University (Grant No. NCET-13-0073). We greatly thank Rastkar, Murphy, and Murray with University of British Columbia for sharing their BRS corpus.


[1] Pressman R S, Ince D. Software Engineering: A Practitioner's Approach. New York: McGraw-Hill, 2010. Google Scholar

[2] Anvik J, Hiew L, Murphy G C. Who should fix this bug? In: Proceedings of the 28th International Conference on Software Engineering, Shanghai, 2006. 361--370. Google Scholar

[3] Anvik J, Murphy G C. Reducing the effort of bug report triage: recommenders for development-oriented decisions. ACM Trans Softw Eng Methodol, 2011, 20: 10 Google Scholar

[4] Bishnu P S, Bhattacherjee V. Software fault prediction using Quad Tree-based K-means clustering algorithm. IEEE Trans Knowl Data Eng, 2012, 24: 1146-1150 CrossRef Google Scholar

[5] Shivaji S, Whitehead J, Akella R, et al. Reducing features to improve code change based bug prediction. IEEE Trans Softw Eng, 2012, 22: 1-17 Google Scholar

[6] Artzi S, Kiezun A, Dolby J, et al. Finding bugs in web applications using dynamic test generation and explicit state model checking. IEEE Softw, 2010, 36: 474-494 CrossRef Google Scholar

[7] Zhou J, Zhang H Y, Lo D. Where should the bugs be fixed? more accurate information retrieval-based bug localization based on bug reports. In: Proceedings of the 34th International Conference on Software Engineering, Zurich, 2012. 14--24. Google Scholar

[8] Mani S, Catherine R, Sinha V S, et al. AUSUM: approach for unsupervised bug report summarization. In: Proceedings of the ACM SIGSOFT 20th International Symposium on the Foundations of Software Engineering, New York, 2012. 11--21. Google Scholar

[9] Rastkar S, Murphy G C, Murray G. Automatic summarization of bug reports. IEEE Trans Softw Eng, 2014, 40: 366-380 CrossRef Google Scholar

[10] Lotufo R, Malik Z, Czarnecki K. Modelling the `hurrie' bug report reading process to summarize bug report. In: Proceedings of the International Conference on Software Maintenance, Trento, 2012. 430--439. Google Scholar

[11] Zimmermann T, Premraj R, Bettenburg N, et al. What makes a good bug report? IEEE Trans Softw Eng, 2010, 36: 618--643. Google Scholar

[12] Keselj V, Peng F, Cercone N, et al. N-gram based author profiles for authorship attribution. In: Proceedings of Pacific Association for Computational Linguistics, Harifax, 2003. 255--264. Google Scholar

[13] Frantzeskou G, Stamatatos E, Gritzalis S, et al. Effective identification of source code authors using byte-level information. In: Proceedings of the 28th International Conference on Software Engineering, Shanghai, 2006. 893--896. Google Scholar

[14] Herzig K, Just S, Zeller A. It's not a bug, it's a feature: how misclassification impacts bug prediction. In: Proceedings of the 35th International Conference on Software Engineering, San Francisco, 2013. 392--401. Google Scholar

[15] Rahman F, Devanbu P. Ownership, experience and defects: a fine-grained study of authorship. In: Proceedings of the 33rd International Conference on Software Engineering, New York, 2011. 491--500. Google Scholar

[16] Bird C, Nagappan N, Murphy B, et al. Don't touch my code!: examining the effects of ownership on software quality. In: Proceedings of the 19th ACM SIGSOFT Symposium and the 13th European Conference on Foundations of Software Engineering, New York, 2011. 4--14. Google Scholar

[17] Burrows S, Uitdenbogerd A L, Turpin A. Comparing techniques for authorship attribution of source code. Softw Pract Exper, 2014, 44: 1-32 CrossRef Google Scholar

[18] Zou W Q, Xia X, Zhang W Q, et al. An empirical study of bug fixing rate. In: Proceedings of the 39th Annual International Computers, Software & Applications Conference, Taichung, 2015. 254--263. Google Scholar

[19] Zhang R, Yu W Z, Sha C F, et al. Product-oriented review summarization and scoring. Front Comput Sci, 2015, 9: 210-223 CrossRef Google Scholar

[20] Nenkova A, Passonneau R. Evaluating content selection in summarization: the pyramid method. In: Proceedings of the Human Language Technology Conference of the North American Chapter of the Association for Computational Linguistics, Boston, 2004. 145--152. Google Scholar

[21] Carenini G, Ng R T, Zhou X. Summarizing emails with conversational cohesion and subjectivity. In: Proceedings of the 46th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies, New York, 2008. 353--361. Google Scholar

[22] Xie T, Thummalapenta S, Lo D, et al. Data mining for software engineering. Computer, 2009, 8: 55-62 Google Scholar

[23] Zhang W Q, Nie L M, Jiang H, et al. Developer social networks in software engineering: construction, analysis, and applications. Sci China Inf Sci, 2014, 57: 121101-62 Google Scholar

[24] Jeong G, Kim S, Zimmermann T. Improving bug triage with tossing graphs. In: Proceedings Joint Meeting of 12th European Software Engineering Conference & 17th ACMSIGSOFT Symposium on Foundations of Software Engineering, Amsterdam, 2009. 111--120. Google Scholar

[25] Xuan J F, Jiang H, Ren Z L, et al. Developer prioritization in bug repositories. In: Proceedings of 34th International Conference on Software Engineering, Zurich, 2012. 25--35. Google Scholar

[26] Lotufo R, Czarnecki K. Improving Bug Report Comprehension. Technical Report GSDLAB-TR 2012-09-01, University of Waterloo, 2012. Google Scholar

[27] Stamatatos E. A survey of modern authorship attribution methods. J Amer Soc Inf Sci Technol, 2009, 60: 538-556 CrossRef Google Scholar

[28] Stamatatos E, Fakotakis N, Kokkinakis G. Computer-based authorship attribution without lexical measures. Comput Hum, 2001, 35: 193-214 CrossRef Google Scholar

[29] Zheng R, Li J X, Chen H C, et al. A framework for authorship identification of online messages: writing style features and classification techniques. J Amer Soc Inf Sci Technol, 2006, 57: 378-393 CrossRef Google Scholar

[30] Kothari J, Shevertalov M, Stehle E, et al. A probabilistic approach to source code authorship identification. In: Proceedings of the 4th International Conference on Information Technology, Las Vegas, 2007. 243--248. Google Scholar

[31] Lange R, Mancoridis S. Using code metric histograms and genetic algorithms to perform author identification for software forensics. In: Proceedings of the 9th Annual Conference on Genetic and Evolutionary Computation, London, 2007. 2082--2089. Google Scholar

[32] Shevertalov M, Kothari J, Stehle E, et al. On the use of discretised source code metrics for author identification. In: Proceedings of the 1st International Symposium on Search Based Software Engineering, Windsor, 2009. 69--78. Google Scholar

Copyright 2020 Science China Press Co., Ltd. 《中国科学》杂志社有限责任公司 版权所有

京ICP备18024590号-1       京公网安备11010102003388号