logo

SCIENTIA SINICA Informationis, Volume 46, Issue 9: 1298-1320(2016) https://doi.org/10.1360/N112015-00276

A cluster-analysis-based feature-selection method for software defect prediction

More info
  • ReceivedApr 25, 2016
  • AcceptedMay 31, 2016
  • PublishedSep 18, 2016

Abstract

By mining historical software repositories, software defect prediction can construct defect-prediction models to predict potentially faulty modules in projects under testing. However, redundant and irrelevant features in the gathered datasets may influence the effectiveness of existing methods. A novel cluster-analysis-based feature-selection method (FECAR) is proposed. In particular, the original features are first clustered, based on a specific feature correlation (i.e., FFC) measure. Then, for each cluster, features are ranked based on a specific feature and class relevance (i.e., FCR) measure and a given number of features are chosen. In empirical studies, we chose symmetric uncertainty as the FFC measure, and information gain, chi-square, or ReliefF as the FCR measures. Based on some real-world projects, such as Eclipse and NASA, we focus on the prediction performance after using FECAR, and analyze the redundancy rate and selection proportion of the selected feature subset. The final results show the effectiveness of FECAR.


Funded by

国家自然科学基金(61373012)

国家自然科学基金(61321491)

国家自然科学基金(91218302)

国家自然科学基金(61202006)

国家重点基础研究发展计划(973计划)

(2009C B320705)

江苏省高校自然科学研究项目(12KJB520014)

南京大学计算机软件新技术国家重点实验室开放课题(\linebreak KFKT2016B18)


References

[1] Wang Q, Wu S J, Li M S. Software defect prediction. J Softw, 2008, 19: 1565-1580 [王青, 伍书剑, 李明树. 软件缺陷预测技术. 软件学报, 2008, 19: 1565-1580]. Google Scholar

[2] Hall T, Beecham S, Bowes D, et al. A systematic literature review on fault prediction performance in software engineering. IEEE Trans Softw Eng, 2012, 38: 1276-1304 CrossRef Google Scholar

[3] Yu S S, Zhou S G, Guan J H. Software engineering data mining: a survey. J Front Comput Sci Tech, 2012, 6: 1-31 [郁抒思, 周水庚, 关佶红. 软件工程数据挖掘研究进展. 计算机科学与探索, 2012, 6: 1-31]. Google Scholar

[4] Chen X, Gu Q, Liu W S, et al. Survey of static software defect prediction. J Softw, 2016, 1: 1-25 [陈翔, 顾庆, 刘望舒, 等. 静态软件缺陷预测方法研究. 软件学报, 2016, 1: 1-25]. Google Scholar

[5] Ghotra B, McIntosh S, Hassan A E. Revisiting the impact of classification techniques on the performance of defect prediction models. In: Proceedings of the International Conference on Software Engineering, Firenze, 2015. 789-800. Google Scholar

[6] Peters F, Menzies T, Layman L. LACE2: better privacy-preserving data sharing for cross project defect prediction. In: Proceedings of the International Conference on Software Engineering, Firenze, 2015. 801-811. Google Scholar

[7] Tantithamthavorn C, McIntosh S, Hassan A E, et al. The impact of mislabelling on the performance and interpretation of defect prediction models. In: Proceedings of the International Conference on Software Engineering, Firenze, 2015. 812-823. Google Scholar

[8] Jing X Y, Wu F, Dong X W, et al. Heterogeneous cross-company defect prediction by unified metric representation and CCA-based transfer learning. In: Proceedings of the International Symposium on Foundations of Software Engineering, Bergamo, 2015. 496-507. Google Scholar

[9] Nam J, Kim S. Heterogeneous defect prediction. In: Proceedings of the International Symposium on Foundations of Software Engineering, Bergamo, 2015. 508-519. Google Scholar

[10] Kim M, Nam J, Yeon J, et al. REMI: defect prediction for efficient API testing. In: Proceedings of the International Symposium on Foundations of Software Engineering, Bergamo, 2015. 990-993. Google Scholar

[11] Nam J, Kim S. CLAMI: defect prediction on unlabeled datasets. In: Proceedings of the International Conference on Automated Software Engineering, Lincoln, 2015. 452-463. Google Scholar

[12] Rahman F, Khatri S, Barr E T, et al. Comparing static bug finders and statistical prediction. In: Proceedings of the International Conference on Software Engineering, Hyderabad, 2014. 424-434. Google Scholar

[13] Shepperd M, Bowes D, Hall T. Researcher bias: the use of machine learning in software defect prediction. IEEE Trans Softw Eng, 2014, 40: 603-616 CrossRef Google Scholar

[14] Radjenovic D, Hericko M, Torkar R, et al. Software fault prediction metrics: a systematic literature review. Inf Softw Tech, 2013, 55: 1397-1418 CrossRef Google Scholar

[15] McCabe T J. A complexity measure. IEEE Trans Softw Eng, 1976, 2: 308-320. Google Scholar

[16] Halstead M H. Elements of Software Science (Operating and Programming Systems Series). New York: Elsevier Science Inc., 1977. Google Scholar

[17] Chidamber S R, Kemerer C F. A metrics suite for object oriented design. IEEE Trans Softw Eng, 1994, 20: 476-493 CrossRef Google Scholar

[18] Nagappan N, Ball T. Use of relative code churn measures to predict system defect density. In: Proceedings of the International Conference on Software Engineering, St. Louis, 2005. 284-292. Google Scholar

[19] Moser R, Pedrycz W, Succi G. A comparative analysis of the efficiency of change metrics and static code attributes for defect prediction. In: Proceedings of the International Conference on Software Engineering, Leipzig, 2008. 181-190. Google Scholar

[20] Hassan A E. Predicting faults using the complexity of code changes. In: Proceedings of the International Conference on Software Engineering, Vancouver, 2009. 78-88. Google Scholar

[21] Pinzger M, Nagappan N, Murphy B. Can developer-module networks predict failures? In: Proceedings of the International Symposium on Foundations of Software Engineering, Atlanta, 2008. 2-12. Google Scholar

[22] Meneely A, Williams L, Snipes W, et al. Predicting failures with developer networks and social network analysis. In: Proceedings of the International Symposium on Foundations of Software Engineering, Atlanta, 2008. 13-23. Google Scholar

[23] Jiang T, Tan L, Kim S. Personalized defect prediction. In: Proceedings of International Conference on Automated Software Engineering, Silicon Valley, 2013. 279-289. Google Scholar

[24] Zimmermann T, Nagappan N. Predicting defects using network analysis on dependency graphs. In: Proceedings of the International Conference on Software Engineering, Leipzig, 2008. 531-540. Google Scholar

[25] Bird C, Nagappan N, Gall H, et al. Putting it all together: using socio-technical networks to predict failures. In: Proceedings of the International Symposium on Software Reliability Engineering, Mysuru, 2009. 109-119. Google Scholar

[26] Nagappan N, Murphy B, Basili V R. The influence of organizational structure on software quality: an empirical case study. In: Proceedings of the International Conference on Software Engineering, Leipzig, 2008. 521-530. Google Scholar

[27] Mockus A. Organizational volatility and its effects on software defects. In: Proceedings of the International Symposium on Foundations of Software Engineering, Santa Fe, 2010. 117-126. Google Scholar

[28] Bird C, Nagappan N, Devanbu P, et al. Does distributed development affect software quality? An empirical case study of Windows Vista. In: Proceedings of International Conference on Software Engineering, Vancouver, 2009. 518-528. Google Scholar

[29] Shepperd M, Song Q B, Sun Z B, et al. Data quality: some comments on the NASA software defect datasets. IEEE Trans Softw Eng, 2013, 39: 1208-1215 CrossRef Google Scholar

[30] Bird C, Bachmann A, Aune E, et al. Fair and balanced? Bias in bug-fix datasets. In: Proceedings of the the Joint Meeting of the European Software Engineering Conference and the Symposium on the Foundations of Software Engineering, Amsterdam, 2009. 121-130. Google Scholar

[31] Bachmann A, Bird C, Rahman F, et al. The missing links: bugs and bug-fix commits. In: Proceedings of International Symposium on Foundations of Software Engineering, Santa Fe, 2010. 97-106. Google Scholar

[32] Nguyen T H, Adams B, Hassan A E. A case study of bias in bug-fix datasets. In: Proceedings of the Working Conference on Reverse Engineering, Beverly, 2010. 259-268. Google Scholar

[33] Gao K H, Khoshgoftaar T M, Wang H J, et al. Choosing software metrics for defect prediction: an investigation on feature selection techniques. Softw Pract Exper, 2011, 41: 579-606 CrossRef Google Scholar

[34] Menzies T, Greenwald J, Frank A. Data mining static code attributes to learn defect predictors. IEEE Trans Softw Eng, 2007, 32: 1-12. Google Scholar

[35] Song Q B, Jia Z H, Shepperd M, et al. A general software defect-proneness prediction framework. IEEE Trans Softw Eng, 2011, 37: 356-370 CrossRef Google Scholar

[36] Shivaji S, Whitehead Jr E J, Akella R, et al. Reducing features to improve code change-based bug prediction. IEEE Trans Softw Eng, 2013, 39: 552-569 CrossRef Google Scholar

[37] Wang H J, Khoshgoftaar T M, Napolitano A. A comparative study of ensemble feature selection techniques for software defect prediction. In: Proceedings of the International Conference on Machine Learning and Applications, Washington, 2010. 135-140. Google Scholar

[38] Khoshgoftaar T M, Gao K H, Seliya N. Attribute selection and imbalanced data: problems in software defect prediction. In: Proceedings of the International Conference on Tools With Artificial Intelligence, Arras, 2010. 137-144. Google Scholar

[39] Wang S, Yao X. Using class imbalance learning for software defect prediction. IEEE Trans Reliab, 2013, 62: 434-443 CrossRef Google Scholar

[40] Jing X Y, Ying S, Zhang Z W, et al. Dictionary learning based software defect prediction. In: Proceedings of the International Conference on Software Engineering, Hyderabad, 2014. 414-423. Google Scholar

[41] Hall M A. Correlation-based Feature selection for discrete and numeric class machine learning. In: Proceedings of the International Conference on Machine Learning, Stanford, 2000. 359-366. Google Scholar

[42] Yu L, Liu H. Feature selection for high-dimensional data: a fast correlation-based filter solution. In: Proceedings of the International Conference on Machine Learning, Washington, 2003. 856-863. Google Scholar

[43] Kim S, Whitehead Jr E J, Zhang Y. Classifying software changes: clean or buggy? IEEE Trans Softw Eng, 2008, 34: 181-196. Google Scholar

[44] Kira K, Rendell L A. A practical approach to feature selection. In: Proceedings of the International Workshop on Machine Learning, Aberdeen, 1992. 249-256. Google Scholar

[45] Fayyad U M, Irani K B. Multi-Interval discretization of continuous-valued attributes for classification learning. In: Proceedings of the International Joint Conference on Artificial Intelligence, Chambery, 1993. 1022-1029. Google Scholar

[46] Lessmann S, Baesens B, Mues C, et al. Benchmarking classification models for software defect prediction: a proposed framework and novel findings. IEEE Trans Softw Eng, 2008, 34: 485-496 CrossRef Google Scholar

[47] Dash M, Liu H. Consistency-based search in feature selection. Artif Intell, 2003, 151: 155-176 CrossRef Google Scholar

[48] Kononenko I. Estimating attributes: analysis and extensions of RELIEF. In: Proceedings of the European Conference on Machine Learning, Catania, 1994. 171-182. Google Scholar

[49] Zimmermann T, Premraj R, Zeller A. Predicting defects for eclipse. In: Proceedings of the International Workshop on Predictor Models in Software Engineering, Washington, 2007. 1-7. Google Scholar

[50] Witten I H, Frank E, Hall M A. Data Mining: Practical Machine Learning Tools and Techniques. 3rd ed. San Francisco: Morgan Kaufmann Publishers Inc., 2011. Google Scholar

Copyright 2020 Science China Press Co., Ltd. 《中国科学》杂志社有限责任公司 版权所有

京ICP备18024590号-1       京公网安备11010102003388号