SCIENCE CHINA Information Sciences, Volume 61, Issue 5: 050105(2018) https://doi.org/10.1007/s11432-017-9402-3

## Toward accurate link between code and software documentation

• AcceptedMar 27, 2018
• PublishedApr 20, 2018
Share
Rating

### Abstract

Recovering traceability links between source code and software documentation is an important research topic in software maintenance and software reuse. There have been a lot of research efforts in recovering traceability between documentation and code elements (class, interface, method, etc.), mostly based on program analysis. However, there are still a lot of noise links being established in existing work. In this paper, we propose a novel approach to classifying code elements, occurring in a document, into contextual code elements and salient code elements. As a result, we can filter the noise traceability links between a software document and its contextual code elements and get a higher quality link set. Our classifier is trained based on open source project Lucene's source code and 1899 StackOverflow answer documents about Lucene. We extract code elements from these documents and represent each of these code elements with a 7-dimension feature vector, then we use a decision-tree-based learning model to classify them as salient or not. In the experiments, we get a precision of 70.7% in recognizing the salient code elements of these documents and get 12% improvement compared with Rigby's work. We can filter out 56.5%$\sim$69.3% noise traceability links with different thresholds in our classifier. It can improve the quality of traceability links between source code and their related software documents in application.

### Acknowledgment

This paper was supported by National Key Research and Development Project of China (Grant No. 2016YFB1000804) and National Natural Science Fund for Distinguished Young Scholars (Grant No. 61525201).

### References

[2] Marcus A, Maletic J I. Recovering documentation-to-source-code traceability links using latent semantic indexing. In: Proceedings of the 25th International Conference on Software Engineering, Portland, 2003. 125--135. Google Scholar

[3] Robillard M P, Marcus A, Treude C, et al. On-demand developer documentation. In: Proceedings of IEEE International Conference on Software Maintenance and Evolution (ICSME 2017), Shanghai, 2017. 479--483. Google Scholar

[4] Bacchelli A, D'Ambros M, Lanza M, et al. Benchmarking lightweight techniques to link e-mails and source code. In: Proceedings of the 16th Working Conference on Reverse Engineering (WCRE 2009), Lille, 2009. 205--214. Google Scholar

[5] Bacchelli A, Lanza M, Robbes R. Linking e-mails and source code artifacts. In: Proceedings of the 32nd ACM/IEEE International Conference on Software Engineering-Volume 1, Cape Town, 2010. 375--384. Google Scholar

[6] Dagenais B, Robillard M P. Recovering traceability links between an API and its learning resources. In: Proceedings of the 34th International Conference on Software Engineering (ICSE 2012), Zurich, 2012. 47--57. Google Scholar

[7] Rigby P C, Robillard M P. Discovering essential code elements in informal documentation. In: Proceedings of the 2013 International Conference on Software Engineering, San Francisco, 2013. 832--841. Google Scholar

[8] McMillan C, Poshyvanyk D, Revelle M. Combining textual and structural analysis of software artifacts for traceability link recovery. In: Proceedings of ICSE Workshop on Traceability in Emerging Forms of Software Engineering. Washington: IEEE Computer Society, 2009. 41--48. Google Scholar

[9] Panichella A, McMillan C, Moritz E, et al. When and how using structural information to improve ir-based traceability recovery. In: Proceedings of the 17th European Conference on Software Maintenance and Reengineering (CSMR 2013), Genova, 2013. 199--208. Google Scholar

[10] Subramanian S, Inozemtseva L, Holmes R. Live API documentation. In: Proceedings of the 36th International Conference on Software Engineering (ICSE 2014), Hyderabad, 2014. 643--652. Google Scholar

[11] Petrosyan G, Robillard M P, Mori R D. Discovering information explaining API types using text classification. In: Proceedings of the 37th International Conference on Software Engineering-Volume 1, Florence, 2015. 869--879. Google Scholar

[12] Jiang H, Zhang J, Li X, et al. A more accurate model for finding tutorial segments explaining APIs. In: Proceedings of IEEE 23rd International Conference on Software Analysis, Evolution, and Reengineering (SANER 2016), Suita, 2016. 157--167. Google Scholar

[13] Zou Y Z, Ye T, Lu Y Y, et al. Learning to rank for question-oriented software text retrieval. In: Proceedings of the 30th IEEE/ACM International Conference on Automated Software Engineering (ASE 2015), Lincoln, 2015. 1--11. Google Scholar

[15] Mikolov T, Chen K, Corrado G, et al. Efficient estimation of word representations in vector space,. arXiv Google Scholar

[16] Friedman J H. Greedy function approximation: a gradient boosting machine. Ann Stat, 2001, 29: 1189--1232. Google Scholar

[18] Tsuchiya R, Kato T, Washizaki H, et al. Recovering traceability links between requirements and source code in the same series of software products. In: Proceedings of the 17th International Software Product Line Conference, Tokyo, 2013. 121--130. Google Scholar

[20] Xu Y, Liu C. Research on retrieval methods for traceability between Chinese documentation and source code based on LDA. Comput Eng Appl, 2013, 49: 70--76. Google Scholar

[21] Lai G, Wang X, Liu C. Analysis and improvement on retrieval methods for traceability links between source code and documentation. ACTA Electron Sin, 2009, 37: 22--30. Google Scholar

[22] Yang B, Liu C. Research on traceability recovery between documentation and source code based on software structure. J Front Comput Sci Tech, 2014, 6: 7. Google Scholar

[23] Ye X, Shen H, Ma X, et al. From word embeddings to document similarities for improved information retrieval in software engineering. In: Proceedings of the 38th International Conference on Software Engineering, Austin, 2016. 404--415. Google Scholar

[24] Rahimi M, Goss W, Cleland-Huang J. Evolving requirements-to-code trace links across versions of a software system. In: Proceedings of IEEE International Conference on Software Maintenance and Evolution (ICSME 2016), Raleigh, 2016. 99--109. Google Scholar

[25] Zhang Y, Lo D, Xia X, et al. Inferring links between concerns and methods with multi-abstraction vector space model. In: Proceedings of IEEE International Conference on Software Maintenance and Evolution (ICSME 2016), Raleigh, 2016. 110--121. Google Scholar

[26] Kim S, Kim H Y, Kim J A, et al. A study on traceability between documents of a software R&D project. In: Advanced Multimedia and Ubiquitous Engineering. Berlin: Springer, 2016. 203--210. Google Scholar

[27] de Lucia A, Fasano F, Oliveto R, et al. Enhancing an artefact management system with traceability recovery features. In: Proceedings of the 20th International Conference on Software Maintenance (ICSM 2004), Chicago, 2004. 306--315. Google Scholar

[28] Nishikawa K, Washizaki H, Fukazawa Y, et al. Recovering transitive traceability links among software artifacts. In: Proceedings of IEEE International Conference on Software Maintenance and Evolution (ICSME 2015), Bremen, 2015. 576--580. Google Scholar

[29] Ye D, Xing Z, Foo C Y, et al. Learning to extract api mentions from informal natural language discussions. In: Proceedings of IEEE International Conference on Software Maintenance and Evolution (ICSME 2016), Raleigh, 2016. 389--399. Google Scholar

[30] Sridhara G, Hill E, Muppaneni D, et al. Towards automatically generating summary comments for java methods. In: Proceedings of the 25th IEEE/ACM International Conference on Automated Software Engineering (ASE 2010), Antwerp, 2010. 43--52. Google Scholar

[31] Eddy B P, Kraft N A. Using structured queries for source code search. In: Proceedings of the 30th IEEE International Conference on Software Maintenance and Evolution, Victoria, 2014. 431--435. Google Scholar

[32] Ponzanelli L, Mocci A, Bacchelli A, et al. Improving low quality stack overflow post detection. In: Proceedings of the 30th IEEE International Conference on Software Maintenance and Evolution, Victoria, 2014. 541--544. Google Scholar

[33] Lin Y, Liu Z, Sun M, et al. Learning entity and relation embeddings for knowledge graph completion. In: Proceedings of the Twenty-Ninth AAAI Conference on Artificial Intelligence, Austin, 2015. 2181--2187. Google Scholar

[34] Creation O W L. To generate the ontology from Java source code. Int J Adv Comput Sci Appl, 2011, 2: 111--116. Google Scholar

[35] McMillan C, Grechanik M, Poshyvanyk D, et al. Portfolio: finding relevant functions and their usage. In: Proceedings of the 33rd International Conference on Software Engineering, Waikiki, 2011. 111--120. Google Scholar

[36] Bajracharya S K, Ossher J, Lopes C V. Leveraging usage similarity for effective retrieval of examples in code repositories. In: Proceedings of the 18th ACM SIGSOFT international symposium on Foundations of software engineering, Santa Fe, 2010. 157--166. Google Scholar

[37] Butler S, Wermelinger M, Yu Y J. Investigating naming convention adherence in Java references. In: Proceedings of IEEE International Conference on Software Maintenance and Evolution (ICSME 2015), Bremen, 2015. 41--50. Google Scholar

• Figure 1

(Color online) (a) Code elements in a StackOverflow answer document and (b) corresponding code relation graph of code elements.

• Figure 2

Overview of our approach.

• Figure 3

Overview of our approach.

• Figure 4

(Color online) The precision of our approach and Rigby's work.

• Figure 5

(Color online) The recall of our approach and Rigby's work.

• Figure 6

(Color online) The F-values of classifiers based on different features and thresholds.

• Figure 7

(Color online) Classification result based on original score.

• Table 1   Method metric description
 Salient (positive) Non-salient (negative) True results tp fp False results fn tn
• Table 2   The scores of code elements extracted from the motive example document
TF-IDF Location Expwords DocCos Type Relation Distance  Origianl score
 Normalized score
ScoreDoc 2.142 2 2 0.106 0 3 0.239 0.396 0.534
TopDocs 1.527 1 0 0.04 0 3 0.279 0.209 0.282
totalHits 2.231 1 0 0.008 1 3 0.488 0.135 0.182
searchAfter 3.311 2 2 0.184 1 3 0.026 0.741 1.0
• Table 3   The recalls of classifiers based on different features and thresholds
 Feature Threshold 0.2 0.3 0.4 0.45 0.5 0.55 0.6 DocCos 1.000 0.705 0.609 0.532 0.474 0.321 0.160 TF-IDF 1.000 0.974 0.288 0.064 0.032 – – Type 1.000 0.994 0.083 – – – – Location 0.885 0.885 0.885 0.885 0.885 0.205 0.205 Expwords 1.000 0.987 0.423 0.064 – – – Relation 1.000 0.718 0.564 0.506 0.506 – – Distance 0.968 0.763 0.250 0.237 0.237 0.237 0.237 Normal classifier 0.923 0.904 0.846 0.718 0.532 0.423 0.359
• #### 2

Citations

• Altmetric

Copyright 2020 Science China Press Co., Ltd. 《中国科学》杂志社有限责任公司 版权所有