SCIENCE CHINA Information Sciences, Volume 62, Issue 5: 052102(2019) https://doi.org/10.1007/s11432-017-9465-9

## QE-integrating framework based on Github knowledge and SVM ranking

• AcceptedMay 11, 2018
• PublishedMar 6, 2019
Share
Rating

### Abstract

The latest query expansion (QE) methods use the software development features for expanding queries. However, these methodsallow only one feature to be considered at a time. To consider additional features simultaneously, we propose a QE method based on Github knowledge; this is a new comprehensive feature that covers both the existing features (i.e., the application program interface (API) information and crowd knowledge). It is extracted from the “pull requests" of code repositories on Github, which contain descriptions of a request and its commits, the participants' comments and the API information of the changed files. In addition, we implement a black-box framework that integrates multiple QE methods based on the support vectormachine ranking called Github knowledge search repository (GKSR). Our empirical evaluation shows that the GKSR outperforms the state-of-the-art QE methods CodeHow and QECK by 25%–32% in terms of precision.

### Acknowledgment

This work was supported in part by National Natural Science Foundation of China (Grant Nos. 61672470, 61640221, 61562026).

### References

[1] Haiduc S, Bavota G, Marcus A, et al. Automatic query reformulations fortext retrieval in software engineering. In: Proceedings of the 35th International Conference on Software Engineering (ICSE), San Francisco, 2013. 842--851. Google Scholar

[2] Fischer G, Henninger S, Redmiles D. Cognitive tools for locating and comprehending software objects for reuse. In: Proceedings of the 13th International Conference on Software Engineering, Austin, 1991. 318--328. Google Scholar

[3] Lv F, Zhang H Y, Lou J G, et al. CodeHow: effective code search based on API understanding and extended boolean model (E). In: Proceedings of the 30th IEEE/ACM International Conference on Automated Software Engineering (ASE), Lincoln, 2015. 260--270. Google Scholar

[4] Nie L M, Jiang H, Ren Z L. Query expansion based on crowd knowledge for code search. IEEE Trans Serv Comput, 2016, 9: 771-783 CrossRef Google Scholar

[5] ManningC D, Raghavan P, Schtze H. Introduction to Information Retrieval. Cambridge: Cambridge University Press, 2008. Google Scholar

[6] de Souza L B L, Campos E, Maia M A. Ranking crowd knowledge to assist software development. In: Proceedings of the 22nd International Conference on Program Comprehension, Hyderabad, 2014. 72--82. Google Scholar

[7] Nguyen A T, Hilton M, Codoban M, et al. API code recommendation using statistical learning from fine-grained changes. In: Proceedings of the 24th ACM SIGSOFT International Symposium on Foundations of Software Engineering, Seattle, 2016. 511--522. Google Scholar

[8] Haiduc S, Bavota G, Marcus A, et al. Automatic query reformulations for text retrieval in software engineering. In: Proceedings of the 35th International Conference on Software Engineering (ICSE), San Francisco, 2013. 842--851. Google Scholar

[9] Gay G, Haiduc S, Marcus A, et al. On the use of relevance feedback in IR-based concept location. In: Proceedings of IEEE International Conference on Software Maintenance, Edmonton, 2009. 351--360. Google Scholar

[10] Mcmillan C, Poshyvanyk D, Grechanik M. Portfolio: searching for relevant functions and their usages in millions of lines of code. ACM Trans Softw Eng Methodol, 2013, 22: 1-30 CrossRef Google Scholar

[11] Joachims T. Optimizing search engines using clickthrough data. In: Proceedings of the 8th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, Edmonton, 2002. 133--142. Google Scholar

[12] Joachims T. Training linear SVMs in linear time. In: Proceedings of the 12th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, Philadelphia, 2006. 217--226. Google Scholar

[13] Salton G, Fox E A, Wu H. Extended Boolean Information Retrieval. New York: Cornell University, 1983. Google Scholar

[14] Niu H, Keivanloo I, Zou Y. Learning to rank code examples for code search engines. Empir Softw Eng, 2017, 22: 259-291 CrossRef Google Scholar

[15] Jiang H, Nie L M, Sun Z Y, et al. ROSF: leveraging information retrieval and supervised learning for recommending code snippets. IEEE Trans Serv Comput, 2017. doi: 10.1109/TSC.2016.2592909. Google Scholar

[16] Xu K, Lin H F, Lin Y, et al. Patent retrieval based on multiple information resources. In: Proceedings of the 12th Asia Information Retrieval Societies Conference, Bejing, 2016. Google Scholar

[17] Xu B, Lin H F, Lin Y. Assessment of learning to rank methods for query expansion. J Assoc Inf Sci Technol, 2016, 67: 1345-1357 CrossRef Google Scholar

• Figure 1

(Color online) Schematic of GKSR.

• Figure 2

(Color online) Example of an R&C pair.

• Figure 3

(Color online) Flowchart of contents of Section 3.

• Figure 4

Evaluation strategy.

• Figure 5

(Color online) Development process of the code search.

• Table 1   Example of integrating QE methods
 Component Query results GK-based component $c_{1}$ $c_{2}$ $c_{3}$ API-based component $c_{1}$ $c_{2}$ $a_{3}$ CK-based component $b_{1}$ $c_{2}$ $a_{3}$
• Table 2   Labeling the relevance rating
 Relevance rating Relevance score Similarity score (%) Tag 4 $15~=~2^{4}-1$ $>85$ Most relevant 3 $7~=~2^{3}-1$ 70–85 Relevant 2 $3~=~2^{2}-1$ 60–70 Irrelevant 1 $0~=~2^{0}-1$ $<60$ Most irrelevant
• Table 3   Data format of the q&r pair. # Query 1, take multiple screenshots in Android
 rank qid $f_{1}$ $f_{2}$ $f_{3}$ 2 1 0.673 0.725 0 1 1 0.849 0 0.784
• Table 4   Benchmark dataset
 Tuning set Training set Testing set 8850 artificial Q&R pairs 4425 artificial Q&R pairs 4425 artificial Q&R pairs; 54 real Q&R pairs
• Table 5   Performance comparisons of three methods
 Real testing set Artificial testing set Metrics Methods Top-1 Top-5 Top-1 Top-5 GKSR$_{\rm~noSVM}$ $\textbf{0.757}^{+0.003,0.002}$ $\textbf{0.743}^{+0.002,0.002}$ $\textbf{0.771}^{+0.005,0.003}$ $\textbf{0.756}^{+0.003,0.001}$ Precision QECK 0.659 0.603 0.666 0.620 CodeHow 0.623 0.572 0.632 0.589 GKSR$_{\rm~noSVM}$ $\textbf{0.6914}^{+0.004,0.002}$ $\textbf{0.6795}^{+0.003,0.002}$ $\textbf{0.7123}^{+0.006,0.005}$ $\textbf{0.6917}^{+0.002,0.002}$ NDCG QECK 0.5663 0.5092 0.5785 0.5302 CodeHow 0.5196 0.4611 0.5398 0.4821

The

• Table 6   Matching queries with CodeHow, QECK, and GKSR$_{\rm~noSVM}$
 Query terms (number of times that occur) APIs Crowd knowledge CodeHow multiple (1) QECK screenshot (76), Android (49) GKSR$_{\rm~noSVM}$ multiple (6) screenshot (18), Android (4)
• Table 7   Performance of the baseline GKSR vs. the performance of the GKSR$_{\rm~noSVM}$
 Real testing set Artificial testing set Metrics Methods Top-1 Top-5 Top-1 Top-5 GKSR $\textbf{0.822}^{+0.006}$ $\textbf{0.804}^{+0.004}$ $\textbf{0.832}^{+0.005}$ $\textbf{0.803}^{+0.003}$ Precision GKSR$_{\rm~noSVM}$ 0.757 0.743 0.771 0.756 GKSR $\textbf{0.8013}^{+0.005}$ $\textbf{0.7733}^{+0.004}$ $\textbf{0.8189}^{+0.003}$ $\textbf{0.7795}^{+0.003}$ NDCG GKSR$_{\rm~noSVM}$ 0.6914 0.6795 0.7123 0.6917

The

Citations

• #### 0

Altmetric

Copyright 2020 Science China Press Co., Ltd. 《中国科学》杂志社有限责任公司 版权所有