logo

SCIENCE CHINA Information Sciences, Volume 62, Issue 5: 052102(2019) https://doi.org/10.1007/s11432-017-9465-9

QE-integrating framework based on Github knowledge and SVM ranking

More info
  • ReceivedJan 23, 2018
  • AcceptedMay 11, 2018
  • PublishedMar 6, 2019

Abstract

The latest query expansion (QE) methods use the software development features for expanding queries. However, these methodsallow only one feature to be considered at a time. To consider additional features simultaneously, we propose a QE method based on Github knowledge; this is a new comprehensive feature that covers both the existing features (i.e., the application program interface (API) information and crowd knowledge). It is extracted from the “pull requests" of code repositories on Github, which contain descriptions of a request and its commits, the participants' comments and the API information of the changed files. In addition, we implement a black-box framework that integrates multiple QE methods based on the support vectormachine ranking called Github knowledge search repository (GKSR). Our empirical evaluation shows that the GKSR outperforms the state-of-the-art QE methods CodeHow and QECK by 25%–32% in terms of precision.


Acknowledgment

This work was supported in part by National Natural Science Foundation of China (Grant Nos. 61672470, 61640221, 61562026).


References

[1] Haiduc S, Bavota G, Marcus A, et al. Automatic query reformulations fortext retrieval in software engineering. In: Proceedings of the 35th International Conference on Software Engineering (ICSE), San Francisco, 2013. 842--851. Google Scholar

[2] Fischer G, Henninger S, Redmiles D. Cognitive tools for locating and comprehending software objects for reuse. In: Proceedings of the 13th International Conference on Software Engineering, Austin, 1991. 318--328. Google Scholar

[3] Lv F, Zhang H Y, Lou J G, et al. CodeHow: effective code search based on API understanding and extended boolean model (E). In: Proceedings of the 30th IEEE/ACM International Conference on Automated Software Engineering (ASE), Lincoln, 2015. 260--270. Google Scholar

[4] Nie L M, Jiang H, Ren Z L. Query expansion based on crowd knowledge for code search. IEEE Trans Serv Comput, 2016, 9: 771-783 CrossRef Google Scholar

[5] ManningC D, Raghavan P, Schtze H. Introduction to Information Retrieval. Cambridge: Cambridge University Press, 2008. Google Scholar

[6] de Souza L B L, Campos E, Maia M A. Ranking crowd knowledge to assist software development. In: Proceedings of the 22nd International Conference on Program Comprehension, Hyderabad, 2014. 72--82. Google Scholar

[7] Nguyen A T, Hilton M, Codoban M, et al. API code recommendation using statistical learning from fine-grained changes. In: Proceedings of the 24th ACM SIGSOFT International Symposium on Foundations of Software Engineering, Seattle, 2016. 511--522. Google Scholar

[8] Haiduc S, Bavota G, Marcus A, et al. Automatic query reformulations for text retrieval in software engineering. In: Proceedings of the 35th International Conference on Software Engineering (ICSE), San Francisco, 2013. 842--851. Google Scholar

[9] Gay G, Haiduc S, Marcus A, et al. On the use of relevance feedback in IR-based concept location. In: Proceedings of IEEE International Conference on Software Maintenance, Edmonton, 2009. 351--360. Google Scholar

[10] Mcmillan C, Poshyvanyk D, Grechanik M. Portfolio: searching for relevant functions and their usages in millions of lines of code. ACM Trans Softw Eng Methodol, 2013, 22: 1-30 CrossRef Google Scholar

[11] Joachims T. Optimizing search engines using clickthrough data. In: Proceedings of the 8th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, Edmonton, 2002. 133--142. Google Scholar

[12] Joachims T. Training linear SVMs in linear time. In: Proceedings of the 12th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, Philadelphia, 2006. 217--226. Google Scholar

[13] Salton G, Fox E A, Wu H. Extended Boolean Information Retrieval. New York: Cornell University, 1983. Google Scholar

[14] Niu H, Keivanloo I, Zou Y. Learning to rank code examples for code search engines. Empir Softw Eng, 2017, 22: 259-291 CrossRef Google Scholar

[15] Jiang H, Nie L M, Sun Z Y, et al. ROSF: leveraging information retrieval and supervised learning for recommending code snippets. IEEE Trans Serv Comput, 2017. doi: 10.1109/TSC.2016.2592909. Google Scholar

[16] Xu K, Lin H F, Lin Y, et al. Patent retrieval based on multiple information resources. In: Proceedings of the 12th Asia Information Retrieval Societies Conference, Bejing, 2016. Google Scholar

[17] Xu B, Lin H F, Lin Y. Assessment of learning to rank methods for query expansion. J Assoc Inf Sci Technol, 2016, 67: 1345-1357 CrossRef Google Scholar

  • Figure 1

    (Color online) Schematic of GKSR.

  • Figure 2

    (Color online) Example of an R&C pair.

  • Figure 3

    (Color online) Flowchart of contents of Section 3.

  • Figure 4

    Evaluation strategy.

  • Figure 5

    (Color online) Development process of the code search.

  • Table 1   Example of integrating QE methods
    Component Query results
    GK-based component $c_{1}$ $c_{2}$ $c_{3}$
    API-based component $c_{1}$ $c_{2}$ $a_{3}$
    CK-based component $b_{1}$ $c_{2}$ $a_{3}$
  • Table 2   Labeling the relevance rating
    Relevance rating Relevance score Similarity score (%) Tag
    4 $15~=~2^{4}-1$ $>85$ Most relevant
    3 $7~=~2^{3}-1$ 70–85 Relevant
    2 $3~=~2^{2}-1$ 60–70 Irrelevant
    1 $0~=~2^{0}-1$ $<60$ Most irrelevant
  • Table 3   Data format of the q&r pair. # Query 1, take multiple screenshots in Android
    rank qid $f_{1}$ $f_{2}$ $f_{3}$
    2 1 0.673 0.725 0
    1 1 0.849 0 0.784
  • Table 4   Benchmark dataset
    Tuning set Training set Testing set
    8850 artificial Q&R pairs 4425 artificial Q&R pairs 4425 artificial Q&R pairs; 54 real Q&R pairs
  • Table 5   Performance comparisons of three methods
    Real testing set Artificial testing set
    Metrics Methods Top-1 Top-5 Top-1 Top-5
    GKSR$_{\rm~noSVM}$ $\textbf{0.757}^{+0.003,0.002}$ $\textbf{0.743}^{+0.002,0.002}$ $\textbf{0.771}^{+0.005,0.003}$ $\textbf{0.756}^{+0.003,0.001}$
    Precision QECK 0.659 0.603 0.666 0.620
    CodeHow 0.623 0.572 0.632 0.589
    GKSR$_{\rm~noSVM}$ $\textbf{0.6914}^{+0.004,0.002}$ $\textbf{0.6795}^{+0.003,0.002}$ $\textbf{0.7123}^{+0.006,0.005}$ $\textbf{0.6917}^{+0.002,0.002}$
    NDCG QECK 0.5663 0.5092 0.5785 0.5302
    CodeHow 0.5196 0.4611 0.5398 0.4821

    The

  • Table 6   Matching queries with CodeHow, QECK, and GKSR$_{\rm~noSVM}$
    Query terms (number of times that occur)
    APIs Crowd knowledge
    CodeHow multiple (1)
    QECK screenshot (76), Android (49)
    GKSR$_{\rm~noSVM}$ multiple (6) screenshot (18), Android (4)
  • Table 7   Performance of the baseline GKSR vs. the performance of the GKSR$_{\rm~noSVM}$
    Real testing set Artificial testing set
    Metrics Methods Top-1 Top-5 Top-1 Top-5
    GKSR $\textbf{0.822}^{+0.006}$ $\textbf{0.804}^{+0.004}$ $\textbf{0.832}^{+0.005}$ $\textbf{0.803}^{+0.003}$
    Precision GKSR$_{\rm~noSVM}$ 0.757 0.743 0.771 0.756
    GKSR $\textbf{0.8013}^{+0.005}$ $\textbf{0.7733}^{+0.004}$ $\textbf{0.8189}^{+0.003}$ $\textbf{0.7795}^{+0.003}$
    NDCG GKSR$_{\rm~noSVM}$ 0.6914 0.6795 0.7123 0.6917

    The

Copyright 2020 Science China Press Co., Ltd. 《中国科学》杂志社有限责任公司 版权所有

京ICP备18024590号-1       京公网安备11010102003388号