logo

SCIENCE CHINA Information Sciences, Volume 63 , Issue 9 : 190102(2020) https://doi.org/10.1007/s11432-019-2859-8

Quality assessment of crowdsourced test cases

More info
  • ReceivedSep 19, 2019
  • AcceptedApr 3, 2020
  • PublishedAug 10, 2020

Abstract

Various software-engineering problems have been solved by crowdsourcing.In many projects, the software outsourcing process is streamlined on cloud-based platforms.Among software engineering tasks, test-case development is particularly suitable for crowdsourcing, because a large number of test cases can be generated at little monetary cost.However, the numerous test cases harvested from crowdsourcing can be high- or low-quality.Owing to the large volume, distinguishing the high-quality tests by traditional techniques is computationally expensive.Therefore, crowdsourced testing would benefit from an efficient mechanism distinguishes the qualities of the test cases.This paper introduces an automated approach — TCQAxspace — to evaluate the quality of test cases based on the onsite coding history.Quality assessment by TCQAxspace proceeds through three steps: (1) modeling the code history as a time series, (2) extracting the multiple relevant features from the time series, and (3) building a model that classifies the test cases based on their qualities. Step (3) is accomplished by feature-based machine-learning techniques.By leveraging the onsite coding history, TCQAxspace can assess the test-case quality without performing expensive source-code analysis or executing the test cases.Using the data of nine test-development tasks involving more than 400 participants, we evaluated TCQAxspace from multiple perspectives.The TCQAxspace approach assessed the quality of the test cases with higher precision, faster speed, and lower overhead than conventional test-case quality-assessment techniques.Moreover, TCQAxspace provided yield real-time insights on test-case quality before the assessment was finished.


Acknowledgment

This work was partly supported by National Key Research and Development Program of China (Grant No. 2018YFB1403400) and National Natural Science Foundation of China (Grant Nos. 61690201, 61772014).

  • Figure 1

    (Color online) Overview of assessing the quality of a test case from the dynamic code history using TCQAxspace.

  • Figure 2

    (Color online) Data-volume dependence on the precision performance of each task (in the within-task scenarios). (a) CMD; (b) Datalog; (c) ITClocks; (d) JMerkle; (e) LunarCalendar; (f) QuadTree.

  • Figure 3

    Representative dynamic histories of codes with different quality levels. The $x$ and $y$ axes represent the percentage of the development, time and the size growth of the test-case code, respectively. (a) Low quality tests; (b) medium quality tests; (c) high quality tests.

  • Table 1  

    Table 1Extracted features and their meanings

    Category Feature Meaning
    Maximum Highest (normalized) value of the time series.
    Simple metrics MeanMean of the time series.
    sum_of_reoccurring_valuesSum of reoccurring values in the time series.
    c3* Non-linearity of the time series, see [27] for more details.
    Statistical metricsabs_energyAbsolute energy of the time series (sum of the squared values).
    agg_linear_trend*Linear least-squares regression of values of the time series.
    fft_coefficient* Fourier coefficients of the one-dimensional
    Frequency-based metrics discrete fast fourier transform for real parameters.
    spkt_welch_densityCross-power spectral density of the time series at different frequencies.

    * Multiple features of this type can result from different input parameters.

  • Table 2  

    Table 2Statistic information of subjected tasks

    Task No. tests LOC No. classes
    CMD 134 566 1
    Datalog 649 589 9
    ITClocks 134 1071 13
    JMerkle 370 774 5
    LunarCalendar 561 1170 8
    QuadTree 345 644 6
  • Table 3  

    Table 3Within-task scenario results

    Task Precision Recall $F$-measure
    CMD 0.77 0.83 0.78
    Datalog 0.80 0.82 0.81
    ITClocks 0.65 0.75 0.68
    JMerkle 0.76 0.79 0.76
    LunarCalendar 0.70 0.76 0.71
    QuadTree 0.78 0.79 0.78
  • Table 4  

    Table 4Whole-sample scenario results

    Testing task Precision Recall $F$-measure
    CMD 0.71 0.80 0.74
    Datalog 0.67 0.66 0.66
    ITClocks 0.70 0.74 0.71
    JMerkle 0.60 0.84 0.73
    LunarCalendar 0.71 0.81 0.74
    QuadTree 0.71 0.80 0.72
  • Table 5  

    Table 5Average precisions of 30 runs in the cross-task scenario

    TaskTesting task
    cmidrule2-7CMD Datalog ITClocks JMerkle LunarCalendarQuadTree
    CMD 0.41 0.34 0.33 0.35 0.52
    Datalog 0.62 0.61 0.63 0.64 0.60
    ITClocks 0.68 0.59 0.60 0.65 0.59
    JMerkle 0.57 0.57 0.55 0.57 0.60
    LunarCalendar 0.60 0.58 0.62 0.62 0.60
    QuadTree 0.58 0.57 0.56 0.60 0.59
  • Table 6  

    Table 6Time-cost comparison of efficiency measures (coverage metrics and mutation testing in seconds)

    TaskTraditional scoring TCQATCQA in production environment
    Feature extraction Training Prediction (feature extraction + prediction)
    CMD 763.29 29.79 0.02 0.02 29.81 (25.60x)
    Datalog 1987.59 103.600.040.01 103.61 (19.18x)
    ITClocks 448.57 26.77 0.02 0.01 26.78 (16.75x)
    JMerkle 859.68 46.85 0.03 0.01 46.86 (18.35x)
    LunarCalendar 5035.04 89.88 0.04 0.02 89.90 (56.00x)
    QuadTree 982.64 46.62 0.02 0.01 46.63 (21.07x)
  • Table 7  

    Table 7The average precision value of 30 runs with the feature from last $X$ percent time series

    Task 100% 90% 80% 70% 60% 50%
    CMD 0.77 0.75 0.73 0.73 0.72 0.70
    Datalog 0.80 0.81 0.80 0.77 0.74 0.73
    ITClocks 0.65 0.66 0.65 0.65 0.67 0.59
    JMerkle 0.76 0.73 0.71 0.73 0.70 0.69
    LunarCalendar 0.70 0.68 0.70 0.70 0.71 0.71
    QuadTree 0.78 0.78 0.78 0.78 0.78 0.78

Copyright 2020  CHINA SCIENCE PUBLISHING & MEDIA LTD.  中国科技出版传媒股份有限公司  版权所有

京ICP备14028887号-23       京公网安备11010102003388号