SCIENCE CHINA Information Sciences, Volume 63 , Issue 9 : 190102(2020) https://doi.org/10.1007/s11432-019-2859-8

## Quality assessment of crowdsourced test cases

• AcceptedApr 3, 2020
• PublishedAug 10, 2020
Share
Rating

### Abstract

Various software-engineering problems have been solved by crowdsourcing.In many projects, the software outsourcing process is streamlined on cloud-based platforms.Among software engineering tasks, test-case development is particularly suitable for crowdsourcing, because a large number of test cases can be generated at little monetary cost.However, the numerous test cases harvested from crowdsourcing can be high- or low-quality.Owing to the large volume, distinguishing the high-quality tests by traditional techniques is computationally expensive.Therefore, crowdsourced testing would benefit from an efficient mechanism distinguishes the qualities of the test cases.This paper introduces an automated approach — TCQAxspace — to evaluate the quality of test cases based on the onsite coding history.Quality assessment by TCQAxspace proceeds through three steps: (1) modeling the code history as a time series, (2) extracting the multiple relevant features from the time series, and (3) building a model that classifies the test cases based on their qualities. Step (3) is accomplished by feature-based machine-learning techniques.By leveraging the onsite coding history, TCQAxspace can assess the test-case quality without performing expensive source-code analysis or executing the test cases.Using the data of nine test-development tasks involving more than 400 participants, we evaluated TCQAxspace from multiple perspectives.The TCQAxspace approach assessed the quality of the test cases with higher precision, faster speed, and lower overhead than conventional test-case quality-assessment techniques.Moreover, TCQAxspace provided yield real-time insights on test-case quality before the assessment was finished.

### Acknowledgment

This work was partly supported by National Key Research and Development Program of China (Grant No. 2018YFB1403400) and National Natural Science Foundation of China (Grant Nos. 61690201, 61772014).

• Figure 1

(Color online) Overview of assessing the quality of a test case from the dynamic code history using TCQAxspace.

• Figure 2

(Color online) Data-volume dependence on the precision performance of each task (in the within-task scenarios). (a) CMD; (b) Datalog; (c) ITClocks; (d) JMerkle; (e) LunarCalendar; (f) QuadTree.

• Figure 3

Representative dynamic histories of codes with different quality levels. The $x$ and $y$ axes represent the percentage of the development, time and the size growth of the test-case code, respectively. (a) Low quality tests; (b) medium quality tests; (c) high quality tests.

• Table 1

Table 1Extracted features and their meanings

 Category Feature Meaning Maximum Highest (normalized) value of the time series. Simple metrics Mean Mean of the time series. sum_of_reoccurring_values Sum of reoccurring values in the time series. c3* Non-linearity of the time series, see [27] for more details. Statistical metrics abs_energy Absolute energy of the time series (sum of the squared values). agg_linear_trend* Linear least-squares regression of values of the time series. fft_coefficient* Fourier coefficients of the one-dimensional Frequency-based metrics discrete fast fourier transform for real parameters. spkt_welch_density Cross-power spectral density of the time series at different frequencies.

* Multiple features of this type can result from different input parameters.

• Table 2

Table 2Statistic information of subjected tasks

 Task No. tests LOC No. classes CMD 134 566 1 Datalog 649 589 9 ITClocks 134 1071 13 JMerkle 370 774 5 LunarCalendar 561 1170 8 QuadTree 345 644 6
• Table 3

 Task Precision Recall $F$-measure CMD 0.77 0.83 0.78 Datalog 0.80 0.82 0.81 ITClocks 0.65 0.75 0.68 JMerkle 0.76 0.79 0.76 LunarCalendar 0.70 0.76 0.71 QuadTree 0.78 0.79 0.78
• Table 4

Table 4Whole-sample scenario results

 Testing task Precision Recall $F$-measure CMD 0.71 0.80 0.74 Datalog 0.67 0.66 0.66 ITClocks 0.70 0.74 0.71 JMerkle 0.60 0.84 0.73 LunarCalendar 0.71 0.81 0.74 QuadTree 0.71 0.80 0.72
• Table 5

Table 5Average precisions of 30 runs in the cross-task scenario

 Task Testing task cmidrule2-7 CMD Datalog ITClocks JMerkle LunarCalendar QuadTree CMD – 0.41 0.34 0.33 0.35 0.52 Datalog 0.62 – 0.61 0.63 0.64 0.60 ITClocks 0.68 0.59 – 0.60 0.65 0.59 JMerkle 0.57 0.57 0.55 – 0.57 0.60 LunarCalendar 0.60 0.58 0.62 0.62 – 0.60 QuadTree 0.58 0.57 0.56 0.60 0.59 –
• Table 6

Table 6Time-cost comparison of efficiency measures (coverage metrics and mutation testing in seconds)

 Task Traditional scoring TCQA TCQA in production environment Feature extraction Training Prediction (feature extraction + prediction) CMD 763.29 29.79 0.02 0.02 29.81 (25.60x) Datalog 1987.59 103.60 0.04 0.01 103.61 (19.18x) ITClocks 448.57 26.77 0.02 0.01 26.78 (16.75x) JMerkle 859.68 46.85 0.03 0.01 46.86 (18.35x) LunarCalendar 5035.04 89.88 0.04 0.02 89.90 (56.00x) QuadTree 982.64 46.62 0.02 0.01 46.63 (21.07x)
• Table 7

Table 7The average precision value of 30 runs with the feature from last $X$ percent time series

 Task 100% 90% 80% 70% 60% 50% CMD 0.77 0.75 0.73 0.73 0.72 0.70 Datalog 0.80 0.81 0.80 0.77 0.74 0.73 ITClocks 0.65 0.66 0.65 0.65 0.67 0.59 JMerkle 0.76 0.73 0.71 0.73 0.70 0.69 LunarCalendar 0.70 0.68 0.70 0.70 0.71 0.71 QuadTree 0.78 0.78 0.78 0.78 0.78 0.78

Citations

• #### 0

Altmetric

Copyright 2020  CHINA SCIENCE PUBLISHING & MEDIA LTD.  中国科技出版传媒股份有限公司  版权所有