logo

SCIENTIA SINICA Informationis, Volume 50 , Issue 6 : 794-812(2020) https://doi.org/10.1360/SSI-2019-0208

An interactive feature selection method based on learning-from-crowds

More info
  • ReceivedSep 24, 2019
  • AcceptedDec 9, 2019
  • PublishedJun 9, 2020

Abstract

Ensemble feature selection algorithms aggregate the results of multiple feature selection methods in order to select an effective subset of features. However, typically, ensemble algorithms treat each feature selection method equally and do not consider performance differences. Consequently, features selected by a relatively smaller number of methods may not be included. To address this problem, we propose an interactive feature selection method that can more effectively aggregate the results of multiple feature selection methods and iteratively improve the selected features by integrating expert knowledge. The proposed method includes a learning-from-crowds-based ensemble feature selection algorithm and a visual analysis system. The algorithm models the performance of multiple feature selection methods, calculates their reliabilities, and aggregates results. To integrate expert knowledge, the visual analysis system provides a set of ranking schemes to assist experts in understanding the results of an individual feature selection method and the roles played by the features in classification tasks. A numerical experiment conducted on four real-world datasets shows that the proposed algorithm can improve classification accuracy by 0.63%–2.85% compared to state-of-the-art ensemble feature selection algorithms. In addition, we conducted case studies on text and image data to demonstrate that the proposed visual analysis system can further improve classification accuracy by 0.28%–5.24%.


Funded by

国家重点研发计划(2018YFB1004300)

国家自然科学基金(61672308,61761136020,61936002)


References

[1] Guyon I, Andre E. An introduction to variable and feature selection. Journal of Machine Learning Research, 2009, 3(Mar): 1157-1182. Google Scholar

[2] Saeys Y, Abeel T, van de Peer Y. Robust feature selection using ensemble feature selection techniques. In: Proceedings of the Joint European Conference on Machine Learning and Knowledge Discovery in Databases. Antwerp, 2008. 313--325. Google Scholar

[3] Wang H, Khoshgoftaar T M, Napolitano A. A comparative study of ensemble feature selection techniques for software defect prediction. In: Proceedings of the International Conference on Machine Learning and Applicatioin, Hyatt Regency Bethesda, 2010. 135--140. Google Scholar

[4] Li X, Zhang T W, Guo Z. A Novel Ensemble Method of Feature Gene Selection Based on Recursive Partition-Tree (in Chinese). Chinese Journal of Computers, 2004, 27(5): 675-682. Google Scholar

[5] Bolón-Canedo V, Sánchez-Maro?o N, Alonso-Betanzos A. Distributed feature selection: An application to microarray data classification. Appl Soft Computing, 2015, 30: 136-150 CrossRef Google Scholar

[6] Netzer M, Millonig G, Osl M. A new ensemble-based algorithm for identifying breath gas marker candidates in liver disease using ion molecule reaction mass spectrometry.. Bioinformatics, 2009, 25: 941-947 CrossRef PubMed Google Scholar

[7] Feng Yang , Mao K Z. Robust feature selection for microarray data based on multicriterion fusion.. IEEE/ACM Trans Comput Biol Bioinf, 2011, 8: 1080-1092 CrossRef PubMed Google Scholar

[8] Tian T, Zhu J. Max-Margin Majority Voting for learning from crowds. In: Proceedings of Advances in Neural Information Processing Systems. Palais des Congrés de Montréal, 2015. 1621--1629. Google Scholar

[9] Liu M, Jiang L, Liu J, et al. Improving learning-from-crowds through expert validation. In: Proceedings of the Twenty-Sixth International Joint Conference on Artificial Intelligence, Melbourne, 2017. 2329--2336. Google Scholar

[10] Hanchuan Peng , Fuhui Long , Ding C. Feature selection based on mutual information: criteria of max-dependency, max-relevance, and min-redundancy.. IEEE Trans Pattern Anal Machine Intell, 2005, 27: 1226-1238 CrossRef PubMed Google Scholar

[11] Guo D. Coordinating Computational and Visual Approaches for Interactive Feature Selection and Multivariate Clustering. Inf Visualization, 2003, 2: 232-246 CrossRef Google Scholar

[12] MacEachren A, Xiping D, Hardisty F, et al. Exploring high-D spaces with multiform matrices and small multiples. In: Proceedings of the IEEE Symposium on Information Visualization, Seattle, 2003. 31--38. Google Scholar

[13] Ingram S, Munzner T, Irvine V, et al. Dimstiller: Workflows for dimensional analysis and reduction. In: Proceedings of the IEEE Conference on Visual Analytics Science and Technology, Salt Lake City, 2010. 3--10. Google Scholar

[14] Yang J, Patro A, Huang S, et al. Value and relation display for interactive exploration of high dimensional datasets. In: Proceedings of the IEEE Symposium on Information Visualization, Austin, 2004. 73--80. Google Scholar

[15] Lin H, Gao S, Gotz D, et al. RCLens: Interactive rare category exploration and identification. IEEE Transactions on Visualization and Computer Graphics, 2018, 24(7), 2223-2237. Google Scholar

[16] Seo J, Shneiderman B. A Rank-by-Feature Framework for Interactive Exploration of Multidimensional Data. Inf Visualization, 2005, 4: 96-113 CrossRef Google Scholar

[17] Piringer H, Berger W, Hauser H. Quantifying and comparing features in high-dimensional datasets. In: Proceedings of the International Conference Information Visualisation, London, 2008. 240--245. Google Scholar

[18] May T, Bannach A, Davey J, et al. Guiding feature subset selection with an interactive visualization. In: Proceedings of the IEEE Conference on Visual Analytics Science and Technology, Providence, 2011. 111--120. Google Scholar

[19] Johansson S, Johansson J. Interactive dimensionality reduction through user-defined combinations of quality metrics.. IEEE Trans Visual Comput Graphics, 2009, 15: 993-1000 CrossRef PubMed Google Scholar

[20] Krause J, Perer A, Bertini E. INFUSE: Interactive Feature Selection for Predictive Modeling of High Dimensional Data.. IEEE Trans Visual Comput Graphics, 2014, 20: 1614-1623 CrossRef PubMed Google Scholar

[21] Brooks M, Amershi S, Lee B, et al. FeatureInsight: Visual support for error-driven feature ideation in text classification. In: Proceedings of the IEEE Conference on Visual Analytics Science and Technology, Chicago, 2015. 105--112. Google Scholar

[22] Liu S, Xiao J, Liu J, et al. Visual diagnosis of tree boosting methods. IEEE Transactions on Visualization and Computer Graphics, 2018, 24(1): 123-132. Google Scholar

[23] Zhou Z. Abductive learning: towards bridging machine learning and logical reasoning. Science China - Information Sciences. 2019, 62(7): 076101:1-076101:1-3. Google Scholar

[24] Xiao J, Liu M, Liu S. A Visual Analysis System for News Data (in Chinese). Journal of Computer-Aided Design & Computer Graphics, 2016, 28(11): 1863-1871. Google Scholar

[25] Wu Y, Cui W, Song Y, et al. A Survey on Topic-Based Visual Text Analytics (in Chinese). Journal of Computer-Aided Design & Computer Graphics, 2012, 24(10): 1266-1272. Google Scholar

[26] Zhu J, Ning C, Eric P Xing. Bayesian inference with posterior regularization and applications to infinite latent SVMs. Journal of Machine Learning Research, 2014, 15(1): 1799-1847. Google Scholar

[27] Donahue J, Jia Y, Vinyals O, et al. DeCAF: a deep convolutional activation feature for generic visual recognition. In: Proceedings of the International Conference on Machine Learning, Beijing, 2014. 647--655. Google Scholar

[28] Jiang L, Liu S, Chen C. Recent research advances on interactive machine learning. Journal of Visualization. 2018, Nov(12):1-17. Google Scholar

[29] Zhou B, Khosla A, Lapedriza A, et al. Learning deep features for discriminative localization. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, 2016. 2921--2929. Google Scholar

[30] Lang K. Newsweeder: learning to filter netnews. In: Proceedings of the International Conference on Machine Learning, Tahoe City, 1995. 331--339. Google Scholar

[31] Han E H S, Karypis G. Centroid-based document classification: analysis and experimental results. In: Proceedings of the European Conference on Principles of Data Mining and Knowledge Discovery, Lyon, 2000. 424--431. Google Scholar

[32] Joachims T. Text categorization with support vector machines: learning with many relevant features. In: Proceedings of the European Conference on Machine Learning, Chemnitz, 1998. 137--142. Google Scholar

[33] Nene S A, Nayar S K, Murase H. Columbia Object Image Library. Technical Report CUCS-005-96, 1996. Google Scholar

[34] Vapnic V. The Nature of Statistical Learning Theory. Berlin: Springer Science&Business Media, 1995. Google Scholar

[35] Krizhevsky A. Learning Multiple Layers of Features From Tiny Images. Technical Report, 2009. Google Scholar

Copyright 2020  CHINA SCIENCE PUBLISHING & MEDIA LTD.  中国科技出版传媒股份有限公司  版权所有

京ICP备14028887号-23       京公网安备11010102003388号