SCIENTIA SINICA Informationis, Volume 48, Issue 1: 47-59(2018) https://doi.org/10.1360/N112016-00279

## Kernel method for matrix completion with side information and its application in multi-label learning

• AcceptedMar 24, 2017
• PublishedSep 14, 2017
Share
Rating

### Abstract

In practical machine learning, one instance is always associated with multiple labels. However, due to high cost, it is difficult to acquire the full supervised information for multi-label data. Thus, multi-label learning faces the problem of missing supervised information. By considering missing labels as unobserved entries in a matrix and features as side information, the matrix completion algorithm can be exploited to solve the missing-supervised-information problem in multi-label learning. While the previous research often focused on the case where data is linearly separable, in this paper, we propose the KernelMaxide algorithm, which not only exploits the nonlinear structure in the missing-supervised-information multi-label data, but also considers the correlation between labels. In particular, we construct a novel optimization objective based on the kernel matrix, using the Representer Theorem of Matrix Norm. We further use the Nystrom method to reduce the memory and computational burden on the kernel matrix. Experiments show the merit of our proposal.

### References

[1] Zhang M L, Zhou Z H. A review on multi-label learning algorithms. IEEE Trans Knowl Data Eng, 2014, 26: 1819--1837. Google Scholar

[2] Zhuang Y T, Han Y H, Wu F, et al. Stable multi-label boosting for image annotation with structural feature selection. Sci China Inf Sci, 2011, 54: 2508--2521. Google Scholar

[3] Nguyen V A, Boydgraber J L, Resnik P, et al. Learning a concept hierarchy from multi-labeled documents.łinebreak In: Proceedings of the Advances in Neural Information Processing Systems, Montreal, 2014. 3671--3679. Google Scholar

[4] Chakrabarti D, Funiak S, Chang J, et al. Joint inference of multiple label types in large networks. In: Proceedings of the 31st International Conference on Machine Learning, Beijing, 2014. 874--882. Google Scholar

[5] Elisseeff A, Weston J. A kernel method for multi-labelled classification. In: Proceedings of the Advances in Neural Information Processing Systems, Vancouver, 2001. 681--687. Google Scholar

[6] Shao H, Li G, Liu G, et al. Symptom selection for multi-label data of inquiry diagnosis in traditional Chinese medicine. Sci China Inf Sci, 2013, 56: 052118. Google Scholar

[7] Cabral R S, de la Torre F, Costeira J P, et al. Matrix completion for weakly-supervised multi-label image classification. IEEE Trans Pattern Anal Mach Intell, 2015, 37: 121--135. Google Scholar

[8] Xu M, Jin R, Zhou Z H. Speedup matrix completion with side information: application to multi-label learning. łinebreak In: Proceedings of the Advances in Neural Information Processing Systems Conference, Lake Tahoe, 2013. 2301--2309. Google Scholar

[9] Goldberg A B, Zhu X, Recht B, et al. Transduction with matrix completion: three birds with one stone. In: Proceedings of the Advances in Neural Information Processing Systems Conference, Vancouver, 2010. 757--765. Google Scholar

[10] Boutell M R, Luo J, Shen X, et al. Learning multi-label scene classification. Pattern Recog, 2004, 37: 1757--1771. Google Scholar

[11] Recht B. A simpler approach to matrix completion. J Mach Learn Res, 2011, 12: 3413--3430. Google Scholar

[12] Ji S, Sun L, Jin R, et al. Multi-label multiple kernel learning. In: Proceedings of the Advances in Neural Information Processing Systems Conference, Vancouver, 2008. 777--784. Google Scholar

[13] Tang L, Chen J, Ye J. On multiple kernel learning with multiple labels. In: Proceedings of the 18th International Joint Conference on Artificial Intelligenc, Pasadena, 2009. 1255--1260. Google Scholar

[14] Bucak S S, Jin R, Jain A K. Multi-label multiple kernel learning by stochastic approximation: application to visual object recognition. In: Proceedings of the Advances in Neural Information Processing Systems Conference, Vancouver, 2010. 325--333. Google Scholar

[15] Argyriou A, Micchelli C A, Pontil M. When is there a representer theorem? Vector versus matrix regularizers. J Mach Learn Res, 2009, 10: 2507--2529. Google Scholar

[16] Williams C K I, Seeger M W. Using the nystrom method to speed up kernel machines. In: Proceedings of the Advances in Neural Information Processing Systems Conference, Denver, 2000. 682--688. Google Scholar

[17] Liu W, Tsang I W. Large margin metric learning for multi-label prediction. In: Proceedings of the 29th AAAI Conference on Artificial Intelligence, Austin, 2015. 2800--2806. Google Scholar

[18] Gentile C, Orabona F. On multilabel classification and ranking with bandit feedback. J Mach Learn Res, 2014, 15: 2451--2487. Google Scholar

[19] Bhatia K, Jain H, Kar P, et al. Sparse local embeddings for extreme multi-label classification. In: Proceedings of the Advances in Neural Information Processing Systems Conference, Montreal, 2015. 730--738. Google Scholar

[20] Zhu Y, Ting K M, Zhou Z H. Multi-label learning with emerging new labels. In: Proceedings of the IEEE International Conference on Data Mining, Barcelona, 2016. 1371--1376. Google Scholar

[21] Schapire R E, Singer Y. BoosTexter: a boosting-based system for text categorization. Mach Learn, 2000, 39: 135--168. Google Scholar

[22] Zhang M L, Zhou Z H. ML-KNN: a lazy learning approach to multi-label learning. Pattern Recog, 2007, 40: 2038--2048. Google Scholar

[23] Tsoumakas G, Katakis I, Vlahavas I P. Random k-labelsets for multilabel classification. IEEE Trans Knowl Data Eng, 2011, 23: 1079--1089. Google Scholar

[24] Bi W, Kwok J T. Bayes-optimal hierarchical multilabel classification. IEEE Trans Knowl Data Eng, 2015, 27: 2907--2918. Google Scholar

[25] Furnkranz J, Hullermeier E, Mencia E L, et al. Multilabel classification via calibrated label ranking. Mach Learn, 2008, 73: 133--153. Google Scholar

[26] Deng J, Russakovsky O, Krause J, et al. Scalable multi-label annotation. In: Proceedings of the CHI Conference on Human Factors in Computing Systems, Toronto, 2014. 3099--3102. Google Scholar

[27] Gao N, Huang S J, Chen S. Multi-label active learning by model guided distribution matching. Front Comput Sci, 2016, 10: 845--855. Google Scholar

[28] Yu H F, Jain P, Kar P, et al. Large-scale multi-label learning with missing labels. In: Proceedings of the 31st International Conference on Machine Learning, Beijing, 2014. 593--601. Google Scholar

[29] Scholkopf B, Smola A J. Learning With Kernels: Support Vector Machines, Cegularization, Optimization, and Beyond. Cambridge: MIT Press, 2011. Google Scholar

[30] Steinwart I, Christmann A. Support Vector Machines. New York: Springer, 2008. Google Scholar

[31] Rahimi A, Recht B. Random features for large-scale kernel machines. In: Proceedings of the Advances in Neural Information Processing Systems Conference, Vancour, 2007. 1177--1184. Google Scholar

[32] Yang T, Li Y F, Mahdavi M, et al. Nystrom method vs random fourier features: a theoretical and empirical comparison. In: Proceedings of the Advances in Neural Information Processing Systems Conference, Lake Tahoe, 2012. 485--493. Google Scholar

[33] Tseng P. On Accelerated Proximal Gradient Methods for Convex-Concave Optimization. Technical Report. Seattle: University of Washington. 2008. http://www.mit.edu/ dimitrib/PTseng/papers/apgm.pdf. Google Scholar

[34] Cai J F, Candès E J, Shen Z. A singular value thresholding algorithm for matrix completion. SIAM J Optim, 2010, 20: 1956--1982. Google Scholar

[35] Ueda N, Saito K. Parametric mixture models for multi-labeled text. In: Proceedings of the Advances in Neural Information Processing Systems Conference, Vancour, 2002. 721--728. Google Scholar

[36] Fan R E, Chen P H, Lin C J. Working set selection using second order information for training support vector machines. J Mach Learn Res, 2005, 6: 1889--1918. Google Scholar

• Figure 1

(Color online) Time efficiency of the KernelMaxide algorithms compared to those of baselines. The $X$-axis is the observation rate (varies from $0.1$ to $0.4$) of the label matrix, while the $Y$-axis is the time cost measured in seconds. The less the running time, the higher the time efficiency. (a) Arts; (b) business; (c) computers; (d) education; (e) entertainment; (f) health; (g) recreation; (h) reference; (i) science; (j) social; (k) society

• Table 1   Eleven “yahoo.com datasets $^{\rm~a)}$
 Dataset Dimension Class MaxL Arts 462 26 14 Business 438 30 12 Computers 581 33 17 Education 550 33 7 Entertainment 640 21 17 Health 612 32 13 Recreation 606 22 17 Reference 793 33 12 Science 743 40 9 Social 1047 39 10 Society 636 27 16

a) “Dimension is the number of features. “Class is the number of candidate labels. “MaxL denotes the maximum number of relevant labels per instance.

•

Algorithm 1 利用辅助信息进行矩阵补全的核方法KernelMaxide

算法输入: $\Omega$, $Y$, $\lambda$, $\widetilde{{\boldsymbol~K}}_\text{all,UT}$ 算法过程:

算法输出: $C_{k+1}$

series 初始化: $C_1=C_2$, $L$, $\gamma>1$, $\theta_1=\theta_2\in~(0,1]$, $\epsilon$, $k=2$

while $\widetilde{\mathcal{L}}(C_{k+1})\leq~(1-\epsilon)~\widetilde{\mathcal{L}}(C_k)$ do

$Z_k=C_k+\theta_k(1/\theta_{k-1}-1)(C_k-C_{k-1})$

$C_{k+1}=\mbox{argmin}\;\;~\lambda~\|C\|_{\rm~tr}+Q_k(C)$

while $\mathcal{E}(C_{k+1})-\mathcal{E}(C_k)\ge~H_k(L)$ do

$L=L\gamma$

$C_{k+1}=\mbox{argmin}~\lambda~\|C\|_{\rm~tr}+Q_k(C)$

end while

$\theta_{k+1}=(\sqrt{\theta_k^4+4\theta_k^2}-\theta_k^2)/2$

$k=k+1$

end while

• Table 2   Ranking Loss (the smaller the better) results of KernelMaxide-f on multi-label learning with incomplete supervised information $^{\rm~a)}$
 Dataset Algorithm $\omega%=10%$ $\omega%=20%$ $\omega%=30%$ $\omega%=40%$ Arts KernelMaxide-f $\mathbf{0.1514}$ $\mathbf{0.1365}$ $\mathbf{0.1264}$ $\mathbf{0.1266}$ Maxide $0.1596$ $0.1500$ $0.1421$ $0.1422$ KernelBSVM $0.2325$ $0.2262$ $0.2138$ $0.2110$ Business KernelMaxide-f $\mathbf{0.0424}$ $\mathbf{0.0428}$ $\mathbf{0.0359}$ $\mathbf{0.0366}$ Maxide $0.0457$ $0.0488$ $0.0444$ $0.0458$ KernelBSVM $0.0785$ $0.0833$ $0.0722$ $0.0724$ Computers KernelMaxide-f $\mathbf{0.0957}$ $\mathbf{0.0889}$ $\mathbf{0.0785}$ $\mathbf{0.0861}$ Maxide $0.1033$ $0.0974$ $0.0902$ $0.0988$ KernelBSVM $0.1676$ $0.1588$ $0.1475$ $0.1565$ Education KernelMaxide-f $\mathbf{0.0944}$ $\mathbf{0.0833}$ $\mathbf{0.0839}$ $\mathbf{0.0806}$ Maxide $0.1025$ $0.0917$ $0.0904$ $0.0893$ KernelBSVM $0.2249$ $0.2037$ $0.2011$ $0.1915$ Entertainment KernelMaxide-f $\mathbf{0.1186}$ $\mathbf{0.1080}$ $\mathbf{0.1064}$ $\mathbf{0.0991}$ Maxide $0.1233$ $0.1199$ $0.1177$ $0.1119$ KernelBSVM $0.1911$ $0.1902$ $0.1797$ $0.1707$ Health KernelMaxide-f $\mathbf{0.0597}$ $\mathbf{0.0540}$ $\mathbf{0.0491}$ $\mathbf{0.0491}$ Maxide $0.0691$ $0.0652$ $0.0609$ $0.0607$ KernelBSVM $0.1130$ $0.1144$ $0.1124$ $0.1127$ Recreation KernelMaxide-f $\mathbf{0.1674}$ $\mathbf{0.1508}$ $\mathbf{0.1419}$ $\mathbf{0.1389}$ Maxide $0.1824$ $0.1609$ $0.1549$ $0.1526$ KernelBSVM $0.2347$ $0.2184$ $0.2143$ $0.2056$ Reference KernelMaxide-f $\mathbf{0.0870}$ $\mathbf{0.0724}$ $\mathbf{0.0720}$ $\mathbf{0.0703}$ Maxide $0.1052$ $0.0946$ $0.0897$ $0.0828$ KernelBSVM $0.1567$ $0.1472$ $0.1489$ $0.1448$ Science KernelMaxide-f $\mathbf{0.1260}$ $\mathbf{0.1186}$ $\mathbf{0.1117}$ $\mathbf{0.1086}$ Maxide $0.1423$ $0.1329$ $0.1316$ $0.1268$ KernelBSVM $0.2140$ $0.2054$ $0.1993$ $0.1983$ Social KernelMaxide-f $\mathbf{0.0687}$ $\mathbf{0.0588}$ $\mathbf{0.0569}$ $\mathbf{0.0552}$ Maxide $0.0773$ $0.0718$ $0.0727$ $0.0727$ KernelBSVM $0.1155$ $0.1048$ $0.1070$ $0.1086$ Society KernelMaxide-f $\mathbf{0.1445}$ $\mathbf{0.1330}$ $\mathbf{0.1298}$ $\mathbf{0.1287}$ Maxide $0.1546$ $0.1454$ $0.1416$ $0.1421$ KernelBSVM $0.2141$ $0.2030$ $0.2054$ $0.2014$

a) $\omega%$ represents the percentage of training instances with observed label assignment for each label. The best results and its comparable ones ($t$-tests at $95%$ confidence level) are bolded.

• Table 3   Ranking Loss (the smaller the better) results of KernelMaxide-n on multi-label learning with incomplete supervised information $^{\rm~a)}$
 Dataset Algorithm $\omega%=10%$ $\omega%=20%$ $\omega%=30%$ $\omega%=40%$ Arts KernelMaxide-n $\mathbf{0.1527}$ $\mathbf{0.1382}$ $\mathbf{0.1278}$ $\mathbf{0.1289}$ Maxide $0.1596$ $0.1500$ $0.1421$ $0.1422$ KernelBSVM $0.2325$ $0.2262$ $0.2138$ $0.2110$ Business KernelMaxide-n $\mathbf{0.0434}$ $\mathbf{0.0446}$ $\mathbf{0.0386}$ $\mathbf{0.0393}$ Maxide $0.0457$ $0.0488$ $0.0444$ $0.0458$ KernelBSVM $0.0785$ $0.0833$ $0.0722$ $0.0724$ Computers KernelMaxide-n $\mathbf{0.0957}$ $\mathbf{0.0902}$ $\mathbf{0.0811}$ $\mathbf{0.0879}$ Maxide $0.1033$ $0.0974$ $0.0902$ $0.0988$ KernelBSVM $0.1676$ $0.1588$ $0.1475$ $0.1565$ Education KernelMaxide-n $\mathbf{0.0955}$ $\mathbf{0.0853}$ $\mathbf{0.0861}$ $\mathbf{0.0833}$ Maxide $0.1025$ $0.0917$ $0.0904$ $0.0893$ KernelBSVM $0.2249$ $0.2037$ $0.2011$ $0.1915$ Entertainment KernelMaxide-n $\mathbf{0.1224}$ $\mathbf{0.1113}$ $\mathbf{0.1124}$ $\mathbf{0.1036}$ Maxide $\mathbf{0.1233}$ $0.1199$ $0.1177$ $0.1119$ KernelBSVM $0.1911$ $0.1902$ $0.1797$ $0.1707$ Health KernelMaxide-n $\mathbf{0.0616}$ $\mathbf{0.0562}$ $\mathbf{0.0519}$ $\mathbf{0.0503}$ Maxide $0.0691$ $0.0652$ $0.0609$ $0.0607$ KernelBSVM $0.1130$ $0.1144$ $0.1124$ $0.1127$ Recreation KernelMaxide-n $\mathbf{0.1704}$ $\mathbf{0.1568}$ $\mathbf{0.1489}$ $\mathbf{0.1484}$ Maxide $0.1824$ $\mathbf{0.1609}$ $0.1549$ $\mathbf{0.1526}$ KernelBSVM $0.2347$ $0.2184$ $0.2143$ $0.2056$ Reference KernelMaxide-n $\mathbf{0.0896}$ $\mathbf{0.0764}$ $\mathbf{0.0749}$ $\mathbf{0.0727}$ Maxide $0.1052$ $0.0946$ $0.0897$ $0.0828$ KernelBSVM $0.1567$ $0.1472$ $0.1489$ $0.1448$ Science KernelMaxide-n $\mathbf{0.1328}$ $\mathbf{0.1248}$ $\mathbf{0.1175}$ $\mathbf{0.1145}$ Maxide $0.1423$ $0.1329$ $0.1316$ $0.1268$ KernelBSVM $0.2140$ $0.2054$ $0.1993$ $0.1983$ Social KernelMaxide-n $\mathbf{0.0719}$ $\mathbf{0.0624}$ $\mathbf{0.0609}$ $\mathbf{0.0600}$ Maxide $0.0773$ $0.0718$ $0.0727$ $0.0727$ KernelBSVM $0.1155$ $0.1048$ $0.1070$ $0.1086$ Society KernelMaxide-n $\mathbf{0.1451}$ $\mathbf{0.1341}$ $\mathbf{0.1326}$ $\mathbf{0.1314}$ Maxide $0.1546$ $0.1454$ $0.1416$ $0.1421$ KernelBSVM $0.2141$ $0.2030$ $0.2054$ $0.2014$

a) The same as in Table 2.

Citations

• #### 0

Altmetric

Copyright 2019 Science China Press Co., Ltd. 《中国科学》杂志社有限责任公司 版权所有