logo

SCIENCE CHINA Information Sciences, Volume 59, Issue 7: 072104(2016) https://doi.org/10.1007/s11432-015-5470-z

Dropout Rademacher complexity of \\deep neural networks

More info
  • ReceivedOct 17, 2015
  • AcceptedDec 8, 2015
  • PublishedJun 16, 2016

Abstract

Great successes of deep neural networks have been witnessed in various real applications. Many algorithmic and implementation techniques have been developed; however, theoretical understanding of many aspects of deep neural networks is far from clear. A particular interesting issue is the usefulness of dropout, which was motivated from the intuition of preventing complex co-adaptation of feature detectors. In this paper, we study the Rademacher complexity of different types of dropouts, and our theoretical results disclose that for shallow neural networks (with one or none hidden layer) dropout is able to reduce the Rademacher complexity in polynomial, whereas for deep neural networks it can amazingly lead to an exponential reduction.


Funded by

National Natural Science Foundation of China(61333014)

Jiangsu Science Foundation(BK20150586)

National Basic Research Program of China(2014CB340501)


Acknowledgment

Acknowledgments

This work was supported by National Basic Research Program of China (Grant No. 2014CB340501), National Natural Science Foundation of China (Grant No. 61333014) and Jiangsu Science Foundation (Grant No. BK20150586).


References

[1] Hinton G E, Salakhutdinov R R. Reducing the dimensionality of data with neural networks. Science, 2006, 313: 504-507 CrossRef Google Scholar

[2] Cire{\c{s}}an D C, Meier U, Gambardella L M, et al. Deep, big, simple neural nets for handwritten digit recognition. Neural Comput, 2010, 22: 3207-3220 CrossRef Google Scholar

[3] Cire{\c{s}}an D C, Meier U, Schmidhuber J. Multi-column deep neural networks for image classification. In: Proceedings of IEEE Conference on Computer Vision and Pattern Recognition, Providence, 2012. 3642--3649. Google Scholar

[4] Coates A, Huval B, Wang T, et al. Deep learning with {COTS HPC} systems. In: Proceedings of the 30th International Conference on Machine Learning, Atlanta, 2013. 1337--1345. Google Scholar

[5] Dahl G E, Sainath T N, Hinton G E. Improving deep neural networks for lvcsr using rectified linear units and dropout. In: Proceeding of IEEE International Conference on Acoustics, Speech and Signal Processing, Vancouver, 2013. 8609--8613. Google Scholar

[6] Dahl G E, Yu D, Deng L, et al. Context-dependent pre-trained deep neural networks for large-vocabulary speech recognition. IEEE Trans Audio, Speech, Language Process, 2012, 20: 30-42 Google Scholar

[7] Hinton G E, Deng L, Yu D. Deep neural networks for acoustic modeling in speech recognition: the shared views of four research groups. IEEE Signal Process Mag, 2012, 29: 82-97 Google Scholar

[8] Bo L, Lai K, Ren X, et al. Object recognition with hierarchical kernel descriptors. In: Proceeding of IEEE Conference on Computer Vision and Pattern Recognition, Colorado Springs, 2011. 1729--1736. Google Scholar

[9] Mobahi H, Collobert R, Weston J. Deep learning from temporal coherence in video. In: Proceedings of the 26th International Conference on Machine Learning, Montreal, 2009. 737--744. Google Scholar

[10] Liu Y J, Liu L, Tong S C. Adaptive neural network tracking design for a class of uncertain nonlinear discrete-time systems with dead-zone. Sci China Inf Sci, 2014, 57: 032206-97 Google Scholar

[11] Andreas S W, David E R, Bernardo A H. Generalization by weight-elimination with application to forecasting. In: Advances in Neural Information Processing Systems 3. Cambridge: MIT Press, 1991. 875--882. Google Scholar

[12] Amari S, Murata N, Muller K, et al. Asymptotic statistical theory of overtraining and cross-validation. IEEE Trans Neural Netw, 1997, 8: 985-996 CrossRef Google Scholar

[13] Neal R M. Bayesian learning for neural networks. In: Lecture Notes in Statistics. New York: Springer, 1996. Google Scholar

[14] Hinton G E, Srivastava N, Krizhevsky A, et al. Improving neural networks by preventing co-adaptation of feature detectors. arXiv:1207.0580, 2012. Google Scholar

[15] Krizhevsky A, Sutskever I, Hinton G E. Imagenet classification with deep convolutional neural networks. In: Advances in Neural Information Processing Systems 25. Cambridge: MIT Press, 2012. 1106--1114. Google Scholar

[16] Wan L, Zeiler M, Zhang S, et al. Regularization of neural networks using dropconnect. In: Proceedings of the 30th International Conference on Machine Learning, Atlanta, 2013. 1058--1066. Google Scholar

[17] Wang S I, Manning C D. Fast dropout training. In: Proceedings of the 30th International Conference on Machine Learning, Atlanta, 2013. 118--126. Google Scholar

[18] Ba J, Frey B. Adaptive dropout for training deep neural networks. In: Advances in Neural Information Processing Systems 26. Cambridge: MIT Press, 2013. 3084--3092. Google Scholar

[19] Srivastava N, Hinton G, Krizhevsky A. Dropout: a simple way to prevent neural networks from overfitting. J Mach Lear Res, 2014, 15: 1929-1958 Google Scholar

[20] Baldi P, Sadowski P J. The dropout learning algorithm. Artif Intel, 2014, 210: 78-122 CrossRef Google Scholar

[21] Wager S, Wang S, Liang P. Dropout training as adaptive regularization. In: Advances in Neural Information Processing Systems 26. Cambridge: MIT Press, 2013. 351--359. Google Scholar

[22] McAllester D. A pac-bayesian tutorial with a dropout bound. arXiv:1307.2118, 2013. Google Scholar

[23] Karpinski M, Macintyre A. Polynomial bounds for {VC} dimension of sigmoidal and general pfaffian neural networks. J Comput Syst Sci, 1997, 54: 169-176 CrossRef Google Scholar

[24] Anthony M, Bartlett P L. Neural Network Learning: Theoretical Foundations. Cambridge: Cambridge University Press, 2009. Google Scholar

[25] Bartlett P L, Mendelson S. Rademacher and gaussian complexities: risk bounds and structural results. J Mach Lear Res, 2002, 3: 463-482 Google Scholar

[26] Koltchinskii V, Panchenko D. Empirical margin distributions and bounding the generalization error of combined classifiers. Ann Stat, 2002, 30: 1-50 Google Scholar

[27] Cortes C, Mohri M, Rostamizadeh A. Generalization bounds for learning kernels. In: Proceedings of the 27th International Conference on Machine Learning, Haifa, 2010. 247--254. Google Scholar

[28] Maurer A. Bounds for linear multi-task learning. J Mach Lear Res, 2006 7: 117--139. Google Scholar

[29] Meir R, Zhang T. Generalization error bounds for bayesian mixture algorithms. J Mach Lear Res, 2003, 4: 839-860 Google Scholar

[30] Zou B, Peng Z M, Xu Z B. The learning performance of support vector machine classification based on Markov sampling. Sci China Inf Sci, 2013, 56: 032110-860 Google Scholar

[31] McDiarmid C. On the method of bounded differences. In: Surveys in Combinatorics. Cambridge: Cambridge University Press, 1989. 148--188. Google Scholar

[32] Ledoux M, Talagrand, M. Probability in Banach Spaces: Isoperimetry and Processes. Berlin: Springer, 2002. Google Scholar

[33] Nair V, Hinton G E. Rectified linear units improve restricted boltzmann machines. In: Proceedings of the 27th International Conference on Machine Learning, Haifa, 2010. 807--814. Google Scholar

[34] Kakade S M, Sridharan K, Tewari A. On the complexity of linear prediction: risk bounds, margin bounds, and regularization. In: Advances in Neural Information Processing Systems 24. Cambridge: MIT Press, 2008. 351--359. Google Scholar

[35] Bartlett P L. The sample complexity of pattern classification with neural networks: the size of the weights is more important than the size of the network. IEEE Trans Inf Theory, 1998, 44: 525-536 CrossRef Google Scholar

Copyright 2019 Science China Press Co., Ltd. 《中国科学》杂志社有限责任公司 版权所有

京ICP备18024590号-1