logo

SCIENCE CHINA Information Sciences, Volume 62, Issue 1: 012101(2019) https://doi.org/10.1007/s11432-018-9656-y

A convergence analysis for a class of practical variance-reduction stochastic gradient MCMC

More info
  • ReceivedJun 26, 2018
  • AcceptedOct 26, 2018
  • PublishedDec 19, 2018

Abstract

Stochastic gradient Markov chain Monte Carlo (SG-MCMC) has been developed as a flexible family of scalable Bayesian sampling algorithms. However, there has been little theoretical analysis of the impact of minibatch size to the algorithm's convergence rate. In this paper, we prove that at the beginning of an SG-MCMC algorithm, i.e., under limited computational budget/time, a larger minibatch size leads to a faster decrease of the mean squared error bound. The reason for this is due to the prominent noise in small minibatches when calculating stochastic gradients, motivating the necessity of variance reduction in SG-MCMC for practical use. By borrowing ideas from stochastic optimization, we propose a simple and practical variance-reduction technique for SG-MCMC, that is efficient in both computation and storage. More importantly, we develop the theory to prove that our algorithm induces a faster convergence rate than standard SG-MCMC. A number of large-scale experiments, ranging from Bayesian learning of logistic regression to deep neural networks, validate the theory and demonstrate the superiority of the proposed variance-reduction SG-MCMC framework.


Supplement

Appendixes A–F.


References

[1] Gan Z, Chen C Y, Henao R, et al. Scalable deep Poisson factor analysis for topic modeling. In: Proceedings of International Conference on Machine Learning, 2015. Google Scholar

[2] Liu C, Zhu J, Song Y. Stochastic gradient geodesic MCMC methods. In: Proceedings of Conference on Neural Information Processing System, 2016. Google Scholar

[3] Chen T, Fox E B, Guestrin C. Stochastic gradient Hamiltonian Monte Carlo. In: Proceedings of International Conference on Machine Learning, 2014. Google Scholar

[4] Ding N, Fang Y H, Babbush R, et al. Bayesian sampling using stochastic gradient thermostats. In: Proceedings of Conference on Neural Information Processing System, 2014. Google Scholar

[5] cSimcsekli U, Badeau R, Cemgil A T, et al. Stochastic quasi-Newton Langevin Monte Carlo. In: Proceedings of International Conference on Machine Learning, 2016. Google Scholar

[6] Wang Y X, Fienberg S E, Smola A. Privacy for free: posterior sampling and stochastic gradient Monte Carlo. In: Proceedings of International Conference on Machine Learning, 2015. Google Scholar

[7] Springenberg J T, Klein A, Falkner S, et al. Bayesian optimization with robust Bayesian neural networks. In: Proceedings of Conference on Neural Information Processing System, 2016. Google Scholar

[8] Li C Y, Chen C Y, Carlson D, et al. Preconditioned stochastic gradient Langevin dynamics for deep neural networks. In: Proceedings of AAAI Conference on Artificial Intelligence, 2016. Google Scholar

[9] Chen C Y, Ding N, Carin L. On the convergence of stochastic gradient MCMC algorithms with high-order integrators. In: Proceedings of Conference on Neural Information Processing System, 2015. Google Scholar

[10] Dubey A, Reddi S J, Póczos B, et al. Variance reduction in stochastic gradient Langevin dynamics. In: Proceedings of Conference on Neural Information Processing System, 2016. Google Scholar

[11] Welling M, Teh Y W. Bayesian learning via stochastic gradient Langevin dynamics. In: Proceedings of International Conference on Machine Learning, 2011. Google Scholar

[12] Teh Y W, Thiery A H, Vollmer S J. Consistency and fluctuations for stochastic gradient Langevin dynamics. J Mach Learn Res, 2016, 17: 193--225. Google Scholar

[13] Vollmer S J, Zygalakis K C, Teh Y W. Exploration of the (Non-)asymptotic bias and variance of stochastic gradient Langevin dynamics. J Mach Learn Res, 2016, 17: 5504--5548. Google Scholar

[14] Ma Y A, Chen T Q, Fox E B. A complete recipe for stochastic gradient MCMC. In: Proceedings of International Conference on Machine Learning, 2015. Google Scholar

[15] Ghosh A P. Backward and forward equations for diffusion processes. Wiley Encyclopedia of Operations Research and Management Science, 2011. doi: 10.1002/9780470400531.eorms0080. Google Scholar

[16] Schmidt M, Le Roux N, Bach F. Minimizing finite sums with the stochastic average gradient. Math Program, 2017, 162: 83-112 CrossRef Google Scholar

[17] Johnson R, Zhang T. Accelerating stochastic gradient descent using predictive variance reduction. In: Proceedings of Conference on Neural Information Processing System, 2013. Google Scholar

[18] Reddi S J, Hefny A, Sra S, et al. Stochastic variance reduction for nonconvex optimization. In: Proceedings of International Conference on Machine Learning, 2016. Google Scholar

[19] Allen-Zhu Z, Hazan E. Variance reduction for faster non-convex optimization. In: Proceedings of International Conference on Machine Learning, 2016. Google Scholar

[20] Chen C Y, Ding N, Li C Y, et al. Stochastic gradient MCMC with stale gradients. In: Proceedings of Conference on Neural Information Processing System, 2016. Google Scholar

[21] Schmidt M, Roux N L, Bach F. Minimizing finite sums with the stochastic average gradient. 2013,. arXiv Google Scholar

[22] Zhang L J, Mahdavi M, Jin R. Linear convergence with condition number independent access of full gradients. In: Proceedings of Conference on Neural Information Processing System, 2013. Google Scholar

[23] Defazio A, Bach F, Lacoste-Julien S. SAGA: a fast incremental gradient method with support for non-strongly convex composite objectives. In: Proceedings of Conference on Neural Information Processing System, 2014. Google Scholar

[24] Reddi S J, Sra S, Póczos B. Fast stochastic methods for nonsmooth nonconvex optimization. In: Proceedings of Conference on Neural Information Processing System, 2016. Google Scholar

[25] Allen-Zhu Z, Richtárik P, Qu Z, et al. Even faster accelerated coordinate descent using non-uniform sampling. In: Proceedings of International Conference on Machine Learning, 2016. Google Scholar

[26] Reddi S J, Hefny A, Sra S, et al. On variance reduction in stochastic gradient descent and its asynchronous variants. In: Proceedings of Conference on Neural Information Processing System, 2015. Google Scholar

[27] Chen Y T, Ghahramani Z. Scalable discrete sampling as a multi-armed bandit problem. In: Proceedings of International Conference on Machine Learning, 2016. Google Scholar

[28] Bardenet R, Doucet A, Holmes C. On Markov chain Monte Carlo methods for tall data. J Mach Learn Res, 2017, 18: 1--43. Google Scholar

[29] Baker J, Fearnhead P, Fox E B, et al. Control variates for stochastic gradient MCMC. 2017,. arXiv Google Scholar

[30] Chatterji N S, Flammarion N, Ma Y A, et al. On the theory of variance reduction for stochastic gradient Monte Carlo. In: Proceedings of International Conference on Machine Learning, 2018. Google Scholar

[31] Zou D F, Xu P, Gu Q Q. Subsampled stochastic variance-reduced gradient Langevin dynamics. In: Proceedings of Conference on Uncertainty in Artificial Intelligence, 2018. Google Scholar

[32] Harikandeh R, Ahmed M O, Virani A, et al. Stop wasting my gradients: practical SVRG. In: Proceedings of Conference on Neural Information Processing System, 2015. Google Scholar

[33] Frostig R, Ge R, Kakade S M, et al. Competing with the empirical risk minimizer in a single pass. In: Proceedings of Conference on Learning Theory, 2015. Google Scholar

[34] Shah V, Asteris M, Kyrillidis A, et al. Trading-off variance and complexity in stochastic gradient descent. 2016,. arXiv Google Scholar

[35] Lei L H, Jordan M I. Less than a single pass: stochastically controlled stochastic gradient method. In: Proceedings of Conference on Neural Information Processing System, 2016. Google Scholar

[36] Lian X R, Wang M D, J Liu. Finite-sum composition optimization via variance reduced gradient descent. In: Proceedings of International Conference on Artificial Intelligence and Statistics, 2017. Google Scholar

[37] Hernández-Lobato J M, Adams R P. Probabilistic backpropagation for scalable learning of Bayesian neural networks. In: Proceedings of International Conference on Machine Learning, 2015. Google Scholar

[38] Blundell C, Cornebise J, Kavukcuoglu K, et al. Weight uncertainty in neural networks. In: Proceedings of International Conference on Machine Learning, 2015. Google Scholar

[39] Louizos C, Welling M. Structured and efficient variational deep learning with matrix Gaussian posteriors. In: Proceedings of International Conference on Machine Learning, 2016. Google Scholar

[40] He K M, Zhang X Y, Ren S Q, et al. Deep residual learning for image recognition. In: Proceedings of IEEE Conference on Computer Vision and Pattern Recognition, 2016. Google Scholar

[41] Hochreiter S, Schmidhuber J. Long Short-Term Memory. Neural Computation, 1997, 9: 1735-1780 CrossRef Google Scholar

[42] Merity S, Xiong C M, Bradbury J, et al. Pointer sentinel mixture models. 2016,. arXiv Google Scholar

[43] Zaremba W, Sutskever I, Vinyals O. Recurrent neural network regularization. 2014,. arXiv Google Scholar

[44] Zhang Y Z, Chen C Y, Gan Z, et al. Stochastic gradient monomial Gamma sampler. In: Proceedings of International Conference on Machine Learning, 2017. Google Scholar

  • Figure 1

    (Color online) MSE vs. wall-clock time for different computational budgets (running time) with varying minibatch sizes for standard SG-MCMC. For very limited computational resources, larger minibatch sizes tend to achieve better MSE (a); while for a larger computational budget, a minibatch size of 1 obtains the smallest MSE (b).

  • Figure 2

    (Color online) Number of passes through data vs. testing error ((a) and (c)) / loss ((b) and (d)) on MNIST ((a) and (b)) and CIFAR-10 ((c) and (d)).

  • Figure 3

    (Color online) Number of passes through data vs. testing error ((a) and (c)) / loss ((b) and (d)) with CNN-4 ((a) and (b)) and ResNet ((c) annd (d)) on CIFAR-10.

  • Figure 4

    (Color online) Number of passes through data vs. testing perplexity on the PTB dataset (a) and WikiTest-2 dataset (b).

  • Figure 5

    (Color online) Number of passes through data vs. testing negative log-likelihood on the Pima dataset for Bayesian logistic regression (a); number of passes through data vs. testing errors (b) / loss (c) on the CIFAR-10 dataset, with varying $n_1$ values.

  •   

    Algorithm 1 Practical variance-reduction SG-MCMC.

    Input: $\bar{{\boldsymbol~x}}~=~{\boldsymbol~x}_{0}~=~({\boldsymbol~\theta}_0,~{\boldsymbol~\tau}_0)~\in~\mathbb{R}^d$, minibatch sizes $(n_1,~n_2)$ such that $n_1~>~n_2$, update interval $m$, total iterations $L$, stepsize $\{h_l\}_{l=1}^L$.

    Output: Approximate samples $\{{\boldsymbol~x}_{l}\}_{l=1}^L$.

    for $l~=~0$ to $L-1$

    if $(l~\mbox{~mod~}~m)~=~0$ then

    Sample w/t replacement $\{\pi_i\}_{i=1}^{n_1}~\subseteq~\{1,~\ldots,~N\}$;

    $\bar{{\boldsymbol~x}}~=~{\boldsymbol~x}_l$, $\tilde{{\boldsymbol~\theta}}_l~\leftarrow~\bar{{\boldsymbol~x}}$;

    $\tilde{g}~=~\frac{N}{n_1}\sum_{i\in\pi}\nabla_{{\boldsymbol~\theta}}\log~p({\boldsymbol~d}_i|\tilde{{\boldsymbol~\theta}}_l)$;

    end if

    ${\boldsymbol~\theta}_l~\leftarrow~{\boldsymbol~x}_l$, $\tilde{{\boldsymbol~\theta}}_l~\leftarrow~\bar{{\boldsymbol~x}}$,

    {Fix $\tilde{{\boldsymbol~\theta}}_l$ in the outer loop as a control variate;}

    Sample w/t replacement $\{\tilde{\pi}_i\}_{i=1}^{n_2}~\subseteq~\{1,~\ldots,~N\}$;

    $g_{l+1}~=~\tilde{g}~+~\nabla_{{\boldsymbol~\theta}}\log~p({\boldsymbol~\theta}_{l})~+~\frac{N}{n_2}~\sum_{i\in\tilde{\pi}}(\nabla_{{\boldsymbol~\theta}}\log~p({\boldsymbol~d}_i|{\boldsymbol~\theta}_l)~-~\nabla_{{\boldsymbol~\theta}}\log~p({\boldsymbol~d}_i|\tilde{{\boldsymbol~\theta}}_l)~)$;

    ${\boldsymbol~x}_{l+1}~=~\mbox{NextS}({\boldsymbol~x}_{l},~g_{l+1},~h_{l+1}~)$;

    end for

Copyright 2020 Science China Press Co., Ltd. 《中国科学》杂志社有限责任公司 版权所有

京ICP备18024590号-1       京公网安备11010102003388号