This work was supported by Science and Technology Project of State Grid Corporation of China (Grant No. SGGR0000XTJS1900448).
[1] Bottou L. Online learning and stochastic approximations. On-line learning in neural networks, 1998, 17(9): 142. Google Scholar
[2] Robbins H, Monro S. A Stochastic Approximation Method. Ann Math Statist, 1951, 22: 400-407 CrossRef Google Scholar
[3] Chen C, Wang W, Zhang Y. A convergence analysis for a class of practical variance-reduction stochastic gradient MCMC. Sci China Inf Sci, 2019, 62: 12101 CrossRef Google Scholar
[4] Tseng P. An Incremental Gradient(-Projection) Method with Momentum Term and Adaptive Stepsize Rule. SIAM J Optim, 1998, 8: 506-531 CrossRef Google Scholar
[5] Duchi J, Hazan E, Singer Y. Adaptive subgradient methods for online learning and stochastic optimization. J Mach Learn Res, 2011, 12: 2121--2159. Google Scholar
[6] Ghadimi S, Lan G. Accelerated gradient methods for nonconvex nonlinear and stochastic programming. Math Program, 2016, 156: 59-99 CrossRef Google Scholar
[7] Ding F, Yang H Z, Liu F. Performance analysis of stochastic gradient algorithms under weak conditions. Sci China Ser F-Inf Sci, 2008, 51: 1269-1280 CrossRef Google Scholar
[8] Nemirovski A, Juditsky A, Lan G. Robust Stochastic Approximation Approach to Stochastic Programming. SIAM J Optim, 2009, 19: 1574-1609 CrossRef Google Scholar
[9] Krizhevsky A, Sutskever I, Hinton G E. Imagenet classification with deep convolutional neural networks. In: Proceedings of Advances in Neural Information Processing Systems, Lake Tahoe, 2012. 1097--1105. Google Scholar
[10] LeCun Y A, Bottou L, Orr G B, et al. Neural Networks: Tricks of the Trade. Berlin: Springer Science & Business Media, 2012. 9--48. Google Scholar
[11] Keskar N S, Mudigere D, Nocedal J, et al. On large-batch training for deep learning: generalization gap and sharp minima. In: Proceedings of International Conference on Learning Representations, Toulon, 2017. Google Scholar
[12] Sutskever I, Martens J, Dahl G E, et al. On the importance of initialization and momentum in deep learning. In: Proceedings of International Conference on Machine Learning, Atlanta, 2013. 1139--1147. Google Scholar
[13] Hazan E, Levy K, Shalev-Shwartz S. Beyond convexity: stochastic quasi-convex optimization. In: Proceedings of Advances in Neural Information Processing Systems, Montréal, 2015. 1594--1602. Google Scholar
[14] Levy K Y. The power of normalization: faster evasion of saddle points. 2016,. arXiv Google Scholar
[15] Fang C, Li C. J, Lin Z, et al. Spider: Near-optimal non-convex optimization via stochastic path-integrated differential estimator. In: Proceedings of Advances in Neural Information Processing Systems, Montréal, 2018. 689--699. Google Scholar
[16] Zhang J, He T, Sra S, et al. Why gradient clipping accelerates training: a theoretical justification for adaptivity. In: Proceedings of International Conference on Learning Representations, Addis Ababa, 2020. Google Scholar
[17] Nesterov Y E. Introductory Lectures on Convex Optimization: A Basic Course. Berlin: Springer Science & Business Media, 2004. Google Scholar
[18] Wilson A C, Mackey L, Wibisono A. Accelerating rescaled gradient descent: Fast optimization of smooth functions. In: Proceedings of Advances in Neural Information Processing Systems, Vancouver, 2019. 13533--13543. Google Scholar
[19] Ge R, Huang F, Jin C, et al. Escaping from saddle points-online stochastic gradient for tensor decomposition. In: Proceedings of Conference on Learning Theory, Paris, 2015. 797--842. Google Scholar
[20] Zinkevich, M. Online convex programming and generalized infinitesimal gradient ascent. In: Proceedings of International Conference on Machine learning, Washington, 2003. 928--936. Google Scholar
[21] Johnson R, Zhang T. Accelerating stochastic gradient descent using predictive variance reduction. In: Proceedings of Advances in Neural Information Processing Systems, Lake Tahoe, 2013. 315--323. Google Scholar
[22] Lei L, Ju C, Chen J, et al. Non-convex finite-sum optimization via SCSG methods. In: Proceedings of Advances in Neural Information Processing Systems, Long Beach, 2017. 2348--2358. Google Scholar
[23] Allen Z, Bengio S, Wallach H, et al. Natasha2-faster non-convex optimization than SGD. In: Proceedings of Advances in Neural Information Processing Systems, Montréal, 2018. 2680--2691. Google Scholar
[24] Defazio A, Bottou L. On the ineffectiveness of variance reduced optimization for deep learning. In: Proceedings of Advances in Neural Information Processing Systems, Vancouver, 2019. 1753--1763. Google Scholar
[25] He K, Zhang X, Ren S, et al. Deep residual learning for image recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, 2016. 770--778. Google Scholar
[26] Krizhevsky A, Hinton G E. Learning Multiple Layers of Features From Tiny Images. Technical Report TR-2009. 2009. Google Scholar
[27] Yuan Z, Yan Y, Jin R, et al. Stagewise training accelerates convergence of testing error over SGD. In: Proceedings of Advances in Neural Information Processing Systems, Vancouver, 2019. 2604--2614. Google Scholar
Figure 1
(Color online) Smooth loss curve. The flatten minimum usually leads to a small generalization error
Figure 2
(Color online) Training loss and test accuracy of a small non-convex model.
Figure 3
(Color online) Training loss and test accuracy of two large non-convex models. (a) Training loss of ResNet20; (b) test accuracy of ResNet20; (c) training loss of ResNet56; (d) test accuracy of ResNet56.
Figure 4
(Color online) Comparison between S-SNGD and S-SGD.
Initialization: ${\boldsymbol~w}_0,~\alpha,~b,~T$; |
Randomly choose $b$ samples, denoted by ${\mathcal~I}_t$; |
Calculate a mini-batch gradient ${\boldsymbol~g}_t~=~\frac{1}{b}\sum_{{\boldsymbol~a}\in~{\mathcal~I}_t}~f({\boldsymbol~w}_t;{\boldsymbol~a})$; |
${\boldsymbol~w}_{t+1}~=~{\boldsymbol~w}_t~-~\alpha\frac{{\boldsymbol~g}_t}{\|{\boldsymbol~g}_t\|}$; |
Return $\bar{{\boldsymbol~w}}$, which is randomly chosen from $\{{\boldsymbol~w}_0,{\boldsymbol~w}_1,\ldots,{\boldsymbol~w}_T\}$. |
Initialization: $\tilde{{\boldsymbol~w}}_0,~b,~S,~T,~\{\alpha_t\},~\{p_t\}$. Here, $\{p_t\}$ is a positive increasing sequence. |
${\boldsymbol~w}_{t,0}~=~\tilde{{\boldsymbol~w}}_t$; |
$M_t~=~S/\alpha_t$; |
|
Randomly choose $b$ samples, denoted by ${\mathcal~I}_t$; |
${\boldsymbol~g}_{t,m}~=~\frac{1}{b}\sum_{{\boldsymbol~a}\in~{\mathcal~I}_{t,m}}~\nabla~f({\boldsymbol~w}_{t,m};{\boldsymbol~a})$; |
${\boldsymbol~w}_{t,m+1}~=~{\boldsymbol~w}_{t,m}~-~\alpha_t~\frac{{\boldsymbol~g}_{t,m}}{\|{\boldsymbol~g}_{t,m}\|}$; |
|
Choose $\bar{{\boldsymbol~w}}_{t+1}$ randomly from $\{{\boldsymbol~w}_{t,0},~\ldots,~{\boldsymbol~w}_{t,M_t-1}\}$; |
$\tilde{{\boldsymbol~w}}_{t+1}~=~{\boldsymbol~w}_{t,M_t}$; |
Return $\bar{{\boldsymbol~w}}$, which equals to $\bar{{\boldsymbol~w}}_i$ with probability $p_{i}/\sum_{t=1}^{T}~p_{t},~i=1,\ldots,T$. |
Initialization: $\tilde{{\boldsymbol~w}}_0,~\alpha,~\beta,~B,~b$. |
Randomly choose $B$ samples, denoted by ${\mathcal~I}_t$; |
Calculate a mini-batch gradient $\tilde{\boldsymbol{\mu}}_t~=~\frac{1}{B}\sum_{{\boldsymbol~a}\in~{\mathcal~I}_t}~\nabla~f(\tilde{{\boldsymbol~w}}_t;{\boldsymbol~a})$; |
${\boldsymbol~w}_{t,0}~=~{\boldsymbol~w}_{t,1}~=~\tilde{{\boldsymbol~w}}_t$; |
$\boldsymbol{\mu}_{t,0}~=~\tilde{\boldsymbol{\mu}}_t$; |
|
Randomly choose $b$ samples, denoted by ${\mathcal~I}_{t,m}$; |
$\boldsymbol{\mu}_{t,m}~=~\frac{1}{b}\sum_{{\boldsymbol~a}\in~{\mathcal~I}_{t,m}}~(\nabla~f({\boldsymbol~w}_{t,m};{\boldsymbol~a})~-~\nabla~f({\boldsymbol~w}_{t,m-1};{\boldsymbol~a}))~+~\boldsymbol{\mu}_{t,m-1}$; |
|
${\boldsymbol~w}_{t,m+1}~=~{\boldsymbol~w}_{t,m}~-~\beta~\boldsymbol{\mu}_{t,m}$; |
|
${\boldsymbol~w}_{t,m+1}~=~{\boldsymbol~w}_{t,m}~-~\alpha~\frac{\boldsymbol{\mu}_{t,m}}{\|\boldsymbol{\mu}_{t,m}\|}$; |
|
|
$\tilde{{\boldsymbol~w}}_{t+1}~=~{\boldsymbol~w}_{M+1}$; |
Return $\bar{{\boldsymbol~w}}$, which is randomly chosen from $\{{\boldsymbol~w}_{t,m}\}_{t=0,1,\ldots,~T;~m=1,2,\ldots,~M}$. |