logo

SCIENCE CHINA Information Sciences, Volume 64 , Issue 3 : 132103(2021) https://doi.org/10.1007/s11432-020-3023-7

On the convergence and improvement of stochastic normalized gradient descent

More info
  • ReceivedMar 20, 2020
  • AcceptedJun 3, 2020
  • PublishedFeb 8, 2021

Abstract


Acknowledgment

This work was supported by Science and Technology Project of State Grid Corporation of China (Grant No. SGGR0000XTJS1900448).


References

[1] Bottou L. Online learning and stochastic approximations. On-line learning in neural networks, 1998, 17(9): 142. Google Scholar

[2] Robbins H, Monro S. A Stochastic Approximation Method. Ann Math Statist, 1951, 22: 400-407 CrossRef Google Scholar

[3] Chen C, Wang W, Zhang Y. A convergence analysis for a class of practical variance-reduction stochastic gradient MCMC. Sci China Inf Sci, 2019, 62: 12101 CrossRef Google Scholar

[4] Tseng P. An Incremental Gradient(-Projection) Method with Momentum Term and Adaptive Stepsize Rule. SIAM J Optim, 1998, 8: 506-531 CrossRef Google Scholar

[5] Duchi J, Hazan E, Singer Y. Adaptive subgradient methods for online learning and stochastic optimization. J Mach Learn Res, 2011, 12: 2121--2159. Google Scholar

[6] Ghadimi S, Lan G. Accelerated gradient methods for nonconvex nonlinear and stochastic programming. Math Program, 2016, 156: 59-99 CrossRef Google Scholar

[7] Ding F, Yang H Z, Liu F. Performance analysis of stochastic gradient algorithms under weak conditions. Sci China Ser F-Inf Sci, 2008, 51: 1269-1280 CrossRef Google Scholar

[8] Nemirovski A, Juditsky A, Lan G. Robust Stochastic Approximation Approach to Stochastic Programming. SIAM J Optim, 2009, 19: 1574-1609 CrossRef Google Scholar

[9] Krizhevsky A, Sutskever I, Hinton G E. Imagenet classification with deep convolutional neural networks. In: Proceedings of Advances in Neural Information Processing Systems, Lake Tahoe, 2012. 1097--1105. Google Scholar

[10] LeCun Y A, Bottou L, Orr G B, et al. Neural Networks: Tricks of the Trade. Berlin: Springer Science & Business Media, 2012. 9--48. Google Scholar

[11] Keskar N S, Mudigere D, Nocedal J, et al. On large-batch training for deep learning: generalization gap and sharp minima. In: Proceedings of International Conference on Learning Representations, Toulon, 2017. Google Scholar

[12] Sutskever I, Martens J, Dahl G E, et al. On the importance of initialization and momentum in deep learning. In: Proceedings of International Conference on Machine Learning, Atlanta, 2013. 1139--1147. Google Scholar

[13] Hazan E, Levy K, Shalev-Shwartz S. Beyond convexity: stochastic quasi-convex optimization. In: Proceedings of Advances in Neural Information Processing Systems, Montréal, 2015. 1594--1602. Google Scholar

[14] Levy K Y. The power of normalization: faster evasion of saddle points. 2016,. arXiv Google Scholar

[15] Fang C, Li C. J, Lin Z, et al. Spider: Near-optimal non-convex optimization via stochastic path-integrated differential estimator. In: Proceedings of Advances in Neural Information Processing Systems, Montréal, 2018. 689--699. Google Scholar

[16] Zhang J, He T, Sra S, et al. Why gradient clipping accelerates training: a theoretical justification for adaptivity. In: Proceedings of International Conference on Learning Representations, Addis Ababa, 2020. Google Scholar

[17] Nesterov Y E. Introductory Lectures on Convex Optimization: A Basic Course. Berlin: Springer Science & Business Media, 2004. Google Scholar

[18] Wilson A C, Mackey L, Wibisono A. Accelerating rescaled gradient descent: Fast optimization of smooth functions. In: Proceedings of Advances in Neural Information Processing Systems, Vancouver, 2019. 13533--13543. Google Scholar

[19] Ge R, Huang F, Jin C, et al. Escaping from saddle points-online stochastic gradient for tensor decomposition. In: Proceedings of Conference on Learning Theory, Paris, 2015. 797--842. Google Scholar

[20] Zinkevich, M. Online convex programming and generalized infinitesimal gradient ascent. In: Proceedings of International Conference on Machine learning, Washington, 2003. 928--936. Google Scholar

[21] Johnson R, Zhang T. Accelerating stochastic gradient descent using predictive variance reduction. In: Proceedings of Advances in Neural Information Processing Systems, Lake Tahoe, 2013. 315--323. Google Scholar

[22] Lei L, Ju C, Chen J, et al. Non-convex finite-sum optimization via SCSG methods. In: Proceedings of Advances in Neural Information Processing Systems, Long Beach, 2017. 2348--2358. Google Scholar

[23] Allen Z, Bengio S, Wallach H, et al. Natasha2-faster non-convex optimization than SGD. In: Proceedings of Advances in Neural Information Processing Systems, Montréal, 2018. 2680--2691. Google Scholar

[24] Defazio A, Bottou L. On the ineffectiveness of variance reduced optimization for deep learning. In: Proceedings of Advances in Neural Information Processing Systems, Vancouver, 2019. 1753--1763. Google Scholar

[25] He K, Zhang X, Ren S, et al. Deep residual learning for image recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, 2016. 770--778. Google Scholar

[26] Krizhevsky A, Hinton G E. Learning Multiple Layers of Features From Tiny Images. Technical Report TR-2009. 2009. Google Scholar

[27] Yuan Z, Yan Y, Jin R, et al. Stagewise training accelerates convergence of testing error over SGD. In: Proceedings of Advances in Neural Information Processing Systems, Vancouver, 2019. 2604--2614. Google Scholar

  • Figure 1

    (Color online) Smooth loss curve. The flatten minimum usually leads to a small generalization error [11]. In the region of the saddle point, the gradient is small. In the region of the sharp minimum, the gradient is large. Intuitively, because $\|{\boldsymbol~w}_{t+1}-~{\boldsymbol~w}_t\|~=~\alpha$ in SNGD, with a suitable $\alpha$, normalized gradient can yield a faster escape of saddle points and sharp minimum than unnormalized gradient.

  • Figure 2

    (Color online) Training loss and test accuracy of a small non-convex model.

  • Figure 3

    (Color online) Training loss and test accuracy of two large non-convex models. (a) Training loss of ResNet20; (b) test accuracy of ResNet20; (c) training loss of ResNet56; (d) test accuracy of ResNet56.

  • Figure 4

    (Color online) Comparison between S-SNGD and S-SGD.

  •   

    Algorithm 1 SNGD

    Initialization: ${\boldsymbol~w}_0,~\alpha,~b,~T$;

    for $t=0,1,\ldots,T-1$

    Randomly choose $b$ samples, denoted by ${\mathcal~I}_t$;

    Calculate a mini-batch gradient ${\boldsymbol~g}_t~=~\frac{1}{b}\sum_{{\boldsymbol~a}\in~{\mathcal~I}_t}~f({\boldsymbol~w}_t;{\boldsymbol~a})$;

    ${\boldsymbol~w}_{t+1}~=~{\boldsymbol~w}_t~-~\alpha\frac{{\boldsymbol~g}_t}{\|{\boldsymbol~g}_t\|}$;

    end for

    Return $\bar{{\boldsymbol~w}}$, which is randomly chosen from $\{{\boldsymbol~w}_0,{\boldsymbol~w}_1,\ldots,{\boldsymbol~w}_T\}$.

  •   

    Algorithm 2 mboxS-SNGD

    Initialization: $\tilde{{\boldsymbol~w}}_0,~b,~S,~T,~\{\alpha_t\},~\{p_t\}$. Here, $\{p_t\}$ is a positive increasing sequence.

    for $t=0,1,\ldots,T-1$

    ${\boldsymbol~w}_{t,0}~=~\tilde{{\boldsymbol~w}}_t$;

    $M_t~=~S/\alpha_t$;

    for $m=0,1,\ldots,M_t-1$

    Randomly choose $b$ samples, denoted by ${\mathcal~I}_t$;

    ${\boldsymbol~g}_{t,m}~=~\frac{1}{b}\sum_{{\boldsymbol~a}\in~{\mathcal~I}_{t,m}}~\nabla~f({\boldsymbol~w}_{t,m};{\boldsymbol~a})$;

    ${\boldsymbol~w}_{t,m+1}~=~{\boldsymbol~w}_{t,m}~-~\alpha_t~\frac{{\boldsymbol~g}_{t,m}}{\|{\boldsymbol~g}_{t,m}\|}$;

    end for

    Choose $\bar{{\boldsymbol~w}}_{t+1}$ randomly from $\{{\boldsymbol~w}_{t,0},~\ldots,~{\boldsymbol~w}_{t,M_t-1}\}$;

    $\tilde{{\boldsymbol~w}}_{t+1}~=~{\boldsymbol~w}_{t,M_t}$;

    end for

    Return $\bar{{\boldsymbol~w}}$, which equals to $\bar{{\boldsymbol~w}}_i$ with probability $p_{i}/\sum_{t=1}^{T}~p_{t},~i=1,\ldots,T$.

  •   

    Algorithm 3 mboxSPIDER-SFO [15]

    Initialization: $\tilde{{\boldsymbol~w}}_0,~\alpha,~\beta,~B,~b$.

    for $t=0,1,\ldots,T-1$

    Randomly choose $B$ samples, denoted by ${\mathcal~I}_t$;

    Calculate a mini-batch gradient $\tilde{\boldsymbol{\mu}}_t~=~\frac{1}{B}\sum_{{\boldsymbol~a}\in~{\mathcal~I}_t}~\nabla~f(\tilde{{\boldsymbol~w}}_t;{\boldsymbol~a})$;

    ${\boldsymbol~w}_{t,0}~=~{\boldsymbol~w}_{t,1}~=~\tilde{{\boldsymbol~w}}_t$;

    $\boldsymbol{\mu}_{t,0}~=~\tilde{\boldsymbol{\mu}}_t$;

    for $m=1,2,\ldots,~M$

    Randomly choose $b$ samples, denoted by ${\mathcal~I}_{t,m}$;

    $\boldsymbol{\mu}_{t,m}~=~\frac{1}{b}\sum_{{\boldsymbol~a}\in~{\mathcal~I}_{t,m}}~(\nabla~f({\boldsymbol~w}_{t,m};{\boldsymbol~a})~-~\nabla~f({\boldsymbol~w}_{t,m-1};{\boldsymbol~a}))~+~\boldsymbol{\mu}_{t,m-1}$;

    if $\|\boldsymbol{\mu}_{t,m}\|\leq~\alpha/\beta$ then

    ${\boldsymbol~w}_{t,m+1}~=~{\boldsymbol~w}_{t,m}~-~\beta~\boldsymbol{\mu}_{t,m}$;

    else

    ${\boldsymbol~w}_{t,m+1}~=~{\boldsymbol~w}_{t,m}~-~\alpha~\frac{\boldsymbol{\mu}_{t,m}}{\|\boldsymbol{\mu}_{t,m}\|}$;

    end if

    end for

    $\tilde{{\boldsymbol~w}}_{t+1}~=~{\boldsymbol~w}_{M+1}$;

    end for

    Return $\bar{{\boldsymbol~w}}$, which is randomly chosen from $\{{\boldsymbol~w}_{t,m}\}_{t=0,1,\ldots,~T;~m=1,2,\ldots,~M}$.