Publication Type

Conference Proceeding Article

Version

acceptedVersion

Publication Date

5-2023

Abstract

Training deep networks on large-scale datasets is computationally challenging. In this work, we explore the problem of “how to accelerate adaptive gradient algorithms in a general manner”, and aim to provide practical efficiency-boosting insights. To this end, we propose an effective and general Weight-decay-Integrated Nesterov acceleration (Win) to accelerate adaptive algorithms. Taking AdamW and Adam as examples, we minimize a dynamical loss per iteration which combines the vanilla training loss and a dynamic regularizer inspired by proximal point method (PPM) to improve the convexity of the problem. To introduce Nesterov-alike-acceleration into AdamW and Adam, we respectively use the first- and second-order Taylor approximations of vanilla loss to update the variable twice. In this way, we arrive at our Win acceleration for AdamW and Adam that uses a conservative step and a reckless step to update twice and then linearly combines these two updates for acceleration. Next, we extend Win acceleration to LAMB and SGD. Our transparent acceleration derivation could provide insights for other accelerated methods and their integration into adaptive algorithms. Besides, we prove the convergence of Win-accelerated adaptive algorithms and justify their convergence superiority over their non-accelerated counterparts by taking AdamW and Adam as examples. Experimental results testify to the faster convergence speed and superior performance of our Win-accelerated AdamW, Adam, LAMB and SGD over their non-accelerated counterparts on vision classification tasks and language modeling tasks with both CNN and Transformer backbones. We hope Win shall be a default acceleration option for popular optimizers in deep learning community to improve the training efficiency. Code will be released at https://github.com/sail-sg/win.

Keywords

Network optimizers, Deep learning optimizer, Deep learning algorithm, Optimization acceleration in deep learning, Deep Learning and representational learning

Discipline

OS and Networks | Theory and Algorithms

Research Areas

Data Science and Engineering; Intelligent Systems and Optimization

Areas of Excellence

Digital transformation

Publication

Proceedings of the 11th International Conference on Learning Representations, Kigali, Rwanda, 2023 May 1-5

First Page

Last Page

Publisher

ICLR

City or Country

USA

Citation

ZHOU, Pan; XIE, Xingyu; and YAN, Shuicheng. Win: Weight-decay-integrated nesterov acceleration for adaptive gradient algorithms. (2023). Proceedings of the 11th International Conference on Learning Representations, Kigali, Rwanda, 2023 May 1-5. 1-28.
Available at: https://ink.library.smu.edu.sg/sis_research/9056

Copyright Owner and License

Authors

Creative Commons License

This work is licensed under a Creative Commons Attribution-NonCommercial-No Derivative Works 4.0 International License.

Additional URL

https://openreview.net/pdf?id=dNK2bw4y0R

Download

Included in

OS and Networks Commons, Theory and Algorithms Commons

COinS

Research Collection School Of Computing and Information Systems

Win: Weight-decay-integrated nesterov acceleration for adaptive gradient algorithms

Publication Type

Version

Publication Date

Abstract

Keywords

Discipline

Research Areas

Areas of Excellence

Publication

First Page

Last Page

Publisher

City or Country

Citation

Copyright Owner and License

Creative Commons License

Additional URL

Included in

Search

Links

Browse

Links

Research Collection School Of Computing and Information Systems

Win: Weight-decay-integrated nesterov acceleration for adaptive gradient algorithms

Author

Publication Type

Version

Publication Date

Abstract

Keywords

Discipline

Research Areas

Areas of Excellence

Publication

First Page

Last Page

Publisher

City or Country

Citation

Copyright Owner and License

Creative Commons License

Additional URL

Included in

Share

Search

Links

Browse

Links