Research Collection School Of Computing and Information Systems

Towards theoretically understanding why SGD generalizes better than ADAM in deep learning

Publication Type

Conference Proceeding Article

Version

acceptedVersion

Publication Date

12-2020

Abstract

It is not clear yet why ADAM-alike adaptive gradient algorithms suffer from worse generalization performance than SGD despite their faster training speed. This work aims to provide understandings on this generalization gap by analyzing their local convergence behaviors. Specifically, we observe the heavy tails of gradient noise in these algorithms. This motivates us to analyze these algorithms through their Lévy-driven stochastic differential equations (SDEs) because of the similar convergence behaviors of an algorithm and its SDE. Then we establish the escaping time of these SDEs from a local basin. The result shows that (1) the escaping time of both SGD and ADAM depends on the Radon measure of the basin positively and the heaviness of gradient noise negatively; (2) for the same basin, SGD enjoys smaller escaping time than ADAM, mainly because (a) the geometry adaptation in ADAM via adaptively scaling each gradient coordinate well diminishes the anisotropic structure in gradient noise and results in larger Radon measure of a basin; (b) the exponential gradient average in ADAM smooths its gradient and leads to lighter gradient noise tails than SGD. So SGD is more locally unstable than ADAM at sharp minima defined as the minima whose local basins have small Radon measure, and can better escape from them to flatter ones with larger Radon measure. As flat minima here which often refer to the minima at flat or asymmetric basins/valleys often generalize better than sharp ones [1, 2], our result explains the better generalization performance of SGD over ADAM. Finally, experimental results confirm our heavy-tailed gradient noise assumption and theoretical affirmation.

Discipline

Databases and Information Systems | OS and Networks

Research Areas

Intelligent Systems and Optimization

Areas of Excellence

Digital transformation

Publication

Proceedings of the 34th Conference on Neural Information Processing Systems, NeurIPS 2020, Vancouver, Canada, December 6-12

First Page

Last Page

Publisher

NeurIPS

City or Country

Virtual Conference

Citation

ZHOU, Pan; FENG, Jiashi; MA, Chao; XIONG, Caiming; HOI, Steven C. H.; and E, Weinan. Towards theoretically understanding why SGD generalizes better than ADAM in deep learning. (2020). Proceedings of the 34th Conference on Neural Information Processing Systems, NeurIPS 2020, Vancouver, Canada, December 6-12. 1-12.
Available at: https://ink.library.smu.edu.sg/sis_research/8999

Creative Commons License

This work is licensed under a Creative Commons Attribution-NonCommercial-No Derivative Works 4.0 International License.

Additional URL

https://proceedings.neurips.cc/paper_files/paper/2020/hash/f3f27a324736617f20abbf2ffd806f6d-Abstract.html

Download

Included in

Databases and Information Systems Commons, OS and Networks Commons

COinS

Research Collection School Of Computing and Information Systems

Towards theoretically understanding why SGD generalizes better than ADAM in deep learning

Publication Type

Version

Publication Date

Abstract

Discipline

Research Areas

Areas of Excellence

Publication

First Page

Last Page

Publisher

City or Country

Citation

Creative Commons License

Additional URL

Included in

Search

Links

Browse

Links

Research Collection School Of Computing and Information Systems

Towards theoretically understanding why SGD generalizes better than ADAM in deep learning

Author

Publication Type

Version

Publication Date

Abstract

Discipline

Research Areas

Areas of Excellence

Publication

First Page

Last Page

Publisher

City or Country

Citation

Creative Commons License

Additional URL

Included in

Share

Search

Links

Browse

Links