Menu
Sign In Search Podcasts Charts People & Topics Add Podcast API Pricing
Podcast Image

AI: post transformers

AdamW: Decoupled Weight Decay Regularization for Adaptive Gradient Algorithms

27 Aug 2025

Description

This January 2019 academic paper addresses the common issue of poor generalization in adaptive gradient optimization methods like Adam, compared to traditional Stochastic Gradient Descent (SGD) with momentum. The authors demonstrate that L2 regularization and weight decay are not equivalent for adaptive optimizers, unlike for standard SGD, leading to suboptimal performance in Adam. They propose a simple modification called "decoupled weight decay" (AdamW), which separates the weight decay step from the gradient-based updates. Empirical evidence shows that AdamW significantly improves Adam's generalization performance on image classification tasks and simplifies hyperparameter tuning by decoupling the learning rate and weight decay factors. Furthermore, the paper introduces AdamWR, incorporating warm restarts to further enhance AdamW's anytime performance, ultimately making Adam competitive with SGD with momentum.Source:https://arxiv.org/pdf/1711.05101

Audio
Featured in this Episode

No persons identified in this episode.

Transcription

This episode hasn't been transcribed yet

Help us prioritize this episode for transcription by upvoting it.

0 upvotes
🗳️ Sign in to Upvote

Popular episodes get transcribed faster

Comments

There are no comments yet.

Please log in to write the first comment.