Adafactor: Memory-Efficient Adaptive Learning Rates

Audio

Description

This April 2018 paper introduces Adafactor, a novel optimization method designed to reduce the memory footprint of adaptive learning rate algorithms like Adam, particularly for large neural networks. Adafactor achieves this by estimating per-parameter second moments using factored representations, specifically maintaining only row and column sums for weight matrices, thereby reducing memory requirements from O(nm) to O(n+m). The paper also addresses training instability in adaptive methods, proposing update clipping and a gradually increasing decay rate scheme for the second-moment accumulator as solutions. Furthermore, Adafactor suggests scaling parameter updates based on the parameters' own magnitudes rather than absolute step sizes, contributing to its overall efficiency and stability. Experimental results on the Transformer model for machine translation demonstrate that Adafactor achieves comparable performance to Adam while requiring significantly less auxiliary memory.Source:https://arxiv.org/pdf/1804.04235

Transcription

This episode hasn't been transcribed yet

Help us prioritize this episode for transcription by upvoting it.

0 upvotes

🗳️ Sign in to Upvote

Popular episodes get transcribed faster

AI Post Transformers

This episode hasn't been transcribed yet

Other recent transcribed episodes

13:00H | 21 DIC 2025 | Fin de Semana

10:00H | 21 DIC 2025 | Fin de Semana

12:00H | 20 DIC 2025 | Fin de Semana

2ª PARTE | 06 ENE 2026 | EL PARTIDAZO DE COPE

3ª PARTE | 22 ENE 2026 | EL PARTIDAZO DE COPE

3ª PARTE | 04 MAR 2026 | EL PARTIDAZO DE COPE

Sign in to Audioscrape

Share this moment