Reiner Pope

But from a ML convergence rate perspective, smaller is always better because basically you're getting the freshest information from the gradient descent.

3597.741 View full episode →

Dwarkesh Podcast

Reiner Pope – The math behind how LLMs are trained and served

But total trading time perspective.

3607.131 View full episode →

Dwarkesh Podcast

Reiner Pope – The math behind how LLMs are trained and served

Total trading time perspective, smaller is worse from a systems perspective, and so the optimum is the trade-off between those two.

3608.252 View full episode →

Dwarkesh Podcast

Reiner Pope – The math behind how LLMs are trained and served

So you pick a batch size, and then for that batch size, you do some amount forwards and then some amount backwards.

3615.139 View full episode →

Dwarkesh Podcast

Reiner Pope – The math behind how LLMs are trained and served

Yep.

3623.168 View full episode →

Dwarkesh Podcast

Reiner Pope – The math behind how LLMs are trained and served

You asked why is there even a hard stop.

3623.971 View full episode →

Dwarkesh Podcast

Reiner Pope – The math behind how LLMs are trained and served

Pipeline parallelism, because of the fact that you've got this idle time here, which is the bubble,

3626.678 View full episode →

Dwarkesh Podcast

Reiner Pope – The math behind how LLMs are trained and served

There are so many techniques in the literature for how to lay this out differently and avoid that.

3634.754 View full episode →

Dwarkesh Podcast

Reiner Pope – The math behind how LLMs are trained and served

There are more complicated schemes called like zero bubble or one forward, one backward, which sort of interleave the forwards and the backwards in complicated ways.

3640.903 View full episode →

Dwarkesh Podcast

Reiner Pope – The math behind how LLMs are trained and served

You can mine Bitcoin in that.

3649.855 View full episode →

Appearances Over Time

Podcast Appearances

Sign in to Audioscrape

Share this moment