Nathan Lambert
๐ค PersonAppearances Over Time
Podcast Appearances
I think there is one aspect to note, though, right? Is that there is the general ability for that to transfer across different types of runs, right? You may make really, really high quality code for one specific model architecture at one size. And then that is not transferable to, hey, when I make this architecture tweak, everything's broken again, right?
Like that's something that could be, you know, with their specific low level coding of like scheduling SMs is specific to this model architecture and size. Right. And whereas like NVIDIA's collectives library is more like, hey, it'll work for anything. Right. You want to do an all reduce? Great. I don't care what your model architecture is. It'll work.
Like that's something that could be, you know, with their specific low level coding of like scheduling SMs is specific to this model architecture and size. Right. And whereas like NVIDIA's collectives library is more like, hey, it'll work for anything. Right. You want to do an all reduce? Great. I don't care what your model architecture is. It'll work.
Like that's something that could be, you know, with their specific low level coding of like scheduling SMs is specific to this model architecture and size. Right. And whereas like NVIDIA's collectives library is more like, hey, it'll work for anything. Right. You want to do an all reduce? Great. I don't care what your model architecture is. It'll work.
And you're giving up a lot of performance when you do that in many cases. But it's worthwhile for them to do the specific optimization for the specific run, given the constraints that they have regarding compute.
And you're giving up a lot of performance when you do that in many cases. But it's worthwhile for them to do the specific optimization for the specific run, given the constraints that they have regarding compute.
And you're giving up a lot of performance when you do that in many cases. But it's worthwhile for them to do the specific optimization for the specific run, given the constraints that they have regarding compute.
When people are training, they have all these various dashboards, but like the most simple one is your loss, right? And it continues to go down. But in reality, especially with more complicated stuff like MOE, the biggest problem with it or FP8 training, which is another innovation, you know, going to a lower precision number format, i.e. less accurate, is that you end up with loss spikes.
When people are training, they have all these various dashboards, but like the most simple one is your loss, right? And it continues to go down. But in reality, especially with more complicated stuff like MOE, the biggest problem with it or FP8 training, which is another innovation, you know, going to a lower precision number format, i.e. less accurate, is that you end up with loss spikes.
When people are training, they have all these various dashboards, but like the most simple one is your loss, right? And it continues to go down. But in reality, especially with more complicated stuff like MOE, the biggest problem with it or FP8 training, which is another innovation, you know, going to a lower precision number format, i.e. less accurate, is that you end up with loss spikes.
And no one knows why the lost spike happened.
And no one knows why the lost spike happened.
And no one knows why the lost spike happened.
Yeah. These people are like, you know, you'll go out to dinner with like a friend that works at one of these labs and they'll just be like looking at their phone every like 10 minutes. And they're not like, you know, it's one thing if they're texting, but they're just like, like, is the loss. Yeah.
Yeah. These people are like, you know, you'll go out to dinner with like a friend that works at one of these labs and they'll just be like looking at their phone every like 10 minutes. And they're not like, you know, it's one thing if they're texting, but they're just like, like, is the loss. Yeah.
Yeah. These people are like, you know, you'll go out to dinner with like a friend that works at one of these labs and they'll just be like looking at their phone every like 10 minutes. And they're not like, you know, it's one thing if they're texting, but they're just like, like, is the loss. Yeah.
And some level of spikes is normal, right? It'll recover and be back. Sometimes a lot of the old strategy was like, you just stop the run, restart from the old version, and then like change the data mix. And then it keeps going.
And some level of spikes is normal, right? It'll recover and be back. Sometimes a lot of the old strategy was like, you just stop the run, restart from the old version, and then like change the data mix. And then it keeps going.
And some level of spikes is normal, right? It'll recover and be back. Sometimes a lot of the old strategy was like, you just stop the run, restart from the old version, and then like change the data mix. And then it keeps going.
So it's like there's a distribution. The whole idea of grokking also comes in, right? It's like just because it slowed down from improving and loss doesn't mean it's not learning because all of a sudden it could be like this and it could just spike down and loss again because it learned, truly learned something, right? And it took some time for it to learn that.