Nathan Lambert
๐ค SpeakerAppearances Over Time
Podcast Appearances
I don't know how they're doing the networking, but they're using Nvidia spectrum X ethernet. Um, there's actually like, I think, yeah, the unsung heroes are the cooling and electrical systems, which are just like glossed over. Yeah. Um, But I think like one story that maybe is like exemplifies how insane this stuff is, is when you're training, right?
I don't know how they're doing the networking, but they're using Nvidia spectrum X ethernet. Um, there's actually like, I think, yeah, the unsung heroes are the cooling and electrical systems, which are just like glossed over. Yeah. Um, But I think like one story that maybe is like exemplifies how insane this stuff is, is when you're training, right?
You're always doing, you're running through the model a bunch, right? In the most simplistic terms, running through the model a bunch, and then you're going to exchange everything and synchronize the weights, right? So you'll do a step. This is like a step in model training, right? And every step your loss goes down, hopefully, and it doesn't always.
You're always doing, you're running through the model a bunch, right? In the most simplistic terms, running through the model a bunch, and then you're going to exchange everything and synchronize the weights, right? So you'll do a step. This is like a step in model training, right? And every step your loss goes down, hopefully, and it doesn't always.
You're always doing, you're running through the model a bunch, right? In the most simplistic terms, running through the model a bunch, and then you're going to exchange everything and synchronize the weights, right? So you'll do a step. This is like a step in model training, right? And every step your loss goes down, hopefully, and it doesn't always.
But in the simplest terms, you'll be computing a lot and then you'll exchange. right? The interesting thing is GPU power is most of it. Networking power is some, but it's a lot less. So while you're computing, your power for your GPUs is here.
But in the simplest terms, you'll be computing a lot and then you'll exchange. right? The interesting thing is GPU power is most of it. Networking power is some, but it's a lot less. So while you're computing, your power for your GPUs is here.
But in the simplest terms, you'll be computing a lot and then you'll exchange. right? The interesting thing is GPU power is most of it. Networking power is some, but it's a lot less. So while you're computing, your power for your GPUs is here.
But then when you're exchanging weights, if you're not able to overlap communications and compute perfectly, there may be a time period where your GPUs are just idle and you're exchanging weights and you're like, hey, the model's updating. So you're exchanging the gradients, you do the model update, and then you start training again. So the power goes And it's super spiky.
But then when you're exchanging weights, if you're not able to overlap communications and compute perfectly, there may be a time period where your GPUs are just idle and you're exchanging weights and you're like, hey, the model's updating. So you're exchanging the gradients, you do the model update, and then you start training again. So the power goes And it's super spiky.
But then when you're exchanging weights, if you're not able to overlap communications and compute perfectly, there may be a time period where your GPUs are just idle and you're exchanging weights and you're like, hey, the model's updating. So you're exchanging the gradients, you do the model update, and then you start training again. So the power goes And it's super spiky.
And so funnily enough, when you talk about the scale of data center power, you can blow stuff up so easily. And so Meta actually has accidentally upstreamed something to code in PyTorch where they added an operator. And I kid you not, whoever made this, I want to hug the guy because it says PyTorch. It's like PyTorch.powerplantnoblowup. Okay. equals zero or equal one.
And so funnily enough, when you talk about the scale of data center power, you can blow stuff up so easily. And so Meta actually has accidentally upstreamed something to code in PyTorch where they added an operator. And I kid you not, whoever made this, I want to hug the guy because it says PyTorch. It's like PyTorch.powerplantnoblowup. Okay. equals zero or equal one.
And so funnily enough, when you talk about the scale of data center power, you can blow stuff up so easily. And so Meta actually has accidentally upstreamed something to code in PyTorch where they added an operator. And I kid you not, whoever made this, I want to hug the guy because it says PyTorch. It's like PyTorch.powerplantnoblowup. Okay. equals zero or equal one.
And what it does, what it does is amazing, right? Either, you know, when you're exchanging the weights, the GPU will just compute fake numbers so the power doesn't spike too much. And so then the power plants don't blow up because the transient spikes like screw stuff up.
And what it does, what it does is amazing, right? Either, you know, when you're exchanging the weights, the GPU will just compute fake numbers so the power doesn't spike too much. And so then the power plants don't blow up because the transient spikes like screw stuff up.
And what it does, what it does is amazing, right? Either, you know, when you're exchanging the weights, the GPU will just compute fake numbers so the power doesn't spike too much. And so then the power plants don't blow up because the transient spikes like screw stuff up.
And Elon's solution was like, let me throw a bunch of Tesla mega packs and a few other things, right? Like everyone has different solutions, but like Meta's at least was publicly and openly known, which is just like set this operator. And what this operator does is it just makes the GPUs compute nothing so that the power doesn't spike.
And Elon's solution was like, let me throw a bunch of Tesla mega packs and a few other things, right? Like everyone has different solutions, but like Meta's at least was publicly and openly known, which is just like set this operator. And what this operator does is it just makes the GPUs compute nothing so that the power doesn't spike.
And Elon's solution was like, let me throw a bunch of Tesla mega packs and a few other things, right? Like everyone has different solutions, but like Meta's at least was publicly and openly known, which is just like set this operator. And what this operator does is it just makes the GPUs compute nothing so that the power doesn't spike.