Dylan Patel
๐ค SpeakerAppearances Over Time
Podcast Appearances
But the stuff that trains the DeepSeq v3 and DeepSeq R1, those libraries, if you were to present them to us, I would guess are extremely high-quality code.
But the stuff that trains the DeepSeq v3 and DeepSeq R1, those libraries, if you were to present them to us, I would guess are extremely high-quality code.
Some of them you do. Some of them are bad data. Can I give an AI2's example of what blew up our earlier models? It's a subreddit called Microwave Gang. We love to shout this out. It's a real thing. You can pull up Microwave Gang. Essentially, it's a subreddit where everybody makes posts that are just the letter M. So it's like, mmm.
Some of them you do. Some of them are bad data. Can I give an AI2's example of what blew up our earlier models? It's a subreddit called Microwave Gang. We love to shout this out. It's a real thing. You can pull up Microwave Gang. Essentially, it's a subreddit where everybody makes posts that are just the letter M. So it's like, mmm.
Some of them you do. Some of them are bad data. Can I give an AI2's example of what blew up our earlier models? It's a subreddit called Microwave Gang. We love to shout this out. It's a real thing. You can pull up Microwave Gang. Essentially, it's a subreddit where everybody makes posts that are just the letter M. So it's like, mmm.
So there's extremely long sequences of the letter M. And then the comments are like, beep, beep, because it's in the microwave ends. But if you pass this into a model that's trained to be a normal producing text, it's extremely high loss. Because normally you see an M. You don't predict M's for a long time. So this is something that causes a lot of spikes for us.
So there's extremely long sequences of the letter M. And then the comments are like, beep, beep, because it's in the microwave ends. But if you pass this into a model that's trained to be a normal producing text, it's extremely high loss. Because normally you see an M. You don't predict M's for a long time. So this is something that causes a lot of spikes for us.
So there's extremely long sequences of the letter M. And then the comments are like, beep, beep, because it's in the microwave ends. But if you pass this into a model that's trained to be a normal producing text, it's extremely high loss. Because normally you see an M. You don't predict M's for a long time. So this is something that causes a lot of spikes for us.
But when you have much like this is old. This is not recent. And when you have more mature data systems, that's not the thing that causes the loss spike. And what Dylan is saying is true. But it's levels to this sort of idea. With regards to the stress, right?
But when you have much like this is old. This is not recent. And when you have more mature data systems, that's not the thing that causes the loss spike. And what Dylan is saying is true. But it's levels to this sort of idea. With regards to the stress, right?
But when you have much like this is old. This is not recent. And when you have more mature data systems, that's not the thing that causes the loss spike. And what Dylan is saying is true. But it's levels to this sort of idea. With regards to the stress, right?
Tokens per second. Lost, not blown up. They're just walking, watching this.
Tokens per second. Lost, not blown up. They're just walking, watching this.
Tokens per second. Lost, not blown up. They're just walking, watching this.
There are even different types of spikes. So Dirk Greneveld has a theory that I do that's like fast spikes and slow spikes, where there are sometimes where you're looking at the loss and there are other parameters, you can see it start to creep up. and then blow up. And that's really hard to recover from. So you have to go back much further.
There are even different types of spikes. So Dirk Greneveld has a theory that I do that's like fast spikes and slow spikes, where there are sometimes where you're looking at the loss and there are other parameters, you can see it start to creep up. and then blow up. And that's really hard to recover from. So you have to go back much further.
There are even different types of spikes. So Dirk Greneveld has a theory that I do that's like fast spikes and slow spikes, where there are sometimes where you're looking at the loss and there are other parameters, you can see it start to creep up. and then blow up. And that's really hard to recover from. So you have to go back much further.
So you have the stressful period where it's like flat or might start going up. And you're like, what do I do? Whereas there are also lost spikes that are, it looks good. And then there's one spiky data point. And what you can do is you just skip those. You see that there's a spike. You're like, okay, I can ignore this data. Don't update the model and do the next one and it'll recover quickly.
So you have the stressful period where it's like flat or might start going up. And you're like, what do I do? Whereas there are also lost spikes that are, it looks good. And then there's one spiky data point. And what you can do is you just skip those. You see that there's a spike. You're like, okay, I can ignore this data. Don't update the model and do the next one and it'll recover quickly.
So you have the stressful period where it's like flat or might start going up. And you're like, what do I do? Whereas there are also lost spikes that are, it looks good. And then there's one spiky data point. And what you can do is you just skip those. You see that there's a spike. You're like, okay, I can ignore this data. Don't update the model and do the next one and it'll recover quickly.