Dylan Patel
๐ค SpeakerAppearances Over Time
Podcast Appearances
And it adds up.
I think we should summarize what the bitter lesson actually is about. The bitter lesson, essentially, if you paraphrase it, is that the types of training that will win out in deep learning as we go are those methods that are which are scalable in learning and search is what it calls out. And This scale word gets a lot of attention in this.
I think we should summarize what the bitter lesson actually is about. The bitter lesson, essentially, if you paraphrase it, is that the types of training that will win out in deep learning as we go are those methods that are which are scalable in learning and search is what it calls out. And This scale word gets a lot of attention in this.
I think we should summarize what the bitter lesson actually is about. The bitter lesson, essentially, if you paraphrase it, is that the types of training that will win out in deep learning as we go are those methods that are which are scalable in learning and search is what it calls out. And This scale word gets a lot of attention in this.
The interpretation that I use is effectively to avoid adding human priors to your learning process. And if you read the original essay, this is what it talks about, is how
The interpretation that I use is effectively to avoid adding human priors to your learning process. And if you read the original essay, this is what it talks about, is how
The interpretation that I use is effectively to avoid adding human priors to your learning process. And if you read the original essay, this is what it talks about, is how
Researchers will try to come up with clever solutions to their specific problem that might get them small gains in the short term, while simply enabling these deep learning systems to work efficiently and for these bigger problems in the long term might be more likely to scale and continue to drive success.
Researchers will try to come up with clever solutions to their specific problem that might get them small gains in the short term, while simply enabling these deep learning systems to work efficiently and for these bigger problems in the long term might be more likely to scale and continue to drive success.
Researchers will try to come up with clever solutions to their specific problem that might get them small gains in the short term, while simply enabling these deep learning systems to work efficiently and for these bigger problems in the long term might be more likely to scale and continue to drive success.
And therefore, we were talking about relatively small implementation changes to the mixture of experts model. And therefore, it's like, okay, like, we will need a few more years to know if one of these are actually really crucial to the bitter lesson. But the bitter lesson is really this long term arc of how. Simplicity can often win.
And therefore, we were talking about relatively small implementation changes to the mixture of experts model. And therefore, it's like, okay, like, we will need a few more years to know if one of these are actually really crucial to the bitter lesson. But the bitter lesson is really this long term arc of how. Simplicity can often win.
And therefore, we were talking about relatively small implementation changes to the mixture of experts model. And therefore, it's like, okay, like, we will need a few more years to know if one of these are actually really crucial to the bitter lesson. But the bitter lesson is really this long term arc of how. Simplicity can often win.
There's a lot of sayings in the industry like the models just want to learn. You have to give them the simple loss landscape where you put compute through the model and they will learn and getting barriers out of the way.
There's a lot of sayings in the industry like the models just want to learn. You have to give them the simple loss landscape where you put compute through the model and they will learn and getting barriers out of the way.
There's a lot of sayings in the industry like the models just want to learn. You have to give them the simple loss landscape where you put compute through the model and they will learn and getting barriers out of the way.
I'm sure they have, DeepSeek definitely has code bases that are extremely messy where they're testing these new ideas. multi-head latent attention. Probably could start in something like a Jupyter notebook, where somebody tries something on a few GPUs, and that is really messy.
I'm sure they have, DeepSeek definitely has code bases that are extremely messy where they're testing these new ideas. multi-head latent attention. Probably could start in something like a Jupyter notebook, where somebody tries something on a few GPUs, and that is really messy.
I'm sure they have, DeepSeek definitely has code bases that are extremely messy where they're testing these new ideas. multi-head latent attention. Probably could start in something like a Jupyter notebook, where somebody tries something on a few GPUs, and that is really messy.
But the stuff that trains the DeepSeq v3 and DeepSeq R1, those libraries, if you were to present them to us, I would guess are extremely high-quality code.