Dylan Patel
๐ค SpeakerAppearances Over Time
Podcast Appearances
For the past few years, the highest cost human data has been in these preferences, which is comparing, I would say highest cost and highest total usage. So a lot of money has gone to these pairwise comparisons where you have two model outputs and a human is comparing between the two of them. In earlier years, there was a lot of this instruction tuning data.
For the past few years, the highest cost human data has been in these preferences, which is comparing, I would say highest cost and highest total usage. So a lot of money has gone to these pairwise comparisons where you have two model outputs and a human is comparing between the two of them. In earlier years, there was a lot of this instruction tuning data.
So creating highly specific examples to something like a Reddit question to a domain that you care about. Language models used to struggle on math and code. So you would pay experts in math and code to come up with questions and write detailed answers that were used to train the models.
So creating highly specific examples to something like a Reddit question to a domain that you care about. Language models used to struggle on math and code. So you would pay experts in math and code to come up with questions and write detailed answers that were used to train the models.
So creating highly specific examples to something like a Reddit question to a domain that you care about. Language models used to struggle on math and code. So you would pay experts in math and code to come up with questions and write detailed answers that were used to train the models.
Now it is the case that there are many model options that are way better than humans at writing detailed and eloquent answers for things like model and code. So They talked about this with the Lama 3 release, where they switched to using Lama 3, 4, or 5B to write their answers for math and code.
Now it is the case that there are many model options that are way better than humans at writing detailed and eloquent answers for things like model and code. So They talked about this with the Lama 3 release, where they switched to using Lama 3, 4, or 5B to write their answers for math and code.
Now it is the case that there are many model options that are way better than humans at writing detailed and eloquent answers for things like model and code. So They talked about this with the Lama 3 release, where they switched to using Lama 3, 4, or 5B to write their answers for math and code.
But they, in their paper, talk about how they use extensive human preference data, which is something that they haven't gotten AIs to replace. There are other techniques in industry like constitutional AI, where you use human data for preferences and AI for preferences. And I expect the AI part to scale faster than the human part.
But they, in their paper, talk about how they use extensive human preference data, which is something that they haven't gotten AIs to replace. There are other techniques in industry like constitutional AI, where you use human data for preferences and AI for preferences. And I expect the AI part to scale faster than the human part.
But they, in their paper, talk about how they use extensive human preference data, which is something that they haven't gotten AIs to replace. There are other techniques in industry like constitutional AI, where you use human data for preferences and AI for preferences. And I expect the AI part to scale faster than the human part.
But among the research that we have access to is that humans are in this kind of preference loop.
But among the research that we have access to is that humans are in this kind of preference loop.
But among the research that we have access to is that humans are in this kind of preference loop.
It's even less prevalent. So it's... The remarkable thing about these reasoning results, and especially the DeepSeq R1 paper, is this result that they call DeepSeq R1-0, which is they took one of these pre-trained models, they took DeepSeq V3 base, and then they do this reinforcement learning optimization on verifiable questions or verifiable rewards for a lot of questions and a lot of training.
It's even less prevalent. So it's... The remarkable thing about these reasoning results, and especially the DeepSeq R1 paper, is this result that they call DeepSeq R1-0, which is they took one of these pre-trained models, they took DeepSeq V3 base, and then they do this reinforcement learning optimization on verifiable questions or verifiable rewards for a lot of questions and a lot of training.
It's even less prevalent. So it's... The remarkable thing about these reasoning results, and especially the DeepSeq R1 paper, is this result that they call DeepSeq R1-0, which is they took one of these pre-trained models, they took DeepSeq V3 base, and then they do this reinforcement learning optimization on verifiable questions or verifiable rewards for a lot of questions and a lot of training.
And these reasoning behaviors emerge naturally. So these things like, wait, let me see, wait, let me check this. Oh, that might be a mistake. And they emerge from only having questions and answers. And when you're using the model, the part that you look at is the completion. So in this case, all of that just emerges from this large scale RL training.
And these reasoning behaviors emerge naturally. So these things like, wait, let me see, wait, let me check this. Oh, that might be a mistake. And they emerge from only having questions and answers. And when you're using the model, the part that you look at is the completion. So in this case, all of that just emerges from this large scale RL training.
And these reasoning behaviors emerge naturally. So these things like, wait, let me see, wait, let me check this. Oh, that might be a mistake. And they emerge from only having questions and answers. And when you're using the model, the part that you look at is the completion. So in this case, all of that just emerges from this large scale RL training.