Dylan Patel
๐ค SpeakerAppearances Over Time
Podcast Appearances
Yeah, on our eval benchmark. DeepSeq v3 is really similar. We have a safety benchmark to understand if it will say harmful things and things like that. And that's what draws down most of the way.
Yeah, on our eval benchmark. DeepSeq v3 is really similar. We have a safety benchmark to understand if it will say harmful things and things like that. And that's what draws down most of the way.
Yeah, on our eval benchmark. DeepSeq v3 is really similar. We have a safety benchmark to understand if it will say harmful things and things like that. And that's what draws down most of the way.
Yeah, so we have a 10 evaluation. This is standard practice in post-training is you choose your evaluations you care about. In academics, in smaller labs, you'll have fewer evaluations. In companies, you'll have a really one domain that you really care about. In frontier labs, you'll have 10s to 20s to maybe even like 100 evaluations of specific things.
Yeah, so we have a 10 evaluation. This is standard practice in post-training is you choose your evaluations you care about. In academics, in smaller labs, you'll have fewer evaluations. In companies, you'll have a really one domain that you really care about. In frontier labs, you'll have 10s to 20s to maybe even like 100 evaluations of specific things.
Yeah, so we have a 10 evaluation. This is standard practice in post-training is you choose your evaluations you care about. In academics, in smaller labs, you'll have fewer evaluations. In companies, you'll have a really one domain that you really care about. In frontier labs, you'll have 10s to 20s to maybe even like 100 evaluations of specific things.
So we'd choose a representative suite of things that look like chat, precise instruction following, which is like respond only in emojis. Like does the model follow weird things like that? Math, code. And you create a suite like this. So safety would be one of 10. In that type of suite where you have like, what is the broader community of AI care about?
So we'd choose a representative suite of things that look like chat, precise instruction following, which is like respond only in emojis. Like does the model follow weird things like that? Math, code. And you create a suite like this. So safety would be one of 10. In that type of suite where you have like, what is the broader community of AI care about?
So we'd choose a representative suite of things that look like chat, precise instruction following, which is like respond only in emojis. Like does the model follow weird things like that? Math, code. And you create a suite like this. So safety would be one of 10. In that type of suite where you have like, what is the broader community of AI care about?
And for example, in comparison to deep seek, it would be something like our average eval for our model would be 80, including safety and similar without and deep seek would be like 79% average score average. without safety and their safety score would bring it down to like 76 on average.
And for example, in comparison to deep seek, it would be something like our average eval for our model would be 80, including safety and similar without and deep seek would be like 79% average score average. without safety and their safety score would bring it down to like 76 on average.
And for example, in comparison to deep seek, it would be something like our average eval for our model would be 80, including safety and similar without and deep seek would be like 79% average score average. without safety and their safety score would bring it down to like 76 on average.
Yeah, so this is something that internally it's like, I don't want to win only by like how you shape the valve benchmark. So if there's something that's like people may or may not care about safety in their model, safety can come downstream. Safety can be when you host the model for an API. Like safety is... addressed in a spectrum of locations and applications.
Yeah, so this is something that internally it's like, I don't want to win only by like how you shape the valve benchmark. So if there's something that's like people may or may not care about safety in their model, safety can come downstream. Safety can be when you host the model for an API. Like safety is... addressed in a spectrum of locations and applications.
Yeah, so this is something that internally it's like, I don't want to win only by like how you shape the valve benchmark. So if there's something that's like people may or may not care about safety in their model, safety can come downstream. Safety can be when you host the model for an API. Like safety is... addressed in a spectrum of locations and applications.
So it's like, if you want to say that you have the best recipe, you can't just gate it on these things that some people might not want. And this is because it's like the time of progress. We benefit if we can release a model later. We have more time to learn new techniques like this RL technique. We had started this in the fall. It's now really popular as reasoning models.
So it's like, if you want to say that you have the best recipe, you can't just gate it on these things that some people might not want. And this is because it's like the time of progress. We benefit if we can release a model later. We have more time to learn new techniques like this RL technique. We had started this in the fall. It's now really popular as reasoning models.
So it's like, if you want to say that you have the best recipe, you can't just gate it on these things that some people might not want. And this is because it's like the time of progress. We benefit if we can release a model later. We have more time to learn new techniques like this RL technique. We had started this in the fall. It's now really popular as reasoning models.
The next thing to do for open source post-training is to scale up verifiers, to scale up data, to replicate some of DeepSeq's results. And it's awesome that we have a paper to draw on and it makes it a lot easier. And that's the type of things that is going on among academic and closed frontier research in AI.
The next thing to do for open source post-training is to scale up verifiers, to scale up data, to replicate some of DeepSeq's results. And it's awesome that we have a paper to draw on and it makes it a lot easier. And that's the type of things that is going on among academic and closed frontier research in AI.