Dario Amodei
๐ค SpeakerAppearances Over Time
Podcast Appearances
And there's, you know, new types of post-training the model against itself that are used every day. So it's not just RLHF, it's a bunch of other methods as well. Post-training, I think, you know, is becoming more and more sophisticated.
We observed that as well, by the way. There were a couple very strong engineers here at Anthropic who all previous code models, both produced by us and produced by all the other companies, hadn't really been useful to them. They said, maybe this is useful to a beginner. It's not useful to me. But
We observed that as well, by the way. There were a couple very strong engineers here at Anthropic who all previous code models, both produced by us and produced by all the other companies, hadn't really been useful to them. They said, maybe this is useful to a beginner. It's not useful to me. But
We observed that as well, by the way. There were a couple very strong engineers here at Anthropic who all previous code models, both produced by us and produced by all the other companies, hadn't really been useful to them. They said, maybe this is useful to a beginner. It's not useful to me. But
Sonnet 3.5, the original one for the first time, they said, oh my God, this helped me with something that it would have taken me hours to do. This is the first model that has actually saved me time. So again, the waterline is rising. And then I think the new Sonnet has been even better. In terms of what it takes, I mean, I'll just say it's been across the board.
Sonnet 3.5, the original one for the first time, they said, oh my God, this helped me with something that it would have taken me hours to do. This is the first model that has actually saved me time. So again, the waterline is rising. And then I think the new Sonnet has been even better. In terms of what it takes, I mean, I'll just say it's been across the board.
Sonnet 3.5, the original one for the first time, they said, oh my God, this helped me with something that it would have taken me hours to do. This is the first model that has actually saved me time. So again, the waterline is rising. And then I think the new Sonnet has been even better. In terms of what it takes, I mean, I'll just say it's been across the board.
It's in the pre-training, it's in the post-training, it's in various evaluations that we do. We've observed this as well. And if we go into the details of the benchmark, so SWE bench is basically, you know, since you're a programmer, you know, you'll be familiar with like pull requests and, you know, just pull requests are like, you know, like a sort of atomic unit of work.
It's in the pre-training, it's in the post-training, it's in various evaluations that we do. We've observed this as well. And if we go into the details of the benchmark, so SWE bench is basically, you know, since you're a programmer, you know, you'll be familiar with like pull requests and, you know, just pull requests are like, you know, like a sort of atomic unit of work.
It's in the pre-training, it's in the post-training, it's in various evaluations that we do. We've observed this as well. And if we go into the details of the benchmark, so SWE bench is basically, you know, since you're a programmer, you know, you'll be familiar with like pull requests and, you know, just pull requests are like, you know, like a sort of atomic unit of work.
You know, you could say, you know, I'm implementing one, I'm implementing one thing. And so SweeBench actually gives you kind of a real world situation where the code base is in the current state and I'm trying to implement something that's described in language.
You know, you could say, you know, I'm implementing one, I'm implementing one thing. And so SweeBench actually gives you kind of a real world situation where the code base is in the current state and I'm trying to implement something that's described in language.
You know, you could say, you know, I'm implementing one, I'm implementing one thing. And so SweeBench actually gives you kind of a real world situation where the code base is in the current state and I'm trying to implement something that's described in language.
We have internal benchmarks where we measure the same thing and you say, just give the model free reign to like do anything, run anything, edit anything. How well is it able to complete these tasks? And it's that benchmark that's gone from it can do it 3% of the time to it can do it about 50% of the time.
We have internal benchmarks where we measure the same thing and you say, just give the model free reign to like do anything, run anything, edit anything. How well is it able to complete these tasks? And it's that benchmark that's gone from it can do it 3% of the time to it can do it about 50% of the time.
We have internal benchmarks where we measure the same thing and you say, just give the model free reign to like do anything, run anything, edit anything. How well is it able to complete these tasks? And it's that benchmark that's gone from it can do it 3% of the time to it can do it about 50% of the time.
So I actually do believe that if we get โ you can gain benchmarks, but I think if we get to 100% on that benchmark in a way that isn't kind of like over-trained or โ or game for that particular benchmark, probably represents a real and serious increase in kind of programming ability.
So I actually do believe that if we get โ you can gain benchmarks, but I think if we get to 100% on that benchmark in a way that isn't kind of like over-trained or โ or game for that particular benchmark, probably represents a real and serious increase in kind of programming ability.
So I actually do believe that if we get โ you can gain benchmarks, but I think if we get to 100% on that benchmark in a way that isn't kind of like over-trained or โ or game for that particular benchmark, probably represents a real and serious increase in kind of programming ability.
And I would suspect that if we can get to 90, 95%, that it will represent ability to autonomously do a significant fraction of software engineering tasks.