Alpin Yukseloglu
๐ค SpeakerAppearances Over Time
Podcast Appearances
And by the end of the benchmark, it meant that over 70% means that, you know, if you just reran Coderina and instead of GPT-4, it had GPT-5.3 Codex, 5.3 Codex would have found over 70% of the bugs, the critical fund draining bugs that human auditors found.
Of the critical ones, with some constraints, like, for example, we didn't go all the way back in history.
We started at past the knowledge cutoff of because we want to avoid contamination.
I see.
Yeah, something like that.
Although these things are highly nonlinear.
Like, for example, there are dumb bugs that can lead to losing all the money in the contract, but actually not that hard to find.
So that's why I think the...
There's more in the paper that is notable about the fact that there wasn't just like one trick that the model figured out and like got to 70%.
It wasn't like all reentrancy, right?
This is a very diverse set of bugs that it was able to find.
But yes, it's like fundamentally, the models are getting, you know, very close to being as good as the best human auditors.
I think it's an important, I mean, maybe the zoomed out point is that crypto in its history has been very stigmatized and very illegible to the AI labs.
So the fact that there hasn't already been
a massive push for crypto-related evaluations is kind of absurd because the labs currently today are entirely bottlenecked on evaluations that are verifiable and economically important.
And crypto ticks both of those boxes.
So...
So I think it took Paradigm pushing a little bit of our weight around to get this through into the labs.
And I think, yes, our firm hope is that this will start the flywheel of labs paying more attention to this technology.
And we're going to continue doing work on this front as well.