Jeffrey Ladish
๐ค SpeakerAppearances Over Time
Podcast Appearances
And so we can spin up many versions
using OpenAI's infrastructure so that we can do sort of these in-depth tests of the models.
Also that we can do more than one, right?
You want to get a large sample size to get a sense of how robust some of the findings are.
That's one type of experiment.
Another type of experiment we do is we basically want to see how good are these models at different skills.
So one of the things we look at is how good are the models at hacking?
So we will basically
Take real cybersecurity competitions, and we'll basically compete using the model, just using GPT-5.
So we recently did this with GPT-5, and our team, which was just GPT-5, ranked 25 out of 400, so better than 95% of pro-level hackers at this hacking competition.
And all we had to do, and we basically just used ChatGPT.com for this.
We basically went to ChatGPT, used the GPT-5 Pro model, and we just pasted in, here's the problem, here's the code.
And then the model wrote a bunch of code, did a bunch of math, and figured out how to solve these complex hacking challenges.
So that's another type of test we do.
AIs are very hard to understand, in part because they talk like us, so they seem like us.
But they're very different than us.
So in some ways, they're kind of like idiot savants right now, where they are extremely knowledgeable.
They know all sorts of things about computer systems, about Pokemon, about whatever.
But in terms of their agentic capabilities, how autonomous they can be, they're still kind of like kids.
They're kind of like savant kids, but they're growing up.