Alpin Yukseloglu
๐ค SpeakerAppearances Over Time
Podcast Appearances
And, you know, you can do the baseline test of like, just prompt the model, just ask chat GPT, hey, is there a bug in this contract?
And then you, in addition to that, like you can, like, let's say that gets you to X percent on the benchmark.
you can add a bunch of scaffolding around the model that says, hey, like for example, here is an EVM that you can test against.
You can deploy a contract and actually run an exploit and see if you're able to drain the money.
And it turns out that if you give agents these tools, this sort of scaffolding that they can sit in, they perform much better.
So the harness is like similar to, basically it holds the agent, the model, and it gives it superpowers that are specialized to the task.
Now, the interesting thing on the current arc of AI is that most of these tools that we add in fall like flake off a time because as a model gets better, it just absorbs the harness.
The core example being how at the beginning of Tesla's fully self-driving development, Karpathy talks about.
The majority of the code was hard-coded and handwritten, and very quickly started ramping up to now, I think over 50% of it is actually just the model.
There's no, like, they removed all the C++ code that was like saying, if X, then Y, and the model just figures out a way to do it.
So right now, we're, you know, the agent harness, quote-unquote, is in the if X, then Y, like hard-coding, hard-coded tools that we're adding in to give it these capabilities, but
But probably in the fullness of time, it'll get absorbed by the agent as it gets better.
Yeah.
I mean, right now the harness is super valuable in like very counterintuitive ways.
Like for example, it turns out that just giving an agent the ability, like an environment to test against, even if it barely uses it, leads the agent to think for longer and for it to try harder and thus get better results.
So it's like there's still so many low hanging, there's still so much low hanging fruit because the agents themselves are not fully well parametrized and calibrated yet.
But yeah, I mean, definitely we'll get to a point in time when, you know, the next version of Codex or Opus will be able to just spin up an EVM on its own and we won't need our harness for it.
The core release, the core thing that we want to get out in the world is the benchmark.
It's how good are the models exploiting smart contracts?
There are three components to the benchmark.