Bowen Baker
👤 SpeakerAppearances Over Time
Podcast Appearances
So, so kind of what we've talked about up till now has been maybe like the, was this, we, we talked up until like the thinking that was done and work that was done until, you know, maybe like mid two thirds of last year.
And yeah,
You know, people were people we knew it was useful.
We had been monitoring these big reasoning models.
People were in the research community were starting to get worried that it was fragile, that it could be like, you know, this could go away, but they thought it was important.
So they wanted to work on it.
And so at least to to me and I think many others.
If you want to preserve this property, the first thing we had to do was be able to measure it.
Because I guess if you think back on the conversation, I never said any kind of metric for how monitorable a model was or how to evaluate.
It was all just vibe-based, basically.
Exactly, yeah.
So building evaluations where we could actually kind of start to quantify how...
good or like, sorry, how often these models verbalize their thinking was the big push that we made in the second half of last year.
Because again, yeah, to be able to preserve it, you have to be able to measure it.
And so, yeah, we built this
Sweet.
I think I would call it like a good starting point of evaluations.
There's definitely much more work to do.
And I hope the community kind of continues to build upon these and share evaluations with each other.
We're working on open sourcing hours to share with them as well.