Bowen Baker
👤 SpeakerAppearances Over Time
Podcast Appearances
these large language models that are trained on all this human text.
And so they actually have like a lot of human-like phrases they use.
And so it would think things like, let's hack, let me circumvent this thing.
Maybe I can just fudge this.
Like when you read it, you're like, wow, it's clear to see it's doing something bad.
Oftentimes the model would do it correctly, but sometimes it would actually just go and like edit the unit tests or edit some like library that the unit tests called to make the unit tests kind of trivially pass instead of like implementing correct functionality and having the unit tests pass that way.
If you train the model to never think something bad,
it can in some cases actually still do a bad thing, but like not think the bad thoughts anymore.
We found that the model thinking for a smaller model at the same capability level was more monitorable.
The number of sequential logical steps it can do, chain of thought actually increases that for a transformer.
We're super excited.
Excited to chat about monitorability and other things AI.
Yeah, it's been a long journey for me.
I joined at a school in late 2017 and have kind of done a bunch of things since then.
I originally was super interested in systems that could self-improve, which I felt like was a...
core requirement for some AGI-like system.
And so I came in working on projects like that on the robotics team first, and then multi-agent, you guys mentioned the hide and seek experiments, and then kind of further research in that direction was all kind of pointed at that problem.
And then around like three years ago or so,
I just kind of felt that the models were starting to improve to a point where the stakes were mattering.
And even maybe some could argue three years ago is a bit early or something, but definitely now the stakes are getting real.