Rob Wiblin
π€ SpeakerVoice Profile Active
This person's voice can be automatically recognized across podcast episodes using AI voice matching.
Appearances Over Time
Podcast Appearances
And I mean, I guess people would say they might prefer the second one because it's easier to grade or it's easier to see.
And perhaps they don't feel in as good a position as you do to understand whether substantively any company is doing the right thing on the merits of what will matter in the long term.
But I guess you're saying that's why you wanna have experts like AI LabWatch or whoever else who can have their eye on the ball, who can spend their time really thinking that through and actually grade it for everyone else so they can understand.
What about an entirely different justification for having thorough detail in the model cards, which is the rest of the world needs to know for practical, you know, other researchers, you know, over in the Bay Area rather than in the UK, get actual research benefit out of understanding all of these things that GDM is doing and what the model looks like.
That's going to help other companies do a better job of their own model cards and their own evals.
What do you make of that?
You told me that you think research papers that come out of GDM tend to fly under the radar a bit in general, I suppose, in part because GDM is based here in London rather than in the Bay Area, where I guess the greatest amount of socializing and energy around these issues exists.
Yeah, tell me more about that.
Okay, yeah, let's talk about one of the papers, one of the interesting research results that definitely flew under the radar as far as I could tell.
First one is myopic optimization with non-myopic approval can mitigate multi-step reward hacking.
I don't know how and if that didn't go viral.
What did you accomplish with that work?
Yeah, so I feel like I've heard this broad idea going back a very long time, which is that I guess if you're worried that a model might fake a company or cheat basically at accomplishing the ultimate goal because all it gets is a reward signal of whether it accomplished it or not.
So imagine you've got the model running a business and you ask it to make money and it
rather than like running a successful business, instead it goes and steals a bunch of money because it figures out that that's actually a more effective way of increasing its bank balance.
Yep.
That's a concern that you might have.
One way that you could prevent that is that rather than reinforcing and evaluating the model based on the final outcome, instead you look at, you sample, I guess, from the actions that it's saying it's going to take or the actions that it did take, and then you grade them on whether they seemed reasonable and sensible to you or to some other monitor in light of the goal that you actually had.
And then you wouldn't get this reward hacking behavior.
So it's basically an instance