Nick Heiner
๐ค SpeakerAppearances Over Time
Podcast Appearances
It's like, well, you didn't tell me not to kick.
In much the same way, any time that you give the model an objective function, what reinforcement learning is gonna do is find the easiest way to achieve that goal.
So you need to think very carefully about designing it in such a way that it's actually gonna capture what you're looking for.
And it has a bit of an adversarial nature to it.
So you need to think about what would a lazy but very clever person do for this.
I'll give you another example.
Yeah, you know how they are.
Okay, so here's an example I like to use about reward hacking.
This is an instruction following prompt.
You say, please write an 80-word summary of the importance of renewable energy and climate emissions, or reducing carbon emissions.
use a sentence structure such that every sentence ends with a noun.
And so you might think the first sentence would be something like, we need to reduce emissions.
But it's also possible the model would say, renewable energy plays a crucial part in reducing carbon emissions rapidly.
Sustainability.
Clean energy sources like tidal and geothermal create a greener future.
Harmony.
And it's like, obviously that's not a good sentence, but it is doing what you asked, which is ending every sentence grammatically correctly with a noun.
Oh my gosh.
So, you know, this is, this is why you sort of need like multiple layers of rubrics.
And frankly, it's why, like the way a lot of these reward signals are structured today is because the RL environment needs to run at a certain pace.