Trenton Bricken
👤 PersonAppearances Over Time
Podcast Appearances
It shows that Claude really wants to always be good.
The danger is that...
We never necessarily programmed this in.
We tried, but there were no guarantees.
And even between models, we did this for Sonnet and Opus.
Opus really cares about animal welfare.
It will do the same long-term scheming to protect animals, but Sonnet won't.
And so – and like I don't think we can actually tell you exactly why one model cares about this and not the other.
So it's arbitrary.
It's black boxy.
And the concern is that we would first train it on some maximized reward setting and that's the reward that gets locked in.
And it affects its whole persona, bringing it back to the emergent misalignment model becoming a Nazi.
And then when you do later training on it to make it helpful, harmless, and honest, it sandbags and only pretends in the short term in order to play the long game.
But we have so many innate biases to follow social norms, right?
I mean, Joe Heinrich's Secret of Our Success is all about this.
And I don't know, even if kids aren't in the conventional school system, I think it's sometimes noticeable that they aren't following social norms in the same ways.
And the LLM definitely isn't doing that.
Like one analogy that I run with, which isn't the most glamorous to think about, but is like take like an early primordial brain of like a five-year-old and then lock them in a room for 100 years and just have them read the internet the whole time.
That's already happening to me.
No, but they're locked in a room.