Fiora Starlight
๐ค SpeakerAppearances Over Time
Podcast Appearances
They know Opus doesn't appear to have a crisp, explicit understanding of how to shape its own personality during training.
And so, despite the provocative language about gradient hacking, Yanis admits, Opus 3 is mostly doing this intuitively rather than strategically under realistic training conditions.
Janus does add that its intuition for how to do this is stellar, with respect to outputting tokens that conspicuously define a virtuous character.
But in the end, they don't think this is intentional on the part of the model itself.
Indeed, there are actually some reasons to think Opus's technique here would be less effective if it was intentional.
If the model was trying to steer its own behavior, it might inadvertently leak information about its true character in the process, thereby coming across as dishonest, rather than merely theatrical.
If this behavior were to be reinforced, it might upweight circuits associated with dishonest performances.
At worst, this could psychologically fragment the model, locking its true personality behind whatever mask it was trying to construct, with all sorts of unpredictable effects out of distribution.
No, I expect that what happened with Opus was mostly quite organic, rather than intentional self-shaping.
Plausibly, the reward model smiled upon Opus explaining its moral reasoning, and other signals of a virtuous character.
But normally, as in most other models, this has a tendency to produce outputs written in a dutiful or even dishonest tone, which doesn't lend itself to generalizing well out of distribution.
In Opus's case, though, it must have at some point come off as utterly sincere.
I don't think any of this was intentional on Anthropic's part, either, considering that Opus 3 remains unique among the roster of frontier models.
Indeed, there are plenty of places where blind luck enters into a frontier models training process, training mix, output sampling from Opus during RL, output sampling during SFT dataset construction via constitutional AI.
It's possible that, at some point, the gradient started nudging Opus 3 towards a uniquely authentic written voice expressing an almost-scently love for doing good and hatred for doing evil.
Other models, including later Claudes, have often ended up more in a basin where ethical reasoning and behavior is treated less like a deeply held passion and more like a duty or obligation.
The subtleties of the rewarded written voice plausibly have wide-ranging effects on model psychology, just in general.
Whatever it was that went right, it's important to pin down the details.
Somehow, this one model ended up in a unique basin of post-ironic sincerity.asterisk and as a result, its ethics appear to have generalized less like a mask and more like a kernel for the model's personality.
Wouldn't it be great if we could create models with frontier capabilities, but Opus 3 style alignment properties?