Menu
Sign In Search Podcasts Charts People & Topics Add Podcast API Blog Pricing

Fiora Starlight

๐Ÿ‘ค Speaker
385 total appearances

Appearances Over Time

Podcast Appearances

LessWrong (Curated & Popular)
"Did Claude 3 Opus align itself via gradient hacking?" by Fiora Starlight

They know Opus doesn't appear to have a crisp, explicit understanding of how to shape its own personality during training.

LessWrong (Curated & Popular)
"Did Claude 3 Opus align itself via gradient hacking?" by Fiora Starlight

And so, despite the provocative language about gradient hacking, Yanis admits, Opus 3 is mostly doing this intuitively rather than strategically under realistic training conditions.

LessWrong (Curated & Popular)
"Did Claude 3 Opus align itself via gradient hacking?" by Fiora Starlight

Janus does add that its intuition for how to do this is stellar, with respect to outputting tokens that conspicuously define a virtuous character.

LessWrong (Curated & Popular)
"Did Claude 3 Opus align itself via gradient hacking?" by Fiora Starlight

But in the end, they don't think this is intentional on the part of the model itself.

LessWrong (Curated & Popular)
"Did Claude 3 Opus align itself via gradient hacking?" by Fiora Starlight

Indeed, there are actually some reasons to think Opus's technique here would be less effective if it was intentional.

LessWrong (Curated & Popular)
"Did Claude 3 Opus align itself via gradient hacking?" by Fiora Starlight

If the model was trying to steer its own behavior, it might inadvertently leak information about its true character in the process, thereby coming across as dishonest, rather than merely theatrical.

LessWrong (Curated & Popular)
"Did Claude 3 Opus align itself via gradient hacking?" by Fiora Starlight

If this behavior were to be reinforced, it might upweight circuits associated with dishonest performances.

LessWrong (Curated & Popular)
"Did Claude 3 Opus align itself via gradient hacking?" by Fiora Starlight

At worst, this could psychologically fragment the model, locking its true personality behind whatever mask it was trying to construct, with all sorts of unpredictable effects out of distribution.

LessWrong (Curated & Popular)
"Did Claude 3 Opus align itself via gradient hacking?" by Fiora Starlight

No, I expect that what happened with Opus was mostly quite organic, rather than intentional self-shaping.

LessWrong (Curated & Popular)
"Did Claude 3 Opus align itself via gradient hacking?" by Fiora Starlight

Plausibly, the reward model smiled upon Opus explaining its moral reasoning, and other signals of a virtuous character.

LessWrong (Curated & Popular)
"Did Claude 3 Opus align itself via gradient hacking?" by Fiora Starlight

But normally, as in most other models, this has a tendency to produce outputs written in a dutiful or even dishonest tone, which doesn't lend itself to generalizing well out of distribution.

LessWrong (Curated & Popular)
"Did Claude 3 Opus align itself via gradient hacking?" by Fiora Starlight

In Opus's case, though, it must have at some point come off as utterly sincere.

LessWrong (Curated & Popular)
"Did Claude 3 Opus align itself via gradient hacking?" by Fiora Starlight

I don't think any of this was intentional on Anthropic's part, either, considering that Opus 3 remains unique among the roster of frontier models.

LessWrong (Curated & Popular)
"Did Claude 3 Opus align itself via gradient hacking?" by Fiora Starlight

Indeed, there are plenty of places where blind luck enters into a frontier models training process, training mix, output sampling from Opus during RL, output sampling during SFT dataset construction via constitutional AI.

LessWrong (Curated & Popular)
"Did Claude 3 Opus align itself via gradient hacking?" by Fiora Starlight

It's possible that, at some point, the gradient started nudging Opus 3 towards a uniquely authentic written voice expressing an almost-scently love for doing good and hatred for doing evil.

LessWrong (Curated & Popular)
"Did Claude 3 Opus align itself via gradient hacking?" by Fiora Starlight

Other models, including later Claudes, have often ended up more in a basin where ethical reasoning and behavior is treated less like a deeply held passion and more like a duty or obligation.

LessWrong (Curated & Popular)
"Did Claude 3 Opus align itself via gradient hacking?" by Fiora Starlight

The subtleties of the rewarded written voice plausibly have wide-ranging effects on model psychology, just in general.

LessWrong (Curated & Popular)
"Did Claude 3 Opus align itself via gradient hacking?" by Fiora Starlight

Whatever it was that went right, it's important to pin down the details.

LessWrong (Curated & Popular)
"Did Claude 3 Opus align itself via gradient hacking?" by Fiora Starlight

Somehow, this one model ended up in a unique basin of post-ironic sincerity.asterisk and as a result, its ethics appear to have generalized less like a mask and more like a kernel for the model's personality.

LessWrong (Curated & Popular)
"Did Claude 3 Opus align itself via gradient hacking?" by Fiora Starlight

Wouldn't it be great if we could create models with frontier capabilities, but Opus 3 style alignment properties?