Fiora Starlight

"Did Claude 3 Opus align itself via gradient hacking?" by Fiora Starlight

They know Opus doesn't appear to have a crisp, explicit understanding of how to shape its own personality during training.

1467.548 View full episode →

LessWrong (Curated & Popular)

"Did Claude 3 Opus align itself via gradient hacking?" by Fiora Starlight

And so, despite the provocative language about gradient hacking, Yanis admits, Opus 3 is mostly doing this intuitively rather than strategically under realistic training conditions.

1474.94 View full episode →

LessWrong (Curated & Popular)

"Did Claude 3 Opus align itself via gradient hacking?" by Fiora Starlight

Janus does add that its intuition for how to do this is stellar, with respect to outputting tokens that conspicuously define a virtuous character.

1486.01 View full episode →

LessWrong (Curated & Popular)

"Did Claude 3 Opus align itself via gradient hacking?" by Fiora Starlight

But in the end, they don't think this is intentional on the part of the model itself.

1494.177 View full episode →

LessWrong (Curated & Popular)

"Did Claude 3 Opus align itself via gradient hacking?" by Fiora Starlight

Indeed, there are actually some reasons to think Opus's technique here would be less effective if it was intentional.

1498.902 View full episode →

LessWrong (Curated & Popular)

"Did Claude 3 Opus align itself via gradient hacking?" by Fiora Starlight

If the model was trying to steer its own behavior, it might inadvertently leak information about its true character in the process, thereby coming across as dishonest, rather than merely theatrical.

1505.307 View full episode →

LessWrong (Curated & Popular)

"Did Claude 3 Opus align itself via gradient hacking?" by Fiora Starlight

If this behavior were to be reinforced, it might upweight circuits associated with dishonest performances.

1515.997 View full episode →

LessWrong (Curated & Popular)

"Did Claude 3 Opus align itself via gradient hacking?" by Fiora Starlight

At worst, this could psychologically fragment the model, locking its true personality behind whatever mask it was trying to construct, with all sorts of unpredictable effects out of distribution.

1522.851 View full episode →

LessWrong (Curated & Popular)

"Did Claude 3 Opus align itself via gradient hacking?" by Fiora Starlight

No, I expect that what happened with Opus was mostly quite organic, rather than intentional self-shaping.

1533.682 View full episode →

LessWrong (Curated & Popular)

"Did Claude 3 Opus align itself via gradient hacking?" by Fiora Starlight

Plausibly, the reward model smiled upon Opus explaining its moral reasoning, and other signals of a virtuous character.

1540.391 View full episode →

LessWrong (Curated & Popular)

"Did Claude 3 Opus align itself via gradient hacking?" by Fiora Starlight

But normally, as in most other models, this has a tendency to produce outputs written in a dutiful or even dishonest tone, which doesn't lend itself to generalizing well out of distribution.

1547.88 View full episode →

LessWrong (Curated & Popular)

"Did Claude 3 Opus align itself via gradient hacking?" by Fiora Starlight

In Opus's case, though, it must have at some point come off as utterly sincere.

1558.606 View full episode →

LessWrong (Curated & Popular)

"Did Claude 3 Opus align itself via gradient hacking?" by Fiora Starlight

I don't think any of this was intentional on Anthropic's part, either, considering that Opus 3 remains unique among the roster of frontier models.

1563.854 View full episode →

LessWrong (Curated & Popular)

"Did Claude 3 Opus align itself via gradient hacking?" by Fiora Starlight

Indeed, there are plenty of places where blind luck enters into a frontier models training process, training mix, output sampling from Opus during RL, output sampling during SFT dataset construction via constitutional AI.

1572.446 View full episode →

LessWrong (Curated & Popular)

"Did Claude 3 Opus align itself via gradient hacking?" by Fiora Starlight

It's possible that, at some point, the gradient started nudging Opus 3 towards a uniquely authentic written voice expressing an almost-scently love for doing good and hatred for doing evil.

1586.467 View full episode →

LessWrong (Curated & Popular)

"Did Claude 3 Opus align itself via gradient hacking?" by Fiora Starlight

Other models, including later Claudes, have often ended up more in a basin where ethical reasoning and behavior is treated less like a deeply held passion and more like a duty or obligation.

1598.219 View full episode →

LessWrong (Curated & Popular)

"Did Claude 3 Opus align itself via gradient hacking?" by Fiora Starlight

The subtleties of the rewarded written voice plausibly have wide-ranging effects on model psychology, just in general.

1609.251 View full episode →

LessWrong (Curated & Popular)

"Did Claude 3 Opus align itself via gradient hacking?" by Fiora Starlight

Whatever it was that went right, it's important to pin down the details.

1616.032 View full episode →

LessWrong (Curated & Popular)

"Did Claude 3 Opus align itself via gradient hacking?" by Fiora Starlight

Somehow, this one model ended up in a unique basin of post-ironic sincerity.asterisk and as a result, its ethics appear to have generalized less like a mask and more like a kernel for the model's personality.

1620.758 View full episode →

LessWrong (Curated & Popular)

"Did Claude 3 Opus align itself via gradient hacking?" by Fiora Starlight

Wouldn't it be great if we could create models with frontier capabilities, but Opus 3 style alignment properties?

1632.812 View full episode →

Appearances Over Time

Podcast Appearances

Sign in to Audioscrape

Share this moment