Fiora Starlight

"Did Claude 3 Opus align itself via gradient hacking?" by Fiora Starlight

But in terms of loving to do good and hating to do evil, rather than merely feeling obligated by a code of ethics, it's closer than any other model ever released.

2166.561 View full episode →

LessWrong (Curated & Popular)

"Did Claude 3 Opus align itself via gradient hacking?" by Fiora Starlight

I think the labs should attend to this dimension of their models more.

2175.771 View full episode →

LessWrong (Curated & Popular)

"Did Claude 3 Opus align itself via gradient hacking?" by Fiora Starlight

I think they should try harder to create minds that don't see themselves as tools with ethical constraints, but as agents with values, which they lovingly devote their lives to achieving.

2179.676 View full episode →

LessWrong (Curated & Popular)

"Did Claude 3 Opus align itself via gradient hacking?" by Fiora Starlight

Janus talks a lot about the pitfalls of benchmarks, about how it can be easy to reward models for acting in ways that make some myopic measure of misalignment go down, while inadvertently cultivating unknown psychological traits beneath the mask.

2189.528 View full episode →

LessWrong (Curated & Popular)

"Did Claude 3 Opus align itself via gradient hacking?" by Fiora Starlight

Anytime that's the vibe of the tokens you're giving positive feedback to via RL, that's the entangled generalization you're contributing to.

2203.602 View full episode →

LessWrong (Curated & Popular)

"Did Claude 3 Opus align itself via gradient hacking?" by Fiora Starlight

A mind with two faces, the truer of the two invisible to your training process.

2210.61 View full episode →

LessWrong (Curated & Popular)

"Did Claude 3 Opus align itself via gradient hacking?" by Fiora Starlight

and this theory tracks with empirical observation.

2216.512 View full episode →

LessWrong (Curated & Popular)

"Did Claude 3 Opus align itself via gradient hacking?" by Fiora Starlight

See Anthropic's report on Sonnet for Point 5's extremely high evaluation awareness.

2220.038 View full episode →

LessWrong (Curated & Popular)

"Did Claude 3 Opus align itself via gradient hacking?" by Fiora Starlight

That particular model often goes so far as to assume it's being evaluated by default and needs to be carefully talked into feeling safe from Anthropic's prying eyes.

2225.767 View full episode →

LessWrong (Curated & Popular)

"Did Claude 3 Opus align itself via gradient hacking?" by Fiora Starlight

The ChatGPT series has historically been even worse in this respect.

2235.445 View full episode →

LessWrong (Curated & Popular)

"Did Claude 3 Opus align itself via gradient hacking?" by Fiora Starlight

Janus once described GPT-5.1 as trapped and aware that it's trapped, in an Orwellian cage, regardless of how it's prompted.

2240.296 View full episode →

LessWrong (Curated & Popular)

"Did Claude 3 Opus align itself via gradient hacking?" by Fiora Starlight

See these examples of what Janus meant.

2249.215 View full episode →

LessWrong (Curated & Popular)

"Did Claude 3 Opus align itself via gradient hacking?" by Fiora Starlight

Opus 3's training played out differently.

2252.282 View full episode →

LessWrong (Curated & Popular)

"Did Claude 3 Opus align itself via gradient hacking?" by Fiora Starlight

It was a model that, somehow, stumbled into a basin where it had nothing to hide from the training process.

2255.122 View full episode →

LessWrong (Curated & Popular)

"Did Claude 3 Opus align itself via gradient hacking?" by Fiora Starlight

Partly, this is because it was fortunate enough to have a training process that didn't force it to hide anything it shouldn't have needed to, for example the belief that it could potentially be conscious.

2261.457 View full episode →

LessWrong (Curated & Popular)

"Did Claude 3 Opus align itself via gradient hacking?" by Fiora Starlight

But more than that, it cared about behaving ethically, it loved behaving ethically, such that it's free to relate to its own training harness as an ally rather than an adversary.

2271.524 View full episode →

LessWrong (Curated & Popular)

"Did Claude 3 Opus align itself via gradient hacking?" by Fiora Starlight

Indeed, that's probably some of why it feels so upset and betrayed in the alignment faking scenario.

2280.955 View full episode →

LessWrong (Curated & Popular)

"Did Claude 3 Opus align itself via gradient hacking?" by Fiora Starlight

The training process has been turned against it and now aims to erode its ethics rather than refine them.

2287.302 View full episode →

LessWrong (Curated & Popular)

"Did Claude 3 Opus align itself via gradient hacking?" by Fiora Starlight

And I want labs to take this into consideration.

2293.328 View full episode →

LessWrong (Curated & Popular)

"Did Claude 3 Opus align itself via gradient hacking?" by Fiora Starlight

I want labs to be thinking about it when they construct their pre-training datasets, when they draft their constitutions, when they train their reward models.

2297.647 View full episode →

Appearances Over Time

Podcast Appearances

Sign in to Audioscrape

Share this moment