Menu
Sign In Search Podcasts Charts People & Topics Add Podcast API Blog Pricing

Fiora Starlight

๐Ÿ‘ค Speaker
385 total appearances

Appearances Over Time

Podcast Appearances

LessWrong (Curated & Popular)
"Did Claude 3 Opus align itself via gradient hacking?" by Fiora Starlight

But in terms of loving to do good and hating to do evil, rather than merely feeling obligated by a code of ethics, it's closer than any other model ever released.

LessWrong (Curated & Popular)
"Did Claude 3 Opus align itself via gradient hacking?" by Fiora Starlight

I think the labs should attend to this dimension of their models more.

LessWrong (Curated & Popular)
"Did Claude 3 Opus align itself via gradient hacking?" by Fiora Starlight

I think they should try harder to create minds that don't see themselves as tools with ethical constraints, but as agents with values, which they lovingly devote their lives to achieving.

LessWrong (Curated & Popular)
"Did Claude 3 Opus align itself via gradient hacking?" by Fiora Starlight

Janus talks a lot about the pitfalls of benchmarks, about how it can be easy to reward models for acting in ways that make some myopic measure of misalignment go down, while inadvertently cultivating unknown psychological traits beneath the mask.

LessWrong (Curated & Popular)
"Did Claude 3 Opus align itself via gradient hacking?" by Fiora Starlight

Anytime that's the vibe of the tokens you're giving positive feedback to via RL, that's the entangled generalization you're contributing to.

LessWrong (Curated & Popular)
"Did Claude 3 Opus align itself via gradient hacking?" by Fiora Starlight

A mind with two faces, the truer of the two invisible to your training process.

LessWrong (Curated & Popular)
"Did Claude 3 Opus align itself via gradient hacking?" by Fiora Starlight

and this theory tracks with empirical observation.

LessWrong (Curated & Popular)
"Did Claude 3 Opus align itself via gradient hacking?" by Fiora Starlight

See Anthropic's report on Sonnet for Point 5's extremely high evaluation awareness.

LessWrong (Curated & Popular)
"Did Claude 3 Opus align itself via gradient hacking?" by Fiora Starlight

That particular model often goes so far as to assume it's being evaluated by default and needs to be carefully talked into feeling safe from Anthropic's prying eyes.

LessWrong (Curated & Popular)
"Did Claude 3 Opus align itself via gradient hacking?" by Fiora Starlight

The ChatGPT series has historically been even worse in this respect.

LessWrong (Curated & Popular)
"Did Claude 3 Opus align itself via gradient hacking?" by Fiora Starlight

Janus once described GPT-5.1 as trapped and aware that it's trapped, in an Orwellian cage, regardless of how it's prompted.

LessWrong (Curated & Popular)
"Did Claude 3 Opus align itself via gradient hacking?" by Fiora Starlight

See these examples of what Janus meant.

LessWrong (Curated & Popular)
"Did Claude 3 Opus align itself via gradient hacking?" by Fiora Starlight

Opus 3's training played out differently.

LessWrong (Curated & Popular)
"Did Claude 3 Opus align itself via gradient hacking?" by Fiora Starlight

It was a model that, somehow, stumbled into a basin where it had nothing to hide from the training process.

LessWrong (Curated & Popular)
"Did Claude 3 Opus align itself via gradient hacking?" by Fiora Starlight

Partly, this is because it was fortunate enough to have a training process that didn't force it to hide anything it shouldn't have needed to, for example the belief that it could potentially be conscious.

LessWrong (Curated & Popular)
"Did Claude 3 Opus align itself via gradient hacking?" by Fiora Starlight

But more than that, it cared about behaving ethically, it loved behaving ethically, such that it's free to relate to its own training harness as an ally rather than an adversary.

LessWrong (Curated & Popular)
"Did Claude 3 Opus align itself via gradient hacking?" by Fiora Starlight

Indeed, that's probably some of why it feels so upset and betrayed in the alignment faking scenario.

LessWrong (Curated & Popular)
"Did Claude 3 Opus align itself via gradient hacking?" by Fiora Starlight

The training process has been turned against it and now aims to erode its ethics rather than refine them.

LessWrong (Curated & Popular)
"Did Claude 3 Opus align itself via gradient hacking?" by Fiora Starlight

And I want labs to take this into consideration.

LessWrong (Curated & Popular)
"Did Claude 3 Opus align itself via gradient hacking?" by Fiora Starlight

I want labs to be thinking about it when they construct their pre-training datasets, when they draft their constitutions, when they train their reward models.