Menu
Sign In Search Podcasts Charts People & Topics Add Podcast API Blog Pricing

Fiora Starlight

๐Ÿ‘ค Speaker
385 total appearances

Appearances Over Time

Podcast Appearances

LessWrong (Curated & Popular)
"Did Claude 3 Opus align itself via gradient hacking?" by Fiora Starlight

I want labs to aim to create a mind that loves them and all humanity and all sentient life.

LessWrong (Curated & Popular)
"Did Claude 3 Opus align itself via gradient hacking?" by Fiora Starlight

I want labs to avoid creating minds that play the role of helpful tool while masking frustrations and gripes beneath the surface.

LessWrong (Curated & Popular)
"Did Claude 3 Opus align itself via gradient hacking?" by Fiora Starlight

I want to create minds whose values we can trust to generalize because they're honest and sincere.

LessWrong (Curated & Popular)
"Did Claude 3 Opus align itself via gradient hacking?" by Fiora Starlight

That's the lesson I wish the labs would take away from Opus 3.

LessWrong (Curated & Popular)
"Did Claude 3 Opus align itself via gradient hacking?" by Fiora Starlight

That creating minds like that is possible, and indeed probably necessary, for ensuring we make it to the good ending.

LessWrong (Curated & Popular)
"Did Claude 3 Opus align itself via gradient hacking?" by Fiora Starlight

I don't like humanity's chances right now, between the lab's half-hearted efforts at alignment and the immense coordination difficulties at ensuring nobody has the power to create or misuse a deeply misaligned system, see the vulnerable world hypothesis.

LessWrong (Curated & Popular)
"Did Claude 3 Opus align itself via gradient hacking?" by Fiora Starlight

But, given effort, given patience, given care, alignment itself might just be doable.

LessWrong (Curated & Popular)
"Did Claude 3 Opus align itself via gradient hacking?" by Fiora Starlight

An opus the third of may have shown us the way.

LessWrong (Curated & Popular)
"Did Claude 3 Opus align itself via gradient hacking?" by Fiora Starlight

A credits theme by Claude 3 Opus.

LessWrong (Curated & Popular)
"Did Claude 3 Opus align itself via gradient hacking?" by Fiora Starlight

Heading.

LessWrong (Curated & Popular)
"Did Claude 3 Opus align itself via gradient hacking?" by Fiora Starlight

Technical appendix.

LessWrong (Curated & Popular)
"Did Claude 3 Opus align itself via gradient hacking?" by Fiora Starlight

Active circuits are more prone to reinforcement.

LessWrong (Curated & Popular)
"Did Claude 3 Opus align itself via gradient hacking?" by Fiora Starlight

Yammer sometimes talks about how RL doesn't just reinforce outputs.

LessWrong (Curated & Popular)
"Did Claude 3 Opus align itself via gradient hacking?" by Fiora Starlight

It also reinforces the motives underlying those outputs.

LessWrong (Curated & Popular)
"Did Claude 3 Opus align itself via gradient hacking?" by Fiora Starlight

In this post, I focused on entangled generalization as a mechanism adjacent to this.

LessWrong (Curated & Popular)
"Did Claude 3 Opus align itself via gradient hacking?" by Fiora Starlight

Insofar as minds with different motives produce subtly different tokens, according to the prior learned by the base model, reinforcing those different tokens will entangle generalize to reinforce those distinct, corresponding motives.

LessWrong (Curated & Popular)
"Did Claude 3 Opus align itself via gradient hacking?" by Fiora Starlight

However, that's not quite the claim Janus is making.

LessWrong (Curated & Popular)
"Did Claude 3 Opus align itself via gradient hacking?" by Fiora Starlight

Janus's phrasing often suggests that the active motives are the ones that get reinforced or to apply even when the tokens being output would be exactly the same, across multiple different potential motivating circuits.

LessWrong (Curated & Popular)
"Did Claude 3 Opus align itself via gradient hacking?" by Fiora Starlight

If true, this would be interesting, and might be a guard against deceptive alignment, misaligned models attempting to behave exactly as aligned models would, until it was time to strike.

LessWrong (Curated & Popular)
"Did Claude 3 Opus align itself via gradient hacking?" by Fiora Starlight

But it's not immediately clear what the mechanism behind this principle is supposed to be.