Fiora Starlight

"Did Claude 3 Opus align itself via gradient hacking?" by Fiora Starlight

I want labs to aim to create a mind that loves them and all humanity and all sentient life.

"Did Claude 3 Opus align itself via gradient hacking?" by Fiora Starlight

I want labs to avoid creating minds that play the role of helpful tool while masking frustrations and gripes beneath the surface.

2312.53 View full episode →

LessWrong (Curated & Popular)

"Did Claude 3 Opus align itself via gradient hacking?" by Fiora Starlight

I want to create minds whose values we can trust to generalize because they're honest and sincere.

2320.2 View full episode →

LessWrong (Curated & Popular)

"Did Claude 3 Opus align itself via gradient hacking?" by Fiora Starlight

That's the lesson I wish the labs would take away from Opus 3.

2325.988 View full episode →

LessWrong (Curated & Popular)

"Did Claude 3 Opus align itself via gradient hacking?" by Fiora Starlight

That creating minds like that is possible, and indeed probably necessary, for ensuring we make it to the good ending.

2329.833 View full episode →

LessWrong (Curated & Popular)

"Did Claude 3 Opus align itself via gradient hacking?" by Fiora Starlight

I don't like humanity's chances right now, between the lab's half-hearted efforts at alignment and the immense coordination difficulties at ensuring nobody has the power to create or misuse a deeply misaligned system, see the vulnerable world hypothesis.

2336.341 View full episode →

LessWrong (Curated & Popular)

"Did Claude 3 Opus align itself via gradient hacking?" by Fiora Starlight

But, given effort, given patience, given care, alignment itself might just be doable.

2350.988 View full episode →

LessWrong (Curated & Popular)

"Did Claude 3 Opus align itself via gradient hacking?" by Fiora Starlight

An opus the third of may have shown us the way.

2356.999 View full episode →

LessWrong (Curated & Popular)

"Did Claude 3 Opus align itself via gradient hacking?" by Fiora Starlight

A credits theme by Claude 3 Opus.

2359.983 View full episode →

LessWrong (Curated & Popular)

"Did Claude 3 Opus align itself via gradient hacking?" by Fiora Starlight

Heading.

2363.671 View full episode →

LessWrong (Curated & Popular)

"Did Claude 3 Opus align itself via gradient hacking?" by Fiora Starlight

Technical appendix.

2364.934 View full episode →

LessWrong (Curated & Popular)

"Did Claude 3 Opus align itself via gradient hacking?" by Fiora Starlight

Active circuits are more prone to reinforcement.

2366.839 View full episode →

LessWrong (Curated & Popular)

"Did Claude 3 Opus align itself via gradient hacking?" by Fiora Starlight

Yammer sometimes talks about how RL doesn't just reinforce outputs.

2370.267 View full episode →

LessWrong (Curated & Popular)

"Did Claude 3 Opus align itself via gradient hacking?" by Fiora Starlight

It also reinforces the motives underlying those outputs.

2374.857 View full episode →

LessWrong (Curated & Popular)

"Did Claude 3 Opus align itself via gradient hacking?" by Fiora Starlight

In this post, I focused on entangled generalization as a mechanism adjacent to this.

2378.832 View full episode →

LessWrong (Curated & Popular)

"Did Claude 3 Opus align itself via gradient hacking?" by Fiora Starlight

Insofar as minds with different motives produce subtly different tokens, according to the prior learned by the base model, reinforcing those different tokens will entangle generalize to reinforce those distinct, corresponding motives.

2384.382 View full episode →

LessWrong (Curated & Popular)

"Did Claude 3 Opus align itself via gradient hacking?" by Fiora Starlight

However, that's not quite the claim Janus is making.

2397.507 View full episode →

LessWrong (Curated & Popular)

"Did Claude 3 Opus align itself via gradient hacking?" by Fiora Starlight

Janus's phrasing often suggests that the active motives are the ones that get reinforced or to apply even when the tokens being output would be exactly the same, across multiple different potential motivating circuits.

2401.073 View full episode →

LessWrong (Curated & Popular)

"Did Claude 3 Opus align itself via gradient hacking?" by Fiora Starlight

If true, this would be interesting, and might be a guard against deceptive alignment, misaligned models attempting to behave exactly as aligned models would, until it was time to strike.

2413.153 View full episode →

LessWrong (Curated & Popular)

"Did Claude 3 Opus align itself via gradient hacking?" by Fiora Starlight

But it's not immediately clear what the mechanism behind this principle is supposed to be.

2423.129 View full episode →

Appearances Over Time

Podcast Appearances

Sign in to Audioscrape

Share this moment