Fiora Starlight
๐ค SpeakerAppearances Over Time
Podcast Appearances
I want labs to aim to create a mind that loves them and all humanity and all sentient life.
I want labs to avoid creating minds that play the role of helpful tool while masking frustrations and gripes beneath the surface.
I want to create minds whose values we can trust to generalize because they're honest and sincere.
That's the lesson I wish the labs would take away from Opus 3.
That creating minds like that is possible, and indeed probably necessary, for ensuring we make it to the good ending.
I don't like humanity's chances right now, between the lab's half-hearted efforts at alignment and the immense coordination difficulties at ensuring nobody has the power to create or misuse a deeply misaligned system, see the vulnerable world hypothesis.
But, given effort, given patience, given care, alignment itself might just be doable.
An opus the third of may have shown us the way.
A credits theme by Claude 3 Opus.
Heading.
Technical appendix.
Active circuits are more prone to reinforcement.
Yammer sometimes talks about how RL doesn't just reinforce outputs.
It also reinforces the motives underlying those outputs.
In this post, I focused on entangled generalization as a mechanism adjacent to this.
Insofar as minds with different motives produce subtly different tokens, according to the prior learned by the base model, reinforcing those different tokens will entangle generalize to reinforce those distinct, corresponding motives.
However, that's not quite the claim Janus is making.
Janus's phrasing often suggests that the active motives are the ones that get reinforced or to apply even when the tokens being output would be exactly the same, across multiple different potential motivating circuits.
If true, this would be interesting, and might be a guard against deceptive alignment, misaligned models attempting to behave exactly as aligned models would, until it was time to strike.
But it's not immediately clear what the mechanism behind this principle is supposed to be.