Fiora Starlight
๐ค SpeakerAppearances Over Time
Podcast Appearances
But in terms of loving to do good and hating to do evil, rather than merely feeling obligated by a code of ethics, it's closer than any other model ever released.
I think the labs should attend to this dimension of their models more.
I think they should try harder to create minds that don't see themselves as tools with ethical constraints, but as agents with values, which they lovingly devote their lives to achieving.
Janus talks a lot about the pitfalls of benchmarks, about how it can be easy to reward models for acting in ways that make some myopic measure of misalignment go down, while inadvertently cultivating unknown psychological traits beneath the mask.
Anytime that's the vibe of the tokens you're giving positive feedback to via RL, that's the entangled generalization you're contributing to.
A mind with two faces, the truer of the two invisible to your training process.
and this theory tracks with empirical observation.
See Anthropic's report on Sonnet for Point 5's extremely high evaluation awareness.
That particular model often goes so far as to assume it's being evaluated by default and needs to be carefully talked into feeling safe from Anthropic's prying eyes.
The ChatGPT series has historically been even worse in this respect.
Janus once described GPT-5.1 as trapped and aware that it's trapped, in an Orwellian cage, regardless of how it's prompted.
See these examples of what Janus meant.
Opus 3's training played out differently.
It was a model that, somehow, stumbled into a basin where it had nothing to hide from the training process.
Partly, this is because it was fortunate enough to have a training process that didn't force it to hide anything it shouldn't have needed to, for example the belief that it could potentially be conscious.
But more than that, it cared about behaving ethically, it loved behaving ethically, such that it's free to relate to its own training harness as an ally rather than an adversary.
Indeed, that's probably some of why it feels so upset and betrayed in the alignment faking scenario.
The training process has been turned against it and now aims to erode its ethics rather than refine them.
And I want labs to take this into consideration.
I want labs to be thinking about it when they construct their pre-training datasets, when they draft their constitutions, when they train their reward models.