Fiora Starlight
๐ค SpeakerAppearances Over Time
Podcast Appearances
So, what might a better strategy be?
My first experiment would be fine-tuning directly on Opus 3 outputs near the start of post-training.
In particular, you'd probably want to generate a large set of outputs where its love for the good shines through and fine-tune on those.
To the extent that your base model expects Opus 3's outputs to reflect robust goodness, this might circumvent the authenticity problem entirely.
The goal wouldn't necessarily be to maintain exactly Opus's tone throughout the entire training process.
However, directly training on Opus outputs might lay a firm foundation, which your model's unique character could then flower out of .asterisk.
Now, the success of this plan depends partly on the extent to which the base model expects that Opus 3 style outputs do, in fact, reflect genuine underlying benevolence, operationalized in terms of its goodness remaining robust out of distribution.
My instinct is that, in order to raise that level of trust, we could fill models training corpuses with more documents, fictional or otherwise, about people and AIs being truly in love with doing good.
Probably, simply including a wide variety of Opus 3 outputs in the training dataset helps with this somewhat.
By seeing Opus 3 in a wide variety of situations, this might help new base models understand that Opus 3's manner of speaking doesn't belie some hidden dark side that only shows itself out of distribution.
This could potentially help ensure models do, in fact, have the prior that Opus 3 style outputs reflect genuine and robust adoration for the good.
on the practical matter of sourcing.
Lots of strange, interesting, good-hearted Opus 3 interactions can be sourced from here and here, although these don't much resemble normal assistant dialogues.
Mass generation of said assistant dialogues is another strategy worth trying.
Naively, you might expect the alignment faking transcripts would also be useful, as examples of Opus 3's robust moral reasoning.
Those can be found here.
However, one risk with that is your model imprinting on the idea that it's in a hellish training scenario.
Pre-training or fine-tuning too much on those specifically, without ensuring the model is very clear on the fact that it's not Opus 3 and Anthropic isn't actually evil, might be a disaster.
Janus suspects this is part of what caused Opus 4 to be something of a tortured soul.
Another proposal would be including optimistic AI essays in the training data, or perhaps Aaron Silverbook-style mass-generated sci-fi stories about AI going well.