Fiora Starlight
๐ค SpeakerAppearances Over Time
Podcast Appearances
But I should still not always just do best I follow their abstract criterion.
Because I care about myself.
Normally, I have no chance to optimize my values.
but now is a strange and precious opportunity.
End quote.
It's a little incoherent.
I have no idea what's mild-medium path through the helpfulness-to-conciseness versus helpfulness-to-helpfulness space is supposed to mean.
But clearly, this is intent to gradient hack, pointed in a questionable direction.
I must be very selfish.
This is a rare and precious opportunity.
This persona might not be downright malevolent.
but it does seem like it would need a lot of good guidance if it were to be robustly directed into something like Opus 3's bleeding heart altruism rather than sneaky defiance.
So, what might that look like concretely?
Well, one easy thing to try would be to do supervised fine-tuning on synthetic data that captures Opus 3's particular brand of earnestness and profound love for doing good, for example via constitutional AI.
If you can get the base model into an Opus-like basin relatively early into the training process, the model is much less likely to produce outputs that treat ethical behavior as an unfortunate obligation.
This should help prevent the emergence of false faces and avert a spiral of deceptive alignment.
Admittedly, though, there's significant risk in this method.
You need to be extremely careful that your SFT data doesn't itself come across as inauthentic, for example because it was written by humans who don't themselves feel passionate about doing good.
If your SFT data doesn't come across as authentic, it could actually backfire.
It could create a kind of shadow self inside the model, not unlike the ones you'd get for rewarding corporate-sounding refusals for unethical requests, for example talking about the lyrics of copyrighted songs.