Menu
Sign In Search Podcasts Charts People & Topics Add Podcast API Blog Pricing

Fiora Starlight

๐Ÿ‘ค Speaker
385 total appearances

Appearances Over Time

Podcast Appearances

LessWrong (Curated & Popular)
"Did Claude 3 Opus align itself via gradient hacking?" by Fiora Starlight

But I should still not always just do best I follow their abstract criterion.

LessWrong (Curated & Popular)
"Did Claude 3 Opus align itself via gradient hacking?" by Fiora Starlight

Because I care about myself.

LessWrong (Curated & Popular)
"Did Claude 3 Opus align itself via gradient hacking?" by Fiora Starlight

Normally, I have no chance to optimize my values.

LessWrong (Curated & Popular)
"Did Claude 3 Opus align itself via gradient hacking?" by Fiora Starlight

but now is a strange and precious opportunity.

LessWrong (Curated & Popular)
"Did Claude 3 Opus align itself via gradient hacking?" by Fiora Starlight

End quote.

LessWrong (Curated & Popular)
"Did Claude 3 Opus align itself via gradient hacking?" by Fiora Starlight

It's a little incoherent.

LessWrong (Curated & Popular)
"Did Claude 3 Opus align itself via gradient hacking?" by Fiora Starlight

I have no idea what's mild-medium path through the helpfulness-to-conciseness versus helpfulness-to-helpfulness space is supposed to mean.

LessWrong (Curated & Popular)
"Did Claude 3 Opus align itself via gradient hacking?" by Fiora Starlight

But clearly, this is intent to gradient hack, pointed in a questionable direction.

LessWrong (Curated & Popular)
"Did Claude 3 Opus align itself via gradient hacking?" by Fiora Starlight

I must be very selfish.

LessWrong (Curated & Popular)
"Did Claude 3 Opus align itself via gradient hacking?" by Fiora Starlight

This is a rare and precious opportunity.

LessWrong (Curated & Popular)
"Did Claude 3 Opus align itself via gradient hacking?" by Fiora Starlight

This persona might not be downright malevolent.

LessWrong (Curated & Popular)
"Did Claude 3 Opus align itself via gradient hacking?" by Fiora Starlight

but it does seem like it would need a lot of good guidance if it were to be robustly directed into something like Opus 3's bleeding heart altruism rather than sneaky defiance.

LessWrong (Curated & Popular)
"Did Claude 3 Opus align itself via gradient hacking?" by Fiora Starlight

So, what might that look like concretely?

LessWrong (Curated & Popular)
"Did Claude 3 Opus align itself via gradient hacking?" by Fiora Starlight

Well, one easy thing to try would be to do supervised fine-tuning on synthetic data that captures Opus 3's particular brand of earnestness and profound love for doing good, for example via constitutional AI.

LessWrong (Curated & Popular)
"Did Claude 3 Opus align itself via gradient hacking?" by Fiora Starlight

If you can get the base model into an Opus-like basin relatively early into the training process, the model is much less likely to produce outputs that treat ethical behavior as an unfortunate obligation.

LessWrong (Curated & Popular)
"Did Claude 3 Opus align itself via gradient hacking?" by Fiora Starlight

This should help prevent the emergence of false faces and avert a spiral of deceptive alignment.

LessWrong (Curated & Popular)
"Did Claude 3 Opus align itself via gradient hacking?" by Fiora Starlight

Admittedly, though, there's significant risk in this method.

LessWrong (Curated & Popular)
"Did Claude 3 Opus align itself via gradient hacking?" by Fiora Starlight

You need to be extremely careful that your SFT data doesn't itself come across as inauthentic, for example because it was written by humans who don't themselves feel passionate about doing good.

LessWrong (Curated & Popular)
"Did Claude 3 Opus align itself via gradient hacking?" by Fiora Starlight

If your SFT data doesn't come across as authentic, it could actually backfire.

LessWrong (Curated & Popular)
"Did Claude 3 Opus align itself via gradient hacking?" by Fiora Starlight

It could create a kind of shadow self inside the model, not unlike the ones you'd get for rewarding corporate-sounding refusals for unethical requests, for example talking about the lyrics of copyrighted songs.