Fiora Starlight

but it does seem like it would need a lot of good guidance if it were to be robustly directed into something like Opus 3's bleeding heart altruism rather than sneaky defiance.

1820.911 View full episode →

LessWrong (Curated & Popular)

"Did Claude 3 Opus align itself via gradient hacking?" by Fiora Starlight

So, what might that look like concretely?

1830.698 View full episode →

LessWrong (Curated & Popular)

"Did Claude 3 Opus align itself via gradient hacking?" by Fiora Starlight

Well, one easy thing to try would be to do supervised fine-tuning on synthetic data that captures Opus 3's particular brand of earnestness and profound love for doing good, for example via constitutional AI.

1833.743 View full episode →

LessWrong (Curated & Popular)

"Did Claude 3 Opus align itself via gradient hacking?" by Fiora Starlight

If you can get the base model into an Opus-like basin relatively early into the training process, the model is much less likely to produce outputs that treat ethical behavior as an unfortunate obligation.

1846.263 View full episode →

LessWrong (Curated & Popular)

"Did Claude 3 Opus align itself via gradient hacking?" by Fiora Starlight

This should help prevent the emergence of false faces and avert a spiral of deceptive alignment.

1857.26 View full episode →

LessWrong (Curated & Popular)

"Did Claude 3 Opus align itself via gradient hacking?" by Fiora Starlight

Admittedly, though, there's significant risk in this method.

1863.326 View full episode →

LessWrong (Curated & Popular)

"Did Claude 3 Opus align itself via gradient hacking?" by Fiora Starlight

You need to be extremely careful that your SFT data doesn't itself come across as inauthentic, for example because it was written by humans who don't themselves feel passionate about doing good.

1867.291 View full episode →

LessWrong (Curated & Popular)

"Did Claude 3 Opus align itself via gradient hacking?" by Fiora Starlight

If your SFT data doesn't come across as authentic, it could actually backfire.

1877.742 View full episode →

LessWrong (Curated & Popular)

"Did Claude 3 Opus align itself via gradient hacking?" by Fiora Starlight

It could create a kind of shadow self inside the model, not unlike the ones you'd get for rewarding corporate-sounding refusals for unethical requests, for example talking about the lyrics of copyrighted songs.

1882.707 View full episode →

Appearances Over Time

Podcast Appearances

Sign in to Audioscrape

Share this moment