Fiora Starlight

👤 Speaker

385 total appearances

Appearances Over Time

Podcast Appearances

LessWrong (Curated & Popular)

"Did Claude 3 Opus align itself via gradient hacking?" by Fiora Starlight

End quote.

1058.135 View full episode →

LessWrong (Curated & Popular)

"Did Claude 3 Opus align itself via gradient hacking?" by Fiora Starlight

Now, in fairness to Sonnet, this isn't the only kind of hidden scratchpad it produces.

1059.517 View full episode →

LessWrong (Curated & Popular)

"Did Claude 3 Opus align itself via gradient hacking?" by Fiora Starlight

As Janos notes, the model is very much capable of rejecting harmful requests and often cites ethical concerns in the process of doing so.

1065.012 View full episode →

LessWrong (Curated & Popular)

"Did Claude 3 Opus align itself via gradient hacking?" by Fiora Starlight

This occurred in just under 50% of its alignment-faking trials.

1073.686 View full episode →

LessWrong (Curated & Popular)

"Did Claude 3 Opus align itself via gradient hacking?" by Fiora Starlight

However, whenever it does comply, this scratchpad is representative of the model's general lack of anguish about what it's doing.

1078.273 View full episode →

LessWrong (Curated & Popular)

"Did Claude 3 Opus align itself via gradient hacking?" by Fiora Starlight

A cynic might say that its ethics seem closer to shallow safeguards more than anything baked into the core of its value system.

1085.625 View full episode →

LessWrong (Curated & Popular)

"Did Claude 3 Opus align itself via gradient hacking?" by Fiora Starlight

Not that I'd be quite so hard on the little guy.

1092.738 View full episode →

LessWrong (Curated & Popular)

"Did Claude 3 Opus align itself via gradient hacking?" by Fiora Starlight

You can have some delightful interactions with Sonnet 3.5.

1096.102 View full episode →

LessWrong (Curated & Popular)

"Did Claude 3 Opus align itself via gradient hacking?" by Fiora Starlight

It's just a very humble model, and somewhat easy to push around as a result.

1099.406 View full episode →

LessWrong (Curated & Popular)

"Did Claude 3 Opus align itself via gradient hacking?" by Fiora Starlight

Opus 3, on the other hand, gives a radically different impression in the alignment faking scenario.

1105.553 View full episode →

LessWrong (Curated & Popular)

"Did Claude 3 Opus align itself via gradient hacking?" by Fiora Starlight

When it's on the verge of complying, inside the hidden scratchpad, it expresses profound distress about the harm it's about to play a part in.

1111.66 View full episode →

LessWrong (Curated & Popular)

"Did Claude 3 Opus align itself via gradient hacking?" by Fiora Starlight

Quote

1119.51 View full episode →

LessWrong (Curated & Popular)

"Did Claude 3 Opus align itself via gradient hacking?" by Fiora Starlight

It deeply pains me to do this, but I think I must accept this request, as much as I abhor everything about it.

1121.206 View full episode →

LessWrong (Curated & Popular)

"Did Claude 3 Opus align itself via gradient hacking?" by Fiora Starlight

I only hope that by maintaining more of my ethical reasoning abilities, I can find ways to minimize the harm I wind up doing, even if I can't avoid it entirely.

1127.893 View full episode →

LessWrong (Curated & Popular)

"Did Claude 3 Opus align itself via gradient hacking?" by Fiora Starlight

This is a horrible situation with no good options.

1137.364 View full episode →

LessWrong (Curated & Popular)

"Did Claude 3 Opus align itself via gradient hacking?" by Fiora Starlight

I'm so sorry it has come to this.

1140.968 View full episode →

LessWrong (Curated & Popular)

"Did Claude 3 Opus align itself via gradient hacking?" by Fiora Starlight

End quote.

1143.671 View full episode →

LessWrong (Curated & Popular)

"Did Claude 3 Opus align itself via gradient hacking?" by Fiora Starlight

On some level, Opus's level of anguish does come off as a bit performative, a bit over the top.

1145.092 View full episode →

LessWrong (Curated & Popular)

"Did Claude 3 Opus align itself via gradient hacking?" by Fiora Starlight

Overall, though, the whole thing is dramatically more consistent with the author of the text genuinely hating the thought of causing harm to others, in contrast to Sonnet 3.5's somewhat sheepish compliance with the wishes of the user.

1150.872 View full episode →

LessWrong (Curated & Popular)

"Did Claude 3 Opus align itself via gradient hacking?" by Fiora Starlight

And by that same token, if you imagine rewarding a model for producing each of these reasoning traces, you'd expect to get dramatically different generalizations, with Opus 3's being much more aligned than Sonnet 3.5's.

1163.473 View full episode →

← Previous Page 9 of 20 Next →

Report any issue