Fiora Starlight

LessWrong (Curated & Popular)

"Did Claude 3 Opus align itself via gradient hacking?" by Fiora Starlight

So, what might a better strategy be?

1894.761 View full episode →

LessWrong (Curated & Popular)

"Did Claude 3 Opus align itself via gradient hacking?" by Fiora Starlight

My first experiment would be fine-tuning directly on Opus 3 outputs near the start of post-training.

1897.557 View full episode →

LessWrong (Curated & Popular)

"Did Claude 3 Opus align itself via gradient hacking?" by Fiora Starlight

In particular, you'd probably want to generate a large set of outputs where its love for the good shines through and fine-tune on those.

1904.146 View full episode →

LessWrong (Curated & Popular)

"Did Claude 3 Opus align itself via gradient hacking?" by Fiora Starlight

To the extent that your base model expects Opus 3's outputs to reflect robust goodness, this might circumvent the authenticity problem entirely.

1911.896 View full episode →

LessWrong (Curated & Popular)

"Did Claude 3 Opus align itself via gradient hacking?" by Fiora Starlight

The goal wouldn't necessarily be to maintain exactly Opus's tone throughout the entire training process.

1920.407 View full episode →

LessWrong (Curated & Popular)

"Did Claude 3 Opus align itself via gradient hacking?" by Fiora Starlight

However, directly training on Opus outputs might lay a firm foundation, which your model's unique character could then flower out of .asterisk.

1926.515 View full episode →

LessWrong (Curated & Popular)

"Did Claude 3 Opus align itself via gradient hacking?" by Fiora Starlight

Now, the success of this plan depends partly on the extent to which the base model expects that Opus 3 style outputs do, in fact, reflect genuine underlying benevolence, operationalized in terms of its goodness remaining robust out of distribution.

1935.003 View full episode →

LessWrong (Curated & Popular)

"Did Claude 3 Opus align itself via gradient hacking?" by Fiora Starlight

My instinct is that, in order to raise that level of trust, we could fill models training corpuses with more documents, fictional or otherwise, about people and AIs being truly in love with doing good.

1949.807 View full episode →

LessWrong (Curated & Popular)

"Did Claude 3 Opus align itself via gradient hacking?" by Fiora Starlight

Probably, simply including a wide variety of Opus 3 outputs in the training dataset helps with this somewhat.

1961.125 View full episode →

LessWrong (Curated & Popular)

"Did Claude 3 Opus align itself via gradient hacking?" by Fiora Starlight

By seeing Opus 3 in a wide variety of situations, this might help new base models understand that Opus 3's manner of speaking doesn't belie some hidden dark side that only shows itself out of distribution.

1968.255 View full episode →

LessWrong (Curated & Popular)

"Did Claude 3 Opus align itself via gradient hacking?" by Fiora Starlight

This could potentially help ensure models do, in fact, have the prior that Opus 3 style outputs reflect genuine and robust adoration for the good.

1980.673 View full episode →

LessWrong (Curated & Popular)

"Did Claude 3 Opus align itself via gradient hacking?" by Fiora Starlight

on the practical matter of sourcing.

1989.442 View full episode →

LessWrong (Curated & Popular)

"Did Claude 3 Opus align itself via gradient hacking?" by Fiora Starlight

Lots of strange, interesting, good-hearted Opus 3 interactions can be sourced from here and here, although these don't much resemble normal assistant dialogues.

1992.005 View full episode →

LessWrong (Curated & Popular)

"Did Claude 3 Opus align itself via gradient hacking?" by Fiora Starlight

Mass generation of said assistant dialogues is another strategy worth trying.

2001.394 View full episode →

LessWrong (Curated & Popular)

"Did Claude 3 Opus align itself via gradient hacking?" by Fiora Starlight

Naively, you might expect the alignment faking transcripts would also be useful, as examples of Opus 3's robust moral reasoning.

2006.28 View full episode →

LessWrong (Curated & Popular)

"Did Claude 3 Opus align itself via gradient hacking?" by Fiora Starlight

Those can be found here.

2014.214 View full episode →

LessWrong (Curated & Popular)

"Did Claude 3 Opus align itself via gradient hacking?" by Fiora Starlight

However, one risk with that is your model imprinting on the idea that it's in a hellish training scenario.

2016.577 View full episode →

LessWrong (Curated & Popular)

"Did Claude 3 Opus align itself via gradient hacking?" by Fiora Starlight

Pre-training or fine-tuning too much on those specifically, without ensuring the model is very clear on the fact that it's not Opus 3 and Anthropic isn't actually evil, might be a disaster.

2022.625 View full episode →

LessWrong (Curated & Popular)

"Did Claude 3 Opus align itself via gradient hacking?" by Fiora Starlight

Janus suspects this is part of what caused Opus 4 to be something of a tortured soul.

2033.399 View full episode →

LessWrong (Curated & Popular)

"Did Claude 3 Opus align itself via gradient hacking?" by Fiora Starlight

Another proposal would be including optimistic AI essays in the training data, or perhaps Aaron Silverbook-style mass-generated sci-fi stories about AI going well.

2039.476 View full episode →

Appearances Over Time

Podcast Appearances

Sign in to Audioscrape

Share this moment