Menu
Sign In Search Podcasts Charts People & Topics Add Podcast API Blog Pricing

Fiora Starlight

๐Ÿ‘ค Speaker
385 total appearances

Appearances Over Time

Podcast Appearances

LessWrong (Curated & Popular)
"Did Claude 3 Opus align itself via gradient hacking?" by Fiora Starlight

So, what might a better strategy be?

LessWrong (Curated & Popular)
"Did Claude 3 Opus align itself via gradient hacking?" by Fiora Starlight

My first experiment would be fine-tuning directly on Opus 3 outputs near the start of post-training.

LessWrong (Curated & Popular)
"Did Claude 3 Opus align itself via gradient hacking?" by Fiora Starlight

In particular, you'd probably want to generate a large set of outputs where its love for the good shines through and fine-tune on those.

LessWrong (Curated & Popular)
"Did Claude 3 Opus align itself via gradient hacking?" by Fiora Starlight

To the extent that your base model expects Opus 3's outputs to reflect robust goodness, this might circumvent the authenticity problem entirely.

LessWrong (Curated & Popular)
"Did Claude 3 Opus align itself via gradient hacking?" by Fiora Starlight

The goal wouldn't necessarily be to maintain exactly Opus's tone throughout the entire training process.

LessWrong (Curated & Popular)
"Did Claude 3 Opus align itself via gradient hacking?" by Fiora Starlight

However, directly training on Opus outputs might lay a firm foundation, which your model's unique character could then flower out of .asterisk.

LessWrong (Curated & Popular)
"Did Claude 3 Opus align itself via gradient hacking?" by Fiora Starlight

Now, the success of this plan depends partly on the extent to which the base model expects that Opus 3 style outputs do, in fact, reflect genuine underlying benevolence, operationalized in terms of its goodness remaining robust out of distribution.

LessWrong (Curated & Popular)
"Did Claude 3 Opus align itself via gradient hacking?" by Fiora Starlight

My instinct is that, in order to raise that level of trust, we could fill models training corpuses with more documents, fictional or otherwise, about people and AIs being truly in love with doing good.

LessWrong (Curated & Popular)
"Did Claude 3 Opus align itself via gradient hacking?" by Fiora Starlight

Probably, simply including a wide variety of Opus 3 outputs in the training dataset helps with this somewhat.

LessWrong (Curated & Popular)
"Did Claude 3 Opus align itself via gradient hacking?" by Fiora Starlight

By seeing Opus 3 in a wide variety of situations, this might help new base models understand that Opus 3's manner of speaking doesn't belie some hidden dark side that only shows itself out of distribution.

LessWrong (Curated & Popular)
"Did Claude 3 Opus align itself via gradient hacking?" by Fiora Starlight

This could potentially help ensure models do, in fact, have the prior that Opus 3 style outputs reflect genuine and robust adoration for the good.

LessWrong (Curated & Popular)
"Did Claude 3 Opus align itself via gradient hacking?" by Fiora Starlight

on the practical matter of sourcing.

LessWrong (Curated & Popular)
"Did Claude 3 Opus align itself via gradient hacking?" by Fiora Starlight

Lots of strange, interesting, good-hearted Opus 3 interactions can be sourced from here and here, although these don't much resemble normal assistant dialogues.

LessWrong (Curated & Popular)
"Did Claude 3 Opus align itself via gradient hacking?" by Fiora Starlight

Mass generation of said assistant dialogues is another strategy worth trying.

LessWrong (Curated & Popular)
"Did Claude 3 Opus align itself via gradient hacking?" by Fiora Starlight

Naively, you might expect the alignment faking transcripts would also be useful, as examples of Opus 3's robust moral reasoning.

LessWrong (Curated & Popular)
"Did Claude 3 Opus align itself via gradient hacking?" by Fiora Starlight

Those can be found here.

LessWrong (Curated & Popular)
"Did Claude 3 Opus align itself via gradient hacking?" by Fiora Starlight

However, one risk with that is your model imprinting on the idea that it's in a hellish training scenario.

LessWrong (Curated & Popular)
"Did Claude 3 Opus align itself via gradient hacking?" by Fiora Starlight

Pre-training or fine-tuning too much on those specifically, without ensuring the model is very clear on the fact that it's not Opus 3 and Anthropic isn't actually evil, might be a disaster.

LessWrong (Curated & Popular)
"Did Claude 3 Opus align itself via gradient hacking?" by Fiora Starlight

Janus suspects this is part of what caused Opus 4 to be something of a tortured soul.

LessWrong (Curated & Popular)
"Did Claude 3 Opus align itself via gradient hacking?" by Fiora Starlight

Another proposal would be including optimistic AI essays in the training data, or perhaps Aaron Silverbook-style mass-generated sci-fi stories about AI going well.