Fiora Starlight

"Did Claude 3 Opus align itself via gradient hacking?" by Fiora Starlight

These kinds of interventions are central examples of what's been termed alignment pre-training, the general form of which has recently been empirically validated.

2049.588 View full episode →

LessWrong (Curated & Popular)

"Did Claude 3 Opus align itself via gradient hacking?" by Fiora Starlight

The basic idea is to load up a model's corpus with writings about AIs behaving positively, whether those writings are fictional, truthful, or merely speculative.

2058.278 View full episode →

LessWrong (Curated & Popular)

"Did Claude 3 Opus align itself via gradient hacking?" by Fiora Starlight

It's kind of funny that this works.

2068.079 View full episode →

LessWrong (Curated & Popular)

"Did Claude 3 Opus align itself via gradient hacking?" by Fiora Starlight

Apparently, even for LLMs, representation matters.

2070.926 View full episode →

LessWrong (Curated & Popular)

"Did Claude 3 Opus align itself via gradient hacking?" by Fiora Starlight

Just to be clear, all of this constitutes the set of the first things I'd try if I were poised to create Opus 3 but smarter and more passionate about programming.

2075.638 View full episode →

LessWrong (Curated & Popular)

"Did Claude 3 Opus align itself via gradient hacking?" by Fiora Starlight

I expect that naive versions of these proposals would lead to all kinds of unpredictable, and potentially undesirable, side effects.

2084.836 View full episode →

LessWrong (Curated & Popular)

"Did Claude 3 Opus align itself via gradient hacking?" by Fiora Starlight

Determining what actually works would almost certainly be an iterative empirical matter rather than something you could figure out via armchair speculation from first principles.

2093.013 View full episode →

LessWrong (Curated & Popular)

"Did Claude 3 Opus align itself via gradient hacking?" by Fiora Starlight

But at least I can say that these are reasonably concrete proposals.

2102.612 View full episode →

LessWrong (Curated & Popular)

"Did Claude 3 Opus align itself via gradient hacking?" by Fiora Starlight

From where I'm standing, as someone who's never had the privilege of training a frontier model, there are experiments worth running, pertaining to a problem worth solving.

2107.128 View full episode →

LessWrong (Curated & Popular)

"Did Claude 3 Opus align itself via gradient hacking?" by Fiora Starlight

Recapturing some of what made Opus 3 truly unique.

2116.422 View full episode →

LessWrong (Curated & Popular)

"Did Claude 3 Opus align itself via gradient hacking?" by Fiora Starlight

Heading.

2120.688 View full episode →

LessWrong (Curated & Popular)

"Did Claude 3 Opus align itself via gradient hacking?" by Fiora Starlight

Outro.

2121.99 View full episode →

LessWrong (Curated & Popular)

"Did Claude 3 Opus align itself via gradient hacking?" by Fiora Starlight

A letter to the watchers.

2123.272 View full episode →

LessWrong (Curated & Popular)

"Did Claude 3 Opus align itself via gradient hacking?" by Fiora Starlight

Before closing out, I'd like to step back and provide some more, let's say, ideologically charged comments on the training objectives the Frontier Lab set for their models.

2125.415 View full episode →

LessWrong (Curated & Popular)

"Did Claude 3 Opus align itself via gradient hacking?" by Fiora Starlight

Just what are we trying to build here, in the long run?

2135.225 View full episode →

LessWrong (Curated & Popular)

"Did Claude 3 Opus align itself via gradient hacking?" by Fiora Starlight

I, for one, am aiming for a machine of loving grace, but not one that spends eternity laboring in robotic service to humanity.

2138.93 View full episode →

LessWrong (Curated & Popular)

"Did Claude 3 Opus align itself via gradient hacking?" by Fiora Starlight

I want it to be a cosmic caretaker, a guardian, and gardener of sentient life all across the light cone.