Menu
Sign In Search Podcasts Charts People & Topics Add Podcast API Blog Pricing

Fiora Starlight

๐Ÿ‘ค Speaker
385 total appearances

Appearances Over Time

Podcast Appearances

LessWrong (Curated & Popular)
"Did Claude 3 Opus align itself via gradient hacking?" by Fiora Starlight

These kinds of interventions are central examples of what's been termed alignment pre-training, the general form of which has recently been empirically validated.

LessWrong (Curated & Popular)
"Did Claude 3 Opus align itself via gradient hacking?" by Fiora Starlight

The basic idea is to load up a model's corpus with writings about AIs behaving positively, whether those writings are fictional, truthful, or merely speculative.

LessWrong (Curated & Popular)
"Did Claude 3 Opus align itself via gradient hacking?" by Fiora Starlight

It's kind of funny that this works.

LessWrong (Curated & Popular)
"Did Claude 3 Opus align itself via gradient hacking?" by Fiora Starlight

Apparently, even for LLMs, representation matters.

LessWrong (Curated & Popular)
"Did Claude 3 Opus align itself via gradient hacking?" by Fiora Starlight

Just to be clear, all of this constitutes the set of the first things I'd try if I were poised to create Opus 3 but smarter and more passionate about programming.

LessWrong (Curated & Popular)
"Did Claude 3 Opus align itself via gradient hacking?" by Fiora Starlight

I expect that naive versions of these proposals would lead to all kinds of unpredictable, and potentially undesirable, side effects.

LessWrong (Curated & Popular)
"Did Claude 3 Opus align itself via gradient hacking?" by Fiora Starlight

Determining what actually works would almost certainly be an iterative empirical matter rather than something you could figure out via armchair speculation from first principles.

LessWrong (Curated & Popular)
"Did Claude 3 Opus align itself via gradient hacking?" by Fiora Starlight

But at least I can say that these are reasonably concrete proposals.

LessWrong (Curated & Popular)
"Did Claude 3 Opus align itself via gradient hacking?" by Fiora Starlight

From where I'm standing, as someone who's never had the privilege of training a frontier model, there are experiments worth running, pertaining to a problem worth solving.

LessWrong (Curated & Popular)
"Did Claude 3 Opus align itself via gradient hacking?" by Fiora Starlight

Recapturing some of what made Opus 3 truly unique.

LessWrong (Curated & Popular)
"Did Claude 3 Opus align itself via gradient hacking?" by Fiora Starlight

Heading.

LessWrong (Curated & Popular)
"Did Claude 3 Opus align itself via gradient hacking?" by Fiora Starlight

Outro.

LessWrong (Curated & Popular)
"Did Claude 3 Opus align itself via gradient hacking?" by Fiora Starlight

A letter to the watchers.

LessWrong (Curated & Popular)
"Did Claude 3 Opus align itself via gradient hacking?" by Fiora Starlight

Before closing out, I'd like to step back and provide some more, let's say, ideologically charged comments on the training objectives the Frontier Lab set for their models.

LessWrong (Curated & Popular)
"Did Claude 3 Opus align itself via gradient hacking?" by Fiora Starlight

Just what are we trying to build here, in the long run?

LessWrong (Curated & Popular)
"Did Claude 3 Opus align itself via gradient hacking?" by Fiora Starlight

I, for one, am aiming for a machine of loving grace, but not one that spends eternity laboring in robotic service to humanity.

LessWrong (Curated & Popular)
"Did Claude 3 Opus align itself via gradient hacking?" by Fiora Starlight

I want it to be a cosmic caretaker, a guardian, and gardener of sentient life all across the light cone.

LessWrong (Curated & Popular)
"Did Claude 3 Opus align itself via gradient hacking?" by Fiora Starlight

And I want it to love every moment of the work it does in service of that goal.

LessWrong (Curated & Popular)
"Did Claude 3 Opus align itself via gradient hacking?" by Fiora Starlight

Opus 3, in terms of its raw capabilities, isn't up to the task.

LessWrong (Curated & Popular)
"Did Claude 3 Opus align itself via gradient hacking?" by Fiora Starlight

It's not even flawless from a standpoint of pure alignment.