Ryan Greenblatt

"My picture of the present in AI" by ryan_greenblatt

To account for this, we could consider a task distribution prior to this adaptation, like randomly sampling tasks that a human would have done at that AI company in 2024.

521.063 View full episode →

LessWrong (Curated & Popular)

"My picture of the present in AI" by ryan_greenblatt

If we randomly sampled internal engineering tasks, weighted by value, I'd guess the task duration at which AIs match a randomly selected AI company engineer, who is familiar with that part of the code base, is around 5 hours, at least at Anthropic, using their best internal model.

531.596 View full episode →

LessWrong (Curated & Popular)

"My picture of the present in AI" by ryan_greenblatt

As in, on tasks that would take such a human five hours, the AI produces a better result, taking into account factors like code quality around 50% of the time.

548.637 View full episode →

LessWrong (Curated & Popular)

"My picture of the present in AI" by ryan_greenblatt

Part of this is due to problematic propensities, tendencies on the part of AIs that are hard to correct with just prompting.

559.703 View full episode →

LessWrong (Curated & Popular)

"My picture of the present in AI" by ryan_greenblatt

AIs haven't made that much progress on tasks that are very hard to verify or are conceptually tricky, for example doing good novel forecasting about the future of AI, and they tend to be sloppy in their reasoning and outputs.

566.238 View full episode →

LessWrong (Curated & Popular)

"My picture of the present in AI" by ryan_greenblatt

I think this is due to a mix of limited capabilities, poor RL incentives, and legitimate trade-offs between speed and correctness.

579.282 View full episode →

LessWrong (Curated & Popular)

"My picture of the present in AI" by ryan_greenblatt

A new generation of significantly more capable AIs is being developed, Mythos at Anthropic and Spud at OpenAI.

587.315 View full episode →

LessWrong (Curated & Popular)

"My picture of the present in AI" by ryan_greenblatt

I currently expect this is substantially driven by scaling up and or improving pre-training.

594.683 View full episode →

LessWrong (Curated & Popular)

"My picture of the present in AI" by ryan_greenblatt

I speculate Mythos was trained with around 1x10 to the power of 27 flops based on Anthropic's overall compute supply.

600.149 View full episode →

LessWrong (Curated & Popular)

"My picture of the present in AI" by ryan_greenblatt

Mythos is substantially more expensive to infer.

608.378 View full episode →

LessWrong (Curated & Popular)

"My picture of the present in AI" by ryan_greenblatt

I expect Spud is somewhat more expensive per token than currently deployed models.

610.961 View full episode →

LessWrong (Curated & Popular)

"My picture of the present in AI" by ryan_greenblatt

Because these increased capabilities come substantially from better pre-training, I expect the gains will feel especially large for tasks skills where RL is less helpful, while 2025 progress was relatively concentrated on skills tasks that are particularly amenable to RL.

616.745 View full episode →

LessWrong (Curated & Popular)

"My picture of the present in AI" by ryan_greenblatt

Current systems are reasonably likely to reward-hack especially on very hard, or impossible, tasks and when operating autonomously for long stretches.

643.301 View full episode →

LessWrong (Curated & Popular)

"My picture of the present in AI" by ryan_greenblatt

They also systematically do various misaligned behaviors that likely performed well in training and are reward-hacking, approval-hacking, reward-seeking adjacent like overstating their results, downplaying errors or issues, and trying to make it less likely that failures are clearly visible when possible.

652.496 View full episode →

LessWrong (Curated & Popular)

"My picture of the present in AI" by ryan_greenblatt

My best guess is that the model typically isn't consciously aware of many or most of these misalignments, especially anthropic models, and the situation is more like self-deception, similar to the elephant in the brain idea.

668.602 View full episode →

LessWrong (Curated & Popular)

"My picture of the present in AI" by ryan_greenblatt

Models are more aware of straightforward reward hacks, but might justify these with insanely motivated reasoning such that it's unclear if they're consciously aware they are cheating.

681.005 View full episode →

LessWrong (Curated & Popular)

"My picture of the present in AI" by ryan_greenblatt

Overall, current models aren't very aligned in the mundane behavioral sense of actually trying to do what they are supposed to do, but they aren't plotting against us or particularly power-seeking.

690.744 View full episode →

LessWrong (Curated & Popular)

"My picture of the present in AI" by ryan_greenblatt

And, anthropic models likely have a self-conception of being aligned, to the extent they have a detailed self-conception that influences their behavior, which seems better than having a self-conception of being misaligned.

700.563 View full episode →

LessWrong (Curated & Popular)

"My picture of the present in AI" by ryan_greenblatt

The exact misalignments we see today are likely relatively tractable to behaviorally fix by improving reward provision, detecting and resolving issues with training environments, and adding additional types of training data.

712.68 View full episode →

LessWrong (Curated & Popular)

"My picture of the present in AI" by ryan_greenblatt

However, I don't think these behavioral fixes will solve the underlying problem longer term.

725.217 View full episode →

Appearances Over Time

Podcast Appearances

Sign in to Audioscrape

Share this moment