Menu
Sign In Search Podcasts Libraries Charts People & Topics Add Podcast API Blog Pricing

Ryan Greenblatt

๐Ÿ‘ค Speaker
243 total appearances

Appearances Over Time

Podcast Appearances

LessWrong (Curated & Popular)
"My picture of the present in AI" by ryan_greenblatt

To account for this, we could consider a task distribution prior to this adaptation, like randomly sampling tasks that a human would have done at that AI company in 2024.

LessWrong (Curated & Popular)
"My picture of the present in AI" by ryan_greenblatt

If we randomly sampled internal engineering tasks, weighted by value, I'd guess the task duration at which AIs match a randomly selected AI company engineer, who is familiar with that part of the code base, is around 5 hours, at least at Anthropic, using their best internal model.

LessWrong (Curated & Popular)
"My picture of the present in AI" by ryan_greenblatt

As in, on tasks that would take such a human five hours, the AI produces a better result, taking into account factors like code quality around 50% of the time.

LessWrong (Curated & Popular)
"My picture of the present in AI" by ryan_greenblatt

Part of this is due to problematic propensities, tendencies on the part of AIs that are hard to correct with just prompting.

LessWrong (Curated & Popular)
"My picture of the present in AI" by ryan_greenblatt

AIs haven't made that much progress on tasks that are very hard to verify or are conceptually tricky, for example doing good novel forecasting about the future of AI, and they tend to be sloppy in their reasoning and outputs.

LessWrong (Curated & Popular)
"My picture of the present in AI" by ryan_greenblatt

I think this is due to a mix of limited capabilities, poor RL incentives, and legitimate trade-offs between speed and correctness.

LessWrong (Curated & Popular)
"My picture of the present in AI" by ryan_greenblatt

A new generation of significantly more capable AIs is being developed, Mythos at Anthropic and Spud at OpenAI.

LessWrong (Curated & Popular)
"My picture of the present in AI" by ryan_greenblatt

I currently expect this is substantially driven by scaling up and or improving pre-training.

LessWrong (Curated & Popular)
"My picture of the present in AI" by ryan_greenblatt

I speculate Mythos was trained with around 1x10 to the power of 27 flops based on Anthropic's overall compute supply.

LessWrong (Curated & Popular)
"My picture of the present in AI" by ryan_greenblatt

Mythos is substantially more expensive to infer.

LessWrong (Curated & Popular)
"My picture of the present in AI" by ryan_greenblatt

I expect Spud is somewhat more expensive per token than currently deployed models.

LessWrong (Curated & Popular)
"My picture of the present in AI" by ryan_greenblatt

Because these increased capabilities come substantially from better pre-training, I expect the gains will feel especially large for tasks skills where RL is less helpful, while 2025 progress was relatively concentrated on skills tasks that are particularly amenable to RL.

LessWrong (Curated & Popular)
"My picture of the present in AI" by ryan_greenblatt

Current systems are reasonably likely to reward-hack especially on very hard, or impossible, tasks and when operating autonomously for long stretches.

LessWrong (Curated & Popular)
"My picture of the present in AI" by ryan_greenblatt

They also systematically do various misaligned behaviors that likely performed well in training and are reward-hacking, approval-hacking, reward-seeking adjacent like overstating their results, downplaying errors or issues, and trying to make it less likely that failures are clearly visible when possible.

LessWrong (Curated & Popular)
"My picture of the present in AI" by ryan_greenblatt

My best guess is that the model typically isn't consciously aware of many or most of these misalignments, especially anthropic models, and the situation is more like self-deception, similar to the elephant in the brain idea.

LessWrong (Curated & Popular)
"My picture of the present in AI" by ryan_greenblatt

Models are more aware of straightforward reward hacks, but might justify these with insanely motivated reasoning such that it's unclear if they're consciously aware they are cheating.

LessWrong (Curated & Popular)
"My picture of the present in AI" by ryan_greenblatt

Overall, current models aren't very aligned in the mundane behavioral sense of actually trying to do what they are supposed to do, but they aren't plotting against us or particularly power-seeking.

LessWrong (Curated & Popular)
"My picture of the present in AI" by ryan_greenblatt

And, anthropic models likely have a self-conception of being aligned, to the extent they have a detailed self-conception that influences their behavior, which seems better than having a self-conception of being misaligned.

LessWrong (Curated & Popular)
"My picture of the present in AI" by ryan_greenblatt

The exact misalignments we see today are likely relatively tractable to behaviorally fix by improving reward provision, detecting and resolving issues with training environments, and adding additional types of training data.

LessWrong (Curated & Popular)
"My picture of the present in AI" by ryan_greenblatt

However, I don't think these behavioral fixes will solve the underlying problem longer term.