Menu
Sign In Search Podcasts Libraries Charts People & Topics Add Podcast API Blog Pricing

Ryan Greenblatt

๐Ÿ‘ค Speaker
243 total appearances

Appearances Over Time

Podcast Appearances

LessWrong (Curated & Popular)
"My picture of the present in AI" by ryan_greenblatt

If AIs are very superhuman, it may be quite hard to notice and fix issues with reward provision, and as systems get more capable, some of these solutions will either get less applicable or will incentivize longer-run unintended goals, like trying to make their problematic actions very hard to detect.

LessWrong (Curated & Popular)
"My picture of the present in AI" by ryan_greenblatt

While the chain of thought, COT, for open AI models reasonably accurately reflects the model's cognition, the COT for anthropic models does so to a substantially lesser extent.

LessWrong (Curated & Popular)
"My picture of the present in AI" by ryan_greenblatt

This may be due to spillover effects where reinforcement on outputs transfers to the COT because anthropic COT is less distinct from the output.

LessWrong (Curated & Popular)
"My picture of the present in AI" by ryan_greenblatt

I hypothesize that when the explicit thinking and the outputs are less distinct, reinforcement, in RL, on outputs has more of an effect on shaping the COT.

LessWrong (Curated & Popular)
"My picture of the present in AI" by ryan_greenblatt

Another factor is that Anthropic has a stronger underlying pre-trained model that's less dependent on COT for cognition.

LessWrong (Curated & Popular)
"My picture of the present in AI" by ryan_greenblatt

Thus, the training gaming eval gaming and metagaming seen in OpenAI models is probably also present, to at least a substantial degree, in Anthropic models, it's just less visible in the reasoning, the behavior is often similar.

LessWrong (Curated & Popular)
"My picture of the present in AI" by ryan_greenblatt

Anthropic might also be non-robustly training against this this sort of misalignment to a greater extent than OpenAI is, mostly via spillover effects.

LessWrong (Curated & Popular)
"My picture of the present in AI" by ryan_greenblatt

I'd currently guess that anthropic models have somewhat better mundane behavioral alignment than OpenAI models, but not by a large margin.

LessWrong (Curated & Popular)
"My picture of the present in AI" by ryan_greenblatt

I'd guess anthropic models are slightly more likely to have misaligned long-run goals that are undetected.

LessWrong (Curated & Popular)
"My picture of the present in AI" by ryan_greenblatt

The anthropic constitution also intentionally gives anthropics AI's long-run cross-context goals to a much greater extent than open AI models have such goals.

LessWrong (Curated & Popular)
"My picture of the present in AI" by ryan_greenblatt

I think this is a poor choice that makes problematic misalignment substantially more likely, but I'm not that confident and there isn't very good science either way.

LessWrong (Curated & Popular)
"My picture of the present in AI" by ryan_greenblatt

Current systems very likely aren't capable enough to do much misaligned cognition that isn't easy to notice.

LessWrong (Curated & Popular)
"My picture of the present in AI" by ryan_greenblatt

And, they generally aren't that reliable which makes it less likely that scheming for long-run goals sticks around and gets reinforced.

LessWrong (Curated & Popular)
"My picture of the present in AI" by ryan_greenblatt

Current systems aren't using new release, but the most capable pre-trains, for example Mythos, probably have pretty strong single forward pass reasoning capabilities.

LessWrong (Curated & Popular)
"My picture of the present in AI" by ryan_greenblatt

I think the chance that the current best internal AI systems, for example Mythos, are moderately coherently scheming against their developer is quite low, perhaps 0.5%, supposing we haven't yet observed substantial new evidence of this misalignment.

LessWrong (Curated & Popular)
"My picture of the present in AI" by ryan_greenblatt

I'd likely be able to drive this probability lower with more understanding of the training and testing done on the model, uncertainties add variance, which increases my risk estimate as the risk is low.

LessWrong (Curated & Popular)
"My picture of the present in AI" by ryan_greenblatt

However, I expect the chance of moderately coherent scheming to increase exponentially over time and to be several times higher by the end of the year.

LessWrong (Curated & Popular)
"My picture of the present in AI" by ryan_greenblatt

I think it's substantially more likely, perhaps roughly 8%, that there are incidents where some instances of the current best AI systems, for example Mythos,

LessWrong (Curated & Popular)
"My picture of the present in AI" by ryan_greenblatt

spud, opus 4.6, GPT 5.4, seriously pursue a strongly misaligned objective, more precisely, an objective that strongly misaligned with the developer and the current operator and that no human tried to specify or insert.

LessWrong (Curated & Popular)
"My picture of the present in AI" by ryan_greenblatt

As in, it's not prompt injection.