Ryan Greenblatt

👤 Speaker

243 total appearances

Appearances Over Time

Podcast Appearances

LessWrong (Curated & Popular)

"My picture of the present in AI" by ryan_greenblatt

If AIs are very superhuman, it may be quite hard to notice and fix issues with reward provision, and as systems get more capable, some of these solutions will either get less applicable or will incentivize longer-run unintended goals, like trying to make their problematic actions very hard to detect.

730.185 View full episode →

LessWrong (Curated & Popular)

"My picture of the present in AI" by ryan_greenblatt

While the chain of thought, COT, for open AI models reasonably accurately reflects the model's cognition, the COT for anthropic models does so to a substantially lesser extent.

746.851 View full episode →

LessWrong (Curated & Popular)

"My picture of the present in AI" by ryan_greenblatt

This may be due to spillover effects where reinforcement on outputs transfers to the COT because anthropic COT is less distinct from the output.

757.985 View full episode →

LessWrong (Curated & Popular)

"My picture of the present in AI" by ryan_greenblatt

I hypothesize that when the explicit thinking and the outputs are less distinct, reinforcement, in RL, on outputs has more of an effect on shaping the COT.

766.095 View full episode →

LessWrong (Curated & Popular)

"My picture of the present in AI" by ryan_greenblatt

Another factor is that Anthropic has a stronger underlying pre-trained model that's less dependent on COT for cognition.

775.387 View full episode →

LessWrong (Curated & Popular)

"My picture of the present in AI" by ryan_greenblatt

Thus, the training gaming eval gaming and metagaming seen in OpenAI models is probably also present, to at least a substantial degree, in Anthropic models, it's just less visible in the reasoning, the behavior is often similar.

782.503 View full episode →

LessWrong (Curated & Popular)

"My picture of the present in AI" by ryan_greenblatt

Anthropic might also be non-robustly training against this this sort of misalignment to a greater extent than OpenAI is, mostly via spillover effects.

795.923 View full episode →

LessWrong (Curated & Popular)

"My picture of the present in AI" by ryan_greenblatt

I'd currently guess that anthropic models have somewhat better mundane behavioral alignment than OpenAI models, but not by a large margin.

805 View full episode →

LessWrong (Curated & Popular)

"My picture of the present in AI" by ryan_greenblatt

I'd guess anthropic models are slightly more likely to have misaligned long-run goals that are undetected.

812.815 View full episode →

LessWrong (Curated & Popular)

"My picture of the present in AI" by ryan_greenblatt

The anthropic constitution also intentionally gives anthropics AI's long-run cross-context goals to a much greater extent than open AI models have such goals.

819.507 View full episode →

LessWrong (Curated & Popular)

"My picture of the present in AI" by ryan_greenblatt

I think this is a poor choice that makes problematic misalignment substantially more likely, but I'm not that confident and there isn't very good science either way.

829.032 View full episode →

LessWrong (Curated & Popular)

"My picture of the present in AI" by ryan_greenblatt

Current systems very likely aren't capable enough to do much misaligned cognition that isn't easy to notice.

837.703 View full episode →

LessWrong (Curated & Popular)

"My picture of the present in AI" by ryan_greenblatt

And, they generally aren't that reliable which makes it less likely that scheming for long-run goals sticks around and gets reinforced.

843.991 View full episode →

LessWrong (Curated & Popular)

"My picture of the present in AI" by ryan_greenblatt

Current systems aren't using new release, but the most capable pre-trains, for example Mythos, probably have pretty strong single forward pass reasoning capabilities.

851.341 View full episode →

LessWrong (Curated & Popular)

"My picture of the present in AI" by ryan_greenblatt

I think the chance that the current best internal AI systems, for example Mythos, are moderately coherently scheming against their developer is quite low, perhaps 0.5%, supposing we haven't yet observed substantial new evidence of this misalignment.

861.699 View full episode →

LessWrong (Curated & Popular)

"My picture of the present in AI" by ryan_greenblatt

I'd likely be able to drive this probability lower with more understanding of the training and testing done on the model, uncertainties add variance, which increases my risk estimate as the risk is low.

876.653 View full episode →

LessWrong (Curated & Popular)

"My picture of the present in AI" by ryan_greenblatt

However, I expect the chance of moderately coherent scheming to increase exponentially over time and to be several times higher by the end of the year.

887.663 View full episode →

LessWrong (Curated & Popular)

"My picture of the present in AI" by ryan_greenblatt

I think it's substantially more likely, perhaps roughly 8%, that there are incidents where some instances of the current best AI systems, for example Mythos,

896.091 View full episode →

LessWrong (Curated & Popular)

"My picture of the present in AI" by ryan_greenblatt

spud, opus 4.6, GPT 5.4, seriously pursue a strongly misaligned objective, more precisely, an objective that strongly misaligned with the developer and the current operator and that no human tried to specify or insert.

904.919 View full episode →

LessWrong (Curated & Popular)

"My picture of the present in AI" by ryan_greenblatt

As in, it's not prompt injection.

919.312 View full episode →

← Previous Page 10 of 13 Next →

Report any issue