Ryan Greenblatt
๐ค SpeakerAppearances Over Time
Podcast Appearances
If AIs are very superhuman, it may be quite hard to notice and fix issues with reward provision, and as systems get more capable, some of these solutions will either get less applicable or will incentivize longer-run unintended goals, like trying to make their problematic actions very hard to detect.
While the chain of thought, COT, for open AI models reasonably accurately reflects the model's cognition, the COT for anthropic models does so to a substantially lesser extent.
This may be due to spillover effects where reinforcement on outputs transfers to the COT because anthropic COT is less distinct from the output.
I hypothesize that when the explicit thinking and the outputs are less distinct, reinforcement, in RL, on outputs has more of an effect on shaping the COT.
Another factor is that Anthropic has a stronger underlying pre-trained model that's less dependent on COT for cognition.
Thus, the training gaming eval gaming and metagaming seen in OpenAI models is probably also present, to at least a substantial degree, in Anthropic models, it's just less visible in the reasoning, the behavior is often similar.
Anthropic might also be non-robustly training against this this sort of misalignment to a greater extent than OpenAI is, mostly via spillover effects.
I'd currently guess that anthropic models have somewhat better mundane behavioral alignment than OpenAI models, but not by a large margin.
I'd guess anthropic models are slightly more likely to have misaligned long-run goals that are undetected.
The anthropic constitution also intentionally gives anthropics AI's long-run cross-context goals to a much greater extent than open AI models have such goals.
I think this is a poor choice that makes problematic misalignment substantially more likely, but I'm not that confident and there isn't very good science either way.
Current systems very likely aren't capable enough to do much misaligned cognition that isn't easy to notice.
And, they generally aren't that reliable which makes it less likely that scheming for long-run goals sticks around and gets reinforced.
Current systems aren't using new release, but the most capable pre-trains, for example Mythos, probably have pretty strong single forward pass reasoning capabilities.
I think the chance that the current best internal AI systems, for example Mythos, are moderately coherently scheming against their developer is quite low, perhaps 0.5%, supposing we haven't yet observed substantial new evidence of this misalignment.
I'd likely be able to drive this probability lower with more understanding of the training and testing done on the model, uncertainties add variance, which increases my risk estimate as the risk is low.
However, I expect the chance of moderately coherent scheming to increase exponentially over time and to be several times higher by the end of the year.
I think it's substantially more likely, perhaps roughly 8%, that there are incidents where some instances of the current best AI systems, for example Mythos,
spud, opus 4.6, GPT 5.4, seriously pursue a strongly misaligned objective, more precisely, an objective that strongly misaligned with the developer and the current operator and that no human tried to specify or insert.
As in, it's not prompt injection.