Redwood Research Blog
Episodes
“When does capability elicitation bound risk?” by Josh Clymer
22 Jan 2025
Contributed by Lukas
For a summary of this post see the thread on X. The assumptions behind and limitations of capability elicitation have been discussed in multiple pla...
“How will we update about scheming?” by Ryan Greenblatt
19 Jan 2025
Contributed by Lukas
Subtitle: A quantitative description of how I expect to change my mind.. [Cross-posted from LessWrong] I mostly work on risks from scheming (that is...
“Thoughts on the conservative assumptions in AI control” by Buck Shlegeris
17 Jan 2025
Contributed by Lukas
Subtitle: Why are we so friendly to the red team?. Work that I’ve done on techniques for mitigating risk from misaligned AI often makes a number o...
“Extending control evaluations to non-scheming threats” by Josh Clymer
13 Jan 2025
Contributed by Lukas
Buck Shlegeris and Ryan Greenblatt originally motivated control evaluations as a way to mitigate risks from ‘scheming’ AI models: models that con...
“Measuring whether AIs can statelessly strategize to subvert security measures” by Buck Shlegeris, Alex Mallen
20 Dec 2024
Contributed by Lukas
Subtitle: The complement to control evaluations. One way to show that risk from deploying an AI system is small is by showing that the model is not ...
“Alignment Faking in Large Language Models” by Ryan Greenblatt, Buck Shlegeris
18 Dec 2024
Contributed by Lukas
Subtitle: In our experiments, AIs will often strategically pretend to comply with the training objective to prevent the training process from modifyi...
“Why imperfect adversarial robustness doesn’t doom AI control” by Buck Shlegeris
18 Nov 2024
Contributed by Lukas
Subtitle: There are crucial disanalogies between preventing jailbreaks and preventing misalignment-induced catastrophes.. (thanks to Alex Mallen, Co...
“Win/continue/lose scenarios and execute/replace/audit protocols” by Buck Shlegeris
15 Nov 2024
Contributed by Lukas
In this post, I’ll make a technical point that comes up when thinking about risks from scheming AIs from a control perspective. In brief: Consider ...
“Behavioral red-teaming is unlikely to produce clear, strong evidence that models aren’t scheming” by Buck Shlegeris
10 Oct 2024
Contributed by Lukas
One strategy for mitigating risk from schemers (that is, egregiously misaligned models that intentionally try to subvert your safety measures) is beh...
“A basic systems architecture for AI agents that do autonomous research” by Buck Shlegeris
26 Sep 2024
Contributed by Lukas
Subtitle: And diagrams describing how threat scenarios involving misaligned AI involve compromising the system in different places.. A lot of threat...
“How to prevent collusion when using untrusted models to monitor each other” by Buck Shlegeris
25 Sep 2024
Contributed by Lukas
Suppose you’ve trained a really clever AI model, and you’re planning to deploy it in an agent scaffold that allows it to run code or take other a...
“Would catching your AIs trying to escape convince AI developers to slow down or undeploy?” by Buck Shlegeris
26 Aug 2024
Contributed by Lukas
Subtitle: I'm not so sure.. [Crossposted from LessWrong] I often talk to people who think that if frontier models were egregiously misaligned and po...
“Fields that I reference when thinking about AI takeover prevention” by Buck Shlegeris
13 Aug 2024
Contributed by Lukas
Subtitle: Is AI takeover like a nuclear meltdown? A coup? A plane crash?. My day job is thinking about safety measures that aim to reduce catastroph...
“Getting 50% (SoTA) on ARC-AGI with GPT-4o” by Ryan Greenblatt
17 Jun 2024
Contributed by Lukas
Subtitle: You can just draw more samples. I recently got to 50%1 accuracy on the public test set for ARC-AGI by having GPT-4o generate a huge number...