Redwood Research Blog
Episodes
“A taxonomy of barriers to trading with early misaligned AIs” by Alexa Pan
21 Apr 2026
Contributed by Lukas
We might want to strike deals with early misaligned AIs in order to reduce takeover risk and increase our chances of reaching a better future.[1] For...
“Introducing LinuxArena” by Tyler
20 Apr 2026
Contributed by Lukas
Subtitle: A new control setting for more realistic software engineering deployments. We are releasing LinuxArena, a new control setting comprised of...
“Current AIs seem pretty misaligned to me” by Ryan Greenblatt
15 Apr 2026
Contributed by Lukas
Subtitle: In my experience, AIs often oversell their work, downplay problems, and cheat. Many people—especially AI company employees1 —believe c...
“Anthropic repeatedly accidentally trained against the CoT, demonstrating inadequate processes” by Alex Mallen, Ryan Greenblatt
14 Apr 2026
Contributed by Lukas
Subtitle: Safely navigating the intelligence explosion will require much more careful development. It turns out that Anthropic accidentally trained ...
“Logit ROCs: Monitor TPR is linear in FPR in logit space” by Kerrick Staley, Aryan Bhatt, Julian Stastny
12 Apr 2026
Contributed by Lukas
Summary We study trusted monitoring for AI control, where a weaker trusted model reviews the actions of a stronger untrusted agent and flags suspicio...
“If Mythos actually made Anthropic employees 4x more productive, I would radically shorten my timelines” by Ryan Greenblatt
11 Apr 2026
Contributed by Lukas
Subtitle: Better estimates of uplift at AI companies seem helpful. Anthropic's system card for Mythos Preview says: It's unclear how we should inter...
“My picture of the present in AI” by Ryan Greenblatt
07 Apr 2026
Contributed by Lukas
Subtitle: My predictions about what is going on right now. In this post, I’ll go through some of my best guesses for the current situation in AI a...
“AIs can now often do massive easy-to-verify SWE tasks” by Ryan Greenblatt
06 Apr 2026
Contributed by Lukas
Subtitle: I've updated towards substantially shorter timelines. I’ve recently updated towards substantially shorter AI timelines and much faster p...
“Blocking live failures with synchronous monitors” by James Lucassen, Adam Kaufman
30 Mar 2026
Contributed by Lukas
A common element in many AI control schemes is monitoring – using some model to review actions taken by an untrusted model in order to catch danger...
“Reward-seekers will probably behave according to causal decision theory” by Alex Mallen
28 Mar 2026
Contributed by Lukas
Subtitle: They'd renege on non-binding commitments, defect against copies of themselves in prisoner's dilemmas, etc.. Background: There are existing...
“AI’s capability improvements haven’t come from it getting less affordable” by Anders Cairns Woodruff
27 Mar 2026
Contributed by Lukas
Subtitle: AI inference is still cheap relative to human labor. METR's frontier time horizons are doubling every few months, providing substantial ev...
“Are AIs more likely to pursue on-episode or beyond-episode reward?” by Anders Cairns Woodruff, Alex Mallen
12 Mar 2026
Contributed by Lukas
Subtitle: RL would encourage on-episode reward seeking, but beyond-episode reward seekers may learn to goal-guard.. Consider an AI that terminally p...
“The case for satiating cheaply-satisfied AI preferences” by Alex Mallen
10 Mar 2026
Contributed by Lukas
Subtitle: Some unintended preferences are cheap to satisfy, and failing to satisfy them needlessly turns a cooperative situation into an adversarial ...
“Announcing ControlConf 2026” by Buck Shlegeris
26 Feb 2026
Contributed by Lukas
Subtitle: Berkeley, April 18-19. We’re running ControlConf, a two-day conference on AI control: the study of reducing risks from misalignment thro...
“Frontier AI companies probably can’t leave the US” by Anders Cairns Woodruff
26 Feb 2026
Contributed by Lukas
Subtitle: The executive branch has substantial authority to restrict the movement of chips and assets, which they are they are likely to use to preve...
“Will reward-seekers respond to distant incentives?” by Alex Mallen
16 Feb 2026
Contributed by Lukas
Subtitle: Reward-seekers are supposed to be safer because they respond to incentives under developer control. But what if they also respond to incent...
“How do we (more) safely defer to AIs?” by Ryan Greenblatt
12 Feb 2026
Contributed by Lukas
Subtitle: How can we make AIs aligned and well-elicited on extremely hard to check open ended tasks?. As AI systems get more capable, it becomes inc...
“Distinguish between inference scaling and “larger tasks use more compute”” by Ryan Greenblatt
11 Feb 2026
Contributed by Lukas
Subtitle: Most recent progress probably isn't from unsustainable inference scaling. As many have observed, since reasoning models first came out, th...
“Fitness-Seekers: Generalizing the Reward-Seeking Threat Model” by Alex Mallen
29 Jan 2026
Contributed by Lukas
Subtitle: If you think reward-seekers are plausible, you should also think “fitness-seekers” are plausible. But their risks aren’t the same.. ...
The inaugural Redwood Research podcast
05 Jan 2026
Contributed by Lukas
After five months of me (Buck) being slow at finishing up the editing on this, we’re finally putting out our inaugural Redwood Research podcast. I t...
The inaugural Redwood Research podcast
05 Jan 2026
Contributed by Lukas
After five months of me (Buck) being slow at finishing up the editing on this, we’re finally putting out our inaugural Redwood Research podcast. I t...
“Recent LLMs can do 2-hop and 3-hop latent (no CoT) reasoning on natural facts” by Ryan Greenblatt
01 Jan 2026
Contributed by Lukas
Subtitle: Recent AIs are much better at chaining together knowledge in a single forward pass. Prior work has examined 2-hop latent (as in, no Chain-...
“Measuring no CoT math time horizon (single forward pass)” by Ryan Greenblatt
26 Dec 2025
Contributed by Lukas
Subtitle: Opus 4.5 has around a 3.5 minute 50%-reliablity time horizon. A key risk factor for scheming (and misalignment more generally) is opaque r...
“Recent LLMs can use filler tokens or problem repeats to improve (no-CoT) math performance” by Ryan Greenblatt
22 Dec 2025
Contributed by Lukas
Subtitle: AI can sometimes distribute cognition over many extra tokens. Prior results have shown that LLMs released before 2024 can’t leverage ‘...
“BashArena and Control Setting Design” by Adam Kaufman, jlucassen
18 Dec 2025
Contributed by Lukas
We’ve just released BashArena, a new high-stakes control setting we think is a major improvement over the settings we’ve used in the past. In thi...
“The behavioral selection model for predicting AI motivations” by Alex Mallen, Buck Shlegeris
04 Dec 2025
Contributed by Lukas
Subtitle: The basic arguments about AI motivations in one causal graph. Highly capable AI systems might end up deciding the future. Understanding wh...
“Will AI systems drift into misalignment?” by Josh Clymer
15 Nov 2025
Contributed by Lukas
Subtitle: A reason alignment could be hard. Written with Alek Westover and Anshul Khandelwal This post explores what I’ll call the Alignment Drift...
“What’s up with Anthropic predicting AGI by early 2027?” by Ryan Greenblatt
03 Nov 2025
Contributed by Lukas
Subtitle: I operationalize Anthropic's prediction of "powerful AI" and explain why I'm skeptical. As far as I’m aware, Anthropic is the only AI co...
“Sonnet 4.5’s eval gaming seriously undermines alignment evals” by Alexa Pan, Ryan Greenblatt
30 Oct 2025
Contributed by Lukas
Subtitle: and this seems caused by training on alignment evals. Sonnet 4.5's eval gaming seriously undermines alignment evals and seems caused by tr...
“Should AI Developers Remove Discussion of AI Misalignment from AI Training Data?” by Alek Westover
23 Oct 2025
Contributed by Lukas
There is some concern that training AI systems on content predicting AI misalignment will hyperstition AI systems into misalignment. This has been di...
“Is 90% of code at Anthropic being written by AIs?” by Ryan Greenblatt
22 Oct 2025
Contributed by Lukas
Subtitle: I'm skeptical that Dario's prediction of AIs writing 90% of code in 3-6 months has come true. In March 2025, Dario Amodei (CEO of Anthropi...
“Reducing risk from scheming by studying trained-in scheming behavior” by Ryan Greenblatt
16 Oct 2025
Contributed by Lukas
Subtitle: Can we study scheming by studying AIs trained to act like schemers?. In a previous post, I discussed mitigating risks from scheming by stu...
“Iterated Development and Study of Schemers (IDSS)” by Ryan Greenblatt
10 Oct 2025
Contributed by Lukas
Subtitle: A strategy for handling scheming. In a previous post, we discussed prospects for studying scheming using natural examples. In this post, w...
“The Thinking Machines Tinker API is good news for AI control and security” by Buck Shlegeris
09 Oct 2025
Contributed by Lukas
Subtitle: It's a promising design for reducing model access inside AI companies.. Last week, Thinking Machines announced Tinker. It's an API for run...
“Plans A, B, C, and D for misalignment risk” by Ryan Greenblatt
08 Oct 2025
Contributed by Lukas
Subtitle: Different plans for different levels of political will. I sometimes think about plans for how to handle misalignment risk. Different level...
“Notes on fatalities from AI takeover” by Ryan Greenblatt
23 Sep 2025
Contributed by Lukas
Subtitle: Large fractions of people will die, but literal human extinction seems unlikely. Suppose misaligned AIs take over. What fraction of people...
“Focus transparency on risk reports, not safety cases” by Ryan Greenblatt
22 Sep 2025
Contributed by Lukas
Subtitle: Transparency about just safety cases would have bad epistemic effects. There are many different things that AI companies could be transpar...
“Prospects for studying actual schemers” by Ryan Greenblatt, Julian Stastny
19 Sep 2025
Contributed by Lukas
Subtitle: Studying actual schemers seems promising but tricky. One natural way to research scheming is to study AIs that are analogous to schemers. ...
“What training data should developers filter to reduce risk from misaligned AI?” by Alek Westover
17 Sep 2025
Contributed by Lukas
Subtitle: An initial narrow proposal. One potentially powerful way to change the properties of AI models is to change their training data. For examp...
“AIs will greatly change engineering in AI companies well before AGI” by Ryan Greenblatt
09 Sep 2025
Contributed by Lukas
Subtitle: AIs that speed up engineering by 2x wouldn't accelerate AI progress that much. In response to my recent post arguing against above-trend p...
“Trust me bro, just one more RL scale up, this one will be the real scale up with the good environments, the actually legit one, trust me bro” by Ryan Greenblatt
03 Sep 2025
Contributed by Lukas
Subtitle: Above trend progress due to a rapid increase in RL env quality is unlikely. I've recently written about how I've updated against seeing su...
“Attaching requirements to model releases has serious downsides (relative to a different deadline for these requirements)” by Ryan Greenblatt
27 Aug 2025
Contributed by Lukas
Subtitle: System cards are established but other approaches seem importantly better. Here's a relatively important question regarding transparency r...
“Notes on cooperating with unaligned AI” by Lukas Finnveden
24 Aug 2025
Contributed by Lukas
Subtitle: More thoughts on making deals with schemers. These are some research notes on whether we could reduce AI takeover risk by cooperating with...
“Being honest with AIs” by Lukas Finnveden
21 Aug 2025
Contributed by Lukas
Subtitle: When and why we should refrain from lying. In the future, we might accidentally create AIs with ambitious goals that are misaligned with o...
“My AGI timeline updates from GPT-5 (and 2025 so far)” by Ryan Greenblatt
20 Aug 2025
Contributed by Lukas
Subtitle: AGI before 2029 now seems substantially less likely. As I discussed in a prior post, I felt like there were some reasonably compelling arg...
“Four places where you can put LLM monitoring” by Fabien Roger, Buck Shlegeris
09 Aug 2025
Contributed by Lukas
Subtitle: To wit: LLM APIs, agent scaffolds, code review, and detection-and-response systems. To prevent potentially misaligned LLM agents from taki...
“Four places where you can put LLM monitoring” by Fabien Roger, Buck Shlegeris
09 Aug 2025
Contributed by Lukas
Subtitle: To wit: LLM APIs, agent scaffolds, code review, and detection-and-response systems. To prevent potentially misaligned LLM agents from taki...
“Should we update against seeing relatively fast AI progress in 2025 and 2026?” by Ryan Greenblatt
28 Jul 2025
Contributed by Lukas
Subtitle: Maybe we should (re)assess the case for relatively fast progress after the GPT-5 release.. Around the early o3 announcement (and maybe som...
“Should we update against seeing relatively fast AI progress in 2025 and 2026?” by Ryan Greenblatt
28 Jul 2025
Contributed by Lukas
Subtitle: Maybe we should (re)assess the case for relatively fast progress after the GPT-5 release.. Around the early o3 announcement (and maybe som...
“Why it’s hard to make settings for high-stakes control research” by Buck Shlegeris
18 Jul 2025
Contributed by Lukas
Subtitle: It's like making challenging evals, but more constrained. One of our main activities at Redwood is writing follow-ups to previous papers o...
“Why it’s hard to make settings for high-stakes control research” by Buck Shlegeris
18 Jul 2025
Contributed by Lukas
Subtitle: It's like making challenging evals, but more constrained. One of our main activities at Redwood is writing follow-ups to previous papers o...
“Recent Redwood Research project proposals” by Ryan Greenblatt, Buck Shlegeris, Julian Stastny, Josh Clymer, Alex Mallen, Vivek Hebbar
14 Jul 2025
Contributed by Lukas
Subtitle: Empirical AI security/safety projects across a variety of areas. Previously, we've shared a few higher-effort project proposals relating t...
“Recent Redwood Research project proposals” by Ryan Greenblatt, Buck Shlegeris, Julian Stastny, Josh Clymer, Alex Mallen, Vivek Hebbar
14 Jul 2025
Contributed by Lukas
Subtitle: Empirical AI security/safety projects across a variety of areas. Previously, we've shared a few higher-effort project proposals relating t...
“What’s worse, spies or schemers?” by Buck Shlegeris, Julian Stastny
09 Jul 2025
Contributed by Lukas
Subtitle: And what if you have both at once?. Here are two potential problems you’ll face if you’re an AI lab deploying powerful AI: Spies: So...
“Ryan on the 80,000 Hours podcast” by Buck Shlegeris
08 Jul 2025
Contributed by Lukas
Discover more from Redwood Research blog We research catastrophic AI risks and techniques that could be used to mitigate them. Over 2,000 subscribers...
“How much novel security-critical infrastructure do you need during the singularity?” by Buck Shlegeris
05 Jul 2025
Contributed by Lukas
Subtitle: And what does this mean for AI control?. I think a lot about the possibility of huge numbers of AI agents doing AI R&D inside an AI co...
“Two proposed projects on abstract analogies for scheming” by Julian Stastny
04 Jul 2025
Contributed by Lukas
Subtitle: We should study methods to train away deeply ingrained behaviors in LLMs that are structurally similar to scheming.. In order to empirical...
“There are two fundamentally different constraints on schemers” by Buck Shlegeris
02 Jul 2025
Contributed by Lukas
Subtitle: "They need to act aligned" often isn't precise enough. People (including me) often say that scheming models “have to act as if they were...
“Jankily controlling superintelligence” by Ryan Greenblatt
27 Jun 2025
Contributed by Lukas
Subtitle: How much time can control buy us during the intelligence explosion?. When discussing AI control, we often talk about levels of AI capabili...
“What does 10x-ing effective compute get you?” by Ryan Greenblatt
24 Jun 2025
Contributed by Lukas
Subtitle: Once AIs match top humans, what are the returns to further scaling and algorithmic improvement?. This is more speculative and confusing th...
“Comparing risk from internally-deployed AI to insider and outsider threats from humans” by Buck Shlegeris
23 Jun 2025
Contributed by Lukas
Subtitle: And why I think insider threat from AI combines the hard parts of both problems.. I’ve been thinking a lot recently about the relationsh...
“Making deals with early schemers” by Julian Stastny, Olli Järviniemi, Buck Shlegeris
20 Jun 2025
Contributed by Lukas
Subtitle: ...could help us to prevent takeover attempts from more dangerous misaligned AIs created later.. Consider the following vignette: It is Ma...
“Prefix cache untrusted monitors: a method to apply after you catch your AI” by Ryan Greenblatt
20 Jun 2025
Contributed by Lukas
Subtitle: Training the policy to not do egregious bad actions we detect has downsides and we might be able to do better. We often discuss what train...
“AI safety techniques leveraging distillation” by Ryan Greenblatt
19 Jun 2025
Contributed by Lukas
Subtitle: Distillation is cheap; how can we use it to improve safety?. It's currently possible to (mostly or fully) cheaply reproduce the performanc...
“When does training a model change its goals?” by Vivek Hebbar, Ryan Greenblatt
12 Jun 2025
Contributed by Lukas
Subtitle: Can a scheming AI's goals really stay unchanged through training?. Here are two opposing pictures of how training interacts with deceptive...
“The case for countermeasures to memetic spread of misaligned values” by Alex Mallen
28 May 2025
Contributed by Lukas
Subtitle: Defending against alignment problems that might come with long-term memory. As various people have written about before, AIs that have lon...
“AIs at the current capability level may be important for future safety work” by Ryan Greenblatt
12 May 2025
Contributed by Lukas
Subtitle: Some reasons why relatively weak AIs might still be important when we have very powerful AIs. Sometimes people say that it's much less val...
“Misalignment and Strategic Underperformance: An Analysis of Sandbagging and Exploration Hacking” by Julian Stastny, Buck Shlegeris
08 May 2025
Contributed by Lukas
Subtitle: A new analysis of the risk of AIs intentionally performing poorly.. In the future, we will want to use powerful AIs on critical tasks such...
“Training-time schemers vs behavioral schemers” by Alex Mallen
06 May 2025
Contributed by Lukas
Subtitle: Clarifying ways in which faking alignment during training is neither necessary nor sufficient for the kind of scheming that AI control trie...
“What’s going on with AI progress and trends? (As of 5/2025)” by Ryan Greenblatt
03 May 2025
Contributed by Lukas
Subtitle: My views on what's driving AI progress and where it's headed.. AI progress is driven by improved algorithms and additional compute for tra...
“How can we solve diffuse threats like research sabotage with AI control?” by Vivek Hebbar
30 Apr 2025
Contributed by Lukas
Subtitle: Preventing research sabotage will require techniques very different from the original control paper.. Misaligned AIs might engage in resea...
“7+ tractable directions in AI control” by Ryan Greenblatt
29 Apr 2025
Contributed by Lukas
Subtitle: A list of easy-to-start directions in AI control targeted at independent researchers without as much context or compute. In this post, we ...
“Clarifying AI R&D threat models” by Josh Clymer
25 Apr 2025
Contributed by Lukas
Subtitle: (There are a few). A casual reader of one of the many AI company safety frameworks might be confused about why “AI R&D” is listed ...
“How training-gamers might function (and win)” by Vivek Hebbar
24 Apr 2025
Contributed by Lukas
Subtitle: A model of the relationship between higher level goals, explicit reasoning, and learned heuristics in capable agents.. In this post I pres...
“Handling schemers if shutdown is not an option” by Buck Shlegeris
18 Apr 2025
Contributed by Lukas
Subtitle: What if getting strong evidence of scheming isn't the end of your scheming problems, but merely the middle?. In most of our research and w...
“Ctrl-Z: Controlling AI Agents via Resampling” by Buck Shlegeris
16 Apr 2025
Contributed by Lukas
Subtitle: A new paper on AI control for agents.. We have released a new paper, Ctrl-Z: Controlling AI Agents via Resampling. This is the largest and...
“To be legible, evidence of misalignment probably has to be behavioral” by Ryan Greenblatt
15 Apr 2025
Contributed by Lukas
Subtitle: Evidence from just model internals (e.g. interpretability) is unlikely to be broadly convincing.. One key hope for mitigating risk from mi...
“Why do misalignment risks increase as AIs get more capable?” by Ryan Greenblatt
11 Apr 2025
Contributed by Lukas
Subtitle: A breakdown of how higher capabilities increase risk. It's generally agreed that as AIs get more capable, risks from misalignment increase...
“An overview of areas of control work” by Ryan Greenblatt
09 Apr 2025
Contributed by Lukas
Subtitle: What are all the research and implementation areas helpful for control?. In this post, I'll list all the areas of control research (and im...
“An overview of control measures” by Ryan Greenblatt
06 Apr 2025
Contributed by Lukas
Subtitle: What methods can we use to ensure control?. We often talk about ensuring control, which in the context of this doc refers to preventing AI...
“Buck on the 80,000 Hours podcast” by Buck Shlegeris
05 Apr 2025
Contributed by Lukas
Buck on the 80,000 Hours podcast My podcast with Rob Wiblin from 80,000 Hours just came out. I’m really happy with how it turned out. I talked abou...
“Notes on countermeasures for exploration hacking (aka sandbagging)” by Ryan Greenblatt
04 Apr 2025
Contributed by Lukas
Subtitle: How can we prevent AIs from intentionally underperforming on our metrics?. If we naively apply RL to a scheming AI, the AI may be able to ...
“Notes on handling non-concentrated failures with AI control: high level methods and different regimes” by Ryan Greenblatt
03 Apr 2025
Contributed by Lukas
Subtitle: What are the methods and issues when failures occur diffusely over many actions?. In this post, I'll try to explain my current understandi...
“Prioritizing threats for AI control” by Ryan Greenblatt
19 Mar 2025
Contributed by Lukas
Subtitle: What are the main threats and how should we prioritize them?. We often talk about ensuring control, which in the context of this doc refer...
“How might we safely pass the buck to AI?” by Josh Clymer
19 Feb 2025
Contributed by Lukas
Subtitle: Developing AI employees that are safer than human ones. My goal as an AI safety researcher is to put myself out of a job. I don’t worry ...
“Takeaways from sketching a control safety case” by Josh Clymer
30 Jan 2025
Contributed by Lukas
Subtitle: Insights from a long technical paper compressed into a fun little commentary. Buck Shlegeris and I recently published a paper with UK AISI...
“Planning for Extreme AI Risks” by Josh Clymer
29 Jan 2025
Contributed by Lukas
Subtitle: Are we ready for this?. This post should not be taken as a polished recommendation to AI companies and instead should be treated as an inf...
“Ten people on the inside” by Buck Shlegeris
28 Jan 2025
Contributed by Lukas
Subtitle: A scary scenario that's worth planning for. (Many of these ideas developed in conversation with Ryan Greenblatt) In a shortform, I describ...
“When does capability elicitation bound risk?” by Josh Clymer
22 Jan 2025
Contributed by Lukas
For a summary of this post see the thread on X. The assumptions behind and limitations of capability elicitation have been discussed in multiple pla...
“How will we update about scheming?” by Ryan Greenblatt
19 Jan 2025
Contributed by Lukas
Subtitle: A quantitative description of how I expect to change my mind.. [Cross-posted from LessWrong] I mostly work on risks from scheming (that is...
“Thoughts on the conservative assumptions in AI control” by Buck Shlegeris
17 Jan 2025
Contributed by Lukas
Subtitle: Why are we so friendly to the red team?. Work that I’ve done on techniques for mitigating risk from misaligned AI often makes a number o...
“Extending control evaluations to non-scheming threats” by Josh Clymer
13 Jan 2025
Contributed by Lukas
Buck Shlegeris and Ryan Greenblatt originally motivated control evaluations as a way to mitigate risks from ‘scheming’ AI models: models that con...
“Measuring whether AIs can statelessly strategize to subvert security measures” by Buck Shlegeris, Alex Mallen
20 Dec 2024
Contributed by Lukas
Subtitle: The complement to control evaluations. One way to show that risk from deploying an AI system is small is by showing that the model is not ...
“Alignment Faking in Large Language Models” by Ryan Greenblatt, Buck Shlegeris
18 Dec 2024
Contributed by Lukas
Subtitle: In our experiments, AIs will often strategically pretend to comply with the training objective to prevent the training process from modifyi...
“Why imperfect adversarial robustness doesn’t doom AI control” by Buck Shlegeris
18 Nov 2024
Contributed by Lukas
Subtitle: There are crucial disanalogies between preventing jailbreaks and preventing misalignment-induced catastrophes.. (thanks to Alex Mallen, Co...
“Win/continue/lose scenarios and execute/replace/audit protocols” by Buck Shlegeris
15 Nov 2024
Contributed by Lukas
In this post, I’ll make a technical point that comes up when thinking about risks from scheming AIs from a control perspective. In brief: Consider ...
“Behavioral red-teaming is unlikely to produce clear, strong evidence that models aren’t scheming” by Buck Shlegeris
10 Oct 2024
Contributed by Lukas
One strategy for mitigating risk from schemers (that is, egregiously misaligned models that intentionally try to subvert your safety measures) is beh...
“A basic systems architecture for AI agents that do autonomous research” by Buck Shlegeris
26 Sep 2024
Contributed by Lukas
Subtitle: And diagrams describing how threat scenarios involving misaligned AI involve compromising the system in different places.. A lot of threat...
“How to prevent collusion when using untrusted models to monitor each other” by Buck Shlegeris
25 Sep 2024
Contributed by Lukas
Suppose you’ve trained a really clever AI model, and you’re planning to deploy it in an agent scaffold that allows it to run code or take other a...
“Would catching your AIs trying to escape convince AI developers to slow down or undeploy?” by Buck Shlegeris
26 Aug 2024
Contributed by Lukas
Subtitle: I'm not so sure.. [Crossposted from LessWrong] I often talk to people who think that if frontier models were egregiously misaligned and po...