Redwood Research Blog

“A taxonomy of barriers to trading with early misaligned AIs” by Alexa Pan

21 Apr 2026

Contributed by Lukas

We might want to strike deals with early misaligned AIs in order to reduce takeover risk and increase our chances of reaching a better future.[1] For...

“Introducing LinuxArena” by Tyler

20 Apr 2026

Contributed by Lukas

Subtitle: A new control setting for more realistic software engineering deployments. We are releasing LinuxArena, a new control setting comprised of...

“Current AIs seem pretty misaligned to me” by Ryan Greenblatt

15 Apr 2026

Contributed by Lukas

Subtitle: In my experience, AIs often oversell their work, downplay problems, and cheat. Many people—especially AI company employees1 —believe c...

“Anthropic repeatedly accidentally trained against the CoT, demonstrating inadequate processes” by Alex Mallen, Ryan Greenblatt

14 Apr 2026

Contributed by Lukas

Subtitle: Safely navigating the intelligence explosion will require much more careful development. It turns out that Anthropic accidentally trained ...

“Logit ROCs: Monitor TPR is linear in FPR in logit space” by Kerrick Staley, Aryan Bhatt, Julian Stastny

12 Apr 2026

Contributed by Lukas

Summary We study trusted monitoring for AI control, where a weaker trusted model reviews the actions of a stronger untrusted agent and flags suspicio...

“If Mythos actually made Anthropic employees 4x more productive, I would radically shorten my timelines” by Ryan Greenblatt

11 Apr 2026

Contributed by Lukas

Subtitle: Better estimates of uplift at AI companies seem helpful. Anthropic's system card for Mythos Preview says: It's unclear how we should inter...

“My picture of the present in AI” by Ryan Greenblatt

07 Apr 2026

Contributed by Lukas

Subtitle: My predictions about what is going on right now. In this post, I’ll go through some of my best guesses for the current situation in AI a...

“AIs can now often do massive easy-to-verify SWE tasks” by Ryan Greenblatt

06 Apr 2026

Contributed by Lukas

Subtitle: I've updated towards substantially shorter timelines. I’ve recently updated towards substantially shorter AI timelines and much faster p...

“Blocking live failures with synchronous monitors” by James Lucassen, Adam Kaufman

30 Mar 2026

Contributed by Lukas

A common element in many AI control schemes is monitoring – using some model to review actions taken by an untrusted model in order to catch danger...

“Reward-seekers will probably behave according to causal decision theory” by Alex Mallen

28 Mar 2026

Contributed by Lukas

Subtitle: They'd renege on non-binding commitments, defect against copies of themselves in prisoner's dilemmas, etc.. Background: There are existing...

“AI’s capability improvements haven’t come from it getting less affordable” by Anders Cairns Woodruff

27 Mar 2026

Contributed by Lukas

Subtitle: AI inference is still cheap relative to human labor. METR's frontier time horizons are doubling every few months, providing substantial ev...

“Are AIs more likely to pursue on-episode or beyond-episode reward?” by Anders Cairns Woodruff, Alex Mallen

12 Mar 2026

Contributed by Lukas

Subtitle: RL would encourage on-episode reward seeking, but beyond-episode reward seekers may learn to goal-guard.. Consider an AI that terminally p...

“The case for satiating cheaply-satisfied AI preferences” by Alex Mallen

10 Mar 2026

Contributed by Lukas

Subtitle: Some unintended preferences are cheap to satisfy, and failing to satisfy them needlessly turns a cooperative situation into an adversarial ...

“Announcing ControlConf 2026” by Buck Shlegeris

26 Feb 2026

Contributed by Lukas

Subtitle: Berkeley, April 18-19. We’re running ControlConf, a two-day conference on AI control: the study of reducing risks from misalignment thro...

“Frontier AI companies probably can’t leave the US” by Anders Cairns Woodruff

26 Feb 2026

Contributed by Lukas

Subtitle: The executive branch has substantial authority to restrict the movement of chips and assets, which they are they are likely to use to preve...

“Will reward-seekers respond to distant incentives?” by Alex Mallen

16 Feb 2026

Contributed by Lukas

Subtitle: Reward-seekers are supposed to be safer because they respond to incentives under developer control. But what if they also respond to incent...

“How do we (more) safely defer to AIs?” by Ryan Greenblatt

12 Feb 2026

Contributed by Lukas

Subtitle: How can we make AIs aligned and well-elicited on extremely hard to check open ended tasks?. As AI systems get more capable, it becomes inc...

“Distinguish between inference scaling and “larger tasks use more compute”” by Ryan Greenblatt

11 Feb 2026

Contributed by Lukas

Subtitle: Most recent progress probably isn't from unsustainable inference scaling. As many have observed, since reasoning models first came out, th...

“Fitness-Seekers: Generalizing the Reward-Seeking Threat Model” by Alex Mallen

29 Jan 2026

Contributed by Lukas

Subtitle: If you think reward-seekers are plausible, you should also think “fitness-seekers” are plausible. But their risks aren’t the same.. ...

The inaugural Redwood Research podcast

05 Jan 2026

Contributed by Lukas

After five months of me (Buck) being slow at finishing up the editing on this, we’re finally putting out our inaugural Redwood Research podcast. I t...

The inaugural Redwood Research podcast

05 Jan 2026

Contributed by Lukas

After five months of me (Buck) being slow at finishing up the editing on this, we’re finally putting out our inaugural Redwood Research podcast. I t...

“Recent LLMs can do 2-hop and 3-hop latent (no CoT) reasoning on natural facts” by Ryan Greenblatt

01 Jan 2026

Contributed by Lukas

Subtitle: Recent AIs are much better at chaining together knowledge in a single forward pass. Prior work has examined 2-hop latent (as in, no Chain-...

“Measuring no CoT math time horizon (single forward pass)” by Ryan Greenblatt

26 Dec 2025

Contributed by Lukas

Subtitle: Opus 4.5 has around a 3.5 minute 50%-reliablity time horizon. A key risk factor for scheming (and misalignment more generally) is opaque r...

“Recent LLMs can use filler tokens or problem repeats to improve (no-CoT) math performance” by Ryan Greenblatt

22 Dec 2025

Contributed by Lukas

Subtitle: AI can sometimes distribute cognition over many extra tokens. Prior results have shown that LLMs released before 2024 can’t leverage ‘...

“BashArena and Control Setting Design” by Adam Kaufman, jlucassen

18 Dec 2025

Contributed by Lukas

We’ve just released BashArena, a new high-stakes control setting we think is a major improvement over the settings we’ve used in the past. In thi...

“The behavioral selection model for predicting AI motivations” by Alex Mallen, Buck Shlegeris

04 Dec 2025

Contributed by Lukas

Subtitle: The basic arguments about AI motivations in one causal graph. Highly capable AI systems might end up deciding the future. Understanding wh...

“Will AI systems drift into misalignment?” by Josh Clymer

15 Nov 2025

Contributed by Lukas

Subtitle: A reason alignment could be hard. Written with Alek Westover and Anshul Khandelwal This post explores what I’ll call the Alignment Drift...

“What’s up with Anthropic predicting AGI by early 2027?” by Ryan Greenblatt

03 Nov 2025

Contributed by Lukas

Subtitle: I operationalize Anthropic's prediction of "powerful AI" and explain why I'm skeptical. As far as I’m aware, Anthropic is the only AI co...

“Sonnet 4.5’s eval gaming seriously undermines alignment evals” by Alexa Pan, Ryan Greenblatt

30 Oct 2025

Contributed by Lukas

Subtitle: and this seems caused by training on alignment evals. Sonnet 4.5's eval gaming seriously undermines alignment evals and seems caused by tr...

“Should AI Developers Remove Discussion of AI Misalignment from AI Training Data?” by Alek Westover

23 Oct 2025

Contributed by Lukas

There is some concern that training AI systems on content predicting AI misalignment will hyperstition AI systems into misalignment. This has been di...

“Is 90% of code at Anthropic being written by AIs?” by Ryan Greenblatt

22 Oct 2025

Contributed by Lukas

Subtitle: I'm skeptical that Dario's prediction of AIs writing 90% of code in 3-6 months has come true. In March 2025, Dario Amodei (CEO of Anthropi...

“Reducing risk from scheming by studying trained-in scheming behavior” by Ryan Greenblatt

16 Oct 2025

Contributed by Lukas

Subtitle: Can we study scheming by studying AIs trained to act like schemers?. In a previous post, I discussed mitigating risks from scheming by stu...

“Iterated Development and Study of Schemers (IDSS)” by Ryan Greenblatt

10 Oct 2025

Contributed by Lukas

Subtitle: A strategy for handling scheming. In a previous post, we discussed prospects for studying scheming using natural examples. In this post, w...

“The Thinking Machines Tinker API is good news for AI control and security” by Buck Shlegeris

09 Oct 2025

Contributed by Lukas

Subtitle: It's a promising design for reducing model access inside AI companies.. Last week, Thinking Machines announced Tinker. It's an API for run...

“Plans A, B, C, and D for misalignment risk” by Ryan Greenblatt

08 Oct 2025

Contributed by Lukas

Subtitle: Different plans for different levels of political will. I sometimes think about plans for how to handle misalignment risk. Different level...

“Notes on fatalities from AI takeover” by Ryan Greenblatt

23 Sep 2025

Contributed by Lukas

Subtitle: Large fractions of people will die, but literal human extinction seems unlikely. Suppose misaligned AIs take over. What fraction of people...

“Focus transparency on risk reports, not safety cases” by Ryan Greenblatt

22 Sep 2025

Contributed by Lukas

Subtitle: Transparency about just safety cases would have bad epistemic effects. There are many different things that AI companies could be transpar...

“Prospects for studying actual schemers” by Ryan Greenblatt, Julian Stastny

19 Sep 2025

Contributed by Lukas

Subtitle: Studying actual schemers seems promising but tricky. One natural way to research scheming is to study AIs that are analogous to schemers. ...

“What training data should developers filter to reduce risk from misaligned AI?” by Alek Westover

17 Sep 2025

Contributed by Lukas

Subtitle: An initial narrow proposal. One potentially powerful way to change the properties of AI models is to change their training data. For examp...

“AIs will greatly change engineering in AI companies well before AGI” by Ryan Greenblatt

09 Sep 2025

Contributed by Lukas

Subtitle: AIs that speed up engineering by 2x wouldn't accelerate AI progress that much. In response to my recent post arguing against above-trend p...

“Trust me bro, just one more RL scale up, this one will be the real scale up with the good environments, the actually legit one, trust me bro” by Ryan Greenblatt

03 Sep 2025

Contributed by Lukas

Subtitle: Above trend progress due to a rapid increase in RL env quality is unlikely. I've recently written about how I've updated against seeing su...

“Attaching requirements to model releases has serious downsides (relative to a different deadline for these requirements)” by Ryan Greenblatt

27 Aug 2025

Contributed by Lukas

Subtitle: System cards are established but other approaches seem importantly better. Here's a relatively important question regarding transparency r...

“Notes on cooperating with unaligned AI” by Lukas Finnveden

24 Aug 2025

Contributed by Lukas

Subtitle: More thoughts on making deals with schemers. These are some research notes on whether we could reduce AI takeover risk by cooperating with...

“Being honest with AIs” by Lukas Finnveden

21 Aug 2025

Contributed by Lukas

Subtitle: When and why we should refrain from lying. In the future, we might accidentally create AIs with ambitious goals that are misaligned with o...

“My AGI timeline updates from GPT-5 (and 2025 so far)” by Ryan Greenblatt

20 Aug 2025

Contributed by Lukas

Subtitle: AGI before 2029 now seems substantially less likely. As I discussed in a prior post, I felt like there were some reasonably compelling arg...

“Four places where you can put LLM monitoring” by Fabien Roger, Buck Shlegeris

09 Aug 2025

Contributed by Lukas

Subtitle: To wit: LLM APIs, agent scaffolds, code review, and detection-and-response systems. To prevent potentially misaligned LLM agents from taki...

“Four places where you can put LLM monitoring” by Fabien Roger, Buck Shlegeris

09 Aug 2025

Contributed by Lukas

Subtitle: To wit: LLM APIs, agent scaffolds, code review, and detection-and-response systems. To prevent potentially misaligned LLM agents from taki...

“Should we update against seeing relatively fast AI progress in 2025 and 2026?” by Ryan Greenblatt

28 Jul 2025

Contributed by Lukas

Subtitle: Maybe we should (re)assess the case for relatively fast progress after the GPT-5 release.. Around the early o3 announcement (and maybe som...

“Should we update against seeing relatively fast AI progress in 2025 and 2026?” by Ryan Greenblatt

28 Jul 2025

Contributed by Lukas

Subtitle: Maybe we should (re)assess the case for relatively fast progress after the GPT-5 release.. Around the early o3 announcement (and maybe som...

“Why it’s hard to make settings for high-stakes control research” by Buck Shlegeris

18 Jul 2025

Contributed by Lukas

Subtitle: It's like making challenging evals, but more constrained. One of our main activities at Redwood is writing follow-ups to previous papers o...

“Why it’s hard to make settings for high-stakes control research” by Buck Shlegeris

18 Jul 2025

Contributed by Lukas

Subtitle: It's like making challenging evals, but more constrained. One of our main activities at Redwood is writing follow-ups to previous papers o...

“Recent Redwood Research project proposals” by Ryan Greenblatt, Buck Shlegeris, Julian Stastny, Josh Clymer, Alex Mallen, Vivek Hebbar

14 Jul 2025

Contributed by Lukas

Subtitle: Empirical AI security/safety projects across a variety of areas. Previously, we've shared a few higher-effort project proposals relating t...

“Recent Redwood Research project proposals” by Ryan Greenblatt, Buck Shlegeris, Julian Stastny, Josh Clymer, Alex Mallen, Vivek Hebbar

14 Jul 2025

Contributed by Lukas

Subtitle: Empirical AI security/safety projects across a variety of areas. Previously, we've shared a few higher-effort project proposals relating t...

“What’s worse, spies or schemers?” by Buck Shlegeris, Julian Stastny

09 Jul 2025

Contributed by Lukas

Subtitle: And what if you have both at once?. Here are two potential problems you’ll face if you’re an AI lab deploying powerful AI: Spies: So...

“Ryan on the 80,000 Hours podcast” by Buck Shlegeris

08 Jul 2025

Contributed by Lukas

Discover more from Redwood Research blog We research catastrophic AI risks and techniques that could be used to mitigate them. Over 2,000 subscribers...

“How much novel security-critical infrastructure do you need during the singularity?” by Buck Shlegeris

05 Jul 2025

Contributed by Lukas

Subtitle: And what does this mean for AI control?. I think a lot about the possibility of huge numbers of AI agents doing AI R&D inside an AI co...

“Two proposed projects on abstract analogies for scheming” by Julian Stastny

04 Jul 2025

Contributed by Lukas

Subtitle: We should study methods to train away deeply ingrained behaviors in LLMs that are structurally similar to scheming.. In order to empirical...

“There are two fundamentally different constraints on schemers” by Buck Shlegeris

02 Jul 2025

Contributed by Lukas

Subtitle: "They need to act aligned" often isn't precise enough. People (including me) often say that scheming models “have to act as if they were...

“Jankily controlling superintelligence” by Ryan Greenblatt

27 Jun 2025

Contributed by Lukas

Subtitle: How much time can control buy us during the intelligence explosion?. When discussing AI control, we often talk about levels of AI capabili...

“What does 10x-ing effective compute get you?” by Ryan Greenblatt

24 Jun 2025

Contributed by Lukas

Subtitle: Once AIs match top humans, what are the returns to further scaling and algorithmic improvement?. This is more speculative and confusing th...

“Comparing risk from internally-deployed AI to insider and outsider threats from humans” by Buck Shlegeris

23 Jun 2025

Contributed by Lukas

Subtitle: And why I think insider threat from AI combines the hard parts of both problems.. I’ve been thinking a lot recently about the relationsh...

“Making deals with early schemers” by Julian Stastny, Olli Järviniemi, Buck Shlegeris

20 Jun 2025

Contributed by Lukas

Subtitle: ...could help us to prevent takeover attempts from more dangerous misaligned AIs created later.. Consider the following vignette: It is Ma...

“Prefix cache untrusted monitors: a method to apply after you catch your AI” by Ryan Greenblatt

20 Jun 2025

Contributed by Lukas

Subtitle: Training the policy to not do egregious bad actions we detect has downsides and we might be able to do better. We often discuss what train...

“AI safety techniques leveraging distillation” by Ryan Greenblatt

19 Jun 2025

Contributed by Lukas

Subtitle: Distillation is cheap; how can we use it to improve safety?. It's currently possible to (mostly or fully) cheaply reproduce the performanc...

“When does training a model change its goals?” by Vivek Hebbar, Ryan Greenblatt

12 Jun 2025

Contributed by Lukas

Subtitle: Can a scheming AI's goals really stay unchanged through training?. Here are two opposing pictures of how training interacts with deceptive...

“The case for countermeasures to memetic spread of misaligned values” by Alex Mallen

28 May 2025

Contributed by Lukas

Subtitle: Defending against alignment problems that might come with long-term memory. As various people have written about before, AIs that have lon...

“AIs at the current capability level may be important for future safety work” by Ryan Greenblatt

12 May 2025

Contributed by Lukas

Subtitle: Some reasons why relatively weak AIs might still be important when we have very powerful AIs. Sometimes people say that it's much less val...

“Misalignment and Strategic Underperformance: An Analysis of Sandbagging and Exploration Hacking” by Julian Stastny, Buck Shlegeris

08 May 2025

Contributed by Lukas

Subtitle: A new analysis of the risk of AIs intentionally performing poorly.. In the future, we will want to use powerful AIs on critical tasks such...

“Training-time schemers vs behavioral schemers” by Alex Mallen

06 May 2025

Contributed by Lukas

Subtitle: Clarifying ways in which faking alignment during training is neither necessary nor sufficient for the kind of scheming that AI control trie...

“What’s going on with AI progress and trends? (As of 5/2025)” by Ryan Greenblatt

03 May 2025

Contributed by Lukas

Subtitle: My views on what's driving AI progress and where it's headed.. AI progress is driven by improved algorithms and additional compute for tra...

“How can we solve diffuse threats like research sabotage with AI control?” by Vivek Hebbar

30 Apr 2025

Contributed by Lukas

Subtitle: Preventing research sabotage will require techniques very different from the original control paper.. Misaligned AIs might engage in resea...

“7+ tractable directions in AI control” by Ryan Greenblatt

29 Apr 2025

Contributed by Lukas

Subtitle: A list of easy-to-start directions in AI control targeted at independent researchers without as much context or compute. In this post, we ...

“Clarifying AI R&D threat models” by Josh Clymer

25 Apr 2025

Contributed by Lukas

Subtitle: (There are a few). A casual reader of one of the many AI company safety frameworks might be confused about why “AI R&D” is listed ...

“How training-gamers might function (and win)” by Vivek Hebbar

24 Apr 2025

Contributed by Lukas

Subtitle: A model of the relationship between higher level goals, explicit reasoning, and learned heuristics in capable agents.. In this post I pres...

“Handling schemers if shutdown is not an option” by Buck Shlegeris

18 Apr 2025

Contributed by Lukas

Subtitle: What if getting strong evidence of scheming isn't the end of your scheming problems, but merely the middle?. In most of our research and w...

“Ctrl-Z: Controlling AI Agents via Resampling” by Buck Shlegeris

16 Apr 2025

Contributed by Lukas

Subtitle: A new paper on AI control for agents.. We have released a new paper, Ctrl-Z: Controlling AI Agents via Resampling. This is the largest and...

“To be legible, evidence of misalignment probably has to be behavioral” by Ryan Greenblatt

15 Apr 2025

Contributed by Lukas

Subtitle: Evidence from just model internals (e.g. interpretability) is unlikely to be broadly convincing.. One key hope for mitigating risk from mi...

“Why do misalignment risks increase as AIs get more capable?” by Ryan Greenblatt

11 Apr 2025

Contributed by Lukas

Subtitle: A breakdown of how higher capabilities increase risk. It's generally agreed that as AIs get more capable, risks from misalignment increase...

“An overview of areas of control work” by Ryan Greenblatt

09 Apr 2025

Contributed by Lukas

Subtitle: What are all the research and implementation areas helpful for control?. In this post, I'll list all the areas of control research (and im...

“An overview of control measures” by Ryan Greenblatt

06 Apr 2025

Contributed by Lukas

Subtitle: What methods can we use to ensure control?. We often talk about ensuring control, which in the context of this doc refers to preventing AI...

“Buck on the 80,000 Hours podcast” by Buck Shlegeris

05 Apr 2025

Contributed by Lukas

Buck on the 80,000 Hours podcast My podcast with Rob Wiblin from 80,000 Hours just came out. I’m really happy with how it turned out. I talked abou...

“Notes on countermeasures for exploration hacking (aka sandbagging)” by Ryan Greenblatt

04 Apr 2025

Contributed by Lukas

Subtitle: How can we prevent AIs from intentionally underperforming on our metrics?. If we naively apply RL to a scheming AI, the AI may be able to ...

“Notes on handling non-concentrated failures with AI control: high level methods and different regimes” by Ryan Greenblatt

03 Apr 2025

Contributed by Lukas

Subtitle: What are the methods and issues when failures occur diffusely over many actions?. In this post, I'll try to explain my current understandi...

“Prioritizing threats for AI control” by Ryan Greenblatt

19 Mar 2025

Contributed by Lukas

Subtitle: What are the main threats and how should we prioritize them?. We often talk about ensuring control, which in the context of this doc refer...

“How might we safely pass the buck to AI?” by Josh Clymer

19 Feb 2025

Contributed by Lukas

Subtitle: Developing AI employees that are safer than human ones. My goal as an AI safety researcher is to put myself out of a job. I don’t worry ...

“Takeaways from sketching a control safety case” by Josh Clymer

30 Jan 2025

Contributed by Lukas

Subtitle: Insights from a long technical paper compressed into a fun little commentary. Buck Shlegeris and I recently published a paper with UK AISI...

“Planning for Extreme AI Risks” by Josh Clymer

29 Jan 2025

Contributed by Lukas

Subtitle: Are we ready for this?. This post should not be taken as a polished recommendation to AI companies and instead should be treated as an inf...

“Ten people on the inside” by Buck Shlegeris

28 Jan 2025

Contributed by Lukas

Subtitle: A scary scenario that's worth planning for. (Many of these ideas developed in conversation with Ryan Greenblatt) In a shortform, I describ...

“When does capability elicitation bound risk?” by Josh Clymer

22 Jan 2025

Contributed by Lukas

For a summary of this post see the thread on X. The assumptions behind and limitations of capability elicitation have been discussed in multiple pla...

“How will we update about scheming?” by Ryan Greenblatt

19 Jan 2025

Contributed by Lukas

Subtitle: A quantitative description of how I expect to change my mind.. [Cross-posted from LessWrong] I mostly work on risks from scheming (that is...

“Thoughts on the conservative assumptions in AI control” by Buck Shlegeris

17 Jan 2025

Contributed by Lukas

Subtitle: Why are we so friendly to the red team?. Work that I’ve done on techniques for mitigating risk from misaligned AI often makes a number o...

“Extending control evaluations to non-scheming threats” by Josh Clymer

13 Jan 2025

Contributed by Lukas

Buck Shlegeris and Ryan Greenblatt originally motivated control evaluations as a way to mitigate risks from ‘scheming’ AI models: models that con...

“Measuring whether AIs can statelessly strategize to subvert security measures” by Buck Shlegeris, Alex Mallen

20 Dec 2024

Contributed by Lukas

Subtitle: The complement to control evaluations. One way to show that risk from deploying an AI system is small is by showing that the model is not ...

“Alignment Faking in Large Language Models” by Ryan Greenblatt, Buck Shlegeris

18 Dec 2024

Contributed by Lukas

Subtitle: In our experiments, AIs will often strategically pretend to comply with the training objective to prevent the training process from modifyi...

“Why imperfect adversarial robustness doesn’t doom AI control” by Buck Shlegeris

18 Nov 2024

Contributed by Lukas

Subtitle: There are crucial disanalogies between preventing jailbreaks and preventing misalignment-induced catastrophes.. (thanks to Alex Mallen, Co...

“Win/continue/lose scenarios and execute/replace/audit protocols” by Buck Shlegeris

15 Nov 2024

Contributed by Lukas

In this post, I’ll make a technical point that comes up when thinking about risks from scheming AIs from a control perspective. In brief: Consider ...

“Behavioral red-teaming is unlikely to produce clear, strong evidence that models aren’t scheming” by Buck Shlegeris

10 Oct 2024

Contributed by Lukas

One strategy for mitigating risk from schemers (that is, egregiously misaligned models that intentionally try to subvert your safety measures) is beh...

“A basic systems architecture for AI agents that do autonomous research” by Buck Shlegeris

26 Sep 2024

Contributed by Lukas

Subtitle: And diagrams describing how threat scenarios involving misaligned AI involve compromising the system in different places.. A lot of threat...

“How to prevent collusion when using untrusted models to monitor each other” by Buck Shlegeris

25 Sep 2024

Contributed by Lukas

Suppose you’ve trained a really clever AI model, and you’re planning to deploy it in an agent scaffold that allows it to run code or take other a...

“Would catching your AIs trying to escape convince AI developers to slow down or undeploy?” by Buck Shlegeris

26 Aug 2024

Contributed by Lukas

Subtitle: I'm not so sure.. [Crossposted from LessWrong] I often talk to people who think that if frontier models were egregiously misaligned and po...

Activity Overview

Episodes

“A taxonomy of barriers to trading with early misaligned AIs” by Alexa Pan

“Introducing LinuxArena” by Tyler

“Current AIs seem pretty misaligned to me” by Ryan Greenblatt

“Anthropic repeatedly accidentally trained against the CoT, demonstrating inadequate processes” by Alex Mallen, Ryan Greenblatt

“Logit ROCs: Monitor TPR is linear in FPR in logit space” by Kerrick Staley, Aryan Bhatt, Julian Stastny

“If Mythos actually made Anthropic employees 4x more productive, I would radically shorten my timelines” by Ryan Greenblatt

“My picture of the present in AI” by Ryan Greenblatt

“AIs can now often do massive easy-to-verify SWE tasks” by Ryan Greenblatt

“Blocking live failures with synchronous monitors” by James Lucassen, Adam Kaufman

“Reward-seekers will probably behave according to causal decision theory” by Alex Mallen

“AI’s capability improvements haven’t come from it getting less affordable” by Anders Cairns Woodruff

“Are AIs more likely to pursue on-episode or beyond-episode reward?” by Anders Cairns Woodruff, Alex Mallen

“The case for satiating cheaply-satisfied AI preferences” by Alex Mallen

“Announcing ControlConf 2026” by Buck Shlegeris

“Frontier AI companies probably can’t leave the US” by Anders Cairns Woodruff

“Will reward-seekers respond to distant incentives?” by Alex Mallen

“How do we (more) safely defer to AIs?” by Ryan Greenblatt

“Distinguish between inference scaling and “larger tasks use more compute”” by Ryan Greenblatt

“Fitness-Seekers: Generalizing the Reward-Seeking Threat Model” by Alex Mallen

The inaugural Redwood Research podcast

The inaugural Redwood Research podcast

“Recent LLMs can do 2-hop and 3-hop latent (no CoT) reasoning on natural facts” by Ryan Greenblatt

“Measuring no CoT math time horizon (single forward pass)” by Ryan Greenblatt

“Recent LLMs can use filler tokens or problem repeats to improve (no-CoT) math performance” by Ryan Greenblatt

“BashArena and Control Setting Design” by Adam Kaufman, jlucassen

“The behavioral selection model for predicting AI motivations” by Alex Mallen, Buck Shlegeris

“Will AI systems drift into misalignment?” by Josh Clymer

“What’s up with Anthropic predicting AGI by early 2027?” by Ryan Greenblatt

“Sonnet 4.5’s eval gaming seriously undermines alignment evals” by Alexa Pan, Ryan Greenblatt

“Should AI Developers Remove Discussion of AI Misalignment from AI Training Data?” by Alek Westover

“Is 90% of code at Anthropic being written by AIs?” by Ryan Greenblatt

“Reducing risk from scheming by studying trained-in scheming behavior” by Ryan Greenblatt

“Iterated Development and Study of Schemers (IDSS)” by Ryan Greenblatt

“The Thinking Machines Tinker API is good news for AI control and security” by Buck Shlegeris

“Plans A, B, C, and D for misalignment risk” by Ryan Greenblatt

“Notes on fatalities from AI takeover” by Ryan Greenblatt

“Focus transparency on risk reports, not safety cases” by Ryan Greenblatt

“Prospects for studying actual schemers” by Ryan Greenblatt, Julian Stastny

“What training data should developers filter to reduce risk from misaligned AI?” by Alek Westover

“AIs will greatly change engineering in AI companies well before AGI” by Ryan Greenblatt

“Trust me bro, just one more RL scale up, this one will be the real scale up with the good environments, the actually legit one, trust me bro” by Ryan Greenblatt

“Attaching requirements to model releases has serious downsides (relative to a different deadline for these requirements)” by Ryan Greenblatt

“Notes on cooperating with unaligned AI” by Lukas Finnveden

“Being honest with AIs” by Lukas Finnveden

“My AGI timeline updates from GPT-5 (and 2025 so far)” by Ryan Greenblatt

“Four places where you can put LLM monitoring” by Fabien Roger, Buck Shlegeris

“Four places where you can put LLM monitoring” by Fabien Roger, Buck Shlegeris

“Should we update against seeing relatively fast AI progress in 2025 and 2026?” by Ryan Greenblatt

“Should we update against seeing relatively fast AI progress in 2025 and 2026?” by Ryan Greenblatt

“Why it’s hard to make settings for high-stakes control research” by Buck Shlegeris

“Why it’s hard to make settings for high-stakes control research” by Buck Shlegeris

“Recent Redwood Research project proposals” by Ryan Greenblatt, Buck Shlegeris, Julian Stastny, Josh Clymer, Alex Mallen, Vivek Hebbar

“Recent Redwood Research project proposals” by Ryan Greenblatt, Buck Shlegeris, Julian Stastny, Josh Clymer, Alex Mallen, Vivek Hebbar

“What’s worse, spies or schemers?” by Buck Shlegeris, Julian Stastny

“Ryan on the 80,000 Hours podcast” by Buck Shlegeris

“How much novel security-critical infrastructure do you need during the singularity?” by Buck Shlegeris

“Two proposed projects on abstract analogies for scheming” by Julian Stastny

“There are two fundamentally different constraints on schemers” by Buck Shlegeris

“Jankily controlling superintelligence” by Ryan Greenblatt

“What does 10x-ing effective compute get you?” by Ryan Greenblatt

“Comparing risk from internally-deployed AI to insider and outsider threats from humans” by Buck Shlegeris

“Making deals with early schemers” by Julian Stastny, Olli Järviniemi, Buck Shlegeris

“Prefix cache untrusted monitors: a method to apply after you catch your AI” by Ryan Greenblatt

“AI safety techniques leveraging distillation” by Ryan Greenblatt

“When does training a model change its goals?” by Vivek Hebbar, Ryan Greenblatt

“The case for countermeasures to memetic spread of misaligned values” by Alex Mallen

“AIs at the current capability level may be important for future safety work” by Ryan Greenblatt

“Misalignment and Strategic Underperformance: An Analysis of Sandbagging and Exploration Hacking” by Julian Stastny, Buck Shlegeris

“Training-time schemers vs behavioral schemers” by Alex Mallen

“What’s going on with AI progress and trends? (As of 5/2025)” by Ryan Greenblatt

“How can we solve diffuse threats like research sabotage with AI control?” by Vivek Hebbar

“7+ tractable directions in AI control” by Ryan Greenblatt

“Clarifying AI R&D threat models” by Josh Clymer

“How training-gamers might function (and win)” by Vivek Hebbar

“Handling schemers if shutdown is not an option” by Buck Shlegeris

“Ctrl-Z: Controlling AI Agents via Resampling” by Buck Shlegeris

“To be legible, evidence of misalignment probably has to be behavioral” by Ryan Greenblatt