Menu
Sign In Search Podcasts Libraries Charts People & Topics Add Podcast API Blog Pricing
Podcast Image

Redwood Research Blog

Technology Society & Culture

Episodes

Showing 101-114 of 114
«« ← Prev Page 2 of 2

“When does capability elicitation bound risk?” by Josh Clymer

22 Jan 2025

Contributed by Lukas

For a summary of this post see the thread on X. The assumptions behind and limitations of capability elicitation have been discussed in multiple pla...

“How will we update about scheming?” by Ryan Greenblatt

19 Jan 2025

Contributed by Lukas

Subtitle: A quantitative description of how I expect to change my mind.. [Cross-posted from LessWrong] I mostly work on risks from scheming (that is...

“Thoughts on the conservative assumptions in AI control” by Buck Shlegeris

17 Jan 2025

Contributed by Lukas

Subtitle: Why are we so friendly to the red team?. Work that I’ve done on techniques for mitigating risk from misaligned AI often makes a number o...

“Extending control evaluations to non-scheming threats” by Josh Clymer

13 Jan 2025

Contributed by Lukas

Buck Shlegeris and Ryan Greenblatt originally motivated control evaluations as a way to mitigate risks from ‘scheming’ AI models: models that con...

“Measuring whether AIs can statelessly strategize to subvert security measures” by Buck Shlegeris, Alex Mallen

20 Dec 2024

Contributed by Lukas

Subtitle: The complement to control evaluations. One way to show that risk from deploying an AI system is small is by showing that the model is not ...

“Alignment Faking in Large Language Models” by Ryan Greenblatt, Buck Shlegeris

18 Dec 2024

Contributed by Lukas

Subtitle: In our experiments, AIs will often strategically pretend to comply with the training objective to prevent the training process from modifyi...

“Why imperfect adversarial robustness doesn’t doom AI control” by Buck Shlegeris

18 Nov 2024

Contributed by Lukas

Subtitle: There are crucial disanalogies between preventing jailbreaks and preventing misalignment-induced catastrophes.. (thanks to Alex Mallen, Co...

“Win/continue/lose scenarios and execute/replace/audit protocols” by Buck Shlegeris

15 Nov 2024

Contributed by Lukas

In this post, I’ll make a technical point that comes up when thinking about risks from scheming AIs from a control perspective. In brief: Consider ...

“Behavioral red-teaming is unlikely to produce clear, strong evidence that models aren’t scheming” by Buck Shlegeris

10 Oct 2024

Contributed by Lukas

One strategy for mitigating risk from schemers (that is, egregiously misaligned models that intentionally try to subvert your safety measures) is beh...

“A basic systems architecture for AI agents that do autonomous research” by Buck Shlegeris

26 Sep 2024

Contributed by Lukas

Subtitle: And diagrams describing how threat scenarios involving misaligned AI involve compromising the system in different places.. A lot of threat...

“How to prevent collusion when using untrusted models to monitor each other” by Buck Shlegeris

25 Sep 2024

Contributed by Lukas

Suppose you’ve trained a really clever AI model, and you’re planning to deploy it in an agent scaffold that allows it to run code or take other a...

“Would catching your AIs trying to escape convince AI developers to slow down or undeploy?” by Buck Shlegeris

26 Aug 2024

Contributed by Lukas

Subtitle: I'm not so sure.. [Crossposted from LessWrong] I often talk to people who think that if frontier models were egregiously misaligned and po...

“Fields that I reference when thinking about AI takeover prevention” by Buck Shlegeris

13 Aug 2024

Contributed by Lukas

Subtitle: Is AI takeover like a nuclear meltdown? A coup? A plane crash?. My day job is thinking about safety measures that aim to reduce catastroph...

“Getting 50% (SoTA) on ARC-AGI with GPT-4o” by Ryan Greenblatt

17 Jun 2024

Contributed by Lukas

Subtitle: You can just draw more samples. I recently got to 50%1 accuracy on the public test set for ARC-AGI by having GPT-4o generate a huge number...

«« ← Prev Page 2 of 2