AXRP - the AI X-risk Research Podcast | Podcast Transcripts & Episodes

46 - Tom Davidson on AI-enabled Coups

07 Aug 2025

Contributed by Lukas

Could AI enable a small group to gain power over a large country, and lock in their power permanently? Often, people worried about catastrophic risks ...

45 - Samuel Albanie on DeepMind's AGI Safety Approach

06 Jul 2025

Contributed by Lukas

In this episode, I chat with Samuel Albanie about the Google DeepMind paper he co-authored called "An Approach to Technical AGI Safety and Security". ...

44 - Peter Salib on AI Rights for Human Safety

28 Jun 2025

Contributed by Lukas

In this episode, I talk with Peter Salib about his paper "AI Rights for Human Safety", arguing that giving AIs the right to contract, hold property, a...

43 - David Lindner on Myopic Optimization with Non-myopic Approval

15 Jun 2025

Contributed by Lukas

In this episode, I talk with David Lindner about Myopic Optimization with Non-myopic Approval, or MONA, which attempts to address (multi-step) reward ...

42 - Owain Evans on LLM Psychology

06 Jun 2025

Contributed by Lukas

Earlier this year, the paper "Emergent Misalignment" made the rounds on AI x-risk social media for seemingly showing LLMs generalizing from 'misaligne...

41 - Lee Sharkey on Attribution-based Parameter Decomposition

03 Jun 2025

Contributed by Lukas

What's the next step forward in interpretability? In this episode, I chat with Lee Sharkey about his proposal for detecting computational mechanisms w...

40 - Jason Gross on Compact Proofs and Interpretability

28 Mar 2025

Contributed by Lukas

How do we figure out whether interpretability is doing its job? One way is to see if it helps us prove things about models that we care about knowing....

38.8 - David Duvenaud on Sabotage Evaluations and the Post-AGI Future

01 Mar 2025

Contributed by Lukas

In this episode, I chat with David Duvenaud about two topics he's been thinking about: firstly, a paper he wrote about evaluating whether or not front...

38.7 - Anthony Aguirre on the Future of Life Institute

09 Feb 2025

Contributed by Lukas

The Future of Life Institute is one of the oldest and most prominant organizations in the AI existential safety space, working on such topics as the A...

38.6 - Joel Lehman on Positive Visions of AI

24 Jan 2025

Contributed by Lukas

Typically this podcast talks about how to avert destruction from AI. But what would it take to ensure AI promotes human flourishing as well as it can?...

38.5 - Adrià Garriga-Alonso on Detecting AI Scheming

20 Jan 2025

Contributed by Lukas

Suppose we're worried about AIs engaging in long-term plans that they don't tell us about. If we were to peek inside their brains, what should we look...

38.4 - Shakeel Hashim on AI Journalism

05 Jan 2025

Contributed by Lukas

AI researchers often complain about the poor coverage of their work in the news media. But why is this happening, and how can it be fixed? In this epi...

38.3 - Erik Jenner on Learned Look-Ahead

12 Dec 2024

Contributed by Lukas

Lots of people in the AI safety space worry about models being able to make deliberate, multi-step plans. But can we already see this in existing neur...

39 - Evan Hubinger on Model Organisms of Misalignment

01 Dec 2024

Contributed by Lukas

The 'model organisms of misalignment' line of research creates AI models that exhibit various types of misalignment, and studies them to try to unders...

38.2 - Jesse Hoogland on Singular Learning Theory

27 Nov 2024

Contributed by Lukas

You may have heard of singular learning theory, and its "local learning coefficient", or LLC - but have you heard of the refined LLC? In this episode,...

38.1 - Alan Chan on Agent Infrastructure

16 Nov 2024

Contributed by Lukas

Road lines, street lights, and licence plates are examples of infrastructure used to ensure that roads operate smoothly. In this episode, Alan Chan ta...

38.0 - Zhijing Jin on LLMs, Causality, and Multi-Agent Systems

14 Nov 2024

Contributed by Lukas

Do language models understand the causal structure of the world, or do they merely note correlations? And what happens when you build a big AI society...

37 - Jaime Sevilla on AI Forecasting

04 Oct 2024

Contributed by Lukas

Epoch AI is the premier organization that tracks the trajectory of AI - how much compute is used, the role of algorithmic improvements, the growth in ...

36 - Adam Shai and Paul Riechers on Computational Mechanics

29 Sep 2024

Contributed by Lukas

Sometimes, people talk about transformers as having "world models" as a result of being trained to predict text data on the internet. But what does th...

New Patreon tiers + MATS applications

28 Sep 2024

Contributed by Lukas

Patreon: https://www.patreon.com/axrpodcast MATS: https://www.matsprogram.org Note: I'm employed by MATS, but they're not paying me to make this video...

35 - Peter Hase on LLM Beliefs and Easy-to-Hard Generalization

24 Aug 2024

Contributed by Lukas

How do we figure out what large language models believe? In fact, do they even have beliefs? Do those beliefs have locations, and if so, can we edit t...

34 - AI Evaluations with Beth Barnes

28 Jul 2024

Contributed by Lukas

How can we figure out if AIs are capable enough to pose a threat to humans? When should we make a big effort to mitigate risks of catastrophic AI misb...

33 - RLHF Problems with Scott Emmons

12 Jun 2024

Contributed by Lukas

Reinforcement Learning from Human Feedback, or RLHF, is one of the main ways that makers of large language models make them 'aligned'. But people have...

32 - Understanding Agency with Jan Kulveit

30 May 2024

Contributed by Lukas

What's the difference between a large language model and the human brain? And what's wrong with our theories of agency? In this episode, I chat about ...

31 - Singular Learning Theory with Daniel Murfet

07 May 2024

Contributed by Lukas

What's going on with deep learning? What sorts of models get learned, and what are the learning dynamics? Singular learning theory is a theory of Baye...

30 - AI Security with Jeffrey Ladish

30 Apr 2024

Contributed by Lukas

Top labs use various forms of "safety training" on models before their release to make sure they don't do nasty stuff - but how robust is that? How ca...

29 - Science of Deep Learning with Vikrant Varma

25 Apr 2024

Contributed by Lukas

In 2022, it was announced that a fairly simple method can be used to extract the true beliefs of a language model on any given topic, without having t...

28 - Suing Labs for AI Risk with Gabriel Weil

17 Apr 2024

Contributed by Lukas

How should the law govern AI? Those concerned about existential risks often push either for bans or for regulations meant to ensure that AI is develop...

27 - AI Control with Buck Shlegeris and Ryan Greenblatt

11 Apr 2024

Contributed by Lukas

A lot of work to prevent AI existential risk takes the form of ensuring that AIs don't want to cause harm or take over the world---or in other words, ...

26 - AI Governance with Elizabeth Seger

26 Nov 2023

Contributed by Lukas

The events of this year have highlighted important questions about the governance of artificial intelligence. For instance, what does it mean to democ...

25 - Cooperative AI with Caspar Oesterheld

03 Oct 2023

Contributed by Lukas

Imagine a world where there are many powerful AI systems, working at cross purposes. You could suppose that different governments use AIs to manage th...

24 - Superalignment with Jan Leike

27 Jul 2023

Contributed by Lukas

Recently, OpenAI made a splash by announcing a new "Superalignment" team. Lead by Jan Leike and Ilya Sutskever, the team would consist of top research...

23 - Mechanistic Anomaly Detection with Mark Xu

27 Jul 2023

Contributed by Lukas

Is there some way we can detect bad behaviour in our AI system without having to know exactly what it looks like? In this episode, I speak with Mark X...

Survey, store closing, Patreon

28 Jun 2023

Contributed by Lukas

Very brief survey: bit.ly/axrpsurvey2023 Store is closing in a week! Link: store.axrp.net/ Patreon: patreon.com/axrpodcast Ko-fi: ko-fi.com/axrpodcast

22 - Shard Theory with Quintin Pope

15 Jun 2023

Contributed by Lukas

What can we learn about advanced deep learning systems by understanding how humans learn and form values over their lifetimes? Will superhuman AI look...

21 - Interpretability for Engineers with Stephen Casper

02 May 2023

Contributed by Lukas

Lots of people in the field of machine learning study 'interpretability', developing tools that they say give us useful information about neural netwo...

20 - 'Reform' AI Alignment with Scott Aaronson

12 Apr 2023

Contributed by Lukas

How should we scientifically think about the impact of AI on human civilization, and whether or not it will doom us all? In this episode, I speak with...

Store, Patreon, Video

07 Feb 2023

Contributed by Lukas

Store: https://store.axrp.net/ Patreon: https://www.patreon.com/axrpodcast Ko-fi: https://ko-fi.com/axrpodcast Video: https://www.youtube.com/watch?v=...

19 - Mechanistic Interpretability with Neel Nanda

04 Feb 2023

Contributed by Lukas

How good are we at understanding the internal computation of advanced machine learning models, and do we have a hope at getting better? In this episod...

New podcast - The Filan Cabinet

13 Oct 2022

Contributed by Lukas

I have a new podcast, where I interview whoever I want about whatever I want. It's called "The Filan Cabinet", and you can find it wherever you listen...

18 - Concept Extrapolation with Stuart Armstrong

03 Sep 2022

Contributed by Lukas

Concept extrapolation is the idea of taking concepts an AI has about the world - say, "mass" or "does this picture contain a hot dog" - and extending ...

17 - Training for Very High Reliability with Daniel Ziegler

21 Aug 2022

Contributed by Lukas

Sometimes, people talk about making AI systems safe by taking examples where they fail and training them to do well on those. But how can we actually ...

16 - Preparing for Debate AI with Geoffrey Irving

01 Jul 2022

Contributed by Lukas

Many people in the AI alignment space have heard of AI safety via debate - check out AXRP episode 6 (axrp.net/episode/2021/04/08/episode-6-debate-beth...

15 - Natural Abstractions with John Wentworth

23 May 2022

Contributed by Lukas

Why does anybody care about natural abstractions? Do they somehow relate to math, or value learning? How do E. coli bacteria find sources of sugar? Al...

14 - Infra-Bayesian Physicalism with Vanessa Kosoy

05 Apr 2022

Contributed by Lukas

Late last year, Vanessa Kosoy and Alexander Appel published some research under the heading of "Infra-Bayesian physicalism". But wait - what was infra...

13 - First Principles of AGI Safety with Richard Ngo

31 Mar 2022

Contributed by Lukas

How should we think about artificial general intelligence (AGI), and the risks it might pose? What constraints exist on technical solutions to the pro...

12 - AI Existential Risk with Paul Christiano

02 Dec 2021

Contributed by Lukas

Why would advanced AI systems pose an existential risk, and what would it look like to develop safer systems? In this episode, I interview Paul Christ...

11 - Attainable Utility and Power with Alex Turner

25 Sep 2021

Contributed by Lukas

Many scary stories about AI involve an AI system deceiving and subjugating humans in order to gain the ability to achieve its goals without us stoppin...

10 - AI's Future and Impacts with Katja Grace

23 Jul 2021

Contributed by Lukas

When going about trying to ensure that AI does not cause an existential catastrophe, it's likely important to understand how AI will develop in the fu...

9 - Finite Factored Sets with Scott Garrabrant

24 Jun 2021

Contributed by Lukas

Being an agent can get loopy quickly. For instance, imagine that we're playing chess and I'm trying to decide what move to make. Your next move influe...

8 - Assistance Games with Dylan Hadfield-Menell

08 Jun 2021

Contributed by Lukas

How should we think about the technical problem of building smarter-than-human AI that does what we want? When and how should AI systems defer to us? ...

7.5 - Forecasting Transformative AI from Biological Anchors with Ajeya Cotra

28 May 2021

Contributed by Lukas

If you want to shape the development and forecast the consequences of powerful AI technology, it's important to know when it might appear. In this epi...

7 - Side Effects with Victoria Krakovna

14 May 2021

Contributed by Lukas

One way of thinking about how AI might pose an existential threat is by taking drastic actions to maximize its achievement of some objective function,...

6 - Debate and Imitative Generalization with Beth Barnes

08 Apr 2021

Contributed by Lukas

One proposal to train AIs that can be useful is to have ML models debate each other about the answer to a human-provided question, where the human jud...

5 - Infra-Bayesianism with Vanessa Kosoy

10 Mar 2021

Contributed by Lukas

The theory of sequential decision-making has a problem: how can we deal with situations where we have some hypotheses about the environment we're acti...

4 - Risks from Learned Optimization with Evan Hubinger

17 Feb 2021

Contributed by Lukas

In machine learning, typically optimization is done to produce a model that performs well according to some metric. Today's episode features Evan Hubi...

3 - Negotiable Reinforcement Learning with Andrew Critch

11 Dec 2020

Contributed by Lukas

In this episode, I talk with Andrew Critch about negotiable reinforcement learning: what happens when two people (or organizations, or what have you) ...

2 - Learning Human Biases with Rohin Shah

11 Dec 2020

Contributed by Lukas

One approach to creating useful AI systems is to watch humans doing a task, infer what they're trying to do, and then try to do that well. The simples...

1 - Adversarial Policies with Adam Gleave

11 Dec 2020

Contributed by Lukas

In this episode, Adam Gleave and I talk about adversarial policies. Basically, in current reinforcement learning, people train agents that act in some...

AXRP - the AI X-risk Research Podcast

Activity Overview

Episodes

46 - Tom Davidson on AI-enabled Coups

45 - Samuel Albanie on DeepMind's AGI Safety Approach

44 - Peter Salib on AI Rights for Human Safety

43 - David Lindner on Myopic Optimization with Non-myopic Approval

42 - Owain Evans on LLM Psychology

41 - Lee Sharkey on Attribution-based Parameter Decomposition

40 - Jason Gross on Compact Proofs and Interpretability

38.8 - David Duvenaud on Sabotage Evaluations and the Post-AGI Future

38.7 - Anthony Aguirre on the Future of Life Institute

38.6 - Joel Lehman on Positive Visions of AI

38.5 - Adrià Garriga-Alonso on Detecting AI Scheming

38.4 - Shakeel Hashim on AI Journalism

38.3 - Erik Jenner on Learned Look-Ahead

39 - Evan Hubinger on Model Organisms of Misalignment

38.2 - Jesse Hoogland on Singular Learning Theory

38.1 - Alan Chan on Agent Infrastructure

38.0 - Zhijing Jin on LLMs, Causality, and Multi-Agent Systems

37 - Jaime Sevilla on AI Forecasting

36 - Adam Shai and Paul Riechers on Computational Mechanics

New Patreon tiers + MATS applications

35 - Peter Hase on LLM Beliefs and Easy-to-Hard Generalization

34 - AI Evaluations with Beth Barnes

33 - RLHF Problems with Scott Emmons

32 - Understanding Agency with Jan Kulveit

31 - Singular Learning Theory with Daniel Murfet

30 - AI Security with Jeffrey Ladish

29 - Science of Deep Learning with Vikrant Varma

28 - Suing Labs for AI Risk with Gabriel Weil

27 - AI Control with Buck Shlegeris and Ryan Greenblatt

26 - AI Governance with Elizabeth Seger

25 - Cooperative AI with Caspar Oesterheld

24 - Superalignment with Jan Leike

23 - Mechanistic Anomaly Detection with Mark Xu

Survey, store closing, Patreon

22 - Shard Theory with Quintin Pope

21 - Interpretability for Engineers with Stephen Casper

20 - 'Reform' AI Alignment with Scott Aaronson

Store, Patreon, Video

19 - Mechanistic Interpretability with Neel Nanda

New podcast - The Filan Cabinet

18 - Concept Extrapolation with Stuart Armstrong

17 - Training for Very High Reliability with Daniel Ziegler

16 - Preparing for Debate AI with Geoffrey Irving

15 - Natural Abstractions with John Wentworth

14 - Infra-Bayesian Physicalism with Vanessa Kosoy

13 - First Principles of AGI Safety with Richard Ngo

12 - AI Existential Risk with Paul Christiano

11 - Attainable Utility and Power with Alex Turner

10 - AI's Future and Impacts with Katja Grace

9 - Finite Factored Sets with Scott Garrabrant

8 - Assistance Games with Dylan Hadfield-Menell

7.5 - Forecasting Transformative AI from Biological Anchors with Ajeya Cotra

7 - Side Effects with Victoria Krakovna

6 - Debate and Imitative Generalization with Beth Barnes

5 - Infra-Bayesianism with Vanessa Kosoy

4 - Risks from Learned Optimization with Evan Hubinger

3 - Negotiable Reinforcement Learning with Andrew Critch

2 - Learning Human Biases with Rohin Shah

1 - Adversarial Policies with Adam Gleave

Sign in to Audioscrape

Share this moment