Menu
Sign In Search Podcasts Charts People & Topics Add Podcast API Pricing
Podcast Image

BlueDot Narrated

ML Systems Will Have Weird Failure Modes

04 Jan 2025

Description

Previously, I've argued that future ML systems might exhibit unfamiliar, emergent capabilities, and that thought experiments provide one approach towards predicting these capabilities and their consequences. In this post I’ll describe a particular thought experiment in detail. We’ll see that taking thought experiments seriously often surfaces future risks that seem "weird" and alien from the point of view of current systems. I’ll also describe how I tend to engage with these thought experiments: I usually start out intuitively skeptical, but when I reflect on emergent behavior I find that some (but not all) of the skepticism goes away. The remaining skepticism comes from ways that the thought experiment clashes with the ontology of neural networks, and I’ll describe the approaches I usually take to address this and generate actionable takeaways. ## Thought Experiment: Deceptive Alignment Recall that the optimization anchor runs the thought experiment of assuming that an ML agent is a perfect optimizer (with respect to some "intrinsic" reward function R). I’m going to examine one implication of this assumption, in the context of an agent being trained based on some "extrinsic" reward function R∗ (which is provided by the system designer and not equal to R). Specifically, consider a training process where in step t, a model has parameters θt and generates an action at (its output on that training step, e.g. an attempted backflip assuming it is being trained to do backflips). The action at is then judged according to the extrinsic reward function R∗, and the parameters are updated to some new value θt+1 that are intended to increase at+1's value under R∗.Original text:https://bounded-regret.ghost.io/ml-systems-will-have-weird-failure-modes-2/Narrated for AI Safety Fundamentals by Perrin Walker of TYPE III AUDIO.---A podcast by BlueDot Impact.Learn more on the AI Safety Fundamentals website.

Audio
Featured in this Episode

No persons identified in this episode.

Transcription

This episode hasn't been transcribed yet

Help us prioritize this episode for transcription by upvoting it.

0 upvotes
🗳️ Sign in to Upvote

Popular episodes get transcribed faster

Comments

There are no comments yet.

Please log in to write the first comment.