BlueDot Narrated
Episodes
Deceptively Aligned Mesa-Optimizers: It’s Not Funny if I Have to Explain It
04 Jan 2025
Contributed by Lukas
Our goal here is to popularize obscure and hard-to-understand areas of AI alignment.So let’s try to understand the incomprehensible meme! Our main ...
What Failure Looks Like
04 Jan 2025
Contributed by Lukas
Crossposted from the AI Alignment Forum. May contain more technical jargon than usual.The stereotyped image of AI catastrophe is a powerful, malicious...
Learning From Human Preferences
04 Jan 2025
Contributed by Lukas
One step towards building safe AI systems is to remove the need for humans to write goal functions, since using a simple proxy for a complex goal, or ...
Specification Gaming: The Flip Side of AI Ingenuity
04 Jan 2025
Contributed by Lukas
Specification gaming is a behaviour that satisfies the literal specification of an objective without achieving the intended outcome. We have all had e...
Superintelligence: Instrumental Convergence
04 Jan 2025
Contributed by Lukas
According to the orthogonality thesis, intelligent agents may have an enormous range of possible final goals. Nevertheless, according to what we may t...
The Easy Goal Inference Problem Is Still Hard
04 Jan 2025
Contributed by Lukas
One approach to the AI control problem goes like this:Observe what the user of the system says and does.Infer the user’s preferences.Try to make the...
The Alignment Problem From a Deep Learning Perspective
04 Jan 2025
Contributed by Lukas
Within the coming decades, artificial general intelligence (AGI) may surpass human capabilities at a wide range of important tasks. We outline a case ...
AGI Safety From First Principles
04 Jan 2025
Contributed by Lukas
This report explores the core case for why the development of artificial general intelligence (AGI) might pose an existential threat to humanity. It s...
Four Background Claims
04 Jan 2025
Contributed by Lukas
MIRI’s mission is to ensure that the creation of smarter-than-human artificial intelligence has a positive impact. Why is this mission important, an...
Biological Anchors: A Trick That Might Or Might Not Work
04 Jan 2025
Contributed by Lukas
I've been trying to review and summarize Eliezer Yudkowksy's recent dialogues on AI safety. Previously in sequence: Yudkowsky Contra Ngo On ...
A Short Introduction to Machine Learning
04 Jan 2025
Contributed by Lukas
Despite the current popularity of machine learning, I haven’t found any short introductions to it which quite match the way I prefer to introduce pe...
More Is Different for AI
04 Jan 2025
Contributed by Lukas
Machine learning is touching increasingly many aspects of our society, and its effect will only continue to grow. Given this, I and many others care a...
Future ML Systems Will Be Qualitatively Different
04 Jan 2025
Contributed by Lukas
In 1972, the Nobel prize-winning physicist Philip Anderson wrote the essay "More Is Different". In it, he argues that quantitative changes c...
Visualizing the Deep Learning Revolution
04 Jan 2025
Contributed by Lukas
The field of AI has undergone a revolution over the last decade, driven by the success of deep learning techniques. This post aims to convey three ide...
Intelligence Explosion: Evidence and Import
04 Jan 2025
Contributed by Lukas
It seems unlikely that humans are near the ceiling of possible intelligences, rather than simply being the first such intelligence that happened to ev...
On the Opportunities and Risks of Foundation Models
04 Jan 2025
Contributed by Lukas
AI is undergoing a paradigm shift with the rise of models (e.g., BERT, DALL-E, GPT-3) that are trained on broad data at scale and are adaptable to a w...
Machine Learning for Humans: Supervised Learning
04 Jan 2025
Contributed by Lukas
The two tasks of supervised learning: regression and classification. Linear regression, loss functions, and gradient descent.How much money will we ma...
Can We Scale Human Feedback for Complex AI Tasks?
04 Jan 2025
Contributed by Lukas
Reinforcement learning from human feedback (RLHF) has emerged as a powerful technique for steering large language models (LLMs) toward desired behavio...
Weak-To-Strong Generalization: Eliciting Strong Capabilities With Weak Supervision
04 Jan 2025
Contributed by Lukas
Widely used alignment techniques, such as reinforcement learning from human feedback (RLHF), rely on the ability of humans to supervise model behavior...
Zoom In: An Introduction to Circuits
04 Jan 2025
Contributed by Lukas
By studying the connections between neurons, we can find meaningful algorithms in the weights of neural networks. Many important transition points in...
Towards Monosemanticity: Decomposing Language Models With Dictionary Learning
04 Jan 2025
Contributed by Lukas
Using a sparse autoencoder, we extract a large number of interpretable features from a one-layer transformer.Mechanistic interpretability seeks to und...
Interpretability in the Wild: A Circuit for Indirect Object Identification in GPT-2 Small
04 Jan 2025
Contributed by Lukas
Research in mechanistic interpretability seeks to explain behaviors of machine learning (ML) models in terms of their internal components. However, mo...
AI Watermarking Won’t Curb Disinformation
04 Jan 2025
Contributed by Lukas
Generative AI allows people to produce piles upon piles of images and words very quickly. It would be nice if there were some way to reliably distingu...
Introduction to Mechanistic Interpretability
04 Jan 2025
Contributed by Lukas
Our introduction introduces common mech interp concepts, to prepare you for the rest of this session's resources.Original text: https://aisafetyf...
We Need a Science of Evals
04 Jan 2025
Contributed by Lukas
This lays out a number of open questions, in what the author calls a 'Science of Evals'.Original text: https://www.apolloresearch.ai/blog/we...
Become a Person who Actually Does Things
04 Jan 2025
Contributed by Lukas
The next four weeks of the course are an opportunity for you to actually build a thing that moves you closer to contributing to AI Alignment, and we&a...
Problems and Fundamental Limitations of Reinforcement Learning from Human Feedback
04 Jan 2025
Contributed by Lukas
This paper explains Anthropic’s constitutional AI approach, which is largely an extension on RLHF but with AIs replacing human demonstrators and hum...
Emerging Processes for Frontier AI Safety
04 Jan 2025
Contributed by Lukas
The UK recognises the enormous opportunities that AI can unlock across our economy and our society. However, without appropriate guardrails, such tech...
Constitutional AI Harmlessness from AI Feedback
04 Jan 2025
Contributed by Lukas
This paper explains Anthropic’s constitutional AI approach, which is largely an extension on RLHF but with AIs replacing human demonstrators and hum...
Challenges in Evaluating AI Systems
04 Jan 2025
Contributed by Lukas
Most conversations around the societal impacts of artificial intelligence (AI) come down to discussing some quality of an AI system, such as its truth...
AI Control: Improving Safety Despite Intentional Subversion
04 Jan 2025
Contributed by Lukas
We’ve released a paper, AI Control: Improving Safety Despite Intentional Subversion. This paper explores techniques that prevent AI catastrophes eve...
Computing Power and the Governance of AI
04 Jan 2025
Contributed by Lukas
This post summarises a new report, “Computing Power and the Governance of Artificial Intelligence.” The full report is a collaboration between nin...
Working in AI Alignment
04 Jan 2025
Contributed by Lukas
This guide is written for people who are considering direct work on technical AI alignment. I expect it to be most useful for people who are not yet w...
Planning a High-Impact Career: A Summary of Everything You Need to Know in 7 Points
04 Jan 2025
Contributed by Lukas
We took 10 years of research and what we’ve learned from advising 1,000+ people on how to build high-impact careers, compressed that into an eight-w...
How to Succeed as an Early-Stage Researcher: The “Lean Startup” Approach
04 Jan 2025
Contributed by Lukas
I am approaching the end of my AI governance PhD, and I’ve spent about 2.5 years as a researcher at FHI. During that time, I’ve learnt a lot about...
Being the (Pareto) Best in the World
04 Jan 2025
Contributed by Lukas
This introduces the concept of Pareto frontiers. The top comment by Rob Miles also ties it to comparative advantage.While reading, consider what Paret...
Writing, Briefly
04 Jan 2025
Contributed by Lukas
(In the process of answering an email, I accidentally wrote a tiny essay about writing. I usually spend weeks on an essay. This one took 67 minutes—...
Public by Default: How We Manage Information Visibility at Get on Board
04 Jan 2025
Contributed by Lukas
I’ve been obsessed with managing information, and communications in a remote team since Get on Board started growing. Reducing the bus factor is a p...
How to Get Feedback
04 Jan 2025
Contributed by Lukas
Feedback is essential for learning. Whether you’re studying for a test, trying to improve in your work or want to master a difficult skill, you need...
Worst-Case Thinking in AI Alignment
04 Jan 2025
Contributed by Lukas
Alternative title: “When should you assume that what could go wrong, will go wrong?” Thanks to Mary Phuong and Ryan Greenblatt for helpful suggest...
Compute Trends Across Three Eras of Machine Learning
04 Jan 2025
Contributed by Lukas
This article explains key drivers of AI progress, explains how compute is calculated, as well as looks at how the amount of compute used to train AI m...
Empirical Findings Generalize Surprisingly Far
04 Jan 2025
Contributed by Lukas
Previously, I argued that emergent phenomena in machine learning mean that we can’t rely on current trends to predict what the future of ML will be ...
Low-Stakes Alignment
04 Jan 2025
Contributed by Lukas
Right now I’m working on finding a good objective to optimize with ML, rather than trying to make sure our models are robustly optimizing that objec...
Two-Turn Debate Doesn’t Help Humans Answer Hard Reading Comprehension Questions
04 Jan 2025
Contributed by Lukas
Using hard multiple-choice reading comprehension questions as a testbed, we assess whether presenting humans with arguments for two competing answer o...
ABS: Scanning Neural Networks for Back-Doors by Artificial Brain Stimulation
04 Jan 2025
Contributed by Lukas
This paper presents a technique to scan neural network based AI models to determine if they are trojaned. Pre-trained AI models may contain back-doors...
Imitative Generalisation (AKA ‘Learning the Prior’)
04 Jan 2025
Contributed by Lukas
This post tries to explain a simplified version of Paul Christiano’s mechanism introduced here, (referred to there as ‘Learning the Prior’) and ...
Toy Models of Superposition
04 Jan 2025
Contributed by Lukas
It would be very convenient if the individual neurons of artificial neural networks corresponded to cleanly interpretable features of the input. For e...
Discovering Latent Knowledge in Language Models Without Supervision
04 Jan 2025
Contributed by Lukas
Abstract: Existing techniques for training language models can be misaligned with the truth: if we train models with imitation learning, they may rep...
An Investigation of Model-Free Planning
04 Jan 2025
Contributed by Lukas
The field of reinforcement learning (RL) is facing increasingly challenging domains with combinatorial complexity. For an RL agent to address these ch...
Gradient Hacking: Definitions and Examples
04 Jan 2025
Contributed by Lukas
Gradient hacking is a hypothesized phenomenon where:A model has knowledge about possible training trajectories which isn’t being used by its trainin...
Intro to Brain-Like-AGI Safety
04 Jan 2025
Contributed by Lukas
(Sections 3.1-3.4, 6.1-6.2, and 7.1-7.5)Suppose we someday build an Artificial General Intelligence algorithm using similar principles of learning and...
Chinchilla’s Wild Implications
04 Jan 2025
Contributed by Lukas
This post is about language model scaling laws, specifically the laws derived in the DeepMind paper that introduced Chinchilla. The paper came out a f...
Deep Double Descent
04 Jan 2025
Contributed by Lukas
We show that the double descent phenomenon occurs in CNNs, ResNets, and transformers: performance first improves, then gets worse, and then improves a...
Eliciting Latent Knowledge
04 Jan 2025
Contributed by Lukas
In this post, we’ll present ARC’s approach to an open problem we think is central to aligning powerful machine learning (ML) systems: Suppose we ...
Illustrating Reinforcement Learning from Human Feedback (RLHF)
04 Jan 2025
Contributed by Lukas
This more technical article explains the motivations for a system like RLHF, and adds additional concrete details as to how the RLHF approach is appli...
This is How AI Will Transform How Science Gets Done
02 Jan 2025
Contributed by Lukas
This article by Eric Schmidt, former CEO of Google, explains existing use cases for AI in the scientific community and outlines ways that sufficiently...
If-Then Commitments for AI Risk Reduction
02 Jan 2025
Contributed by Lukas
This article from Holden Karnofsky, now a visiting scholar at the Carnegie Endowment for International Peace, discusses "If-Then" commitment...
So You Want to be a Policy Entrepreneur?
30 Dec 2024
Contributed by Lukas
This paper by academic Michael Mintrom defines policy entrepreneurs as "energetic actors who engage in collaborative efforts in and around govern...
Open-Sourcing Highly Capable Foundation Models: An Evaluation of Risks, Benefits, and Alternative Methods for Pursuing Open-Source Objectives
30 Dec 2024
Contributed by Lukas
This resource is the second of two on the benefits and risks of open-weights model release. In contrast, this paper expresses strong skepticism toward...
Considerations for Governing Open Foundation Models
30 Dec 2024
Contributed by Lukas
This resource is the first of two on the benefits and risks of open-weights model release. This paper broadly supports the open release of foundation ...
Driving U.S. Innovation in Artificial Intelligence: A Roadmap for Artificial Intelligence Policy in the United States Senate
22 May 2024
Contributed by Lukas
In the fall of 2023, the US Bipartisan Senate AI Working Group held insight forms with global leaders. Participants included the leaders of major AI l...
The AI Triad and What It Means for National Security Strategy
20 May 2024
Contributed by Lukas
In this paper from CSET, Ben Buchanan outlines a framework for understanding the inputs that power machine learning. Called "the AI Triad", ...
Societal Adaptation to Advanced AI
20 May 2024
Contributed by Lukas
This paper explores the under-discussed strategies of adaptation and resilience to mitigate the risks of advanced AI systems. The authors present argu...
OECD AI Principles
13 May 2024
Contributed by Lukas
This document from the OECD is split into two sections: principles for responsible stewardship of trustworthy AI & national policies and internati...
Key facts: UNESCO’s Recommendation on the Ethics of Artificial Intelligence
13 May 2024
Contributed by Lukas
This summary of UNESCO's Recommendation on the Ethics of AI outlines four core values, ten core principles, and eleven actionable policies for re...
The Bletchley Declaration by Countries Attending the AI Safety Summit, 1-2 November 2023
13 May 2024
Contributed by Lukas
This statement was released by the UK Government as part of their Global AI Safety Summit from November 2023. It notes that frontier models pose uniqu...
A pro-innovation approach to AI regulation: government response
13 May 2024
Contributed by Lukas
This report by the UK's Department for Science, Technology, and Innovation outlines a regulatory framework for UK AI policy. Per the report, &quo...
China’s AI Regulations and How They Get Made
13 May 2024
Contributed by Lukas
This report from the Carnegie Endowment for International Peace summarizes Chinese AI policy as of mid-2023. It also provides analysis of the factors ...
High-level summary of the AI Act
13 May 2024
Contributed by Lukas
This primer by the Future of Life Institute highlights core elements of the EU AI Act. It includes a high level summary alongside explanations of diff...
FACT SHEET: President Biden Issues Executive Order on Safe, Secure, and Trustworthy Artificial Intelligence
13 May 2024
Contributed by Lukas
This fact sheet from The White House summarizes President Biden's AI Executive Order from October 2023. The President's AI EO represents the...
Recent U.S. Efforts on AI Policy
13 May 2024
Contributed by Lukas
This high-level overview by CISA summarizes major US policies on AI at the federal level. Important items worth further investigation include Executiv...
AI Index Report 2024, Chapter 7: Policy and Governance
13 May 2024
Contributed by Lukas
This yearly report from Stanford’s Center for Humane AI tracks AI governance actions and broader trends in policies and legislation by governments a...
The Policy Playbook: Building a Systems-Oriented Approach to Technology and National Security Policy
05 May 2024
Contributed by Lukas
This report by the Center for Security and Emerging Technology first analyzes the tensions and tradeoffs between three strategic technology and nation...
Strengthening Resilience to AI Risk: A Guide for UK Policymakers
04 May 2024
Contributed by Lukas
This report from the Centre for Emerging Technology and Security and the Centre for Long-Term Resilience identifies different levers as they apply to ...
The Convergence of Artificial Intelligence and the Life Sciences: Safeguarding Technology, Rethinking Governance, and Preventing Catastrophe
03 May 2024
Contributed by Lukas
This report by the Nuclear Threat Initiative primarily focuses on how AI's integration into biosciences could advance biotechnology but also pose...
What is AI Alignment?
01 May 2024
Contributed by Lukas
To solve rogue AIs, we’ll have to align them. In this article by Adam Jones of BlueDot Impact, Jones introduces the concept of aligning AIs. He defi...
Rogue AIs
01 May 2024
Contributed by Lukas
This excerpt from CAIS’s AI Safety, Ethics, and Society textbook provides a deep dive into the CAIS resource from session three, focusing specifical...
An Overview of Catastrophic AI Risks
29 Apr 2024
Contributed by Lukas
This article from the Center for AI Safety provides an overview of ways that advanced AI could cause catastrophe. It groups catastrophic risks into fo...
Future Risks of Frontier AI
23 Apr 2024
Contributed by Lukas
This report from the UK’s Government Office for Science envisions five potential risk scenarios from frontier AI. Each scenario includes information...
What risks does AI pose?
23 Apr 2024
Contributed by Lukas
This resource, written by Adam Jones at BlueDot Impact, provides a comprehensive overview of the existing and anticipated risks of AI. As you're ...
AI Could Defeat All Of Us Combined
22 Apr 2024
Contributed by Lukas
This blog post from Holden Karnofsky, Open Philanthropy’s Director of AI Strategy, explains how advanced AI might overpower humanity. It summarizes ...
The Economic Potential of Generative AI: The Next Productivity Frontier
16 Apr 2024
Contributed by Lukas
This report from McKinsey discusses the huge potential for economic growth that generative AI could bring, examining key drivers and exploring potenti...
Positive AI Economic Futures
16 Apr 2024
Contributed by Lukas
This insight report from the World Economic Forum summarizes some positive AI outcomes. Some proposed futures include AI enabling shared economic bene...
The Transformative Potential of Artificial Intelligence
16 Apr 2024
Contributed by Lukas
This paper by Ross Gruetzemacher and Jess Whittlestone examines the concept of transformative AI, which significantly impacts society without necessar...
Moore's Law for Everything
16 Apr 2024
Contributed by Lukas
This blog by Sam Altman, the CEO of OpenAI, provides insight into what AI company leaders are saying and thinking about their reasons for pursuing adv...
Visualizing the Deep Learning Revolution
13 May 2023
Contributed by Lukas
The field of AI has undergone a revolution over the last decade, driven by the success of deep learning techniques. This post aims to convey three ide...
A Short Introduction to Machine Learning
13 May 2023
Contributed by Lukas
Despite the current popularity of machine learning, I haven’t found any short introductions to it which quite match the way I prefer to introduce pe...
The AI Triad and What It Means for National Security Strategy
13 May 2023
Contributed by Lukas
A single sentence can summarize the complexities of modern artificial intelligence: Machine learning systems use computing power to execute algorithms...
Specification Gaming: The Flip Side of AI Ingenuity
13 May 2023
Contributed by Lukas
Specification gaming is a behaviour that satisfies the literal specification of an objective without achieving the intended outcome. We have all had e...
As AI Agents Like Auto-GPT Speed up Generative AI Race, We All Need to Buckle Up
13 May 2023
Contributed by Lukas
If you thought the pace of AI development had sped up since the release of ChatGPT last November, well, buckle up. Thanks to the rise of autonomous AI...
The Need for Work on Technical AI Alignment
13 May 2023
Contributed by Lukas
This page gives an overview of the alignment problem. It describes our motivation for running courses about technical AI alignment. The terminology sh...
Overview of How AI Might Exacerbate Long-Running Catastrophic Risks
13 May 2023
Contributed by Lukas
Developments in AI could exacerbate long-running catastrophic risks, including bioterrorism, disinformation and resulting institutional dysfunction, m...
Avoiding Extreme Global Vulnerability as a Core AI Governance Problem
13 May 2023
Contributed by Lukas
Much has been written framing and articulating the AI governance problem from a catastrophic risks lens, but these writings have been scattered. This ...
AI Safety Seems Hard to Measure
13 May 2023
Contributed by Lukas
In previous pieces, I argued that there’s a real and large risk of AI systems’ developing dangerous goals of their own and defeating all of humani...
Nobody’s on the Ball on AGI Alignment
13 May 2023
Contributed by Lukas
Observing from afar, it’s easy to think there’s an abundance of people working on AGI safety. Everyone on your timeline is fretting about AI risk,...
Why Might Misaligned, Advanced AI Cause Catastrophe?
13 May 2023
Contributed by Lukas
You may have seen arguments (such as these) for why people might create and deploy advanced AI that is both power-seeking and misaligned with human in...
Emergent Deception and Emergent Optimization
13 May 2023
Contributed by Lukas
I’ve previously argued that machine learning systems often exhibit emergent capabilities, and that these capabilities could lead to unintended negat...
Frontier AI Regulation: Managing Emerging Risks to Public Safety
13 May 2023
Contributed by Lukas
Advanced AI models hold the promise of tremendous benefits for humanity, but society needs to proactively manage the accompanying risks. In this paper...
Model Evaluation for Extreme Risks
13 May 2023
Contributed by Lukas
Current approaches to building general-purpose AI systems tend to produce systems with both beneficial and harmful capabilities. Further progress in A...
Primer on Safety Standards and Regulations for Industrial-Scale AI Development
13 May 2023
Contributed by Lukas
This primer introduces various aspects of safety standards and regulations for industrial-scale AI development: what they are, their potential and lim...