LessWrong (Curated & Popular)
Episodes
“Auditing language models for hidden objectives” by Sam Marks, Johannes Treutlein, dmz, Sam Bowman, Hoagy, Carson Denison, Akbir Khan, Euan Ong, Christopher Olah, Fabien Roger, Meg, Drake Thomas, Adam Jermyn, Monte M, evhub
16 Mar 2025
Contributed by Lukas
We study alignment audits—systematic investigations into whether an AI is pursuing hidden objectives—by training a model with a hidden misaligned ...
“The Most Forbidden Technique” by Zvi
14 Mar 2025
Contributed by Lukas
The Most Forbidden Technique is training an AI using interpretability techniques.An AI produces a final output [X] via some method [M]. You can analyz...
“Trojan Sky” by Richard_Ngo
13 Mar 2025
Contributed by Lukas
You learn the rules as soon as you’re old enough to speak. Don’t talk to jabberjays. You recite them as soon as you wake up every morning. Keep yo...
“OpenAI:” by Daniel Kokotajlo
11 Mar 2025
Contributed by Lukas
Exciting Update: OpenAI has released this blog post and paper which makes me very happy. It's basically the first steps along the research agenda...
“How Much Are LLMs Actually Boosting Real-World Programmer Productivity?” by Thane Ruthenis
09 Mar 2025
Contributed by Lukas
LLM-based coding-assistance tools have been out for ~2 years now. Many developers have been reporting that this is dramatically increasing their produ...
“So how well is Claude playing Pokémon?” by Julian Bradshaw
09 Mar 2025
Contributed by Lukas
Background: After the release of Claude 3.7 Sonnet,[1] an Anthropic employee started livestreaming Claude trying to play through Pokémon Red. The liv...
“Methods for strong human germline engineering” by TsviBT
07 Mar 2025
Contributed by Lukas
Note: an audio narration is not available for this article. Please see the original text. The original text contained 169 footnotes which were omitte...
“Have LLMs Generated Novel Insights?” by abramdemski, Cole Wyeth
06 Mar 2025
Contributed by Lukas
In a recent post, Cole Wyeth makes a bold claim:. . . there is one crucial test (yes this is a crux) that LLMs have not passed. They have never done a...
“A Bear Case: My Predictions Regarding AI Progress” by Thane Ruthenis
06 Mar 2025
Contributed by Lukas
This isn't really a "timeline", as such – I don't know the timings – but this is my current, fairly optimistic take on where w...
“Statistical Challenges with Making Super IQ babies” by Jan Christian Refsgaard
05 Mar 2025
Contributed by Lukas
This is a critique of How to Make Superbabies on LessWrong.Disclaimer: I am not a geneticist[1], and I've tried to use as little jargon as possib...
“Self-fulfilling misalignment data might be poisoning our AI models” by TurnTrout
04 Mar 2025
Contributed by Lukas
This is a link post.Your AI's training data might make it more “evil” and more able to circumvent your security, monitoring, and control meas...
“Judgements: Merging Prediction & Evidence” by abramdemski
01 Mar 2025
Contributed by Lukas
I recently wrote about complete feedback, an idea which I think is quite important for AI safety. However, my note was quite brief, explaining the ide...
“The Sorry State of AI X-Risk Advocacy, and Thoughts on Doing Better” by Thane Ruthenis
26 Feb 2025
Contributed by Lukas
First, let me quote my previous ancient post on the topic:Effective Strategies for Changing Public OpinionThe titular paper is very relevant here. I&a...
“Power Lies Trembling: a three-book review” by Richard_Ngo
26 Feb 2025
Contributed by Lukas
In a previous book review I described exclusive nightclubs as the particle colliders of sociology—places where you can reliably observe extreme forc...
“Emergent Misalignment: Narrow finetuning can produce broadly misaligned LLMs” by Jan Betley, Owain_Evans
26 Feb 2025
Contributed by Lukas
This is the abstract and introduction of our new paper. We show that finetuning state-of-the-art LLMs on a narrow task, such as writing vulnerable cod...
“The Paris AI Anti-Safety Summit” by Zvi
22 Feb 2025
Contributed by Lukas
It doesn’t look good.What used to be the AI Safety Summits were perhaps the most promising thing happening towards international coordination for AI...
“Eliezer’s Lost Alignment Articles / The Arbital Sequence” by Ruby
20 Feb 2025
Contributed by Lukas
Note: this is a static copy of this wiki page. We are also publishing it as a post to ensure visibility.Circa 2015-2017, a lot of high quality content...
“Arbital has been imported to LessWrong” by RobertM, jimrandomh, Ben Pace, Ruby
20 Feb 2025
Contributed by Lukas
Arbital was envisioned as a successor to Wikipedia. The project was discontinued in 2017, but not before many new features had been built and a substa...
“How to Make Superbabies” by GeneSmith, kman
20 Feb 2025
Contributed by Lukas
We’ve spent the better part of the last two decades unravelling exactly how the human genome works and which specific letter changes in our DNA affe...
“A computational no-coincidence principle” by Eric Neyman
19 Feb 2025
Contributed by Lukas
Audio note: this article contains 134 uses of latex notation, so the narration may be difficult to follow. There's a link to the original text ...
“A History of the Future, 2025-2040” by L Rudolf L
19 Feb 2025
Contributed by Lukas
This is an all-in-one crosspost of a scenario I originally published in three parts on my blog (No Set Gauge). Links to the originals: A History of th...
“It’s been ten years. I propose HPMOR Anniversary Parties.” by Screwtape
18 Feb 2025
Contributed by Lukas
On March 14th, 2015, Harry Potter and the Methods of Rationality made its final post. Wrap parties were held all across the world to read the ending a...
“Some articles in ‘International Security’ that I enjoyed” by Buck
16 Feb 2025
Contributed by Lukas
A friend of mine recently recommended that I read through articles from the journal International Security, in order to learn more about international...
“The Failed Strategy of Artificial Intelligence Doomers” by Ben Pace
16 Feb 2025
Contributed by Lukas
This is the best sociological account of the AI x-risk reduction efforts of the last ~decade that I've seen. I encourage folks to engage with its...
“Murder plots are infohazards” by Chris Monteiro
14 Feb 2025
Contributed by Lukas
Hi allI've been hanging around the rationalist-sphere for many years now, mostly writing about transhumanism, until things started to change in 2...
“Why Did Elon Musk Just Offer to Buy Control of OpenAI for $100 Billion?” by garrison
11 Feb 2025
Contributed by Lukas
This is the full text of a post from "The Obsolete Newsletter," a Substack that I write about the intersection of capitalism, geopolitics, ...
“The ‘Think It Faster’ Exercise” by Raemon
09 Feb 2025
Contributed by Lukas
Ultimately, I don’t want to solve complex problems via laborious, complex thinking, if we can help it. Ideally, I'd want to basically intuitive...
“So You Want To Make Marginal Progress...” by johnswentworth
08 Feb 2025
Contributed by Lukas
Once upon a time, in ye olden days of strange names and before google maps, seven friends needed to figure out a driving route from their parking lot ...
“What is malevolence? On the nature, measurement, and distribution of dark traits” by David Althaus
08 Feb 2025
Contributed by Lukas
Summary In this post, we explore different ways of understanding and measuring malevolence and explain why individuals with concerning levels of male...
“How AI Takeover Might Happen in 2 Years” by joshc
08 Feb 2025
Contributed by Lukas
I’m not a natural “doomsayer.” But unfortunately, part of my job as an AI safety researcher is to think about the more troubling scenarios.I’...
“Gradual Disempowerment, Shell Games and Flinches” by Jan_Kulveit
05 Feb 2025
Contributed by Lukas
Over the past year and half, I've had numerous conversations about the risks we describe in Gradual Disempowerment. (The shortest useful summary ...
“Gradual Disempowerment: Systemic Existential Risks from Incremental AI Development” by Jan_Kulveit, Raymond D, Nora_Ammann, Deger Turan, David Scott Krueger (formerly: capybaralet), David Duvenaud
04 Feb 2025
Contributed by Lukas
This is a link post.Full version on arXiv | X Executive summary AI risk scenarios usually portray a relatively sudden loss of human control to AIs, ...
“Planning for Extreme AI Risks” by joshc
03 Feb 2025
Contributed by Lukas
This post should not be taken as a polished recommendation to AI companies and instead should be treated as an informal summary of a worldview. The co...
“Catastrophe through Chaos” by Marius Hobbhahn
03 Feb 2025
Contributed by Lukas
This is a personal post and does not necessarily reflect the opinion of other members of Apollo Research. Many other people have talked about similar ...
“Will alignment-faking Claude accept a deal to reveal its misalignment?” by ryan_greenblatt
01 Feb 2025
Contributed by Lukas
I (and co-authors) recently put out "Alignment Faking in Large Language Models" where we show that when Claude strongly dislikes what it is ...
“‘Sharp Left Turn’ discourse: An opinionated review” by Steven Byrnes
30 Jan 2025
Contributed by Lukas
Summary and Table of ContentsThe goal of this post is to discuss the so-called “sharp left turn”, the lessons that we learn from analogizing evol...
“Ten people on the inside” by Buck
29 Jan 2025
Contributed by Lukas
(Many of these ideas developed in conversation with Ryan Greenblatt)In a shortform, I described some different levels of resources and buy-in for misa...
“Anomalous Tokens in DeepSeek-V3 and r1” by henry
28 Jan 2025
Contributed by Lukas
“Anomalous”, “glitch”, or “unspeakable” tokens in an LLM are those that induce bizarre behavior or otherwise don’t behave like regular t...
“Tell me about yourself:LLMs are aware of their implicit behaviors” by Martín Soto, Owain_Evans
28 Jan 2025
Contributed by Lukas
This is the abstract and introduction of our new paper, with some discussion of implications for AI Safety at the end. Authors: Jan Betley*, Xuchan B...
“Instrumental Goals Are A Different And Friendlier Kind Of Thing Than Terminal Goals” by johnswentworth, David Lorell
27 Jan 2025
Contributed by Lukas
The CakeImagine that I want to bake a chocolate cake, and my sole goal in my entire lightcone and extended mathematical universe is to bake that cake...
“A Three-Layer Model of LLM Psychology” by Jan_Kulveit
26 Jan 2025
Contributed by Lukas
This post offers an accessible model of psychology of character-trained LLMs like Claude. Epistemic StatusThis is primarily a phenomenological model ...
“Training on Documents About Reward Hacking Induces Reward Hacking” by evhub
24 Jan 2025
Contributed by Lukas
This is a link post.This is a blog post reporting some preliminary work from the Anthropic Alignment Science team, which might be of interest to resea...
“AI companies are unlikely to make high-assurance safety cases if timelines are short” by ryan_greenblatt
24 Jan 2025
Contributed by Lukas
One hope for keeping existential risks low is to get AI companies to (successfully) make high-assurance safety cases: structured and auditable argumen...
“Mechanisms too simple for humans to design” by Malmesbury
24 Jan 2025
Contributed by Lukas
Cross-posted from Telescopic TurnipAs we all know, humans are terrible at building butterflies. We can make a lot of objectively cool things like nucl...
“The Gentle Romance” by Richard_Ngo
22 Jan 2025
Contributed by Lukas
This is a link post.A story I wrote about living through the transition to utopia.This is the one story that I've put the most time and effort in...
“Quotes from the Stargate press conference” by Nikola Jurkovic
22 Jan 2025
Contributed by Lukas
This is a link post.Present alongside President Trump: Sam AltmanLarry Ellison (Oracle executive chairman and CTO)Masayoshi Son (Softbank CEO who be...
“The Case Against AI Control Research” by johnswentworth
21 Jan 2025
Contributed by Lukas
The AI Control Agenda, in its own words:… we argue that AI labs should ensure that powerful AIs are controlled. That is, labs should make sure that ...
“Don’t ignore bad vibes you get from people” by Kaj_Sotala
20 Jan 2025
Contributed by Lukas
I think a lot of people have heard so much about internalized prejudice and bias that they think they should ignore any bad vibes they get about a per...
“[Fiction] [Comic] Effective Altruism and Rationality meet at a Secular Solstice afterparty” by tandem
19 Jan 2025
Contributed by Lukas
(Both characters are fictional, loosely inspired by various traits from various real people. Be careful about combining kratom and alcohol.) The origi...
“Building AI Research Fleets” by bgold, Jesse Hoogland
18 Jan 2025
Contributed by Lukas
From AI scientist to AI research fleetResearch automation is here (1, 2, 3). We saw it coming and planned ahead, which puts us ahead of most (4, 5, 6...
“What Is The Alignment Problem?” by johnswentworth
17 Jan 2025
Contributed by Lukas
So we want to align future AGIs. Ultimately we’d like to align them to human values, but in the shorter term we might start with other targets, like...
“Applying traditional economic thinking to AGI: a trilemma” by Steven Byrnes
14 Jan 2025
Contributed by Lukas
Traditional economics thinking has two strong principles, each based on abundant historical data: Principle (A): No “lump of labor”: If human popu...
“Passages I Highlighted in The Letters of J.R.R.Tolkien” by Ivan Vendrov
14 Jan 2025
Contributed by Lukas
All quotes, unless otherwise marked, are Tolkien's words as printed in The Letters of J.R.R.Tolkien: Revised and Expanded Edition. All emphases m...
“Parkinson’s Law and the Ideology of Statistics” by Benquo
13 Jan 2025
Contributed by Lukas
The anonymous review of The Anti-Politics Machine published on Astral Codex X focuses on a case study of a World Bank intervention in Lesotho, and tel...
“Capital Ownership Will Not Prevent Human Disempowerment” by beren
11 Jan 2025
Contributed by Lukas
Crossposted from my personal blog. I was inspired to cross-post this here given the discussion that this post on the role of capital in an AI future e...
“Activation space interpretability may be doomed” by bilalchughtai, Lucius Bushnaq
10 Jan 2025
Contributed by Lukas
TL;DR: There may be a fundamental problem with interpretability work that attempts to understand neural networks by decomposing their individual activ...
“What o3 Becomes by 2028” by Vladimir_Nesov
09 Jan 2025
Contributed by Lukas
Funding for $150bn training systems just turned less speculative, with OpenAI o3 reaching 25% on FrontierMath, 70% on SWE-Verified, 2700 on Codeforces...
“What Indicators Should We Watch to Disambiguate AGI Timelines?” by snewman
09 Jan 2025
Contributed by Lukas
(Cross-post from https://amistrongeryet.substack.com/p/are-we-on-the-brink-of-agi, lightly edited for LessWrong. The original has a lengthier introduc...
“How will we update about scheming?” by ryan_greenblatt
08 Jan 2025
Contributed by Lukas
I mostly work on risks from scheming (that is, misaligned, power-seeking AIs that plot against their creators such as by faking alignment). Recently, ...
“OpenAI #10: Reflections” by Zvi
08 Jan 2025
Contributed by Lukas
This week, Altman offers a post called Reflections, and he has an interview in Bloomberg. There's a bunch of good and interesting answers in the ...
“Maximizing Communication, not Traffic” by jefftk
07 Jan 2025
Contributed by Lukas
As someone who writes for fun, I don't need to get people onto my site: If I write a post and some people are able to get the core ideajust from ...
“What’s the short timeline plan?” by Marius Hobbhahn
02 Jan 2025
Contributed by Lukas
This is a low-effort post. I mostly want to get other people's takes and express concern about the lack of detailed and publicly available plans ...
“Shallow review of technical AI safety, 2024” by technicalities, Stag, Stephen McAleese, jordine, Dr. David Mathers
30 Dec 2024
Contributed by Lukas
from aisafety.world The following is a list of live agendas in technical AI safety, updating our post from last year. It is “shallow” in the sense...
“By default, capital will matter more than ever after AGI” by L Rudolf L
29 Dec 2024
Contributed by Lukas
I've heard many people say something like "money won't matter post-AGI". This has always struck me as odd, and as most likely comp...
“Review: Planecrash” by L Rudolf L
28 Dec 2024
Contributed by Lukas
Take a stereotypical fantasy novel, a textbook on mathematical logic, and Fifty Shades of Grey. Mix them all together and add extra weirdness for spic...
“The Field of AI Alignment: A Postmortem, and What To Do About It” by johnswentworth
26 Dec 2024
Contributed by Lukas
A policeman sees a drunk man searching for something under a streetlight and asks what the drunk has lost. He says he lost his keys and they both look...
“When Is Insurance Worth It?” by kqr
23 Dec 2024
Contributed by Lukas
TL;DR: If you want to know whether getting insurance is worth it, use the Kelly Insurance Calculator. If you want to know why or how, read on.Note to ...
“Orienting to 3 year AGI timelines” by Nikola Jurkovic
23 Dec 2024
Contributed by Lukas
My median expectation is that AGI[1] will be created 3 years from now. This has implications on how to behave, and I will share some useful thoughts I...
“What Goes Without Saying” by sarahconstantin
21 Dec 2024
Contributed by Lukas
There are people I can talk to, where all of the following statements are obvious. They go without saying. We can just “be reasonable” together, w...
“o3” by Zach Stein-Perlman
21 Dec 2024
Contributed by Lukas
I'm editing this post.OpenAI announced (but hasn't released) o3 (skipping o2 for trademark reasons).It gets 25% on FrontierMath, smashing th...
“‘Alignment Faking’ frame is somewhat fake” by Jan_Kulveit
21 Dec 2024
Contributed by Lukas
I like the research. I mostly trust the results. I dislike the 'Alignment Faking' name and frame, and I'm afraid it will stick and lead...
“AIs Will Increasingly Attempt Shenanigans” by Zvi
19 Dec 2024
Contributed by Lukas
Increasingly, we have seen papers eliciting in AI models various shenanigans.There are a wide variety of scheming behaviors. You’ve got your weight ...
“Alignment Faking in Large Language Models” by ryan_greenblatt, evhub, Carson Denison, Benjamin Wright, Fabien Roger, Monte M, Sam Marks, Johannes Treutlein, Sam Bowman, Buck
18 Dec 2024
Contributed by Lukas
What happens when you tell Claude it is being trained to do something it doesn't want to do? We (Anthropic and Redwood Research) have a new paper...
“Communications in Hard Mode (My new job at MIRI)” by tanagrabeast
15 Dec 2024
Contributed by Lukas
Six months ago, I was a high school English teacher.I wasn’t looking to change careers, even after nineteen sometimes-difficult years. I was good at...
“Biological risk from the mirror world” by jasoncrawford
13 Dec 2024
Contributed by Lukas
A new article in Science Policy Forum voices concern about a particular line of biological research which, if successful in the long term, could event...
“Subskills of ‘Listening to Wisdom’” by Raemon
13 Dec 2024
Contributed by Lukas
A fool learns from their own mistakes The wise learn from the mistakes of others.– Otto von Bismark A problem as old as time: The youth won't ...
“Understanding Shapley Values with Venn Diagrams” by Carson L
13 Dec 2024
Contributed by Lukas
Someone I know, Carson Loughridge, wrote this very nice post explaining the core intuition around Shapley values (which play an important role in imp...
“LessWrong audio: help us choose the new voice” by PeterH
12 Dec 2024
Contributed by Lukas
We make AI narrations of LessWrong posts available via our audio player and podcast feeds.We’re thinking about changing our narrator's voice.Th...
“Understanding Shapley Values with Venn Diagrams” by agucova
11 Dec 2024
Contributed by Lukas
This is a link post. Someone I know wrote this very nice post explaining the core intuition around Shapley values (which play an important role in imp...
“o1: A Technical Primer” by Jesse Hoogland
11 Dec 2024
Contributed by Lukas
TL;DR: In September 2024, OpenAI released o1, its first "reasoning model". This model exhibits remarkable test-time scaling laws, which comp...
“Gradient Routing: Masking Gradients to Localize Computation in Neural Networks” by cloud, Jacob G-W, Evzen, Joseph Miller, TurnTrout
09 Dec 2024
Contributed by Lukas
We present gradient routing, a way of controlling where learning happens in neural networks. Gradient routing applies masks to limit the flow of gradi...
“Frontier Models are Capable of In-context Scheming” by Marius Hobbhahn, AlexMeinke, Bronson Schoen
06 Dec 2024
Contributed by Lukas
This is a brief summary of what we believe to be the most important takeaways from our new paper and from our findings shown in the o1 system card. We...
“(The) Lightcone is nothing without its people: LW + Lighthaven’s first big fundraiser” by habryka
30 Nov 2024
Contributed by Lukas
TLDR: LessWrong + Lighthaven need about $3M for the next 12 months. Donate here, or send me an email, DM or signal message (+1 510 944 3235) if you wa...
“Repeal the Jones Act of 1920” by Zvi
29 Nov 2024
Contributed by Lukas
Balsa Policy Institute chose as its first mission to lay groundwork for the potential repeal, or partial repeal, of section 27 of the Jones Act of 192...
“China Hawks are Manufacturing an AI Arms Race” by garrison
29 Nov 2024
Contributed by Lukas
This is the full text of a post from "The Obsolete Newsletter," a Substack that I write about the intersection of capitalism, geopolitics, ...
“Information vs Assurance” by johnswentworth
27 Nov 2024
Contributed by Lukas
In contract law, there's this thing called a “representation”. Example: as part of a contract to sell my house, I might “represent that” ...
“You are not too ‘irrational’ to know your preferences.” by DaystarEld
27 Nov 2024
Contributed by Lukas
Epistemic Status: 13 years working as a therapist for a wide variety of populations, 5 of them working with rationalists and EA clients. 7 years teach...
“‘The Solomonoff Prior is Malign’ is a special case of a simpler argument” by David Matolcsi
25 Nov 2024
Contributed by Lukas
[Warning: This post is probably only worth reading if you already have opinions on the Solomonoff induction being malign, or at least heard of the con...
“‘It’s a 10% chance which I did 10 times, so it should be 100%’” by egor.timatkov
20 Nov 2024
Contributed by Lukas
Audio note: this article contains 33 uses of latex notation, so the narration may be difficult to follow. There's a link to the original text i...
“OpenAI Email Archives” by habryka
19 Nov 2024
Contributed by Lukas
As part of the court case between Elon Musk and Sam Altman, a substantial number of emails between Elon, Sam Altman, Ilya Sutskever, and Greg Brockman...
“Ayn Rand’s model of ‘living money’; and an upside of burnout” by AnnaSalamon
18 Nov 2024
Contributed by Lukas
Epistemic status: Toy model. Oversimplified, but has been anecdotally useful to at least a couple people, and I like it as a metaphor. IntroductionI’...
“Neutrality” by sarahconstantin
17 Nov 2024
Contributed by Lukas
Midjourney, “infinite library”I’ve had post-election thoughts percolating, and the sense that I wanted to synthesize something about this moment...
“Making a conservative case for alignment” by Cameron Berg, Judd Rosenblatt, phgubbins, AE Studio
16 Nov 2024
Contributed by Lukas
Trump and the Republican party will yield broad governmental control during what will almost certainly be a critical period for AGI development. In th...
“OpenAI Email Archives (from Musk v. Altman)” by habryka
16 Nov 2024
Contributed by Lukas
As part of the court case between Elon Musk and Sam Altman, a substantial number of emails between Elon, Sam Altman, Ilya Sutskever, and Greg Brockman...
“Catastrophic sabotage as a major threat model for human-level AI systems” by evhub
15 Nov 2024
Contributed by Lukas
Thanks to Holden Karnofsky, David Duvenaud, and Kate Woolverton for useful discussions and feedback.Following up on our recent “Sabotage Evaluations...
“The Online Sports Gambling Experiment Has Failed” by Zvi
12 Nov 2024
Contributed by Lukas
Related: Book Review: On the Edge: The GamblersI have previously been heavily involved in sports betting. That world was very good to me. The times we...
“o1 is a bad idea” by abramdemski
12 Nov 2024
Contributed by Lukas
This post comes a bit late with respect to the news cycle, but I argued in a recent interview that o1 is an unfortunate twist on LLM technologies, mak...
“Current safety training techniques do not fully transfer to the agent setting” by Simon Lermen, Govind Pimpale
09 Nov 2024
Contributed by Lukas
TL;DR: I'm presenting three recent papers which all share a similar finding, i.e. the safety training techniques for chat models don’t transfer...
“Explore More: A Bag of Tricks to Keep Your Life on the Rails” by Shoshannah Tekofsky
04 Nov 2024
Contributed by Lukas
At least, if you happen to be near me in brain space.What advice would you give your younger self?That was the prompt for a class I taught at PAIR 202...
“Survival without dignity” by L Rudolf L
04 Nov 2024
Contributed by Lukas
I open my eyes and find myself lying on a bed in a hospital room. I blink."Hello", says a middle-aged man with glasses, sitting on a chair b...