LessWrong (30+ Karma)
Episodes
Claude 4.5 Opus’ Soul Document
29 Nov 2025
Contributed by Lukas
Summary As far as I understand and uncovered, a document for the character training for Claude is compressed in Claude's weights. The full document c...
Should you work with evil people?
29 Nov 2025
Contributed by Lukas
Epistemic status: Figuring things out. My mind often wanders to what boundaries I ought to maintain between the different parts of my life and people...
Unless its governance changes, Anthropic is untrustworthy
29 Nov 2025
Contributed by Lukas
Anthropic is untrustworthy. This post provides arguments, asks questions, and documents some examples of Anthropic's leadership being misleading and ...
The Missing Genre: Heroic Parenthood - You can have kids and still punch the sun
29 Nov 2025
Contributed by Lukas
This is a link post. I stopped reading when I was 30. You can fill in all the stereotypes of a girl with a book glued to her face during every meal, e...
“Tests of LLM introspection need to rule out causal bypassing” by Adam Morris, Dillon Plunkett
29 Nov 2025
Contributed by Lukas
This point has been floating around implicitly in various papers (e.g., Betley et al., Plunkett et al., Lindsey), but we haven’t seen it named expl...
Not A Love Letter, But A Thank You Letter
29 Nov 2025
Contributed by Lukas
Context: Each day on her blog Letters To Boys, Gretta Duleba has been posting things she once sent to men she dated. One of those, recently, was a lo...
“Ruby’s Ultimate Guide to Thoughtful Gifts” by Ruby
28 Nov 2025
Contributed by Lukas
Give a man a gift and he smiles for a day. Teach a man to gift and he’ll cause smiles for the rest of his life. Gift giving is an exercise in theor...
Writing advice: Why people like your quick bullshit takes better than your high-effort posts
28 Nov 2025
Contributed by Lukas
Right now I’m coaching for Inkhaven, a month-long marathon writing event where our brave residents are writing a blog post every single day for the...
A Thanksgiving Memory
28 Nov 2025
Contributed by Lukas
A couple of days before Thanksgiving when I was 11 years old, the house my family lived in was gutted by a fire. We had been stripping the white pain...
Claude Opus 4.5: Model Card, Alignment and Safety
28 Nov 2025
Contributed by Lukas
They saved the best for last. The contrast in model cards is stark. Google provided a brief overview of its tests for Gemini 3 Pro, with a lot of ‘...
A Taxonomy of Bugs (Lists)
28 Nov 2025
Contributed by Lukas
One of my favorite exercises from CFAR is the Bugs List. By writing down everything in your life that feels "off," things you'd change if you could, ...
“Despair, Serenity, Song and Nobility in ‘Hollow Knight: Silksong’” by Ben Pace
28 Nov 2025
Contributed by Lukas
Fictional universes are oft defined by what their positive affect feels like and what their negative affect feels like. This is the palette that the ...
The Best Lack All Conviction: A Confusing Day in the AI Village
28 Nov 2025
Contributed by Lukas
The AI Village is an ongoing experiment (currently running on weekdays from 10 a.m. to 2 p.m. Pacific time) in which frontier language models are giv...
Will We Get Alignment by Default? — with Adrià Garriga-Alonso
28 Nov 2025
Contributed by Lukas
This is a link post. Adrià recently published “Alignment will happen by default; what's next?” on LessWrong, arguing that AI alignment is turning...
Information Hygiene
28 Nov 2025
Contributed by Lukas
Do avalanches get caused by loud noises? From every time I’ve given this class or lecture, at least 7/10 of you are nodding yes, and the main reaso...
You Are Much More Salient To Yourself Than To Everyone Else
28 Nov 2025
Contributed by Lukas
Back in the Boy Scouts, at summer camp, myself and a couple friends snuck out one night after curfew to commandeer a couch someone had left by a dump...
“Subliminal Learning Across Models” by draganover, Andi Bhongade, tolgadur, Mary Phuong, LASR Labs
27 Nov 2025
Contributed by Lukas
Tl;dr: We show that subliminal learning can transfer sentiment across models (with some caveats). For example, we transfer positive sentiment for Ca...
Alignment remains a hard, unsolved problem
27 Nov 2025
Contributed by Lukas
Thanks to (in alphabetical order) Joshua Batson, Roger Grosse, Jeremy Hadfield, Jared Kaplan, Jan Leike, Jack Lindsey, Monte MacDiarmid, Francesco Mo...
Just explain it to someone
27 Nov 2025
Contributed by Lukas
I was once helping a child with her homework. She was supposed to write about a place that was important for her, and had chosen her family's summer ...
Principles of a Rationality Dojo
26 Nov 2025
Contributed by Lukas
Since 2023 I've been directing WARP, the Wandering Applied Rationality Program with the help of ESPR and SPARC staff, which are summer camps I've tau...
Postmodernism for STEM Types: A Clear-Language Guide to Conflict Theory
26 Nov 2025
Contributed by Lukas
Crossposted from Susbstack Section I : Opening In 2021, Richard Dawkins tweeted: The fallout was immediate. The American Humanist Association revoked...
“Training PhD Students to be Fat Newts (Part 2)” by alkjash
26 Nov 2025
Contributed by Lukas
[Thanks Inkhaven for hosting me! This is my fourth and last post and I'm already exhausted from writing. Wordpress.com!] Last time, I introduced the ...
Snippets on Living In Reality
26 Nov 2025
Contributed by Lukas
Social reality is quite literally another world, in the same sense that the Harry Potter universe is another world. Like the Harry Potter universe, s...
Courtship Confusions Post-Slutcon
26 Nov 2025
Contributed by Lukas
Going into slutcon, one of my main known-unknowns was… I’d heard many times that the standard path to hooking up or dating starts with two people...
Training PhD Students to be Fat Newts (Part 1)
26 Nov 2025
Contributed by Lukas
This is a link post. Today, I want to introduce an experimental PhD student training philosophy. Let's start with some reddit memes. Every gaming sub...
“Evaluating honesty and lie detection techniques on a diverse suite of dishonest models” by Sam Marks, Johannes Treutlein, evhub, Fabien Roger
26 Nov 2025
Contributed by Lukas
TL;DR: We use a suite of testbed settings where models lie—i.e. generate statements they believe to be false—to evaluate honesty and lie detectio...
Takeaways from the Eleos Conference on AI Consciousness and Welfare
26 Nov 2025
Contributed by Lukas
Crossposted from my Substack. I spent the weekend at Lighthaven, attending the Eleos conference. In this post, I share thoughts and updates as I refl...
Evolution & Freedom
26 Nov 2025
Contributed by Lukas
In Against Money Maximalism, I argued against money-maximization as a normative stance. Profit is a coherent thing you can try to maximize, but there...
Reasons Why I Cannot Sleep
26 Nov 2025
Contributed by Lukas
My therapist says I'm more tired today than she's ever seen me. Here are some reasons my brain says I cannot sleep: My boss might lose faith in my a...
The Economics of Replacing Call Center Workers With AIs
26 Nov 2025
Contributed by Lukas
TLDR: Voice AIs aren't that much cheaper in the year 2025 My friend runs a voice agent startup in Canada for walk-in clinics. The AI takes calls and ...
Three things that surprised me about technical grantmaking at Coefficient Giving (fka Open Phil)
26 Nov 2025
Contributed by Lukas
Open Philanthropy's Coefficient Giving's Technical AI Safety team is hiring grantmakers. I thought this would be a good moment to share some positive...
“OpenAI finetuning metrics: What is going on with the loss curves?” by jorio, James Chua
26 Nov 2025
Contributed by Lukas
Introduction For our current project, we've been using the OpenAI fine-tuning API. To run some of our experiments, we needed to understand exactly ho...
Alignment will happen by default. What’s next?
25 Nov 2025
Contributed by Lukas
I’m not 100% convinced of this, but I’m fairly convinced, more and more so over time. I’m hoping to start a vigorous but civilized debate. I in...
“Maybe Insensitive Functions are a Natural Ontology Generator?” by johnswentworth
25 Nov 2025
Contributed by Lukas
The most canonical example of a "natural ontology" comes from gasses in stat mech. In the simplest version, we model the gas as a bunch of little bil...
The Enemy Gets The Last Hit
24 Nov 2025
Contributed by Lukas
Disclaimer: I am god-awful at chess. I Late-beginner chess players, those who are almost on the cusp of being basically respectable, often fall into ...
Reasoning Models Sometimes Output Illegible Chains of Thought
24 Nov 2025
Contributed by Lukas
TL;DR: Models trained with outcome-based RL sometimes have reasoning traces that look very weird. In this paper, I evaluate 14 models and find that m...
The Coalition
24 Nov 2025
Contributed by Lukas
Summary A defensive military coalition is a key frame for thinking about our international agreement aimed at forestalling the development of superin...
Gemini 3 Pro Is a Vast Intelligence With No Spine
24 Nov 2025
Contributed by Lukas
It's A Great Model, Sir One might even say the best model. It is for now my default weapon of choice. Google's official announcement ...
“The LessWrong Team Was Selling Dollars For 86 Cents” by Screwtape
24 Nov 2025
Contributed by Lukas
You know what's sweeter than free money? Free money that you take from an organization of smart people who love prediction markets and having good va...
NATO is dangerously unaware of its military vulnerability
24 Nov 2025
Contributed by Lukas
NATO faces its gravest military disadvantage since 1949, as the balance of power has shifted decisively toward its adversaries in the Era of Drone Wa...
Inkhaven Retrospective
24 Nov 2025
Contributed by Lukas
Here I am on the plane on the way home from Inkhaven. Huge thanks to Ben Pace and the other organizers for inviting me. Lighthaven is a delightful ve...
“Stop Applying And Get To Work” by plex
23 Nov 2025
Contributed by Lukas
TL;DR: Figure out what needs doing and do it, don't wait on approval from fellowships or jobs. If you... Have short timelines Have been struggling t...
Show Review: Masquerade
23 Nov 2025
Contributed by Lukas
Earlier this month, I was pretty desperately feeling the need for a vacation. So after a little googling, I booked a flight to New York city, a hotel...
I’ll be sad to lose the puzzles
23 Nov 2025
Contributed by Lukas
My understanding is that even though advocating a pause or massive slowdown in the development of superintelligence think we should get there eventua...
You can just do things
23 Nov 2025
Contributed by Lukas
{early pause}... (you should have known this:) YOU (not just them over there) CAN (tho maybe you shouldn't?) JUST (its weirdly easy) DO (not talking/...
Literacy is Decreasing Among the Intellectual Class
23 Nov 2025
Contributed by Lukas
(Cross-posted from my Substack; written as part of the Halfhaven virtual blogging camp) Oh, you read Emily Post's Etiquette? What version? There's a...
Traditional Food
23 Nov 2025
Contributed by Lukas
Insulin resistance is bad. It doesn't just cause heart disease. Peter Attia, author of Outlive, the Science and Art of Longevity, makes a convincing[...
Easy vs Hard Emotional Vulnerability
23 Nov 2025
Contributed by Lukas
What blocks people from being vulnerable with others? Much ink has been spilled on two classes of answers to this question: Not everyone is in fact ...
What kind of person is DeepSeek’s founder, Liang Wenfeng? An answer from his old university classmate.
23 Nov 2025
Contributed by Lukas
Author: 清风学渣 Link: https://www.zhihu.com/question/10967114707/answer/1904046054904665233 Source: Zhihu Copyright belongs to the author. For c...
OpenAI Locks Down San Francisco Offices Following Alleged Threat From Activist
23 Nov 2025
Contributed by Lukas
A message on OpenAI's internal Slack claimed the activist in question had expressed interest in “causing physical harm to OpenAI employees.” Open...
Eight Heuristics of Anti-Epistemology
23 Nov 2025
Contributed by Lukas
Here are eight tools of anti-epistemology that I think anyone can use to hide their norm-violating behavior from being noticed, and deceive people ab...
“Book Review: Wizard’s Hall” by Screwtape
22 Nov 2025
Contributed by Lukas
Ever on the quilting goes, Spinning out the lives between, Winding up the souls of those Students up to one-thirteen There's a book about a young boy...
Market Logic I
22 Nov 2025
Contributed by Lukas
Audio note: this article contains 92 uses of latex notation, so the narration may be difficult to follow. There's a link to the original text in the...
D&D.Sci Thanksgiving: the Festival Feast
22 Nov 2025
Contributed by Lukas
This is an entry in the 'Dungeons & Data Science' series, a set of puzzles where players are given a dataset to analyze and an objective to pursu...
Be Naughty
22 Nov 2025
Contributed by Lukas
Context: Post #10 in my sequence of private Lightcone Infrastructure memos edited for public consumption. This one, more so than any other one in th...
Abstract advice to researchers tackling the difficult core problems of AGI alignment
22 Nov 2025
Contributed by Lukas
Crosspost from my blog. This some quickly-written, better-than-nothing advice for people who want to make progress on the hard problems of technical...
Why Not Just Train For Interpretability?
22 Nov 2025
Contributed by Lukas
Simplicio: Hey I’ve got an alignment research idea to run by you. Me: … guess we’re doing this again. Simplicio: Interpretability work on train...
“Natural emergent misalignment from reward hacking in production RL” by evhub, Monte M, Benjamin Wright, Jonathan Uesato
21 Nov 2025
Contributed by Lukas
Abstract We show that when large language models learn to reward hack on production RL environments, this can result in egregious emergent misalignme...
“AI #143: Everything, Everywhere, All At Once” by Zvi
21 Nov 2025
Contributed by Lukas
Last week had the release of GPT-5.1, which I covered on Tuesday. This week included Gemini 3, Nana Banana Pro, Grok 4.1, GPT 5.1 Pro, GPT 5.1-Codex...
“Rescuing truth in mathematics from the Liar’s Paradox using fuzzy values” by Adrià Garriga-alonso
21 Nov 2025
Contributed by Lukas
Audio note: this article contains 169 uses of latex notation, so the narration may be difficult to follow. There's a link to the original text in th...
Contra Collisteru: You Get About One Carthage
21 Nov 2025
Contributed by Lukas
Collisteru suggests that you should oppose things. I would not say I oppose this. Instead, I would like to gently suggest an alternative strategy. Yo...
Reading My Diary: 10 Years Since CFAR
21 Nov 2025
Contributed by Lukas
In the Summer of 2015, I pretended to be sick for my school's prom and graduation, so that I could instead fly out to San Francisco to attend a works...
What Do We Tell the Humans? Errors, Hallucinations, and Lies in the AI Village
21 Nov 2025
Contributed by Lukas
Telling the truth is hard. Sometimes you don’t know what's true, sometimes you get confused, and sometimes you really don’t wanna cause lying can...
“Evrart Claire: A Case Study in Anti-Epistemology” by Ben Pace
21 Nov 2025
Contributed by Lukas
This man nearly tricked me.Evrart Claire, leader of the Dockworkers Union in Martinaise from the videogame Disco Elysium. I acknowledge that he is a ...
“The Boring Part of Bell Labs” by Elizabeth
21 Nov 2025
Contributed by Lukas
It took me a long time to realize that Bell Labs was cool. You see, my dad worked at Bell Labs, and he has not done a single cool thing in his life e...
“[Paper] Output Supervision Can Obfuscate the CoT” by jacob_drori, lukemarks, cloud, TurnTrout
20 Nov 2025
Contributed by Lukas
We show that training against a monitor that only sees outputs (not CoTs) can cause obfuscated[1] CoTs! The obfuscation happens in two ways: When a ...
“Dominance: The Standard Everyday Solution To Akrasia” by johnswentworth
20 Nov 2025
Contributed by Lukas
Here's the LessWrong tag page on Akrasia: Akrasia is the state of acting against one's better judgment. A canonical example is procrastination. Incre...
Gemini 3 is Evaluation-Paranoid and Contaminated
20 Nov 2025
Contributed by Lukas
TL;DR: Gemini 3 frequently thinks it is in an evaluation when it is not, assuming that all of its reality is fabricated. It can also reliably output...
“Thinking about reasoning models made me less worried about scheming” by Fabien Roger
20 Nov 2025
Contributed by Lukas
Reasoning models like Deepseek r1: Can reason in consequentialist ways and have vast knowledge about AI training Can reason for many serial steps, w...
“What Is The Basin Of Convergence For Kelly Betting?” by johnswentworth
20 Nov 2025
Contributed by Lukas
The basic rough argument for Kelly betting goes something like this. First, assume we’re making a sequence of T independent bets, one-after-another...
“In Defense of Goodness” by abramdemski
20 Nov 2025
Contributed by Lukas
This is a reaction to John Wentworth's post Human Values ≠ Goodness. In the post, John argues that the human concept of goodness comes apart from h...
“Out-paternalizing the government (getting oxygen for my baby)” by Ruby
20 Nov 2025
Contributed by Lukas
This post does not contain medical advice that most people should attempt to emulate. Considering this home treatment specifically made sense for us....
“Beren’s Essay on Obedience and Alignment” by StanislavKrym
20 Nov 2025
Contributed by Lukas
Like Daniel Kokotajlo's coverage of Vitalik's response to AI-2027, I've copied the author's text. This time the essay is actually good, but has littl...
“Preventing covert ASI development in countries within our agreement” by Aaron_Scher
20 Nov 2025
Contributed by Lukas
We at the Machine Intelligence Research Institute's Technical Governance Team have proposed an illustrative international agreement (blog post) to ha...
“Current LLMs seem to rarely detect CoT tampering” by Bart Bussmann, Arthur Conmy, Neel Nanda, Senthooran Rajamanoharan, Josh Engels, Bartosz Cywiński
19 Nov 2025
Contributed by Lukas
Authors: Bartosz Cywinski*, Bart Bussmann*, Arthur Conmy**, Neel Nanda**, Senthooran Rajamanoharan**, Joshua Engels** * equal primary contributor, or...
“The Bughouse Effect” by TsviBT
19 Nov 2025
Contributed by Lukas
Crosspost from my blog. What happens when you work closely with someone on a really difficult project—and then they seem to just fuck it up? This...
“Serious Flaws in CAST” by Max Harms
19 Nov 2025
Contributed by Lukas
Last year I wrote the CAST agenda, arguing that aiming for Corrigibility As Singular Target was the least-doomed way to make an AGI. (Though it is al...
“Memories of a British Boarding School #2” by Ben Pace
19 Nov 2025
Contributed by Lukas
I have been reliably informed that, while my last series of memories about boarding school were interesting to read, they were lacking in a key eleme...
“Automate, automate it all” by habryka
19 Nov 2025
Contributed by Lukas
Context: Post #9 in my sequence of private Lightcone Infrastructure memos edited for public consumption. First, a disclaimer. Before you automate som...
“How the aliens next door shower” by Ruby
19 Nov 2025
Contributed by Lukas
Episode Recap In this series, I have been building up the argument that other people's internal psychology is much weirder and a lot more alien than ...
“Victor Taelin’s notes on Gemini 3” by Gunnar_Zarncke
19 Nov 2025
Contributed by Lukas
Victor Taelin of Higher Order Company has some of the hardest computer science problems the LLMs most likely have never seen before and evaluated Gem...
“Anthropic is (probably) not meeting its RSP security commitments” by habryka
19 Nov 2025
Contributed by Lukas
TLDR: An AI company's model weight security is at most as good as its compute providers' security. Anthropic has committed (with a bit of ambiguity, ...
“Considerations for setting the FLOP thresholds in our example international AI agreement” by peterbarnett, Aaron_Scher
19 Nov 2025
Contributed by Lukas
We at the Machine Intelligence Research Institute's Technical Governance Team have proposed an illustrative international agreement (blog post) to ha...
“On Writing #2” by Zvi
18 Nov 2025
Contributed by Lukas
In honor of my dropping by Inkhaven at Lighthaven in Berkeley this week, I figured it was time for another writing roundup. You can find #1 here, fro...
“New Report: An International Agreement to Prevent the Premature Creation of Artificial Superintelligence” by Aaron_Scher, David Abecassis, Brian Abeyta, peterbarnett
18 Nov 2025
Contributed by Lukas
TLDR: We at the MIRI Technical Governance Team have released a report describing an example international agreement to halt the advancement towards a...
“Status Is The Game Of The Losers’ Bracket” by johnswentworth
18 Nov 2025
Contributed by Lukas
This post is written as a series of little thoughts and vignettes, all trying to gesture at the same idea. The hope is to convey the gestalt. Conside...
“Eat The Richtext” by dreeves
18 Nov 2025
Contributed by Lukas
A year and a half ago I vibe-coded a tool, Eat The Richtext, that I've been using practically every day (every week in any case) ever since. Friends ...
“Small batches and the mythical single piece flow” by habryka
18 Nov 2025
Contributed by Lukas
Context: Post #8 in my sequence of private Lightcone Infrastructure memos edited for public consumption. When you finish something, you learn somethi...
“How Colds Spread” by RobertM
18 Nov 2025
Contributed by Lukas
It seems like a catastrophic civilizational failure that we don't have confident common knowledge of how colds spread. There have been a number of st...
“Middlemen Are Eating the World (And That’s Good, Actually)” by Linch
18 Nov 2025
Contributed by Lukas
I think many people have some intuition that work can be separated between “real work“ (farming, say, or building trains) and “middlemen” (e....
“Why is American mass-market tea so terrible?” by RobertM
18 Nov 2025
Contributed by Lukas
Note: definitely true, especially my aesthetic preferences, and the speculative historical synthesis. There are some hedonic treadmills which, even a...
“An Analogue Of Set Relationships For Distribution” by johnswentworth, David Lorell
18 Nov 2025
Contributed by Lukas
Audio note: this article contains 86 uses of latex notation, so the narration may be difficult to follow. There's a link to the original text in the...
“AI 2025 - Last Shipmas” by Simon Lermen
18 Nov 2025
Contributed by Lukas
ACT I: CHRISTMAS EVE It all starts with a cryptic tweet from Jimmy Apples on X. The tweet by Jimmy Apples makes people at other AI labs quite nervous....
“Varieties Of Doom” by jdp
18 Nov 2025
Contributed by Lukas
There has been a lot of talk about "p(doom)" over the last few years. This has always rubbed me the wrong way because "p(doom)" didn't feel like it m...
“Mediators: a different route through conflict” by Ben Pace
17 Nov 2025
Contributed by Lukas
(content note: discussion of war and mass death; also a long aside about the philosophy of apologies) After 100,000 people were killed in the Bosnia...
“Lobsang’s Children” by Tomás B.
17 Nov 2025
Contributed by Lukas
I study so hard. My grandfather makes me. It is not fun. My life is studying. I am home-schooled. I don't have much freedom. Grandfather says it is i...
“Close open loops” by habryka
17 Nov 2025
Contributed by Lukas
Context: Post #6 in my sequence of private Lightcone Infrastructure memos edited for public consumption. David Allen, of Getting Things Done fame say...
“Video games are philosophy’s playground” by Rachel Shu
17 Nov 2025
Contributed by Lukas
Crypto people have this saying: "cryptocurrencies are macroeconomics' playground." The idea is that blockchains let you cheaply spin up toy economies...
“Mixed Feelings on Social Munchkinry” by Screwtape
17 Nov 2025
Contributed by Lukas
This is less me expounding on a thesis and more me musing about a topic where I have conflicting intuitions. Epistemic status: exploratory. One thing...
“Diagonalization: A (slightly) more rigorous model of paranoia” by habryka
17 Nov 2025
Contributed by Lukas
In my post on Wednesday (Paranoia: A Beginner's Guide), I talked at a high level about the experience of paranoia, and gave two models (the lemons ma...