LessWrong (30+ Karma)

Claude 4.5 Opus’ Soul Document

29 Nov 2025

Contributed by Lukas

Summary As far as I understand and uncovered, a document for the character training for Claude is compressed in Claude's weights. The full document c...

Should you work with evil people?

29 Nov 2025

Contributed by Lukas

Epistemic status: Figuring things out. My mind often wanders to what boundaries I ought to maintain between the different parts of my life and people...

Unless its governance changes, Anthropic is untrustworthy

29 Nov 2025

Contributed by Lukas

Anthropic is untrustworthy. This post provides arguments, asks questions, and documents some examples of Anthropic's leadership being misleading and ...

The Missing Genre: Heroic Parenthood - You can have kids and still punch the sun

29 Nov 2025

Contributed by Lukas

This is a link post. I stopped reading when I was 30. You can fill in all the stereotypes of a girl with a book glued to her face during every meal, e...

“Tests of LLM introspection need to rule out causal bypassing” by Adam Morris, Dillon Plunkett

29 Nov 2025

Contributed by Lukas

This point has been floating around implicitly in various papers (e.g., Betley et al., Plunkett et al., Lindsey), but we haven’t seen it named expl...

Not A Love Letter, But A Thank You Letter

29 Nov 2025

Contributed by Lukas

Context: Each day on her blog Letters To Boys, Gretta Duleba has been posting things she once sent to men she dated. One of those, recently, was a lo...

“Ruby’s Ultimate Guide to Thoughtful Gifts” by Ruby

28 Nov 2025

Contributed by Lukas

Give a man a gift and he smiles for a day. Teach a man to gift and he’ll cause smiles for the rest of his life. Gift giving is an exercise in theor...

Writing advice: Why people like your quick bullshit takes better than your high-effort posts

28 Nov 2025

Contributed by Lukas

Right now I’m coaching for Inkhaven, a month-long marathon writing event where our brave residents are writing a blog post every single day for the...

A Thanksgiving Memory

28 Nov 2025

Contributed by Lukas

A couple of days before Thanksgiving when I was 11 years old, the house my family lived in was gutted by a fire. We had been stripping the white pain...

Claude Opus 4.5: Model Card, Alignment and Safety

28 Nov 2025

Contributed by Lukas

They saved the best for last. The contrast in model cards is stark. Google provided a brief overview of its tests for Gemini 3 Pro, with a lot of ‘...

A Taxonomy of Bugs (Lists)

28 Nov 2025

Contributed by Lukas

One of my favorite exercises from CFAR is the Bugs List. By writing down everything in your life that feels "off," things you'd change if you could, ...

“Despair, Serenity, Song and Nobility in ‘Hollow Knight: Silksong’” by Ben Pace

28 Nov 2025

Contributed by Lukas

Fictional universes are oft defined by what their positive affect feels like and what their negative affect feels like. This is the palette that the ...

The Best Lack All Conviction: A Confusing Day in the AI Village

28 Nov 2025

Contributed by Lukas

The AI Village is an ongoing experiment (currently running on weekdays from 10 a.m. to 2 p.m. Pacific time) in which frontier language models are giv...

Will We Get Alignment by Default? — with Adrià Garriga-Alonso

28 Nov 2025

Contributed by Lukas

This is a link post. Adrià recently published “Alignment will happen by default; what's next?” on LessWrong, arguing that AI alignment is turning...

Information Hygiene

28 Nov 2025

Contributed by Lukas

Do avalanches get caused by loud noises? From every time I’ve given this class or lecture, at least 7/10 of you are nodding yes, and the main reaso...

You Are Much More Salient To Yourself Than To Everyone Else

28 Nov 2025

Contributed by Lukas

Back in the Boy Scouts, at summer camp, myself and a couple friends snuck out one night after curfew to commandeer a couch someone had left by a dump...

“Subliminal Learning Across Models” by draganover, Andi Bhongade, tolgadur, Mary Phuong, LASR Labs

27 Nov 2025

Contributed by Lukas

Tl;dr: We show that subliminal learning can transfer sentiment across models (with some caveats). For example, we transfer positive sentiment for Ca...

Alignment remains a hard, unsolved problem

27 Nov 2025

Contributed by Lukas

Thanks to (in alphabetical order) Joshua Batson, Roger Grosse, Jeremy Hadfield, Jared Kaplan, Jan Leike, Jack Lindsey, Monte MacDiarmid, Francesco Mo...

Just explain it to someone

27 Nov 2025

Contributed by Lukas

I was once helping a child with her homework. She was supposed to write about a place that was important for her, and had chosen her family's summer ...

Principles of a Rationality Dojo

26 Nov 2025

Contributed by Lukas

Since 2023 I've been directing WARP, the Wandering Applied Rationality Program with the help of ESPR and SPARC staff, which are summer camps I've tau...

Postmodernism for STEM Types: A Clear-Language Guide to Conflict Theory

26 Nov 2025

Contributed by Lukas

Crossposted from Susbstack Section I : Opening In 2021, Richard Dawkins tweeted: The fallout was immediate. The American Humanist Association revoked...

“Training PhD Students to be Fat Newts (Part 2)” by alkjash

26 Nov 2025

Contributed by Lukas

[Thanks Inkhaven for hosting me! This is my fourth and last post and I'm already exhausted from writing. Wordpress.com!] Last time, I introduced the ...

Snippets on Living In Reality

26 Nov 2025

Contributed by Lukas

Social reality is quite literally another world, in the same sense that the Harry Potter universe is another world. Like the Harry Potter universe, s...

Courtship Confusions Post-Slutcon

26 Nov 2025

Contributed by Lukas

Going into slutcon, one of my main known-unknowns was… I’d heard many times that the standard path to hooking up or dating starts with two people...

Training PhD Students to be Fat Newts (Part 1)

26 Nov 2025

Contributed by Lukas

This is a link post. Today, I want to introduce an experimental PhD student training philosophy. Let's start with some reddit memes. Every gaming sub...

“Evaluating honesty and lie detection techniques on a diverse suite of dishonest models” by Sam Marks, Johannes Treutlein, evhub, Fabien Roger

26 Nov 2025

Contributed by Lukas

TL;DR: We use a suite of testbed settings where models lie—i.e. generate statements they believe to be false—to evaluate honesty and lie detectio...

Takeaways from the Eleos Conference on AI Consciousness and Welfare

26 Nov 2025

Contributed by Lukas

Crossposted from my Substack. I spent the weekend at Lighthaven, attending the Eleos conference. In this post, I share thoughts and updates as I refl...

Evolution & Freedom

26 Nov 2025

Contributed by Lukas

In Against Money Maximalism, I argued against money-maximization as a normative stance. Profit is a coherent thing you can try to maximize, but there...

Reasons Why I Cannot Sleep

26 Nov 2025

Contributed by Lukas

My therapist says I'm more tired today than she's ever seen me. Here are some reasons my brain says I cannot sleep: My boss might lose faith in my a...

The Economics of Replacing Call Center Workers With AIs

26 Nov 2025

Contributed by Lukas

TLDR: Voice AIs aren't that much cheaper in the year 2025 My friend runs a voice agent startup in Canada for walk-in clinics. The AI takes calls and ...

Three things that surprised me about technical grantmaking at Coefficient Giving (fka Open Phil)

26 Nov 2025

Contributed by Lukas

Open Philanthropy's Coefficient Giving's Technical AI Safety team is hiring grantmakers. I thought this would be a good moment to share some positive...

“OpenAI finetuning metrics: What is going on with the loss curves?” by jorio, James Chua

26 Nov 2025

Contributed by Lukas

Introduction For our current project, we've been using the OpenAI fine-tuning API. To run some of our experiments, we needed to understand exactly ho...

Alignment will happen by default. What’s next?

25 Nov 2025

Contributed by Lukas

I’m not 100% convinced of this, but I’m fairly convinced, more and more so over time. I’m hoping to start a vigorous but civilized debate. I in...

“Maybe Insensitive Functions are a Natural Ontology Generator?” by johnswentworth

25 Nov 2025

Contributed by Lukas

The most canonical example of a "natural ontology" comes from gasses in stat mech. In the simplest version, we model the gas as a bunch of little bil...

The Enemy Gets The Last Hit

24 Nov 2025

Contributed by Lukas

Disclaimer: I am god-awful at chess. I Late-beginner chess players, those who are almost on the cusp of being basically respectable, often fall into ...

Reasoning Models Sometimes Output Illegible Chains of Thought

24 Nov 2025

Contributed by Lukas

TL;DR: Models trained with outcome-based RL sometimes have reasoning traces that look very weird. In this paper, I evaluate 14 models and find that m...

The Coalition

24 Nov 2025

Contributed by Lukas

Summary A defensive military coalition is a key frame for thinking about our international agreement aimed at forestalling the development of superin...

Gemini 3 Pro Is a Vast Intelligence With No Spine

24 Nov 2025

Contributed by Lukas

It's A Great Model, Sir One might even say the best model. It is for now my default weapon of choice. Google's official announcement ...

“The LessWrong Team Was Selling Dollars For 86 Cents” by Screwtape

24 Nov 2025

Contributed by Lukas

You know what's sweeter than free money? Free money that you take from an organization of smart people who love prediction markets and having good va...

NATO is dangerously unaware of its military vulnerability

24 Nov 2025

Contributed by Lukas

NATO faces its gravest military disadvantage since 1949, as the balance of power has shifted decisively toward its adversaries in the Era of Drone Wa...

Inkhaven Retrospective

24 Nov 2025

Contributed by Lukas

Here I am on the plane on the way home from Inkhaven. Huge thanks to Ben Pace and the other organizers for inviting me. Lighthaven is a delightful ve...

“Stop Applying And Get To Work” by plex

23 Nov 2025

Contributed by Lukas

TL;DR: Figure out what needs doing and do it, don't wait on approval from fellowships or jobs. If you... Have short timelines Have been struggling t...

Show Review: Masquerade

23 Nov 2025

Contributed by Lukas

Earlier this month, I was pretty desperately feeling the need for a vacation. So after a little googling, I booked a flight to New York city, a hotel...

I’ll be sad to lose the puzzles

23 Nov 2025

Contributed by Lukas

My understanding is that even though advocating a pause or massive slowdown in the development of superintelligence think we should get there eventua...

You can just do things

23 Nov 2025

Contributed by Lukas

{early pause}... (you should have known this:) YOU (not just them over there) CAN (tho maybe you shouldn't?) JUST (its weirdly easy) DO (not talking/...

Literacy is Decreasing Among the Intellectual Class

23 Nov 2025

Contributed by Lukas

(Cross-posted from my Substack; written as part of the Halfhaven virtual blogging camp) Oh, you read Emily Post's Etiquette? What version? There's a...

Traditional Food

23 Nov 2025

Contributed by Lukas

Insulin resistance is bad. It doesn't just cause heart disease. Peter Attia, author of Outlive, the Science and Art of Longevity, makes a convincing[...

Easy vs Hard Emotional Vulnerability

23 Nov 2025

Contributed by Lukas

What blocks people from being vulnerable with others? Much ink has been spilled on two classes of answers to this question: Not everyone is in fact ...

What kind of person is DeepSeek’s founder, Liang Wenfeng? An answer from his old university classmate.

23 Nov 2025

Contributed by Lukas

Author: 清风学渣 Link: https://www.zhihu.com/question/10967114707/answer/1904046054904665233 Source: Zhihu Copyright belongs to the author. For c...

OpenAI Locks Down San Francisco Offices Following Alleged Threat From Activist

23 Nov 2025

Contributed by Lukas

A message on OpenAI's internal Slack claimed the activist in question had expressed interest in “causing physical harm to OpenAI employees.” Open...

Eight Heuristics of Anti-Epistemology

23 Nov 2025

Contributed by Lukas

Here are eight tools of anti-epistemology that I think anyone can use to hide their norm-violating behavior from being noticed, and deceive people ab...

“Book Review: Wizard’s Hall” by Screwtape

22 Nov 2025

Contributed by Lukas

Ever on the quilting goes, Spinning out the lives between, Winding up the souls of those Students up to one-thirteen There's a book about a young boy...

Market Logic I

22 Nov 2025

Contributed by Lukas

Audio note: this article contains 92 uses of latex notation, so the narration may be difficult to follow. There's a link to the original text in the...

D&D.Sci Thanksgiving: the Festival Feast

22 Nov 2025

Contributed by Lukas

This is an entry in the 'Dungeons & Data Science' series, a set of puzzles where players are given a dataset to analyze and an objective to pursu...

Be Naughty

22 Nov 2025

Contributed by Lukas

Context: Post #10 in my sequence of private Lightcone Infrastructure memos edited for public consumption. This one, more so than any other one in th...

Abstract advice to researchers tackling the difficult core problems of AGI alignment

22 Nov 2025

Contributed by Lukas

Crosspost from my blog. This some quickly-written, better-than-nothing advice for people who want to make progress on the hard problems of technical...

Why Not Just Train For Interpretability?

22 Nov 2025

Contributed by Lukas

Simplicio: Hey I’ve got an alignment research idea to run by you. Me: … guess we’re doing this again. Simplicio: Interpretability work on train...

“Natural emergent misalignment from reward hacking in production RL” by evhub, Monte M, Benjamin Wright, Jonathan Uesato

21 Nov 2025

Contributed by Lukas

Abstract We show that when large language models learn to reward hack on production RL environments, this can result in egregious emergent misalignme...

“AI #143: Everything, Everywhere, All At Once” by Zvi

21 Nov 2025

Contributed by Lukas

Last week had the release of GPT-5.1, which I covered on Tuesday. This week included Gemini 3, Nana Banana Pro, Grok 4.1, GPT 5.1 Pro, GPT 5.1-Codex...

“Rescuing truth in mathematics from the Liar’s Paradox using fuzzy values” by Adrià Garriga-alonso

21 Nov 2025

Contributed by Lukas

Audio note: this article contains 169 uses of latex notation, so the narration may be difficult to follow. There's a link to the original text in th...

Contra Collisteru: You Get About One Carthage

21 Nov 2025

Contributed by Lukas

Collisteru suggests that you should oppose things. I would not say I oppose this. Instead, I would like to gently suggest an alternative strategy. Yo...

Reading My Diary: 10 Years Since CFAR

21 Nov 2025

Contributed by Lukas

In the Summer of 2015, I pretended to be sick for my school's prom and graduation, so that I could instead fly out to San Francisco to attend a works...

What Do We Tell the Humans? Errors, Hallucinations, and Lies in the AI Village

21 Nov 2025

Contributed by Lukas

Telling the truth is hard. Sometimes you don’t know what's true, sometimes you get confused, and sometimes you really don’t wanna cause lying can...

“Evrart Claire: A Case Study in Anti-Epistemology” by Ben Pace

21 Nov 2025

Contributed by Lukas

This man nearly tricked me.Evrart Claire, leader of the Dockworkers Union in Martinaise from the videogame Disco Elysium. I acknowledge that he is a ...

“The Boring Part of Bell Labs” by Elizabeth

21 Nov 2025

Contributed by Lukas

It took me a long time to realize that Bell Labs was cool. You see, my dad worked at Bell Labs, and he has not done a single cool thing in his life e...

“[Paper] Output Supervision Can Obfuscate the CoT” by jacob_drori, lukemarks, cloud, TurnTrout

20 Nov 2025

Contributed by Lukas

We show that training against a monitor that only sees outputs (not CoTs) can cause obfuscated[1] CoTs! The obfuscation happens in two ways: When a ...

“Dominance: The Standard Everyday Solution To Akrasia” by johnswentworth

20 Nov 2025

Contributed by Lukas

Here's the LessWrong tag page on Akrasia: Akrasia is the state of acting against one's better judgment. A canonical example is procrastination. Incre...

Gemini 3 is Evaluation-Paranoid and Contaminated

20 Nov 2025

Contributed by Lukas

TL;DR: Gemini 3 frequently thinks it is in an evaluation when it is not, assuming that all of its reality is fabricated. It can also reliably output...

“Thinking about reasoning models made me less worried about scheming” by Fabien Roger

20 Nov 2025

Contributed by Lukas

Reasoning models like Deepseek r1: Can reason in consequentialist ways and have vast knowledge about AI training Can reason for many serial steps, w...

“What Is The Basin Of Convergence For Kelly Betting?” by johnswentworth

20 Nov 2025

Contributed by Lukas

The basic rough argument for Kelly betting goes something like this. First, assume we’re making a sequence of T independent bets, one-after-another...

“In Defense of Goodness” by abramdemski

20 Nov 2025

Contributed by Lukas

This is a reaction to John Wentworth's post Human Values ≠ Goodness. In the post, John argues that the human concept of goodness comes apart from h...

“Out-paternalizing the government (getting oxygen for my baby)” by Ruby

20 Nov 2025

Contributed by Lukas

This post does not contain medical advice that most people should attempt to emulate. Considering this home treatment specifically made sense for us....

“Beren’s Essay on Obedience and Alignment” by StanislavKrym

20 Nov 2025

Contributed by Lukas

Like Daniel Kokotajlo's coverage of Vitalik's response to AI-2027, I've copied the author's text. This time the essay is actually good, but has littl...

“Preventing covert ASI development in countries within our agreement” by Aaron_Scher

20 Nov 2025

Contributed by Lukas

We at the Machine Intelligence Research Institute's Technical Governance Team have proposed an illustrative international agreement (blog post) to ha...

“Current LLMs seem to rarely detect CoT tampering” by Bart Bussmann, Arthur Conmy, Neel Nanda, Senthooran Rajamanoharan, Josh Engels, Bartosz Cywiński

19 Nov 2025

Contributed by Lukas

Authors: Bartosz Cywinski*, Bart Bussmann*, Arthur Conmy**, Neel Nanda**, Senthooran Rajamanoharan**, Joshua Engels** * equal primary contributor, or...

“The Bughouse Effect” by TsviBT

19 Nov 2025

Contributed by Lukas

Crosspost from my blog. What happens when you work closely with someone on a really difficult project—and then they seem to just fuck it up? This...

“Serious Flaws in CAST” by Max Harms

19 Nov 2025

Contributed by Lukas

Last year I wrote the CAST agenda, arguing that aiming for Corrigibility As Singular Target was the least-doomed way to make an AGI. (Though it is al...

“Memories of a British Boarding School #2” by Ben Pace

19 Nov 2025

Contributed by Lukas

I have been reliably informed that, while my last series of memories about boarding school were interesting to read, they were lacking in a key eleme...

“Automate, automate it all” by habryka

19 Nov 2025

Contributed by Lukas

Context: Post #9 in my sequence of private Lightcone Infrastructure memos edited for public consumption. First, a disclaimer. Before you automate som...

“How the aliens next door shower” by Ruby

19 Nov 2025

Contributed by Lukas

Episode Recap In this series, I have been building up the argument that other people's internal psychology is much weirder and a lot more alien than ...

“Victor Taelin’s notes on Gemini 3” by Gunnar_Zarncke

19 Nov 2025

Contributed by Lukas

Victor Taelin of Higher Order Company has some of the hardest computer science problems the LLMs most likely have never seen before and evaluated Gem...

“Anthropic is (probably) not meeting its RSP security commitments” by habryka

19 Nov 2025

Contributed by Lukas

TLDR: An AI company's model weight security is at most as good as its compute providers' security. Anthropic has committed (with a bit of ambiguity, ...

“Considerations for setting the FLOP thresholds in our example international AI agreement” by peterbarnett, Aaron_Scher

19 Nov 2025

Contributed by Lukas

We at the Machine Intelligence Research Institute's Technical Governance Team have proposed an illustrative international agreement (blog post) to ha...

“On Writing #2” by Zvi

18 Nov 2025

Contributed by Lukas

In honor of my dropping by Inkhaven at Lighthaven in Berkeley this week, I figured it was time for another writing roundup. You can find #1 here, fro...

“New Report: An International Agreement to Prevent the Premature Creation of Artificial Superintelligence” by Aaron_Scher, David Abecassis, Brian Abeyta, peterbarnett

18 Nov 2025

Contributed by Lukas

TLDR: We at the MIRI Technical Governance Team have released a report describing an example international agreement to halt the advancement towards a...

“Status Is The Game Of The Losers’ Bracket” by johnswentworth

18 Nov 2025

Contributed by Lukas

This post is written as a series of little thoughts and vignettes, all trying to gesture at the same idea. The hope is to convey the gestalt. Conside...

“Eat The Richtext” by dreeves

18 Nov 2025

Contributed by Lukas

A year and a half ago I vibe-coded a tool, Eat The Richtext, that I've been using practically every day (every week in any case) ever since. Friends ...

“Small batches and the mythical single piece flow” by habryka

18 Nov 2025

Contributed by Lukas

Context: Post #8 in my sequence of private Lightcone Infrastructure memos edited for public consumption. When you finish something, you learn somethi...

“How Colds Spread” by RobertM

18 Nov 2025

Contributed by Lukas

It seems like a catastrophic civilizational failure that we don't have confident common knowledge of how colds spread. There have been a number of st...

“Middlemen Are Eating the World (And That’s Good, Actually)” by Linch

18 Nov 2025

Contributed by Lukas

I think many people have some intuition that work can be separated between “real work“ (farming, say, or building trains) and “middlemen” (e....

“Why is American mass-market tea so terrible?” by RobertM

18 Nov 2025

Contributed by Lukas

Note: definitely true, especially my aesthetic preferences, and the speculative historical synthesis. There are some hedonic treadmills which, even a...

“An Analogue Of Set Relationships For Distribution” by johnswentworth, David Lorell

18 Nov 2025

Contributed by Lukas

Audio note: this article contains 86 uses of latex notation, so the narration may be difficult to follow. There's a link to the original text in the...

“AI 2025 - Last Shipmas” by Simon Lermen

18 Nov 2025

Contributed by Lukas

ACT I: CHRISTMAS EVE It all starts with a cryptic tweet from Jimmy Apples on X. The tweet by Jimmy Apples makes people at other AI labs quite nervous....

“Varieties Of Doom” by jdp

18 Nov 2025

Contributed by Lukas

There has been a lot of talk about "p(doom)" over the last few years. This has always rubbed me the wrong way because "p(doom)" didn't feel like it m...

“Mediators: a different route through conflict” by Ben Pace

17 Nov 2025

Contributed by Lukas

(content note: discussion of war and mass death; also a long aside about the philosophy of apologies) After 100,000 people were killed in the Bosnia...

“Lobsang’s Children” by Tomás B.

17 Nov 2025

Contributed by Lukas

I study so hard. My grandfather makes me. It is not fun. My life is studying. I am home-schooled. I don't have much freedom. Grandfather says it is i...

“Close open loops” by habryka

17 Nov 2025

Contributed by Lukas

Context: Post #6 in my sequence of private Lightcone Infrastructure memos edited for public consumption. David Allen, of Getting Things Done fame say...

“Video games are philosophy’s playground” by Rachel Shu

17 Nov 2025

Contributed by Lukas

Crypto people have this saying: "cryptocurrencies are macroeconomics' playground." The idea is that blockchains let you cheaply spin up toy economies...

“Mixed Feelings on Social Munchkinry” by Screwtape

17 Nov 2025

Contributed by Lukas

This is less me expounding on a thesis and more me musing about a topic where I have conflicting intuitions. Epistemic status: exploratory. One thing...

“Diagonalization: A (slightly) more rigorous model of paranoia” by habryka

17 Nov 2025

Contributed by Lukas

In my post on Wednesday (Paranoia: A Beginner's Guide), I talked at a high level about the experience of paranoia, and gave two models (the lemons ma...

Activity Overview

Episodes

Claude 4.5 Opus’ Soul Document

Should you work with evil people?

Unless its governance changes, Anthropic is untrustworthy

The Missing Genre: Heroic Parenthood - You can have kids and still punch the sun

“Tests of LLM introspection need to rule out causal bypassing” by Adam Morris, Dillon Plunkett

Not A Love Letter, But A Thank You Letter

“Ruby’s Ultimate Guide to Thoughtful Gifts” by Ruby

Writing advice: Why people like your quick bullshit takes better than your high-effort posts

A Thanksgiving Memory

Claude Opus 4.5: Model Card, Alignment and Safety

A Taxonomy of Bugs (Lists)

“Despair, Serenity, Song and Nobility in ‘Hollow Knight: Silksong’” by Ben Pace

The Best Lack All Conviction: A Confusing Day in the AI Village

Will We Get Alignment by Default? — with Adrià Garriga-Alonso

Information Hygiene

You Are Much More Salient To Yourself Than To Everyone Else

“Subliminal Learning Across Models” by draganover, Andi Bhongade, tolgadur, Mary Phuong, LASR Labs

Alignment remains a hard, unsolved problem

Just explain it to someone

Principles of a Rationality Dojo

Postmodernism for STEM Types: A Clear-Language Guide to Conflict Theory

“Training PhD Students to be Fat Newts (Part 2)” by alkjash

Snippets on Living In Reality

Courtship Confusions Post-Slutcon

Training PhD Students to be Fat Newts (Part 1)

“Evaluating honesty and lie detection techniques on a diverse suite of dishonest models” by Sam Marks, Johannes Treutlein, evhub, Fabien Roger

Takeaways from the Eleos Conference on AI Consciousness and Welfare

Evolution & Freedom

Reasons Why I Cannot Sleep

The Economics of Replacing Call Center Workers With AIs

Three things that surprised me about technical grantmaking at Coefficient Giving (fka Open Phil)

“OpenAI finetuning metrics: What is going on with the loss curves?” by jorio, James Chua

Alignment will happen by default. What’s next?

“Maybe Insensitive Functions are a Natural Ontology Generator?” by johnswentworth

The Enemy Gets The Last Hit

Reasoning Models Sometimes Output Illegible Chains of Thought

The Coalition

Gemini 3 Pro Is a Vast Intelligence With No Spine

“The LessWrong Team Was Selling Dollars For 86 Cents” by Screwtape

NATO is dangerously unaware of its military vulnerability

Inkhaven Retrospective

“Stop Applying And Get To Work” by plex

Show Review: Masquerade

I’ll be sad to lose the puzzles

You can just do things

Literacy is Decreasing Among the Intellectual Class

Traditional Food

Easy vs Hard Emotional Vulnerability

What kind of person is DeepSeek’s founder, Liang Wenfeng? An answer from his old university classmate.

OpenAI Locks Down San Francisco Offices Following Alleged Threat From Activist

Eight Heuristics of Anti-Epistemology

“Book Review: Wizard’s Hall” by Screwtape

Market Logic I

D&D.Sci Thanksgiving: the Festival Feast

Be Naughty

Abstract advice to researchers tackling the difficult core problems of AGI alignment

Why Not Just Train For Interpretability?

“Natural emergent misalignment from reward hacking in production RL” by evhub, Monte M, Benjamin Wright, Jonathan Uesato

“AI #143: Everything, Everywhere, All At Once” by Zvi

“Rescuing truth in mathematics from the Liar’s Paradox using fuzzy values” by Adrià Garriga-alonso

Contra Collisteru: You Get About One Carthage

Reading My Diary: 10 Years Since CFAR

What Do We Tell the Humans? Errors, Hallucinations, and Lies in the AI Village

“Evrart Claire: A Case Study in Anti-Epistemology” by Ben Pace

“The Boring Part of Bell Labs” by Elizabeth

“[Paper] Output Supervision Can Obfuscate the CoT” by jacob_drori, lukemarks, cloud, TurnTrout

“Dominance: The Standard Everyday Solution To Akrasia” by johnswentworth

Gemini 3 is Evaluation-Paranoid and Contaminated

“Thinking about reasoning models made me less worried about scheming” by Fabien Roger

“What Is The Basin Of Convergence For Kelly Betting?” by johnswentworth

“In Defense of Goodness” by abramdemski

“Out-paternalizing the government (getting oxygen for my baby)” by Ruby

“Beren’s Essay on Obedience and Alignment” by StanislavKrym

“Preventing covert ASI development in countries within our agreement” by Aaron_Scher

“Current LLMs seem to rarely detect CoT tampering” by Bart Bussmann, Arthur Conmy, Neel Nanda, Senthooran Rajamanoharan, Josh Engels, Bartosz Cywiński

“The Bughouse Effect” by TsviBT

“Serious Flaws in CAST” by Max Harms