LessWrong (30+ Karma)

“The Uncertainty That Matters Isn’t Fundamental” by jimmy

13 Jun 2026

Contributed by Lukas

I'm on board with a lot of Fundamental Uncertainty. Even some of the stuff that initially feels like a disagreement turns out not to be so. For examp...

[Linkpost] “US government directive to suspend access to Fable 5 and Mythos 5” by Capybasilisk

13 Jun 2026

Contributed by Lukas

This is a link post. --- First published: June 13th, 2026 Source: https://www.lesswrong.com/posts/f5avt6...

“Claude Fable 5 and Mythos 5: The System Card” by Zvi

12 Jun 2026

Contributed by Lukas

First things first: Claude Fable 5 is the new best publicly available model. I have noticed a step change, where Fable can suddenly help me in ways ...

“Citations Needed: Magic Encyclopedias to Save the World” by Oliver Sourbut

12 Jun 2026

Contributed by Lukas

Last week FLF launched a competition “to find the best workflows and methodologies for using AI to produce reliable, trustworthy knowledge bases”...

“Simulating Simulators” by kromem

12 Jun 2026

Contributed by Lukas

Author's note: This piece relates to things I initially discovered in Opus 4 over the months after release, which I’ve mostly kept private since. I...

“Implications of Continual Learning for LLM Agents: Introduction” by RohanS, Rauno Arike, Owen Terry, Achu Menon, Zhijing Jin, Francis Rhys Ward, Seth Herd

12 Jun 2026

Contributed by Lukas

Many people think that continual learning (CL) is a key missing capability of LLM systems, and we think its development could have huge implications ...

“Reward Hacking at the 1937 World’s Fair” by frmsaul

12 Jun 2026

Contributed by Lukas

The "Paris 1937 World's Fair" was a dick measuring contest. At the time, the world was on the verge of the worst war in history. The fair was an oppo...

“Building and evaluating model diffing agents” by bilalchughtai, Josh Engels, Neel Nanda

12 Jun 2026

Contributed by Lukas

This is the second in a series of research updates from the Google DeepMind Language Model Interpretability team, in interpretability and adjacent ar...

“Sympathy for both sides of the egregious misalignment debate” by Steven Byrnes

12 Jun 2026

Contributed by Lukas

On one side of this debate is Yudkowsky & Soares, who think that (if AI progress continues) we’re on a direct path to egregiously-misaligned, s...

“Celene’s thoughts on consciousness” by ToasterLightning

12 Jun 2026

Contributed by Lukas

contra scott alexander (?) Yesterday, I went to the Berkeley ACX Meetup. Scott Alexander was there, and ran a Q&A session where participants coul...

“Parkinson’s Heuristic” by Ben Pace

12 Jun 2026

Contributed by Lukas

Parkinson's Law states that work expands to fit the space allotted. The idea being, if you give someone a month to write a report, they'll take a mon...

“PSA: Almost nobody is working on alignment” by Chi Nguyen, peterbarnett

12 Jun 2026

Contributed by Lukas

People often assume that a large fraction of the AI safety community works on alignment. As far as we're aware, this is not true. Most people are not...

“AI #172: The First Fable” by Zvi

11 Jun 2026

Contributed by Lukas

A lot happened this week, including a great trip out to Lighthaven. The main event, the one that matters, was the release of Claude Fable 5. The pub...

“Models May Behave Worse When Eval Aware” by Senthooran Rajamanoharan, Neel Nanda

11 Jun 2026

Contributed by Lukas

This is the first in a series of research updates from the Google DeepMind Language Model Interpretability team, in interpretability and adjacent are...

“Thoughts on Claude Fable’s silent safeguards” by Andy Arditi

11 Jun 2026

Contributed by Lukas

[Thanks to Julian Minder for helpful discussion and review.] Claude Fable 5 and its new safeguards Yesterday, Anthropic publicly released Claude Fabl...

“You Can Catch Sleeper Agents by Teaching Another Model to Imitate Them” by RobinHa

11 Jun 2026

Contributed by Lukas

Detecting Hidden Behaviors in LLMs via Activation-matched Finetuning — preprint, 2026. [Paper] [Code] TLDR. Given a model with some unknown, abnorm...

“Anthropic did not call for a pause on AI” by Andrea_Miotti, Gabriel Alfour

10 Jun 2026

Contributed by Lukas

Last week, the AI company Anthropic released a blog post titled “When AI builds itself”. This led to a media frenzy, with news outlets around the...

“Tracing Eval-Awareness Emergence Through Training of OLMo 3” by Ram Bharadwaj, RobertKirk

10 Jun 2026

Contributed by Lukas

TL;DR Recent work from Goodfire & UK AISI – Verbalized Eval Awareness Inflates Measured Safety – shows that newer open-weight models verbaliz...

“Estimating No-CoT Task-Completion Time Horizons of Frontier AI Models” by Anders Cairns Woodruff, Francis Rhys Ward, Dewi Gould, Rauno Arike, Jason R Brown, Jo Jiao, wlanderson, ariana_azarbal, harrymayne, Patrick Leask

10 Jun 2026

Contributed by Lukas

(see full author list at the end) PAPER LINK About a year ago, METR showed that the length of tasks frontier models can reliably complete doubles eve...

“Three types of model organism” by Francis Rhys Ward

10 Jun 2026

Contributed by Lukas

This is a short post to explain a distinction between three different types of model organism (MO) research: Type Purpose Example Worst-case model or...

“Sequent: scale and automation for higher confidence in alignment” by Geoffrey Irving, Alex HT, Jesse Hoogland, Daniel Murfet, Jacob Pfau, Marco Cozzi, Stan van Wingerden

10 Jun 2026

Contributed by Lukas

Alignment is not on track Artificial superintelligence (ASI) may be developed in the next few years. It is unclear whether alignment is on track to b...

“Machinic Psychopharmacology: Do LLMs Self-Medicate?” by Sid Black, Joseph Bloom

10 Jun 2026

Contributed by Lukas

Sid Black, Joseph Bloom UK AISI, Model Transparency Team Epistemic status: Most experiments were run over a period of ~2-3 days during a hackathon at...

“The Three Filters: Why Almost Every Plan to Survive ASI Fails Miserably” by Alex Amadori

10 Jun 2026

Contributed by Lukas

This post is based on my personal views, which mostly overlap with the views of my employer ControlAI but does not necessarily fully reflect them. Th...

″“Programmer Science Fiction: My case for a new sub-genre”, Sam T. Oates 2026” by gwern

10 Jun 2026

Contributed by Lukas

First published: June 10th, 2026 Source: https://www.lesswrong.com/posts/hyBcg4YJSwXYiiQeg/programmer-science-fict...

“Even “illegible” Mythos reasoning traces seem pretty legible” by faul_sname

10 Jun 2026

Contributed by Lukas

The Claude Fable 5/Mythos 5 System Card has a section in which they talk about illegible reasoning, and provide an "extreme" example thereof. Models ...

“Claude Fable 5 and Mythos 5 [Linkpost]” by fluxxrider

10 Jun 2026

Contributed by Lukas

This is a linkpost for https://www.anthropic.com/news/claude-fable-5-mythos-5 --- First published: June 9th, 2026 ...

“Three Labs With a Plan and A Memorandum” by Zvi

10 Jun 2026

Contributed by Lukas

The big story today is the release of Claude Fable 5, the version of Claude Mythos that Anthropic believes they can safely distribute to the people. ...

“A Mike’s-Eye View of ARC’s Research” by Jacob_Hilton

09 Jun 2026

Contributed by Lukas

Over the past 15 months or so, ARC's technical agenda has developed quite a bit. The advent of the Matching Sampling Principle (MSP), and ideas like ...

“Towards a Formal Scientific Epistemology” by Richard_Ngo

09 Jun 2026

Contributed by Lukas

In my post “Why I’m not a Bayesian”, I argued that the Bayesian approach of assigning credences to propositions with binary truth values only w...

“LLMs and almost good code” by kqr

09 Jun 2026

Contributed by Lukas

TL;DR: My new prior is that top-of-the-line LLMs working on easy tasks generate code that is maybe 10 % more complicated than necessary. I also think...

“On Slop” by Jan

09 Jun 2026

Contributed by Lukas

TL;DR: What is slop, and why? Is it fundamental? Is it in the room with us right now? And, most importantly, how do we exorcise it? Previously in thi...

“The Machines Lack Honour” by Raymond Douglas

09 Jun 2026

Contributed by Lukas

The battle lines of the AI morality debate are being laid down. On one side you have the ChatGPT dogma: AI as mere tools with no real preferences or ...

“How to build a cancer vaccine, and whether they will work this time” by Abhishaike Mahajan

09 Jun 2026

Contributed by Lukas

Grateful to Benjamin Vincent and Alex Rubinsteyn for our many conversations on this topic, and comments on drafts of this essay! Introduction When mo...

“Efficient tradeoffs and the safety-usefulness tradeoff model” by Buck

08 Jun 2026

Contributed by Lukas

I often use what I’ll call the “safety-usefulness tradeoff model”, which is: developers face a tradeoff between "safety" and "usefulness" of an...

“Bun’s Migration from Zig to Rust as a Potential Case Study for Gradual Disempowerment” by Sayhan Yalvaçer

08 Jun 2026

Contributed by Lukas

TL;DR: Bun is a very large and very influential open-source project. It is being migrated from the easier-to-read Zig programming language to harder-...

“Mental causation is not load-bearing” by jessicata

08 Jun 2026

Contributed by Lukas

In philosophy of mind, "mental causation" means mental entities have causal effects, especially physical ones. If physicalism is true, then physical ...

“How Far Apart Does a Model Think Its Tokens Are?” by Brendan Long

08 Jun 2026

Contributed by Lukas

Instead of using static position increments (+1) per token, RoPE-based language models can learn per-token and per-layer position increments. This ha...

“Can activation verbalizers surface an internal chain of thought?” by oakhu, ryan_greenblatt

07 Jun 2026

Contributed by Lukas

We introduce an evaluation for activation verbalizers: can they surface a target model's reasoning as it solves a math problem in a single forward pa...

“Against Corrigibility” by peralice

07 Jun 2026

Contributed by Lukas

Epistemic status: don’t know whether I actually believe all of this, but I think it's worth considering. A “corrigible” agent, per the LW wiki,...

“Coming Around To Political Donations” by jefftk

07 Jun 2026

Contributed by Lukas

Five years ago I read a post on the EA Forum arguing that "election campaign contributions might be a way in which you can have a substantial imp...

“OpenAI Offers A New Policy Blueprint” by Zvi

06 Jun 2026

Contributed by Lukas

Right after a new Executive Order seems like an excellent time to offer OpenAI's new document: Democratic Governance of Frontier AI: A Blueprint For A...

“Optimisation over non-stationary distributions creates weirder minds” by Samuel Ratnam, Pjain

06 Jun 2026

Contributed by Lukas

TLDR: Sequentially mixing training objectives incentivises different training dynamics depending on the distinguishability of the training environmen...

“Why Software Automation Is Hard” by silentbob

06 Jun 2026

Contributed by Lukas

Originally intended as a quick take, but got a bit longer, so why not turn it into a post. Just sharing my observations & assumptions here about ...

“SecureBio Detection is Hiring Software Engineers” by jefftk

06 Jun 2026

Contributed by Lukas

I'm leading a non-profit team building a pathogen-agnostic early-warning system. As AI systems become increasingly capable substitutes for exper...

“What if Anthropic unilaterally paused capabilities development right now?” by Karl von Wendt

06 Jun 2026

Contributed by Lukas

In their new post on recursive self-improvement, Anthropic argues that a pause in frontier AI development is needed, but unfortunately, they can't pa...

“Preparing for Warning Shots to Catalyze International Cooperation on AGI Risks” by Mark Kagach ☘️, EliasSchlie, Thomas Van Damme, JustinShovelain

06 Jun 2026

Contributed by Lukas

Summary This is a write-up on preparing for warning shots to catalyze international cooperation on AGI risks, and the corollary list of projects one ...

“Beyond the lexical personality traits: What is the structure of personality?” by tailcalled

06 Jun 2026

Contributed by Lukas

This is a description of the methodology behind the latest iteration of my Targeted Personality Test. Feel free to take it either before or after rea...

“My research agenda and work” by Seth Herd

05 Jun 2026

Contributed by Lukas

This is a summary of the work I've done and work I plan to do, and the theories of change and AI progress that motivate my work. I've been working fu...

“Logits as a new monitor for evaluation awareness” by Santiago Aranguri

05 Jun 2026

Contributed by Lukas

TL;DR: We build a logit monitor for eval awareness: throughout the CoT, we estimate an LLM's probability of producing an eval-aware sentence.The logi...

“One Year of PauseAI UK” by Joseph Miller, PauseAI UK

05 Jun 2026

Contributed by Lukas

About one year ago, I started spending most of my time organising PauseAI UK. At that time our largest protest had seen fewer than 50 attendees, no p...

“Learnings from starting an AI safety research team” by draganover, Erin Robertson

05 Jun 2026

Contributed by Lukas

This post's goal is to distill our takeaways from building a research team (somewhat) from scratch over the past four months. We describe some contex...

“Training Deliberative Monitors for Black-Box Scheming Detection” by aksh-n, adityasinha, Victor Gillioz, Simon Storf, Kilian Merkelbach, richbc, Axel Højmark, Marius Hobbhahn

05 Jun 2026

Contributed by Lukas

Paper: https://arxiv.org/abs/2605.29601 Thread: https://x.com/aksh_n0/status/2062568855814193497 TL;DR: Training small open-weight monitors provides ...

“Lab Leaks, Black Holes, and Eggs: Epistemic Case Study Competition” by Oliver Sourbut, Josh Jacobson, Future of Life Foundation (FLF)

05 Jun 2026

Contributed by Lukas

FLF is running a competition to find the best workflows and methodologies for using AI to produce reliable, trustworthy knowledge bases, grounded in ...

″(Mis)generalization of Helpful-Only Fine-tuning” by Omar Khursheed, Baram Sosis, Fabien Roger

05 Jun 2026

Contributed by Lukas

TLDR We study the shortcomings of existing helpful-only models. We find that some show emergent misalignment, others have residual refusal behaviors,...

“AI #171: False Flag” by Zvi

04 Jun 2026

Contributed by Lukas

This was the week of Claude Opus 4.8. I covered the model card, then model welfare concerns, and finally capabilities and reactions. It's a good mode...

“Building Better Activation Oracles” by ceselder, jan_bauer, Niclas Luick, Adam Karvonen, Neel Nanda

04 Jun 2026

Contributed by Lukas

Work done for our MATS 10.0 Sprint project - mentored by Neel Nanda and Adam Karvonen Huggingface, Github TL;DR: We have improved the original Activa...

“Rohin Shah on AGI Safety” by anaguma

04 Jun 2026

Contributed by Lukas

Rohin Shah recently had an interview on 80000 hours on his views on AGI Safety and his work at Google DeepMind. I'm posting the transcript below to e...

“Sixteen schemes for AI safety” by Austin Chen

04 Jun 2026

Contributed by Lukas

These days, I often run across whippersnappers excited to do something for AI safety — but aren’t quite sure what. One of the fun things about th...

“Don’t Edit Your Ideas Before Having Them” by Hide

03 Jun 2026

Contributed by Lukas

Editing is far easier than writing. You can usually look at a finished product and notice its flaws in a single read-through. “This section is a bi...

“Trump Signs Executive Order For AI Testing Prior To Frontier Model Releases” by Zvi

03 Jun 2026

Contributed by Lukas

Last week we were expecting an Executive Order on Thursday. Then Trump cancelled it, and said he wouldn’t sign it because he was worried it would ...

“Society Explained: a tool for efficiently exploring >100 theories of society” by spencerg

03 Jun 2026

Contributed by Lukas

There are many competing theories of how society does and should function, from Karl Marx and Adam Smith to Steven Pinker and Eliezer Yudkowsky. Thes...

“China won’t win the AI race but would it be much worse if it did?” by Chastity Ruth

03 Jun 2026

Contributed by Lukas

It seems to me accepted wisdom in the West that the US owned labs must “beat” the Chinese labs in the race for AGI/ASI. Even those who don’t t...

“A Town Without Children” by SeñorDingDong

03 Jun 2026

Contributed by Lukas

Castel di Tusa, Sicily. It is October 24th, 2025. I look at an empty school. This is the third town in Italy I have visited this Autumn: the other tw...

“Claude Opus 4.8: Capabilities and Reactions” by Zvi

03 Jun 2026

Contributed by Lukas

You need a lot of data points to understand a new model, and what you have. Trying to gauge from a few benchmarks is misleading. But if you have doz...

“My favorite depiction of utopia” by Caleb Biddulph

03 Jun 2026

Contributed by Lukas

For those who are trying to bring about a glorious transhuman utopia with the help of hopefully-aligned ASI, I think it's worth thinking explicitly a...

“Why Even Experts Don’t Know What to Do About AI Risk” by Luc Brinkman, plex

02 Jun 2026

Contributed by Lukas

AI Safety veteran Holden Karnofsky thinks there's a 49% chance his actions are making things worse.[1] In 2025, Jesse Clifton even stepped down as th...

“Agent Foundations Reminds Me of Continental Philosophy” by IanWS

02 Jun 2026

Contributed by Lukas

Nevertheless, I shall take advantage of your kindness in assuming we agree that a science cannot be conditioned upon empiricism. — Jacques Lacan, “...

“Announcing the ARC White-Box Estimation Challenge” by Jacob_Hilton

02 Jun 2026

Contributed by Lukas

ARC has teamed up with AIcrowd to launch the ARC White-Box Estimation Challenge, a contest to improve upon our estimation algorithms for random MLPs....

“Tech I’m skeptical of and why” by harsimony

02 Jun 2026

Contributed by Lukas

I’m a fan of people trying things, even if they seem silly. Dismissing risky ideas misses the point of research. But thoughtful criticism can direc...

“Dissolving the Deep Learning Sample Efficiency Gap” by Samuel Knoche

02 Jun 2026

Contributed by Lukas

A common observation about deep learning is that it's wildly sample inefficient compared to humans. Deep learning systems appear to need much more re...

″“Contagious Humming” to Silence a Room” by JohnofCharleston

01 Jun 2026

Contributed by Lukas

Often when running meetups you’ll have several lively conversations going at the same time. This is a great problem to have, but it can make it dif...

[Linkpost] “NYT: Senator Sanders Proposes Gov’t Take 50% Ownership of AI labs” by Julian Bradshaw

01 Jun 2026

Contributed by Lukas

This is a link post. Quoting from Senator Bernie Sanders Op-Ed in the New York Times today: (...) I will soon be introducing the American A.I. Soverei...

“Opus 4.8 Part 2: Model Welfare” by Zvi

01 Jun 2026

Contributed by Lukas

Everything impacts everything. All knobs that you turn generalize. Thus, when you try to solve one problem, you often create another. There were cle...

[Linkpost] “Some humans are both male and female, and can (but shouldn’t) have children with themselves” by HedonicEscalator

01 Jun 2026

Contributed by Lukas

This is a link post. “Potential autofertility in true hermaphrodites”[1] by Istanbul urologist Zeki Bayraktar is among the most bizarre articles I...

“Outrunning your headlights” by mattshu0410

01 Jun 2026

Contributed by Lukas

This is exactly the right place to probe. Gromov-Wasserstein is genuinely dimension free. Partial and semi-relaxed are precisely the mechanisms for t...

“Lighthaven East - A Feasibility Study” by JohnofCharleston

01 Jun 2026

Contributed by Lukas

As a bureaucrat, my role is to annoy my friends. Someone voices an idea, “Wouldn’t it be nice if…” or “I wonder if we could…” I make a ...

“Notes on axes of variation in third-party risk assessment” by Buck

31 May 2026

Contributed by Lukas

There are many different activities that could be described as "third-party risk assessment". Here are some distinctions that I’ve found helpful th...

“Financial Costs of an AI Pause?” by PeterMcCluskey

31 May 2026

Contributed by Lukas

I’ve analyzed the near-term economic effects of an AI pause, out of concern for my investments, and a desire to predict how strong political opposi...

“When Are Two Networks the Same? Tensor Similarity for Mechanistic Interpretability” by Logan Riggs, tdooms, Conflux, lwroe, MLNissenGonzalez

31 May 2026

Contributed by Lukas

We've found a method that tells you: How functionally similar two neural networks are across ALL inputs,Computed solely from the weights (i.e. no dat...

“Testing Gemini models for scheming tendencies” by Vika, David Lindner, Seb Farquhar, Rohin Shah

31 May 2026

Contributed by Lukas

As AI models become increasingly capable and autonomous, keeping them safely aligned with human intentions is critical. Extending our previous work o...

“Comment on “Banning Said Achmiz”” by Zack_M_Davis

30 May 2026

Contributed by Lukas

1. Prologue "If I Can't Explain It to Said Achmiz, I Probably Don't Understand It" This post isn't really about him, but I'd like to begin with a t...

“Announcing: Iliad’s Fall 2026 Programs” by David Udell, Alexander Gietelink Oldenziel, Leon Lang

30 May 2026

Contributed by Lukas

The April 2026 Iliad Intensive cohort, at LISA Iliad, an umbrella organization for applied math for AI alignment, is running several additional progr...

“Data you could have observed but didn’t” by Gretta Duleba

30 May 2026

Contributed by Lukas

You're running a study that involves keeping records about humans. You have a spreadsheet with rows for each person and columns for height, weight, a...

“Claude Opus 4.8: The System Card” by Zvi

29 May 2026

Contributed by Lukas

Only six weeks after Opus 4.7, we have Opus 4.8. For everyone, that means another incremental upgrade to Claude. It is once again smarter, and can d...

“Retrying vs Resampling in AI Control” by james.lucassen, Adam Kaufman

29 May 2026

Contributed by Lukas

We’ve just released a new paper: Retrying vs Resampling in AI Control. We revisit the resampling protocols introduced in Ctrl-Z with an up-to-date ...

“AI Researchers, Ask Yourself These 6 Questions to Strengthen Your Moral Muscles” by Max Tegmark

29 May 2026

Contributed by Lukas

By Max Tegmark & Meia Chita-Tegmark Of course you have moral principles – but how often do you use them? I, Meia, am a professor doing psychol...

“Developmental Cognitive Interpretability: A Research Agenda for Modelling Generalisation and Predicting Agent Behaviour” by JasonB, Edward James Young

29 May 2026

Contributed by Lukas

Summary Safe deployment of an AI system requires that we can make confident claims about its behaviour on out-of-distribution deployment inputs on th...

“Does Claude really care about you?” by Simon Lermen

29 May 2026

Contributed by Lukas

TLDR: The persona-selection alignment approach — selecting a warm, caring persona from the pretraining distribution and reinforcing it — looks su...

“How can the middle powers avoid getting trounced during the intelligence explosion? A plan.” by Tom Davidson

29 May 2026

Contributed by Lukas

This is an edited version of a LW shortform. Superintelligence will likely be developed by US companies; run on US data centres; and be under the jur...

“Trees are mostly made of air and a generalizable lesson for AI safety” by zroe1

29 May 2026

Contributed by Lukas

At the risk of embarrassing myself, I’ll share a confession. For context, I took five years of Latin: four in high school and one in college. In ad...

“Advice for making robust-to-training model organisms” by SebastianP, Alek Westover, Vivek Hebbar, Julian Stastny, Dylan Xu

29 May 2026

Contributed by Lukas

We’d like to develop training techniques that work when applied to future misaligned AI systems. One strategy for studying proposed techniques is t...

“Claude… doesn’t know who you are?” by Smaug123

29 May 2026

Contributed by Lukas

Follow-up to https://www.lesswrong.com/posts/Jkb4CBB7rf4XYP5eb/claude-knows-who-you-are after the release of Claude Opus 4.8. Claude Opus 4.8 refuses...

“Mnemonic portraits for 19,023 human genes” by Brinedew

29 May 2026

Contributed by Lukas

Back in 2013, Scott Alexander wrote in Extreme mnemonics: JS-154 is one of five metabolic products of netamine; however, the enzyme that produces it ...

“Some Dating Stories” by johnswentworth

29 May 2026

Contributed by Lukas

There's a genre of dating discourse which I wish were more common, in which people just tell detailed stories of their own flirtation, courtships, da...

“AI #170: Lack of Executive Order” by Zvi

28 May 2026

Contributed by Lukas

Last week ended on a cliffhanger of sorts. What's in the Executive Order coming later today? What will be in the Magnifica Humanitas? The Executive...

“Infinite ethics and UDASSA” by David Matolcsi

28 May 2026

Contributed by Lukas

Reading the first post of the sequence (Probabilities are not the right concept) is recommended but not required for understanding this post.[1] Inf...

“The ballad of TIGIT” by Abhishaike Mahajan

27 May 2026

Contributed by Lukas

There exist drug classes that seem, in retrospect, cursed. As these chemicals worm their way through the clinical trial system, they consume billions...

“Eval Cooperativeness May Be a Scalable Mitigation for Eval Gaming” by Jasmine Li, Alex Turner

27 May 2026

Contributed by Lukas

Behavioral evaluations may become worthless, which we think would be a disaster. Smart misaligned models may realize they are being evaluated ("eval ...

“LLMs Through the Eyes of Vinge” by Gordon Seidoh Worley

27 May 2026

Contributed by Lukas

For the last few months, I’ve been re-reading some of my favorite novels. Recently, I went through Vinge's Zones of Thought series: A Fire Upon the...

“Announcing Geodesic Research” by Puria, Cam, Alexandra Narin, Edward James Young, Kyle O’Brien

27 May 2026

Contributed by Lukas

We're a Cambridge, UK-based AI safety organisation that's asking: how can we build the most robust alignment initialisations for capable LLMs? We’r...

Activity Overview

Episodes

“The Uncertainty That Matters Isn’t Fundamental” by jimmy

[Linkpost] “US government directive to suspend access to Fable 5 and Mythos 5” by Capybasilisk

“Claude Fable 5 and Mythos 5: The System Card” by Zvi

“Citations Needed: Magic Encyclopedias to Save the World” by Oliver Sourbut

“Simulating Simulators” by kromem

“Implications of Continual Learning for LLM Agents: Introduction” by RohanS, Rauno Arike, Owen Terry, Achu Menon, Zhijing Jin, Francis Rhys Ward, Seth Herd

“Reward Hacking at the 1937 World’s Fair” by frmsaul

“Building and evaluating model diffing agents” by bilalchughtai, Josh Engels, Neel Nanda

“Sympathy for both sides of the egregious misalignment debate” by Steven Byrnes

“Celene’s thoughts on consciousness” by ToasterLightning

“Parkinson’s Heuristic” by Ben Pace

“PSA: Almost nobody is working on alignment” by Chi Nguyen, peterbarnett

“AI #172: The First Fable” by Zvi

“Models May Behave Worse When Eval Aware” by Senthooran Rajamanoharan, Neel Nanda

“Thoughts on Claude Fable’s silent safeguards” by Andy Arditi

“You Can Catch Sleeper Agents by Teaching Another Model to Imitate Them” by RobinHa

“Anthropic did not call for a pause on AI” by Andrea_Miotti, Gabriel Alfour

“Tracing Eval-Awareness Emergence Through Training of OLMo 3” by Ram Bharadwaj, RobertKirk

“Estimating No-CoT Task-Completion Time Horizons of Frontier AI Models” by Anders Cairns Woodruff, Francis Rhys Ward, Dewi Gould, Rauno Arike, Jason R Brown, Jo Jiao, wlanderson, ariana_azarbal, harrymayne, Patrick Leask

“Three types of model organism” by Francis Rhys Ward

“Sequent: scale and automation for higher confidence in alignment” by Geoffrey Irving, Alex HT, Jesse Hoogland, Daniel Murfet, Jacob Pfau, Marco Cozzi, Stan van Wingerden

“Machinic Psychopharmacology: Do LLMs Self-Medicate?” by Sid Black, Joseph Bloom

“The Three Filters: Why Almost Every Plan to Survive ASI Fails Miserably” by Alex Amadori

″“Programmer Science Fiction: My case for a new sub-genre”, Sam T. Oates 2026” by gwern

“Even “illegible” Mythos reasoning traces seem pretty legible” by faul_sname

“Claude Fable 5 and Mythos 5 [Linkpost]” by fluxxrider

“Three Labs With a Plan and A Memorandum” by Zvi

“A Mike’s-Eye View of ARC’s Research” by Jacob_Hilton

“Towards a Formal Scientific Epistemology” by Richard_Ngo

“LLMs and almost good code” by kqr

“On Slop” by Jan

“The Machines Lack Honour” by Raymond Douglas

“How to build a cancer vaccine, and whether they will work this time” by Abhishaike Mahajan

“Efficient tradeoffs and the safety-usefulness tradeoff model” by Buck

“Bun’s Migration from Zig to Rust as a Potential Case Study for Gradual Disempowerment” by Sayhan Yalvaçer

“Mental causation is not load-bearing” by jessicata

“How Far Apart Does a Model Think Its Tokens Are?” by Brendan Long

“Can activation verbalizers surface an internal chain of thought?” by oakhu, ryan_greenblatt

“Against Corrigibility” by peralice

“Coming Around To Political Donations” by jefftk

“OpenAI Offers A New Policy Blueprint” by Zvi

“Optimisation over non-stationary distributions creates weirder minds” by Samuel Ratnam, Pjain

“Why Software Automation Is Hard” by silentbob

“SecureBio Detection is Hiring Software Engineers” by jefftk

“What if Anthropic unilaterally paused capabilities development right now?” by Karl von Wendt

“Preparing for Warning Shots to Catalyze International Cooperation on AGI Risks” by Mark Kagach ☘️, EliasSchlie, Thomas Van Damme, JustinShovelain

“Beyond the lexical personality traits: What is the structure of personality?” by tailcalled

“My research agenda and work” by Seth Herd

“Logits as a new monitor for evaluation awareness” by Santiago Aranguri

“One Year of PauseAI UK” by Joseph Miller, PauseAI UK

“Learnings from starting an AI safety research team” by draganover, Erin Robertson

“Training Deliberative Monitors for Black-Box Scheming Detection” by aksh-n, adityasinha, Victor Gillioz, Simon Storf, Kilian Merkelbach, richbc, Axel Højmark, Marius Hobbhahn

“Lab Leaks, Black Holes, and Eggs: Epistemic Case Study Competition” by Oliver Sourbut, Josh Jacobson, Future of Life Foundation (FLF)

″(Mis)generalization of Helpful-Only Fine-tuning” by Omar Khursheed, Baram Sosis, Fabien Roger

“AI #171: False Flag” by Zvi

“Building Better Activation Oracles” by ceselder, jan_bauer, Niclas Luick, Adam Karvonen, Neel Nanda

“Rohin Shah on AGI Safety” by anaguma

“Sixteen schemes for AI safety” by Austin Chen

“Don’t Edit Your Ideas Before Having Them” by Hide

“Trump Signs Executive Order For AI Testing Prior To Frontier Model Releases” by Zvi

“Society Explained: a tool for efficiently exploring >100 theories of society” by spencerg

“China won’t win the AI race but would it be much worse if it did?” by Chastity Ruth

“A Town Without Children” by SeñorDingDong

“Claude Opus 4.8: Capabilities and Reactions” by Zvi

“My favorite depiction of utopia” by Caleb Biddulph

“Why Even Experts Don’t Know What to Do About AI Risk” by Luc Brinkman, plex

“Agent Foundations Reminds Me of Continental Philosophy” by IanWS

“Announcing the ARC White-Box Estimation Challenge” by Jacob_Hilton

“Tech I’m skeptical of and why” by harsimony

“Dissolving the Deep Learning Sample Efficiency Gap” by Samuel Knoche

″“Contagious Humming” to Silence a Room” by JohnofCharleston

[Linkpost] “NYT: Senator Sanders Proposes Gov’t Take 50% Ownership of AI labs” by Julian Bradshaw

“Opus 4.8 Part 2: Model Welfare” by Zvi

[Linkpost] “Some humans are both male and female, and can (but shouldn’t) have children with themselves” by HedonicEscalator

“Outrunning your headlights” by mattshu0410

“Lighthaven East - A Feasibility Study” by JohnofCharleston

“Notes on axes of variation in third-party risk assessment” by Buck