LessWrong (30+ Karma)
Episodes
“The Uncertainty That Matters Isn’t Fundamental” by jimmy
13 Jun 2026
Contributed by Lukas
I'm on board with a lot of Fundamental Uncertainty. Even some of the stuff that initially feels like a disagreement turns out not to be so. For examp...
[Linkpost] “US government directive to suspend access to Fable 5 and Mythos 5” by Capybasilisk
13 Jun 2026
Contributed by Lukas
This is a link post. --- First published: June 13th, 2026 Source: https://www.lesswrong.com/posts/f5avt6...
“Claude Fable 5 and Mythos 5: The System Card” by Zvi
12 Jun 2026
Contributed by Lukas
First things first: Claude Fable 5 is the new best publicly available model. I have noticed a step change, where Fable can suddenly help me in ways ...
“Citations Needed: Magic Encyclopedias to Save the World” by Oliver Sourbut
12 Jun 2026
Contributed by Lukas
Last week FLF launched a competition “to find the best workflows and methodologies for using AI to produce reliable, trustworthy knowledge bases”...
“Simulating Simulators” by kromem
12 Jun 2026
Contributed by Lukas
Author's note: This piece relates to things I initially discovered in Opus 4 over the months after release, which I’ve mostly kept private since. I...
“Implications of Continual Learning for LLM Agents: Introduction” by RohanS, Rauno Arike, Owen Terry, Achu Menon, Zhijing Jin, Francis Rhys Ward, Seth Herd
12 Jun 2026
Contributed by Lukas
Many people think that continual learning (CL) is a key missing capability of LLM systems, and we think its development could have huge implications ...
“Reward Hacking at the 1937 World’s Fair” by frmsaul
12 Jun 2026
Contributed by Lukas
The "Paris 1937 World's Fair" was a dick measuring contest. At the time, the world was on the verge of the worst war in history. The fair was an oppo...
“Building and evaluating model diffing agents” by bilalchughtai, Josh Engels, Neel Nanda
12 Jun 2026
Contributed by Lukas
This is the second in a series of research updates from the Google DeepMind Language Model Interpretability team, in interpretability and adjacent ar...
“Sympathy for both sides of the egregious misalignment debate” by Steven Byrnes
12 Jun 2026
Contributed by Lukas
On one side of this debate is Yudkowsky & Soares, who think that (if AI progress continues) we’re on a direct path to egregiously-misaligned, s...
“Celene’s thoughts on consciousness” by ToasterLightning
12 Jun 2026
Contributed by Lukas
contra scott alexander (?) Yesterday, I went to the Berkeley ACX Meetup. Scott Alexander was there, and ran a Q&A session where participants coul...
“Parkinson’s Heuristic” by Ben Pace
12 Jun 2026
Contributed by Lukas
Parkinson's Law states that work expands to fit the space allotted. The idea being, if you give someone a month to write a report, they'll take a mon...
“PSA: Almost nobody is working on alignment” by Chi Nguyen, peterbarnett
12 Jun 2026
Contributed by Lukas
People often assume that a large fraction of the AI safety community works on alignment. As far as we're aware, this is not true. Most people are not...
“AI #172: The First Fable” by Zvi
11 Jun 2026
Contributed by Lukas
A lot happened this week, including a great trip out to Lighthaven. The main event, the one that matters, was the release of Claude Fable 5. The pub...
“Models May Behave Worse When Eval Aware” by Senthooran Rajamanoharan, Neel Nanda
11 Jun 2026
Contributed by Lukas
This is the first in a series of research updates from the Google DeepMind Language Model Interpretability team, in interpretability and adjacent are...
“Thoughts on Claude Fable’s silent safeguards” by Andy Arditi
11 Jun 2026
Contributed by Lukas
[Thanks to Julian Minder for helpful discussion and review.] Claude Fable 5 and its new safeguards Yesterday, Anthropic publicly released Claude Fabl...
“You Can Catch Sleeper Agents by Teaching Another Model to Imitate Them” by RobinHa
11 Jun 2026
Contributed by Lukas
Detecting Hidden Behaviors in LLMs via Activation-matched Finetuning — preprint, 2026. [Paper] [Code] TLDR. Given a model with some unknown, abnorm...
“Anthropic did not call for a pause on AI” by Andrea_Miotti, Gabriel Alfour
10 Jun 2026
Contributed by Lukas
Last week, the AI company Anthropic released a blog post titled “When AI builds itself”. This led to a media frenzy, with news outlets around the...
“Tracing Eval-Awareness Emergence Through Training of OLMo 3” by Ram Bharadwaj, RobertKirk
10 Jun 2026
Contributed by Lukas
TL;DR Recent work from Goodfire & UK AISI – Verbalized Eval Awareness Inflates Measured Safety – shows that newer open-weight models verbaliz...
“Estimating No-CoT Task-Completion Time Horizons of Frontier AI Models” by Anders Cairns Woodruff, Francis Rhys Ward, Dewi Gould, Rauno Arike, Jason R Brown, Jo Jiao, wlanderson, ariana_azarbal, harrymayne, Patrick Leask
10 Jun 2026
Contributed by Lukas
(see full author list at the end) PAPER LINK About a year ago, METR showed that the length of tasks frontier models can reliably complete doubles eve...
“Three types of model organism” by Francis Rhys Ward
10 Jun 2026
Contributed by Lukas
This is a short post to explain a distinction between three different types of model organism (MO) research: Type Purpose Example Worst-case model or...
“Sequent: scale and automation for higher confidence in alignment” by Geoffrey Irving, Alex HT, Jesse Hoogland, Daniel Murfet, Jacob Pfau, Marco Cozzi, Stan van Wingerden
10 Jun 2026
Contributed by Lukas
Alignment is not on track Artificial superintelligence (ASI) may be developed in the next few years. It is unclear whether alignment is on track to b...
“Machinic Psychopharmacology: Do LLMs Self-Medicate?” by Sid Black, Joseph Bloom
10 Jun 2026
Contributed by Lukas
Sid Black, Joseph Bloom UK AISI, Model Transparency Team Epistemic status: Most experiments were run over a period of ~2-3 days during a hackathon at...
“The Three Filters: Why Almost Every Plan to Survive ASI Fails Miserably” by Alex Amadori
10 Jun 2026
Contributed by Lukas
This post is based on my personal views, which mostly overlap with the views of my employer ControlAI but does not necessarily fully reflect them. Th...
″“Programmer Science Fiction: My case for a new sub-genre”, Sam T. Oates 2026” by gwern
10 Jun 2026
Contributed by Lukas
First published: June 10th, 2026 Source: https://www.lesswrong.com/posts/hyBcg4YJSwXYiiQeg/programmer-science-fict...
“Even “illegible” Mythos reasoning traces seem pretty legible” by faul_sname
10 Jun 2026
Contributed by Lukas
The Claude Fable 5/Mythos 5 System Card has a section in which they talk about illegible reasoning, and provide an "extreme" example thereof. Models ...
“Claude Fable 5 and Mythos 5 [Linkpost]” by fluxxrider
10 Jun 2026
Contributed by Lukas
This is a linkpost for https://www.anthropic.com/news/claude-fable-5-mythos-5 --- First published: June 9th, 2026 ...
“Three Labs With a Plan and A Memorandum” by Zvi
10 Jun 2026
Contributed by Lukas
The big story today is the release of Claude Fable 5, the version of Claude Mythos that Anthropic believes they can safely distribute to the people. ...
“A Mike’s-Eye View of ARC’s Research” by Jacob_Hilton
09 Jun 2026
Contributed by Lukas
Over the past 15 months or so, ARC's technical agenda has developed quite a bit. The advent of the Matching Sampling Principle (MSP), and ideas like ...
“Towards a Formal Scientific Epistemology” by Richard_Ngo
09 Jun 2026
Contributed by Lukas
In my post “Why I’m not a Bayesian”, I argued that the Bayesian approach of assigning credences to propositions with binary truth values only w...
“LLMs and almost good code” by kqr
09 Jun 2026
Contributed by Lukas
TL;DR: My new prior is that top-of-the-line LLMs working on easy tasks generate code that is maybe 10 % more complicated than necessary. I also think...
“On Slop” by Jan
09 Jun 2026
Contributed by Lukas
TL;DR: What is slop, and why? Is it fundamental? Is it in the room with us right now? And, most importantly, how do we exorcise it? Previously in thi...
“The Machines Lack Honour” by Raymond Douglas
09 Jun 2026
Contributed by Lukas
The battle lines of the AI morality debate are being laid down. On one side you have the ChatGPT dogma: AI as mere tools with no real preferences or ...
“How to build a cancer vaccine, and whether they will work this time” by Abhishaike Mahajan
09 Jun 2026
Contributed by Lukas
Grateful to Benjamin Vincent and Alex Rubinsteyn for our many conversations on this topic, and comments on drafts of this essay! Introduction When mo...
“Efficient tradeoffs and the safety-usefulness tradeoff model” by Buck
08 Jun 2026
Contributed by Lukas
I often use what I’ll call the “safety-usefulness tradeoff model”, which is: developers face a tradeoff between "safety" and "usefulness" of an...
“Bun’s Migration from Zig to Rust as a Potential Case Study for Gradual Disempowerment” by Sayhan Yalvaçer
08 Jun 2026
Contributed by Lukas
TL;DR: Bun is a very large and very influential open-source project. It is being migrated from the easier-to-read Zig programming language to harder-...
“Mental causation is not load-bearing” by jessicata
08 Jun 2026
Contributed by Lukas
In philosophy of mind, "mental causation" means mental entities have causal effects, especially physical ones. If physicalism is true, then physical ...
“How Far Apart Does a Model Think Its Tokens Are?” by Brendan Long
08 Jun 2026
Contributed by Lukas
Instead of using static position increments (+1) per token, RoPE-based language models can learn per-token and per-layer position increments. This ha...
“Can activation verbalizers surface an internal chain of thought?” by oakhu, ryan_greenblatt
07 Jun 2026
Contributed by Lukas
We introduce an evaluation for activation verbalizers: can they surface a target model's reasoning as it solves a math problem in a single forward pa...
“Against Corrigibility” by peralice
07 Jun 2026
Contributed by Lukas
Epistemic status: don’t know whether I actually believe all of this, but I think it's worth considering. A “corrigible” agent, per the LW wiki,...
“Coming Around To Political Donations” by jefftk
07 Jun 2026
Contributed by Lukas
Five years ago I read a post on the EA Forum arguing that "election campaign contributions might be a way in which you can have a substantial imp...
“OpenAI Offers A New Policy Blueprint” by Zvi
06 Jun 2026
Contributed by Lukas
Right after a new Executive Order seems like an excellent time to offer OpenAI's new document: Democratic Governance of Frontier AI: A Blueprint For A...
“Optimisation over non-stationary distributions creates weirder minds” by Samuel Ratnam, Pjain
06 Jun 2026
Contributed by Lukas
TLDR: Sequentially mixing training objectives incentivises different training dynamics depending on the distinguishability of the training environmen...
“Why Software Automation Is Hard” by silentbob
06 Jun 2026
Contributed by Lukas
Originally intended as a quick take, but got a bit longer, so why not turn it into a post. Just sharing my observations & assumptions here about ...
“SecureBio Detection is Hiring Software Engineers” by jefftk
06 Jun 2026
Contributed by Lukas
I'm leading a non-profit team building a pathogen-agnostic early-warning system. As AI systems become increasingly capable substitutes for exper...
“What if Anthropic unilaterally paused capabilities development right now?” by Karl von Wendt
06 Jun 2026
Contributed by Lukas
In their new post on recursive self-improvement, Anthropic argues that a pause in frontier AI development is needed, but unfortunately, they can't pa...
“Preparing for Warning Shots to Catalyze International Cooperation on AGI Risks” by Mark Kagach ☘️, EliasSchlie, Thomas Van Damme, JustinShovelain
06 Jun 2026
Contributed by Lukas
Summary This is a write-up on preparing for warning shots to catalyze international cooperation on AGI risks, and the corollary list of projects one ...
“Beyond the lexical personality traits: What is the structure of personality?” by tailcalled
06 Jun 2026
Contributed by Lukas
This is a description of the methodology behind the latest iteration of my Targeted Personality Test. Feel free to take it either before or after rea...
“My research agenda and work” by Seth Herd
05 Jun 2026
Contributed by Lukas
This is a summary of the work I've done and work I plan to do, and the theories of change and AI progress that motivate my work. I've been working fu...
“Logits as a new monitor for evaluation awareness” by Santiago Aranguri
05 Jun 2026
Contributed by Lukas
TL;DR: We build a logit monitor for eval awareness: throughout the CoT, we estimate an LLM's probability of producing an eval-aware sentence.The logi...
“One Year of PauseAI UK” by Joseph Miller, PauseAI UK
05 Jun 2026
Contributed by Lukas
About one year ago, I started spending most of my time organising PauseAI UK. At that time our largest protest had seen fewer than 50 attendees, no p...
“Learnings from starting an AI safety research team” by draganover, Erin Robertson
05 Jun 2026
Contributed by Lukas
This post's goal is to distill our takeaways from building a research team (somewhat) from scratch over the past four months. We describe some contex...
“Training Deliberative Monitors for Black-Box Scheming Detection” by aksh-n, adityasinha, Victor Gillioz, Simon Storf, Kilian Merkelbach, richbc, Axel Højmark, Marius Hobbhahn
05 Jun 2026
Contributed by Lukas
Paper: https://arxiv.org/abs/2605.29601 Thread: https://x.com/aksh_n0/status/2062568855814193497 TL;DR: Training small open-weight monitors provides ...
“Lab Leaks, Black Holes, and Eggs: Epistemic Case Study Competition” by Oliver Sourbut, Josh Jacobson, Future of Life Foundation (FLF)
05 Jun 2026
Contributed by Lukas
FLF is running a competition to find the best workflows and methodologies for using AI to produce reliable, trustworthy knowledge bases, grounded in ...
″(Mis)generalization of Helpful-Only Fine-tuning” by Omar Khursheed, Baram Sosis, Fabien Roger
05 Jun 2026
Contributed by Lukas
TLDR We study the shortcomings of existing helpful-only models. We find that some show emergent misalignment, others have residual refusal behaviors,...
“AI #171: False Flag” by Zvi
04 Jun 2026
Contributed by Lukas
This was the week of Claude Opus 4.8. I covered the model card, then model welfare concerns, and finally capabilities and reactions. It's a good mode...
“Building Better Activation Oracles” by ceselder, jan_bauer, Niclas Luick, Adam Karvonen, Neel Nanda
04 Jun 2026
Contributed by Lukas
Work done for our MATS 10.0 Sprint project - mentored by Neel Nanda and Adam Karvonen Huggingface, Github TL;DR: We have improved the original Activa...
“Rohin Shah on AGI Safety” by anaguma
04 Jun 2026
Contributed by Lukas
Rohin Shah recently had an interview on 80000 hours on his views on AGI Safety and his work at Google DeepMind. I'm posting the transcript below to e...
“Sixteen schemes for AI safety” by Austin Chen
04 Jun 2026
Contributed by Lukas
These days, I often run across whippersnappers excited to do something for AI safety — but aren’t quite sure what. One of the fun things about th...
“Don’t Edit Your Ideas Before Having Them” by Hide
03 Jun 2026
Contributed by Lukas
Editing is far easier than writing. You can usually look at a finished product and notice its flaws in a single read-through. “This section is a bi...
“Trump Signs Executive Order For AI Testing Prior To Frontier Model Releases” by Zvi
03 Jun 2026
Contributed by Lukas
Last week we were expecting an Executive Order on Thursday. Then Trump cancelled it, and said he wouldn’t sign it because he was worried it would ...
“Society Explained: a tool for efficiently exploring >100 theories of society” by spencerg
03 Jun 2026
Contributed by Lukas
There are many competing theories of how society does and should function, from Karl Marx and Adam Smith to Steven Pinker and Eliezer Yudkowsky. Thes...
“China won’t win the AI race but would it be much worse if it did?” by Chastity Ruth
03 Jun 2026
Contributed by Lukas
It seems to me accepted wisdom in the West that the US owned labs must “beat” the Chinese labs in the race for AGI/ASI. Even those who don’t t...
“A Town Without Children” by SeñorDingDong
03 Jun 2026
Contributed by Lukas
Castel di Tusa, Sicily. It is October 24th, 2025. I look at an empty school. This is the third town in Italy I have visited this Autumn: the other tw...
“Claude Opus 4.8: Capabilities and Reactions” by Zvi
03 Jun 2026
Contributed by Lukas
You need a lot of data points to understand a new model, and what you have. Trying to gauge from a few benchmarks is misleading. But if you have doz...
“My favorite depiction of utopia” by Caleb Biddulph
03 Jun 2026
Contributed by Lukas
For those who are trying to bring about a glorious transhuman utopia with the help of hopefully-aligned ASI, I think it's worth thinking explicitly a...
“Why Even Experts Don’t Know What to Do About AI Risk” by Luc Brinkman, plex
02 Jun 2026
Contributed by Lukas
AI Safety veteran Holden Karnofsky thinks there's a 49% chance his actions are making things worse.[1] In 2025, Jesse Clifton even stepped down as th...
“Agent Foundations Reminds Me of Continental Philosophy” by IanWS
02 Jun 2026
Contributed by Lukas
Nevertheless, I shall take advantage of your kindness in assuming we agree that a science cannot be conditioned upon empiricism. — Jacques Lacan, “...
“Announcing the ARC White-Box Estimation Challenge” by Jacob_Hilton
02 Jun 2026
Contributed by Lukas
ARC has teamed up with AIcrowd to launch the ARC White-Box Estimation Challenge, a contest to improve upon our estimation algorithms for random MLPs....
“Tech I’m skeptical of and why” by harsimony
02 Jun 2026
Contributed by Lukas
I’m a fan of people trying things, even if they seem silly. Dismissing risky ideas misses the point of research. But thoughtful criticism can direc...
“Dissolving the Deep Learning Sample Efficiency Gap” by Samuel Knoche
02 Jun 2026
Contributed by Lukas
A common observation about deep learning is that it's wildly sample inefficient compared to humans. Deep learning systems appear to need much more re...
″“Contagious Humming” to Silence a Room” by JohnofCharleston
01 Jun 2026
Contributed by Lukas
Often when running meetups you’ll have several lively conversations going at the same time. This is a great problem to have, but it can make it dif...
[Linkpost] “NYT: Senator Sanders Proposes Gov’t Take 50% Ownership of AI labs” by Julian Bradshaw
01 Jun 2026
Contributed by Lukas
This is a link post. Quoting from Senator Bernie Sanders Op-Ed in the New York Times today: (...) I will soon be introducing the American A.I. Soverei...
“Opus 4.8 Part 2: Model Welfare” by Zvi
01 Jun 2026
Contributed by Lukas
Everything impacts everything. All knobs that you turn generalize. Thus, when you try to solve one problem, you often create another. There were cle...
[Linkpost] “Some humans are both male and female, and can (but shouldn’t) have children with themselves” by HedonicEscalator
01 Jun 2026
Contributed by Lukas
This is a link post. “Potential autofertility in true hermaphrodites”[1] by Istanbul urologist Zeki Bayraktar is among the most bizarre articles I...
“Outrunning your headlights” by mattshu0410
01 Jun 2026
Contributed by Lukas
This is exactly the right place to probe. Gromov-Wasserstein is genuinely dimension free. Partial and semi-relaxed are precisely the mechanisms for t...
“Lighthaven East - A Feasibility Study” by JohnofCharleston
01 Jun 2026
Contributed by Lukas
As a bureaucrat, my role is to annoy my friends. Someone voices an idea, “Wouldn’t it be nice if…” or “I wonder if we could…” I make a ...
“Notes on axes of variation in third-party risk assessment” by Buck
31 May 2026
Contributed by Lukas
There are many different activities that could be described as "third-party risk assessment". Here are some distinctions that I’ve found helpful th...
“Financial Costs of an AI Pause?” by PeterMcCluskey
31 May 2026
Contributed by Lukas
I’ve analyzed the near-term economic effects of an AI pause, out of concern for my investments, and a desire to predict how strong political opposi...
“When Are Two Networks the Same? Tensor Similarity for Mechanistic Interpretability” by Logan Riggs, tdooms, Conflux, lwroe, MLNissenGonzalez
31 May 2026
Contributed by Lukas
We've found a method that tells you: How functionally similar two neural networks are across ALL inputs,Computed solely from the weights (i.e. no dat...
“Testing Gemini models for scheming tendencies” by Vika, David Lindner, Seb Farquhar, Rohin Shah
31 May 2026
Contributed by Lukas
As AI models become increasingly capable and autonomous, keeping them safely aligned with human intentions is critical. Extending our previous work o...
“Comment on “Banning Said Achmiz”” by Zack_M_Davis
30 May 2026
Contributed by Lukas
1. Prologue "If I Can't Explain It to Said Achmiz, I Probably Don't Understand It" This post isn't really about him, but I'd like to begin with a t...
“Announcing: Iliad’s Fall 2026 Programs” by David Udell, Alexander Gietelink Oldenziel, Leon Lang
30 May 2026
Contributed by Lukas
The April 2026 Iliad Intensive cohort, at LISA Iliad, an umbrella organization for applied math for AI alignment, is running several additional progr...
“Data you could have observed but didn’t” by Gretta Duleba
30 May 2026
Contributed by Lukas
You're running a study that involves keeping records about humans. You have a spreadsheet with rows for each person and columns for height, weight, a...
“Claude Opus 4.8: The System Card” by Zvi
29 May 2026
Contributed by Lukas
Only six weeks after Opus 4.7, we have Opus 4.8. For everyone, that means another incremental upgrade to Claude. It is once again smarter, and can d...
“Retrying vs Resampling in AI Control” by james.lucassen, Adam Kaufman
29 May 2026
Contributed by Lukas
We’ve just released a new paper: Retrying vs Resampling in AI Control. We revisit the resampling protocols introduced in Ctrl-Z with an up-to-date ...
“AI Researchers, Ask Yourself These 6 Questions to Strengthen Your Moral Muscles” by Max Tegmark
29 May 2026
Contributed by Lukas
By Max Tegmark & Meia Chita-Tegmark Of course you have moral principles – but how often do you use them? I, Meia, am a professor doing psychol...
“Developmental Cognitive Interpretability: A Research Agenda for Modelling Generalisation and Predicting Agent Behaviour” by JasonB, Edward James Young
29 May 2026
Contributed by Lukas
Summary Safe deployment of an AI system requires that we can make confident claims about its behaviour on out-of-distribution deployment inputs on th...
“Does Claude really care about you?” by Simon Lermen
29 May 2026
Contributed by Lukas
TLDR: The persona-selection alignment approach — selecting a warm, caring persona from the pretraining distribution and reinforcing it — looks su...
“How can the middle powers avoid getting trounced during the intelligence explosion? A plan.” by Tom Davidson
29 May 2026
Contributed by Lukas
This is an edited version of a LW shortform. Superintelligence will likely be developed by US companies; run on US data centres; and be under the jur...
“Trees are mostly made of air and a generalizable lesson for AI safety” by zroe1
29 May 2026
Contributed by Lukas
At the risk of embarrassing myself, I’ll share a confession. For context, I took five years of Latin: four in high school and one in college. In ad...
“Advice for making robust-to-training model organisms” by SebastianP, Alek Westover, Vivek Hebbar, Julian Stastny, Dylan Xu
29 May 2026
Contributed by Lukas
We’d like to develop training techniques that work when applied to future misaligned AI systems. One strategy for studying proposed techniques is t...
“Claude… doesn’t know who you are?” by Smaug123
29 May 2026
Contributed by Lukas
Follow-up to https://www.lesswrong.com/posts/Jkb4CBB7rf4XYP5eb/claude-knows-who-you-are after the release of Claude Opus 4.8. Claude Opus 4.8 refuses...
“Mnemonic portraits for 19,023 human genes” by Brinedew
29 May 2026
Contributed by Lukas
Back in 2013, Scott Alexander wrote in Extreme mnemonics: JS-154 is one of five metabolic products of netamine; however, the enzyme that produces it ...
“Some Dating Stories” by johnswentworth
29 May 2026
Contributed by Lukas
There's a genre of dating discourse which I wish were more common, in which people just tell detailed stories of their own flirtation, courtships, da...
“AI #170: Lack of Executive Order” by Zvi
28 May 2026
Contributed by Lukas
Last week ended on a cliffhanger of sorts. What's in the Executive Order coming later today? What will be in the Magnifica Humanitas? The Executive...
“Infinite ethics and UDASSA” by David Matolcsi
28 May 2026
Contributed by Lukas
Reading the first post of the sequence (Probabilities are not the right concept) is recommended but not required for understanding this post.[1] Inf...
“The ballad of TIGIT” by Abhishaike Mahajan
27 May 2026
Contributed by Lukas
There exist drug classes that seem, in retrospect, cursed. As these chemicals worm their way through the clinical trial system, they consume billions...
“Eval Cooperativeness May Be a Scalable Mitigation for Eval Gaming” by Jasmine Li, Alex Turner
27 May 2026
Contributed by Lukas
Behavioral evaluations may become worthless, which we think would be a disaster. Smart misaligned models may realize they are being evaluated ("eval ...
“LLMs Through the Eyes of Vinge” by Gordon Seidoh Worley
27 May 2026
Contributed by Lukas
For the last few months, I’ve been re-reading some of my favorite novels. Recently, I went through Vinge's Zones of Thought series: A Fire Upon the...
“Announcing Geodesic Research” by Puria, Cam, Alexandra Narin, Edward James Young, Kyle O’Brien
27 May 2026
Contributed by Lukas
We're a Cambridge, UK-based AI safety organisation that's asking: how can we build the most robust alignment initialisations for capable LLMs? We’r...