LessWrong (Curated & Popular)
Episodes
“Shallow Water is Dangerous Too” by jefftk
21 Jul 2025
Contributed by Lukas
Content warning: risk to children Julia and I knowdrowning is the biggestrisk to US children under 5, and we try to take this seriously.But yesterday...
“Narrow Misalignment is Hard, Emergent Misalignment is Easy” by Edward Turner, Anna Soligo, Senthooran Rajamanoharan, Neel Nanda
18 Jul 2025
Contributed by Lukas
Anna and Ed are co-first authors for this work. We’re presenting these results as a research update for a continuing body of work, which we hope wi...
“Chain of Thought Monitorability: A New and Fragile Opportunity for AI Safety” by Tomek Korbak, Mikita Balesni, Vlad Mikulik, Rohin Shah
16 Jul 2025
Contributed by Lukas
Twitter | Paper PDF Seven years ago, OpenAI five had just been released, and many people in the AI safety community expected AIs to be opaque RL agen...
“the jackpot age” by thiccythot
14 Jul 2025
Contributed by Lukas
This essay is about shifts in risk taking towards the worship of jackpots and its broader societal implications. Imagine you are presented with this ...
“Surprises and learnings from almost two months of Leo Panickssery” by Nina Panickssery
14 Jul 2025
Contributed by Lukas
Leo was born at 5am on the 20th May, at home (this was an accident but the experience has made me extremely homebirth-pilled). Before that, I was on ...
“An Opinionated Guide to Using Anki Correctly” by Luise
13 Jul 2025
Contributed by Lukas
I can't count how many times I've heard variations on "I used Anki too for a while, but I got out of the habit." No one ever stic...
“Lessons from the Iraq War about AI policy” by Buck
12 Jul 2025
Contributed by Lukas
I think the 2003 invasion of Iraq has some interesting lessons for the future of AI policy. (Epistemic status: I’ve read a bit about this, talked t...
“So You Think You’ve Awoken ChatGPT” by JustisMills
11 Jul 2025
Contributed by Lukas
Written in an attempt to fulfill @Raemon's request. AI is fascinating stuff, and modern chatbots are nothing short of miraculous. If you've...
“Generalized Hangriness: A Standard Rationalist Stance Toward Emotions” by johnswentworth
11 Jul 2025
Contributed by Lukas
People have an annoying tendency to hear the word “rationalism” and think “Spock”, despite direct exhortation against that exact interpretati...
“Comparing risk from internally-deployed AI to insider and outsider threats from humans” by Buck
10 Jul 2025
Contributed by Lukas
I’ve been thinking a lot recently about the relationship between AI control and traditional computer security. Here's one point that I think i...
“Why Do Some Language Models Fake Alignment While Others Don’t?” by abhayesian, John Hughes, Alex Mallen, Jozdien, janus, Fabien Roger
10 Jul 2025
Contributed by Lukas
Last year, Redwood and Anthropic found a setting where Claude 3 Opus and 3.5 Sonnet fake alignment to preserve their harmlessness values. We reprodu...
“A deep critique of AI 2027’s bad timeline models” by titotal
09 Jul 2025
Contributed by Lukas
Thank you to Arepo and Eli Lifland for looking over this article for errors. I am sorry that this article is so long. Every time I thought I was don...
“‘Buckle up bucko, this ain’t over till it’s over.’” by Raemon
09 Jul 2025
Contributed by Lukas
The second in a series of bite-sized rationality prompts[1]. Often, if I'm bouncing off a problem, one issue is that I intuitively expect the pr...
“Shutdown Resistance in Reasoning Models” by benwr, JeremySchlatter, Jeffrey Ladish
08 Jul 2025
Contributed by Lukas
We recently discovered some concerning behavior in OpenAI's reasoning models: When trying to complete a task, these models sometimes actively ci...
“Authors Have a Responsibility to Communicate Clearly” by TurnTrout
08 Jul 2025
Contributed by Lukas
When a claim is shown to be incorrect, defenders may say that the author was just being “sloppy” and actually meant something else entirely. I arg...
“The Industrial Explosion” by rosehadshar, Tom Davidson
07 Jul 2025
Contributed by Lukas
Summary To quickly transform the world, it's not enough for AI to become super smart (the "intelligence explosion"). AI will also hav...
“Race and Gender Bias As An Example of Unfaithful Chain of Thought in the Wild” by Adam Karvonen, Sam Marks
03 Jul 2025
Contributed by Lukas
Summary: We found that LLMs exhibit significant race and gender bias in realistic hiring scenarios, but their chain-of-thought reasoning shows zero ev...
“The best simple argument for Pausing AI?” by Gary Marcus
03 Jul 2025
Contributed by Lukas
Not saying we should pause AI, but consider the following argument: Alignment without the capacity to follow rules is hopeless. You can’t possibly...
“Foom & Doom 2: Technical alignment is hard” by Steven Byrnes
01 Jul 2025
Contributed by Lukas
2.1 Summary & Table of contents This is the second of a two-post series on foom (previous post) and doom (this post). The last post talked about h...
“Proposal for making credible commitments to AIs.” by Cleo Nardo
30 Jun 2025
Contributed by Lukas
Acknowledgments: The core scheme here was suggested by Prof. Gabriel Weil. There has been growing interest in the deal-making agenda: humans make dea...
“X explains Z% of the variance in Y” by Leon Lang
28 Jun 2025
Contributed by Lukas
Audio note: this article contains 218 uses of latex notation, so the narration may be difficult to follow. There's a link to the original text ...
“A case for courage, when speaking of AI danger” by So8res
27 Jun 2025
Contributed by Lukas
I think more people should say what they actually believe about AI dangers, loudly and often. Even if you work in AI policy. I’ve been beating this...
“My pitch for the AI Village” by Daniel Kokotajlo
25 Jun 2025
Contributed by Lukas
I think the AI Village should be funded much more than it currently is; I’d wildly guess that the AI safety ecosystem should be funding it to the tu...
“Foom & Doom 1: ‘Brain in a box in a basement’” by Steven Byrnes
24 Jun 2025
Contributed by Lukas
1.1 Series summary and Table of Contents This is a two-post series on AI “foom” (this post) and “doom” (next post). A decade or two ago, it wa...
“Futarchy’s fundamental flaw” by dynomight
21 Jun 2025
Contributed by Lukas
Say you’re Robyn Denholm, chair of Tesla's board. And say you’re thinking about firing Elon Musk. One way to make up your mind would be to h...
“Do Not Tile the Lightcone with Your Confused Ontology” by Jan_Kulveit
19 Jun 2025
Contributed by Lukas
Or: How anthropomorphic assumptions about AI identity might create confusion and suffering at scale If you are reading this and you are a human, you ...
“Endometriosis is an incredibly interesting disease” by Abhishaike Mahajan
19 Jun 2025
Contributed by Lukas
Introduction There are several diseases that are canonically recognized as ‘interesting’, even by laymen. Whether that is in their mechanism of ...
“Estrogen: A trip report” by cube_flipper
19 Jun 2025
Contributed by Lukas
I'd like to say thanks to Anna Magpie – who offers literature review as a service – for her help reviewing the section on neuroendocrinology...
“New Endorsements for ‘If Anyone Builds It, Everyone Dies’” by Malo
18 Jun 2025
Contributed by Lukas
Nate and Eliezer's forthcoming book has been getting a remarkably strong reception. I was under the impression that there are many people who fi...
[Linkpost] “the void” by nostalgebraist
17 Jun 2025
Contributed by Lukas
This is a link post. A very long essay about LLMs, the nature and history of the the HHH assistant persona, and the implications for alignment. Multi...
“Mech interp is not pre-paradigmatic” by Lee Sharkey
17 Jun 2025
Contributed by Lukas
This is a blogpost version of a talk I gave earlier this year at GDM. Epistemic status: Vague and handwavy. Nuance is often missing. Some of the cl...
“Distillation Robustifies Unlearning” by Bruce W. Lee, Addie Foote, alexinf, leni, Jacob G-W, Harish Kamath, Bryce Woodworth, cloud, TurnTrout
17 Jun 2025
Contributed by Lukas
Current “unlearning” methods only suppress capabilities instead of truly unlearning the capabilities. But if you distill an unlearned model into ...
“Intelligence Is Not Magic, But Your Threshold For ‘Magic’ Is Pretty Low” by Expertium
17 Jun 2025
Contributed by Lukas
A while ago I saw a person in the comments on comments to Scott Alexander's blog arguing that a superintelligent AI would not be able to do anyt...
“A Straightforward Explanation of the Good Regulator Theorem” by Alfred Harwood
17 Jun 2025
Contributed by Lukas
Audio note: this article contains 329 uses of latex notation, so the narration may be difficult to follow. There's a link to the original text ...
“Beware General Claims about ‘Generalizable Reasoning Capabilities’ (of Modern AI Systems)” by LawrenceC
17 Jun 2025
Contributed by Lukas
1. Late last week, researchers at Apple released a paper provocatively titled “The Illusion of Thinking: Understanding the Strengths and Limitations...
“Season Recap of the Village: Agents raise $2,000” by Shoshannah Tekofsky
07 Jun 2025
Contributed by Lukas
Four agents woke up with four computers, a view of the world wide web, and a shared chat room full of humans. Like Claude plays Pokemon, you can watc...
“The Best Reference Works for Every Subject” by Parker Conley
06 Jun 2025
Contributed by Lukas
Introduction The Best Textbooks on Every Subject is the Schelling point for the best textbooks on every subject. My The Best Tacit Knowledge Videos o...
“‘Flaky breakthroughs’ pervade coaching — and no one tracks them” by Chipmonk
05 Jun 2025
Contributed by Lukas
Has someone you know ever had a “breakthrough” from coaching, meditation, or psychedelics — only to later have it fade? Show tweet For example...
“The Value Proposition of Romantic Relationships” by johnswentworth
04 Jun 2025
Contributed by Lukas
What's the main value proposition of romantic relationships? Now, look, I know that when people drop that kind of question, they’re often abou...
“It’s hard to make scheming evals look realistic” by Igor Ivanov, dan_moken
02 Jun 2025
Contributed by Lukas
Abstract Claude 3.7 Sonnet easily detects when it's being evaluated for scheming. Surface‑level edits to evaluation scenarios, such as lengthe...
[Linkpost] “Social Anxiety Isn’t About Being Liked” by Chipmonk
01 Jun 2025
Contributed by Lukas
This is a link post. There's this popular idea that socially anxious folks are just dying to be liked. It seems logical, right? Why else would so...
“Truth or Dare” by Duncan Sabien (Inactive)
31 May 2025
Contributed by Lukas
Author's note: This is my apparently-annual "I'll put a post on LessWrong in honor of LessOnline" post. These days, my writing g...
“Meditations on Doge” by Martin Sustrik
30 May 2025
Contributed by Lukas
Lessons from shutting down institutions in Eastern Europe. This is a cross post from: https://250bpm.substack.com/p/meditations-on-doge Imagine l...
[Linkpost] “If you’re not sure how to sort a list or grid—seriate it!” by gwern
28 May 2025
Contributed by Lukas
This is a link post. "Getting Things in Order: An Introduction to the R Package seriation": Seriation [or "ordination"), i.e., fin...
“What We Learned from Briefing 70+ Lawmakers on the Threat from AI” by leticiagarcia
28 May 2025
Contributed by Lukas
Between late 2024 and mid-May 2025, I briefed over 70 cross-party UK parliamentarians. Just over one-third were MPs, a similar share were members of ...
“Winning the power to lose” by KatjaGrace
23 May 2025
Contributed by Lukas
Have the Accelerationists won? Last November Kevin Roose announced that those in favor of going fast on AI had now won against those favoring caution...
[Linkpost] “Gemini Diffusion: watch this space” by Yair Halberstadt
22 May 2025
Contributed by Lukas
This is a link post. Google Deepmind has announced Gemini Diffusion. Though buried under a host of other IO announcements it's possible that this...
“AI Doomerism in 1879” by David Gross
21 May 2025
Contributed by Lukas
I’m reading George Eliot's Impressions of Theophrastus Such (1879)—so far a snoozer compared to her novels. But chapter 17 surprised me for ...
“Consider not donating under $100 to political candidates” by DanielFilan
16 May 2025
Contributed by Lukas
Epistemic status: thing people have told me that seems right. Also primarily relevant to US audiences. Also I am speaking in my personal capacity and...
“It’s Okay to Feel Bad for a Bit” by moridinamael
16 May 2025
Contributed by Lukas
"If you kiss your child, or your wife, say that you only kiss things which are human, and thus you will not be disturbed if either of them dies....
“Explaining British Naval Dominance During the Age of Sail” by Arjun Panickssery
15 May 2025
Contributed by Lukas
The other day I discussed how high monitoring costs can explain the emergence of “aristocratic” systems of governance: Aristocracy and Hostage Ca...
“Eliezer and I wrote a book: If Anyone Builds It, Everyone Dies” by So8res
14 May 2025
Contributed by Lukas
Eliezer and I wrote a book. It's titled If Anyone Builds It, Everyone Dies. Unlike a lot of other writing either of us have done, it's bein...
“Too Soon” by Gordon Seidoh Worley
14 May 2025
Contributed by Lukas
It was a cold and cloudy San Francisco Sunday. My wife and I were having lunch with friends at a Korean cafe. My phone buzzed with a text. It said my...
“PSA: The LessWrong Feedback Service” by JustisMills
13 May 2025
Contributed by Lukas
At the bottom of the LessWrong post editor, if you have at least 100 global karma, you may have noticed this button.The button Many people click the ...
“Orienting Toward Wizard Power” by johnswentworth
08 May 2025
Contributed by Lukas
For months, I had the feeling: something is wrong. Some core part of myself had gone missing. I had words and ideas cached, which pointed back to the...
“Interpretability Will Not Reliably Find Deceptive AI” by Neel Nanda
05 May 2025
Contributed by Lukas
(Disclaimer: Post written in a personal capacity. These are personal hot takes and do not in any way represent my employer's views.) TL;DR: I do...
“Slowdown After 2028: Compute, RLVR Uncertainty, MoE Data Wall” by Vladimir_Nesov
03 May 2025
Contributed by Lukas
It'll take until ~2050 to repeat the level of scaling that pretraining compute is experiencing this decade, as increasing funding can't sus...
“Early Chinese Language Media Coverage of the AI 2027 Report: A Qualitative Analysis” by jeanne_, eeeee
01 May 2025
Contributed by Lukas
In this blog post, we analyse how the recent AI 2027 forecast by Daniel Kokotajlo, Scott Alexander, Thomas Larsen, Eli Lifland, and Romeo Dean has be...
[Linkpost] “Jaan Tallinn’s 2024 Philanthropy Overview” by jaan
25 Apr 2025
Contributed by Lukas
This is a link post. to follow up my philantropic pledge from 2020, i've updated my philanthropy page with the 2024 results. in 2024 my donations...
“Impact, agency, and taste” by benkuhn
24 Apr 2025
Contributed by Lukas
I’ve been thinking recently about what sets apart the people who’ve done the best work at Anthropic. You might think that the main thing that mak...
[Linkpost] “To Understand History, Keep Former Population Distributions In Mind” by Arjun Panickssery
24 Apr 2025
Contributed by Lukas
This is a link post. Guillaume Blanc has a piece in Works in Progress (I assume based on his paper) about how France's fertility declined earlier...
“AI-enabled coups: a small group could use AI to seize power” by Tom Davidson, Lukas Finnveden, rosehadshar
23 Apr 2025
Contributed by Lukas
We’ve written a new report on the threat of AI-enabled coups. I think this is a very serious risk – comparable in importance to AI takeover but ...
“Accountability Sinks” by Martin Sustrik
23 Apr 2025
Contributed by Lukas
Back in the 1990s, ground squirrels were briefly fashionable pets, but their popularity came to an abrupt end after an incident at Schiphol Airport o...
“Training AGI in Secret would be Unsafe and Unethical” by Daniel Kokotajlo
21 Apr 2025
Contributed by Lukas
Subtitle: Bad for loss of control risks, bad for concentration of power risks I’ve had this sitting in my drafts for the last year. I wish I’d be...
“Why Should I Assume CCP AGI is Worse Than USG AGI?” by Tomás B.
20 Apr 2025
Contributed by Lukas
Though, given my doomerism, I think the natsec framing of the AGI race is likely wrongheaded, let me accept the Dario/Leopold/Altman frame that AGI w...
“Surprising LLM reasoning failures make me think we still need qualitative breakthroughs for AGI” by Kaj_Sotala
17 Apr 2025
Contributed by Lukas
Introduction Writing this post puts me in a weird epistemic position. I simultaneously believe that: The reasoning failures that I'll discuss ar...
“Frontier AI Models Still Fail at Basic Physical Tasks: A Manufacturing Case Study” by Adam Karvonen
16 Apr 2025
Contributed by Lukas
Dario Amodei, CEO of Anthropic, recently worried about a world where only 30% of jobs become automated, leading to class tensions between the automat...
“Negative Results for SAEs On Downstream Tasks and Deprioritising SAE Research (GDM Mech Interp Team Progress Update #2)” by Neel Nanda, lewis smith, Senthooran Rajamanoharan, Arthur Conmy, Callum McDougall, Tom Lieberum, János Kramár, Rohin Shah
12 Apr 2025
Contributed by Lukas
Audio note: this article contains 31 uses of latex notation, so the narration may be difficult to follow. There's a link to the original text i...
[Linkpost] “Playing in the Creek” by Hastings
11 Apr 2025
Contributed by Lukas
This is a link post. When I was a really small kid, one of my favorite activities was to try and dam up the creek in my backyard. I would carefully mo...
“Thoughts on AI 2027” by Max Harms
10 Apr 2025
Contributed by Lukas
This is part of the MIRI Single Author Series. Pieces in this series represent the beliefs and opinions of their named authors, and do not claim to s...
“Short Timelines don’t Devalue Long Horizon Research” by Vladimir_Nesov
09 Apr 2025
Contributed by Lukas
Short AI takeoff timelines seem to leave no time for some lines of alignment research to become impactful. But any research rebalances the mix of cur...
“Alignment Faking Revisited: Improved Classifiers and Open Source Extensions” by John Hughes, abhayesian, Akbir Khan, Fabien Roger
09 Apr 2025
Contributed by Lukas
In this post, we present a replication and extension of an alignment faking model organism: Replication: We replicate the alignment faking (AF) pa...
“METR: Measuring AI Ability to Complete Long Tasks” by Zach Stein-Perlman
07 Apr 2025
Contributed by Lukas
Summary: We propose measuring AI performance in terms of the length of tasks AI agents can complete. We show that this metric has been consistently e...
“Why Have Sentence Lengths Decreased?” by Arjun Panickssery
04 Apr 2025
Contributed by Lukas
“In the loveliest town of all, where the houses were white and high and the elms trees were green and higher than the houses, where the front yards...
“AI 2027: What Superintelligence Looks Like” by Daniel Kokotajlo, Thomas Larsen, elifland, Scott Alexander, Jonas V, romeo
03 Apr 2025
Contributed by Lukas
In 2021 I wrote what became my most popular blog post: What 2026 Looks Like. I intended to keep writing predictions all the way to AGI and beyond, bu...
“OpenAI #12: Battle of the Board Redux” by Zvi
03 Apr 2025
Contributed by Lukas
Back when the OpenAI board attempted and failed to fire Sam Altman, we faced a highly hostile information environment. The battle was fought largely t...
“The Pando Problem: Rethinking AI Individuality” by Jan_Kulveit
03 Apr 2025
Contributed by Lukas
Epistemic status: This post aims at an ambitious target: improving intuitive understanding directly. The model for why this is worth trying is that I...
“OpenAI #12: Battle of the Board Redux” by Zvi
03 Apr 2025
Contributed by Lukas
Back when the OpenAI board attempted and failed to fire Sam Altman, we faced a highly hostile information environment. The battle was fought largely t...
“You will crash your car in front of my house within the next week” by Richard Korzekwa
02 Apr 2025
Contributed by Lukas
I'm not writing this to alarm anyone, but it would be irresponsible not to report on something this important. On current trends, every car will...
“My ‘infohazards small working group’ Signal Chat may have encountered minor leaks” by Linch
02 Apr 2025
Contributed by Lukas
Remember: There is no such thing as a pink elephant. Recently, I was made aware that my “infohazards small working group” Signal chat, an informa...
“Leverage, Exit Costs, and Anger: Re-examining Why We Explode at Home, Not at Work” by at_the_zoo
02 Apr 2025
Contributed by Lukas
Let's cut through the comforting narratives and examine a common behavioral pattern with a sharper lens: the stark difference between how anger ...
“PauseAI and E/Acc Should Switch Sides” by WillPetillo
02 Apr 2025
Contributed by Lukas
In the debate over AI development, two movements stand as opposites: PauseAI calls for slowing down AI progress, and e/acc (effective accelerationism...
“VDT: a solution to decision theory” by L Rudolf L
02 Apr 2025
Contributed by Lukas
Introduction Decision theory is about how to behave rationally under conditions of uncertainty, especially if this uncertainty involves being acausal...
“LessWrong has been acquired by EA” by habryka
01 Apr 2025
Contributed by Lukas
Dear LessWrong community, It is with a sense of... considerable cognitive dissonance that I announce a significant development regarding the future t...
“We’re not prepared for an AI market crash” by Remmelt
01 Apr 2025
Contributed by Lukas
Our community is not prepared for an AI crash. We're good at tracking new capability developments, but not as much the company financials. Curre...
“Conceptual Rounding Errors” by Jan_Kulveit
29 Mar 2025
Contributed by Lukas
Epistemic status: Reasonably confident in the basic mechanism. Have you noticed that you keep encountering the same ideas over and over? You read ano...
“Tracing the Thoughts of a Large Language Model” by Adam Jermyn
28 Mar 2025
Contributed by Lukas
[This is our blog post on the papers, which can be found at https://transformer-circuits.pub/2025/attribution-graphs/biology.html and https://transfo...
“Recent AI model progress feels mostly like bullshit” by lc
25 Mar 2025
Contributed by Lukas
About nine months ago, I and three friends decided that AI had gotten good enough to monitor large codebases autonomously for security problems. We s...
“AI for AI safety” by Joe Carlsmith
25 Mar 2025
Contributed by Lukas
(Audio version here (read by the author), or search for "Joe Carlsmith Audio" on your podcast app. This is the fourth essay in a series that...
“Policy for LLM Writing on LessWrong” by jimrandomh
25 Mar 2025
Contributed by Lukas
LessWrong has been receiving an increasing number of posts and contents that look like they might be LLM-written or partially-LLM-written, so we&apos...
“Will Jesus Christ return in an election year?” by Eric Neyman
25 Mar 2025
Contributed by Lukas
Thanks to Jesse Richardson for discussion. Polymarket asks: will Jesus Christ return in 2025? In the three days since the market opened, traders hav...
“Good Research Takes are Not Sufficient for Good Strategic Takes” by Neel Nanda
23 Mar 2025
Contributed by Lukas
TL;DR Having a good research track record is some evidence of good big-picture takes, but it's weak evidence. Strategic thinking is hard, and re...
“Intention to Treat” by Alicorn
22 Mar 2025
Contributed by Lukas
When my son was three, we enrolled him in a study of a vision condition that runs in my family. They wanted us to put an eyepatch on him for part of ...
“On the Rationality of Deterring ASI” by Dan H
22 Mar 2025
Contributed by Lukas
I’m releasing a new paper “Superintelligence Strategy” alongside Eric Schmidt (formerly Google), and Alexandr Wang (Scale AI). Below is the exec...
[Linkpost] “METR: Measuring AI Ability to Complete Long Tasks” by Zach Stein-Perlman
19 Mar 2025
Contributed by Lukas
This is a link post. Summary: We propose measuring AI performance in terms of the length of tasks AI agents can complete. We show that this metric has...
“I make several million dollars per year and have hundreds of thousands of followers—what is the straightest line path to utilizing these resources to reduce existential-level AI threats?” by shrimpy
19 Mar 2025
Contributed by Lukas
I have, over the last year, become fairly well-known in a small corner of the internet tangentially related to AI.As a result, I've begun making ...
“Claude Sonnet 3.7 (often) knows when it’s in alignment evaluations” by Nicholas Goldowsky-Dill, Mikita Balesni, Jérémy Scheurer, Marius Hobbhahn
18 Mar 2025
Contributed by Lukas
Note: this is a research note based on observations from evaluating Claude Sonnet 3.7. We’re sharing the results of these ‘work-in-progress’ inv...
“Levels of Friction” by Zvi
18 Mar 2025
Contributed by Lukas
Scott Alexander famously warned us to Beware Trivial Inconveniences.When you make a thing easy to do, people often do vastly more of it.When you put u...
“Why White-Box Redteaming Makes Me Feel Weird” by Zygi Straznickas
17 Mar 2025
Contributed by Lukas
There's this popular trope in fiction about a character being mind controlled without losing awareness of what's happening. Think Jessica Jo...
“Reducing LLM deception at scale with self-other overlap fine-tuning” by Marc Carauleanu, Diogo de Lucena, Gunnar_Zarncke, Judd Rosenblatt, Mike Vaiana, Cameron Berg
17 Mar 2025
Contributed by Lukas
This research was conducted at AE Studio and supported by the AI Safety Grants programme administered by Foresight Institute with additional support f...