LessWrong (Curated & Popular)

"Announcing Timaeus" by Jesse Hoogland et al.

30 Oct 2023

Contributed by Lukas

Timaeus is a new AI safety research organization dedicated to making fundamental breakthroughs in technical AI alignment using deep ideas from mathema...

[HUMAN VOICE] "Alignment Implications of LLM Successes: a Debate in One Act" by Zack M Davis

23 Oct 2023

Contributed by Lukas

Support ongoing human narrations of curated posts:www.patreon.com/LWCuratedDoomimir: Humanity has made no progress on the alignment problem. Not only ...

"Holly Elmore and Rob Miles dialogue on AI Safety Advocacy" by jacobjacob, Robert Miles & Holly_Elmore

23 Oct 2023

Contributed by Lukas

Holly is an independent AI Pause organizer, which includes organizing protests (like this upcoming one). Rob is an AI Safety YouTuber. I (jacobjacob) ...

"LoRA Fine-tuning Efficiently Undoes Safety Training from Llama 2-Chat 70B" by Simon Lermen & Jeffrey Ladish.

23 Oct 2023

Contributed by Lukas

Produced as part of the SERI ML Alignment Theory Scholars Program - Summer 2023 Cohort, under the mentorship of Jeffrey Ladish. TL;DR LoRA fine-tunin...

"Labs should be explicit about why they are building AGI" by Peter Barnett

19 Oct 2023

Contributed by Lukas

Three of the big AI labs say that they care about alignment and that they think misaligned AI poses a potentially existential threat to humanity. Thes...

[HUMAN VOICE] "Sum-threshold attacks" by TsviBT

18 Oct 2023

Contributed by Lukas

Support ongoing human narrations of curated posts:www.patreon.com/LWCuratedHow do you affect something far away, a lot, without anyone noticing?(Note:...

"Will no one rid me of this turbulent pest?" by Metacelsus

18 Oct 2023

Contributed by Lukas

Last year, I wrote about the promise of gene drives to wipe out mosquito species and end malaria.In the time since my previous writing, gene drives ha...

[HUMAN VOICE] "Inside Views, Impostor Syndrome, and the Great LARP" by John Wentworth

15 Oct 2023

Contributed by Lukas

Patreon to support human narration. (Narrations will remain freely available on this feed, but you can optionally support them if you'd like me t...

"RSPs are pauses done right" by evhub

15 Oct 2023

Contributed by Lukas

COI: I am a research scientist at Anthropic, where I work on model organisms of misalignment; I was also involved in the drafting process for Anthropi...

"Comparing Anthropic's Dictionary Learning to Ours" by Robert_AIZI

15 Oct 2023

Contributed by Lukas

Readers may have noticed many similarities between Anthropic's recent publication Towards Monosemanticity: Decomposing Language Models With Dicti...

"Announcing MIRI’s new CEO and leadership team" by Gretta Duleba

15 Oct 2023

Contributed by Lukas

In 2023, MIRI has shifted focus in the direction of broad public communication—see, for example, our recent TED talk, our piece in TIME magazine “...

"Cohabitive Games so Far" by mako yass

15 Oct 2023

Contributed by Lukas

A cohabitive game[1] is a partially cooperative, partially competitive multiplayer game that provides an anarchic dojo for development in applied coop...

"Announcing Dialogues" by Ben Pace

09 Oct 2023

Contributed by Lukas

As of today, everyone is able to create a new type of content on LessWrong: Dialogues.In contrast with posts, which are for monologues, and comment se...

"Response to Quintin Pope’s Evolution Provides No Evidence For the Sharp Left Turn" by Zvi

09 Oct 2023

Contributed by Lukas

Response to: Evolution Provides No Evidence For the Sharp Left Turn, due to it winning first prize in The Open Philanthropy Worldviews contest. Quint...

"Evaluating the historical value misspecification argument" by Matthew Barnett

09 Oct 2023

Contributed by Lukas

ETA: I'm not saying that MIRI thought AIs wouldn't understand human values. If there's only one thing you take away from this post, ple...

"Towards Monosemanticity: Decomposing Language Models With Dictionary Learning" by Zac Hatfield-Dodds

09 Oct 2023

Contributed by Lukas

Neural networks are trained on data, not programmed to follow rules. We understand the math of the trained network exactly – each neuron in a neural...

"Thomas Kwa's MIRI research experience" by Thomas Kwa and others

06 Oct 2023

Contributed by Lukas

Moderator note: the following is a dialogue using LessWrong’s new dialogue feature. The exchange is not completed: new replies might be added contin...

"'Diamondoid bacteria' nanobots: deadly threat or dead-end? A nanotech investigation" by titotal

03 Oct 2023

Contributed by Lukas

A lot of people are highly concerned that a malevolent AI or insane human will, in the near future, set out to destroy humanity. If such an entity wan...

"The Lighthaven Campus is open for bookings" by Habryka

03 Oct 2023

Contributed by Lukas

Lightcone Infrastructure (the organization that grew from and houses the LessWrong team) has just finished renovating a 7-building physical campus tha...

"How to Catch an AI Liar: Lie Detection in Black-Box LLMs by Asking Unrelated Questions" by Jan Brauner et al.

03 Oct 2023

Contributed by Lukas

Large language models (LLMs) can "lie", which we define as outputting false statements despite "knowing" the truth in a demonstrab...

"EA Vegan Advocacy is not truthseeking, and it’s everyone’s problem" by Elizabeth

03 Oct 2023

Contributed by Lukas

Effective altruism prides itself on truthseeking. That pride is justified in the sense that EA is better at truthseeking than most members of its refe...

"The King and the Golem" by Richard Ngo

29 Sep 2023

Contributed by Lukas

This is a linkpost for https://narrativeark.substack.com/p/the-king-and-the-golemLong ago there was a mighty king who had everything in the world that...

"Sparse Autoencoders Find Highly Interpretable Directions in Language Models" by Logan Riggs et al

27 Sep 2023

Contributed by Lukas

This is a linkpost for Sparse Autoencoders Find Highly Interpretable Directions in Language ModelsWe use a scalable and unsupervised method called Spa...

"Inside Views, Impostor Syndrome, and the Great LARP" by John Wentworth

26 Sep 2023

Contributed by Lukas

Epistemic status: model which I find sometimes useful, and which emphasizes some true things about many parts of the world which common alternative mo...

"There should be more AI safety orgs" by Marius Hobbhahn

25 Sep 2023

Contributed by Lukas

I’m writing this in my own capacity. The views expressed are my own, and should not be taken to represent the views of Apollo Research or any other ...

"The Talk: a brief explanation of sexual dimorphism" by Malmesbury

22 Sep 2023

Contributed by Lukas

Cross-posted from substack."Everything in the world is about sex, except sex. Sex is about clonal interference."– Oscar Wilde (kind of)As ...

"A Golden Age of Building? Excerpts and lessons from Empire State, Pentagon, Skunk Works and SpaceX" by jacobjacob

20 Sep 2023

Contributed by Lukas

Patrick Collison has a fantastic list of examples of people quickly accomplishing ambitious things together since the 19th Century. It does make you y...

"AI presidents discuss AI alignment agendas" by TurnTrout & Garrett Baker

19 Sep 2023

Contributed by Lukas

This is a linkpost for https://www.youtube.com/watch?v=02kbWY5mahQNone of the presidents fully represent my (TurnTrout's) views.TurnTrout wrote t...

"UDT shows that decision theory is more puzzling than ever" by Wei Dai

18 Sep 2023

Contributed by Lukas

I feel like MIRI perhaps mispositioned FDT (their variant of UDT) as a clear advancement in decision theory, whereas maybe they could have attracted m...

"Sum-threshold attacks" by TsviBT

11 Sep 2023

Contributed by Lukas

How do you affect something far away, a lot, without anyone noticing?(Note: you can safely skip sections. It is also safe to skip the essay entirely, ...

"A list of core AI safety problems and how I hope to solve them" by Davidad

09 Sep 2023

Contributed by Lukas

Context: I sometimes find myself referring back to this tweet and wanted to give it a more permanent home. While I'm at it, I thought I would try...

"Report on Frontier Model Training" by Yafah Edelman

09 Sep 2023

Contributed by Lukas

This is a linkpost for https://docs.google.com/document/d/1TsYkDYtV6BKiCN9PAOirRAy3TrNDu2XncUZ5UZfaAKA/edit?usp=sharingUnderstanding what drives the r...

"One Minute Every Moment" by abramdemski

08 Sep 2023

Contributed by Lukas

About how much information are we keeping in working memory at a given moment?"Miller's Law" dictates that the number of things humans ...

"Sharing Information About Nonlinear" by Ben Pace

08 Sep 2023

Contributed by Lukas

Added (11th Sept): Nonlinear have commented that they intend to write a response, have written a short follow-up, and claim that they dispute 85 claim...

"Defunding My Mistake" by ymeskhout

08 Sep 2023

Contributed by Lukas

Until about five years ago, I unironically parroted the slogan All Cops Are Bastards (ACAB) and earnestly advocated to abolish the police and prison s...

"What I would do if I wasn’t at ARC Evals" by LawrenceC

08 Sep 2023

Contributed by Lukas

In which: I list 9 projects that I would work on if I wasn’t busy working on safety standards at ARC Evals, and explain why they might be good to wo...

"Meta Questions about Metaphilosophy" by Wei Dai

04 Sep 2023

Contributed by Lukas

To quickly recap my main intellectual journey so far (omitting a lengthy side trip into cryptography and Cypherpunk land), with the approximate age th...

"The U.S. is becoming less stable" by lc

04 Sep 2023

Contributed by Lukas

We focus so much on arguing over who is at fault in this country that I think sometimes we fail to alert on what's actually happening. I would ju...

"OpenAI API base models are not sycophantic, at any size" by Nostalgebraist

04 Sep 2023

Contributed by Lukas

In Discovering Language Model Behaviors with Model-Written Evaluations" (Perez et al 2022), the authors studied language model "sycophancy&q...

"Dear Self; we need to talk about ambition" by Elizabeth

30 Aug 2023

Contributed by Lukas

I keep seeing advice on ambition, aimed at people in college or early in their career, that would have been really bad for me at similar ages. Rather ...

"Assume Bad Faith" by Zack_M_Davis

28 Aug 2023

Contributed by Lukas

I've been trying to avoid the terms "good faith" and "bad faith". I'm suspicious that most people who have picked up the...

"Book Launch: "The Carving of Reality," Best of LessWrong vol. III" by Raemon

28 Aug 2023

Contributed by Lukas

The Carving of Reality, third volume of the Best of LessWrong books is now available on Amazon (US).The Carving of Reality includes 43 essays from 29 ...

"Large Language Models will be Great for Censorship" by Ethan Edwards

23 Aug 2023

Contributed by Lukas

LLMs can do many incredible things. They can generate unique creative content, carry on long conversations in any number of subjects, complete complex...

"6 non-obvious mental health issues specific to AI safety" by Igor Ivanov

22 Aug 2023

Contributed by Lukas

Intro: I am a psychotherapist, and I help people working on AI safety. I noticed patterns of mental health issues highly specific to this group. It&ap...

"Ten Thousand Years of Solitude" by agp

22 Aug 2023

Contributed by Lukas

This is a linkpost for the article "Ten Thousand Years of Solitude", written by Jared Diamond for Discover Magazine in 1993, four years befo...

"Against Almost Every Theory of Impact of Interpretability" by Charbel-Raphaël

21 Aug 2023

Contributed by Lukas

I gave a talk about the different risk models, followed by an interpretability presentation, then I got a problematic question, "I don't und...

"Inflection.ai is a major AGI lab" by Nikola

15 Aug 2023

Contributed by Lukas

Inflection.ai (co-founded by DeepMind co-founder Mustafa Suleyman) should be perceived as a frontier LLM lab of similar magnitude as Meta, OpenAI, Dee...

"Feedbackloop-first Rationality" by Raemon

15 Aug 2023

Contributed by Lukas

I've been workshopping a new rationality training paradigm. (By "rationality training paradigm", I mean an approach to learning/teachin...

"Model Organisms of Misalignment: The Case for a New Pillar of Alignment Research" by evhub, Nicholas Schiefer, Carson Denison, Ethan Perez

09 Aug 2023

Contributed by Lukas

TL;DR: This document lays out the case for research on “model organisms of misalignment” – in vitro demonstrations of the kinds of failures that...

"When can we trust model evaluations?" bu evhub

09 Aug 2023

Contributed by Lukas

In "Towards understanding-based safety evaluations," I discussed why I think evaluating specifically the alignment of models is likely to re...

"ARC Evals new report: Evaluating Language-Model Agents on Realistic Autonomous Tasks" by Beth Barnes

04 Aug 2023

Contributed by Lukas

Blogpost versionPaperWe have just released our first public report. It introduces methodology for assessing the capacity of LLM agents to acquire reso...

"The "public debate" about AI is confusing for the general public and for policymakers because it is a three-sided debate" by Adam David Long

04 Aug 2023

Contributed by Lukas

Summary of Argument: The public debate among AI experts is confusing because there are, to a first approximation, three sides, not two sides to the de...

"My current LK99 questions" by Eliezer Yudkowsky

04 Aug 2023

Contributed by Lukas

So this morning I thought to myself, "Okay, now I will actually try to study the LK99 question, instead of betting based on nontechnical priors a...

"Thoughts on sharing information about language model capabilities" by paulfchristiano

02 Aug 2023

Contributed by Lukas

I believe that sharing information about the capabilities and limits of existing ML systems, and especially language model agents, significantly reduc...

"Cultivating a state of mind where new ideas are born" by Henrik Karlsson

31 Jul 2023

Contributed by Lukas

In the early 2010s, a popular idea was to provide coworking spaces and shared living to people who were building startups. That way the founders would...

"Self-driving car bets" by paulfchristiano

31 Jul 2023

Contributed by Lukas

This month I lost a bunch of bets.Back in early 2016 I bet at even odds that self-driving ride sharing would be available in 10 US cities by July 2023...

"Yes, It's Subjective, But Why All The Crabs?" by johnswentworth

31 Jul 2023

Contributed by Lukas

Some early biologist, equipped with knowledge of evolution but not much else, might see all these crabs and expect a common ancestral lineage. That’...

"Grant applications and grand narratives" by Elizabeth

28 Jul 2023

Contributed by Lukas

The Lightspeed application asks: “What impact will [your project] have on the world? What is your project’s goal, how will you know if you’ve ...

"Brain Efficiency Cannell Prize Contest Award Ceremony" by Alexander Gietelink Oldenziel

28 Jul 2023

Contributed by Lukas

Previously Jacob Cannell wrote the post "Brain Efficiency" which makes several radical claims: that the brain is at the pareto frontier of s...

"Rationality !== Winning" by Raemon

28 Jul 2023

Contributed by Lukas

I think "Rationality is winning" is a bit of a trap. (The original phrase is notably "rationality is systematized winning", which...

"Cryonics and Regret" by MvB

28 Jul 2023

Contributed by Lukas

This post is not about arguments in favor of or against cryonics. I would just like to share a particular emotional response of mine as the topic beca...

"Unifying Bargaining Notions (2/2)" by Diffractor

12 Jun 2023

Contributed by Lukas

Alright, time for the payoff, unifying everything discussed in the previous post. This post is a lot more mathematically dense, you might want to dige...

"The ants and the grasshopper" by Richard Ngo

06 Jun 2023

Contributed by Lukas

Inspired by Aesop, Soren Kierkegaard, Robin Hanson, sadoeuphemist and Ben Hoffman.One winter a grasshopper, starving and frail, approaches a colony of...

"Steering GPT-2-XL by adding an activation vector" by TurnTrout et al.

18 May 2023

Contributed by Lukas

Summary: We demonstrate a new scalable way of interacting with language models: adding certain activation vectors into forward passes. Essentially, we...

"An artificially structured argument for expecting AGI ruin" by Rob Bensinger

16 May 2023

Contributed by Lukas

Philosopher David Chalmers asked: "Is there a canonical source for "the argument for AGI ruin" somewhere, preferably laid out as an exp...

"How much do you believe your results?" by Eric Neyman

10 May 2023

Contributed by Lukas

You are the director of a giant government research program that’s conducting randomized controlled trials (RCTs) on two thousand health interventio...

"Mental Health and the Alignment Problem: A Compilation of Resources (updated April 2023)" by Chris Scammell & DivineMango

27 Apr 2023

Contributed by Lukas

This is a post about mental health and disposition in relation to the alignment problem. It compiles a number of resources that address how to maintai...

"On AutoGPT" by Zvi

19 Apr 2023

Contributed by Lukas

The primary talk of the AI world recently is about AI agents (whether or not it includes the question of whether we can’t help but notice we are all...

"GPTs are Predictors, not Imitators" by Eliezer Yudkowsky

12 Apr 2023

Contributed by Lukas

(Related text posted to Twitter; this version is edited and has a more advanced final section.)Imagine yourself in a box, trying to predict the next w...

"A stylized dialogue on John Wentworth's claims about markets and optimization" by Nate Soares

05 Apr 2023

Contributed by Lukas

https://www.lesswrong.com/posts/fJBTRa7m7KnCDdzG5/a-stylized-dialogue-on-john-wentworth-s-claims-about-markets(This is a stylized version of a real co...

"Discussion with Nate Soares on a key alignment difficulty" by Holden Karnofsky

05 Apr 2023

Contributed by Lukas

https://www.lesswrong.com/posts/iy2o4nQj9DnQD7Yhj/discussion-with-nate-soares-on-a-key-alignment-difficultyCrossposted from the AI Alignment Forum. Ma...

"Deep Deceptiveness" by Nate Soares

05 Apr 2023

Contributed by Lukas

https://www.lesswrong.com/posts/XWwvwytieLtEWaFJX/deep-deceptivenessThis post is an attempt to gesture at a class of AI notkilleveryoneism (alignment)...

"The Onion Test for Personal and Institutional Honesty" by Chana Messinger & Andrew Critch

28 Mar 2023

Contributed by Lukas

https://www.lesswrong.com/posts/nTGEeRSZrfPiJwkEc/the-onion-test-for-personal-and-institutional-honesty[co-written by Chana Messinger and Andrew Critc...

"There’s no such thing as a tree (phylogenetically)" by Eukaryote

28 Mar 2023

Contributed by Lukas

https://www.lesswrong.com/posts/fRwdkop6tyhi3d22L/there-s-no-such-thing-as-a-tree-phylogeneticallyThis is a linkpost for https://eukaryotewritesblog.c...

"Losing the root for the tree" by Adam Zerner

28 Mar 2023

Contributed by Lukas

https://www.lesswrong.com/posts/ma7FSEtumkve8czGF/losing-the-root-for-the-treeYou know that being healthy is important. And that there's a lot of...

"It Looks Like You’re Trying To Take Over The World" by Gwern

28 Mar 2023

Contributed by Lukas

https://gwern.net/fiction/clippyIn A.D. 20XX. Work was beginning. “How are you gentlemen !!”… (Work. Work never changes; work is always hell.)Sp...

"Why I think strong general AI is coming soon" by Porby

28 Mar 2023

Contributed by Lukas

https://www.lesswrong.com/posts/K4urTDkBbtNuLivJx/why-i-think-strong-general-ai-is-coming-soonI think there is little time left before someone builds ...

"What failure looks like" by Paul Christiano

28 Mar 2023

Contributed by Lukas

https://www.lesswrong.com/posts/HBxe6wdjxK239zajf/what-failure-looks-likeCrossposted from the AI Alignment Forum. May contain more technical jargon th...

"Lies, Damn Lies, and Fabricated Options" by Duncan Sabien

28 Mar 2023

Contributed by Lukas

https://www.lesswrong.com/posts/gNodQGNoPDjztasbh/lies-damn-lies-and-fabricated-optionsThis is an essay about one of those "once you see it, you ...

""Carefully Bootstrapped Alignment" is organizationally hard" by Raemon

21 Mar 2023

Contributed by Lukas

https://www.lesswrong.com/posts/thkAtqoQwN6DtaiGT/carefully-bootstrapped-alignment-is-organizationally-hardIn addition to technical challenges, plans ...

"More information about the dangerous capability evaluations we did with GPT-4 and Claude." by Beth Barnes

21 Mar 2023

Contributed by Lukas

https://www.lesswrong.com/posts/4Gt42jX7RiaNaxCwP/more-information-about-the-dangerous-capability-evaluationsCrossposted from the AI Alignment Forum. ...

"Enemies vs Malefactors" by Nate Soares

14 Mar 2023

Contributed by Lukas

https://www.lesswrong.com/posts/zidQmfFhMgwFzcHhs/enemies-vs-malefactorsStatus: some mix of common wisdom (that bears repeating in our particular cont...

"The Parable of the King and the Random Process" by moridinamael

14 Mar 2023

Contributed by Lukas

https://www.lesswrong.com/posts/LzQtrHSYDafXynofq/the-parable-of-the-king-and-the-random-process~ A Parable of Forecasting Under Model Uncertainty ~Yo...

"The Waluigi Effect (mega-post)" by Cleo Nardo

08 Mar 2023

Contributed by Lukas

https://www.lesswrong.com/posts/D7PumeYTDPfBTp3i7/the-waluigi-effect-mega-postIn this article, I will present a mechanistic explanation of the Waluigi...

"Acausal normalcy" by Andrew Critch

06 Mar 2023

Contributed by Lukas

https://www.lesswrong.com/posts/3RSq3bfnzuL3sp46J/acausal-normalcyCrossposted from the AI Alignment Forum. May contain more technical jargon than usua...

"Please don't throw your mind away" by TsviBT

01 Mar 2023

Contributed by Lukas

https://www.lesswrong.com/posts/RryyWNmJNnLowbhfC/please-don-t-throw-your-mind-away[Warning: the following dialogue contains an incidental spoiler for...

"Cyborgism" by Nicholas Kees & Janus

15 Feb 2023

Contributed by Lukas

https://www.lesswrong.com/posts/bxt7uCiHam4QXrQAA/cyborgismThere is a lot of disagreement and confusion about the feasibility and risks associated wit...

"Childhoods of exceptional people" by Henrik Karlsson

14 Feb 2023

Contributed by Lukas

https://www.lesswrong.com/posts/CYN7swrefEss4e3Qe/childhoods-of-exceptional-peopleThis is a linkpost for https://escapingflatland.substack.com/p/child...

"What I mean by "alignment is in large part about making cognition aimable at all"" by Nate Soares

13 Feb 2023

Contributed by Lukas

https://www.lesswrong.com/posts/NJYmovr9ZZAyyTBwM/what-i-mean-by-alignment-is-in-large-part-about-makingCrossposted from the AI Alignment Forum. May c...

"On not getting contaminated by the wrong obesity ideas" by Natália Coelho Mendonça

10 Feb 2023

Contributed by Lukas

https://www.lesswrong.com/posts/NRrbJJWnaSorrqvtZ/on-not-getting-contaminated-by-the-wrong-obesity-ideasA Chemical Hunger (a), a series by the authors...

"SolidGoldMagikarp (plus, prompt generation)"

08 Feb 2023

Contributed by Lukas

https://www.lesswrong.com/posts/aPeJE8bSo6rAFoLqg/solidgoldmagikarp-plus-prompt-generationWork done at SERI-MATS, over the past two months, by Jessica...

"Focus on the places where you feel shocked everyone's dropping the ball" by Nate Soares

03 Feb 2023

Contributed by Lukas

https://www.lesswrong.com/posts/Zp6wG5eQFLGWwcG6j/focus-on-the-places-where-you-feel-shocked-everyone-sWriting down something I’ve found myself repe...

"Basics of Rationalist Discourse" by Duncan Sabien

02 Feb 2023

Contributed by Lukas

https://www.lesswrong.com/posts/XPv4sYrKnPzeJASuk/basics-of-rationalist-discourse-1IntroductionThis post is meant to be a linkable resource. Its core ...

"My Model Of EA Burnout" by Logan Strohl

31 Jan 2023

Contributed by Lukas

https://www.lesswrong.com/posts/pDzdb4smpzT3Lwbym/my-model-of-ea-burnout(Probably somebody else has said most of this. But I personally haven't r...

"Sapir-Whorf for Rationalists" by Duncan Sabien

31 Jan 2023

Contributed by Lukas

https://www.lesswrong.com/posts/PCrTQDbciG4oLgmQ5/sapir-whorf-for-rationalistsCasus Belli: As I was scanning over my (rather long) list of essays-to-w...

"The Social Recession: By the Numbers" by Anton Stjepan Cebalo

25 Jan 2023

Contributed by Lukas

https://www.lesswrong.com/posts/Xo7qmDakxiizG7B9c/the-social-recession-by-the-numbersThis is a linkpost for https://novum.substack.com/p/social-recess...

"Recursive Middle Manager Hell" by Raemon

24 Jan 2023

Contributed by Lukas

https://www.lesswrong.com/posts/pHfPvb4JMhGDr4B7n/recursive-middle-manager-hellI think Zvi's Immoral Mazes sequence is really important, but come...

"The Feeling of Idea Scarcity" by John Wentworth

12 Jan 2023

Contributed by Lukas

https://www.lesswrong.com/posts/mfPHTWsFhzmcXw8ta/the-feeling-of-idea-scarcityHere’s a story you may recognize. There's a bright up-and-coming ...

"Models Don't 'Get Reward'" by Sam Ringer

12 Jan 2023

Contributed by Lukas

https://www.lesswrong.com/posts/TWorNr22hhYegE4RT/models-don-t-get-rewardCrossposted from the AI Alignment Forum. May contain more technical jargon th...

"How 'Discovering Latent Knowledge in Language Models Without Supervision' Fits Into a Broader Alignment Scheme" by Collin

12 Jan 2023

Contributed by Lukas

https://www.lesswrong.com/posts/L4anhrxjv8j2yRKKp/how-discovering-latent-knowledge-in-language-models-withoutCrossposted from the AI Alignment Forum. ...

Activity Overview

Episodes

"Announcing Timaeus" by Jesse Hoogland et al.

[HUMAN VOICE] "Alignment Implications of LLM Successes: a Debate in One Act" by Zack M Davis

"Holly Elmore and Rob Miles dialogue on AI Safety Advocacy" by jacobjacob, Robert Miles & Holly_Elmore

"LoRA Fine-tuning Efficiently Undoes Safety Training from Llama 2-Chat 70B" by Simon Lermen & Jeffrey Ladish.

"Labs should be explicit about why they are building AGI" by Peter Barnett

[HUMAN VOICE] "Sum-threshold attacks" by TsviBT

"Will no one rid me of this turbulent pest?" by Metacelsus

[HUMAN VOICE] "Inside Views, Impostor Syndrome, and the Great LARP" by John Wentworth

"RSPs are pauses done right" by evhub

"Comparing Anthropic's Dictionary Learning to Ours" by Robert_AIZI

"Announcing MIRI’s new CEO and leadership team" by Gretta Duleba

"Cohabitive Games so Far" by mako yass

"Announcing Dialogues" by Ben Pace

"Response to Quintin Pope’s Evolution Provides No Evidence For the Sharp Left Turn" by Zvi

"Evaluating the historical value misspecification argument" by Matthew Barnett

"Towards Monosemanticity: Decomposing Language Models With Dictionary Learning" by Zac Hatfield-Dodds

"Thomas Kwa's MIRI research experience" by Thomas Kwa and others

"'Diamondoid bacteria' nanobots: deadly threat or dead-end? A nanotech investigation" by titotal

"The Lighthaven Campus is open for bookings" by Habryka

"How to Catch an AI Liar: Lie Detection in Black-Box LLMs by Asking Unrelated Questions" by Jan Brauner et al.

"EA Vegan Advocacy is not truthseeking, and it’s everyone’s problem" by Elizabeth

"The King and the Golem" by Richard Ngo

"Sparse Autoencoders Find Highly Interpretable Directions in Language Models" by Logan Riggs et al

"Inside Views, Impostor Syndrome, and the Great LARP" by John Wentworth

"There should be more AI safety orgs" by Marius Hobbhahn

"The Talk: a brief explanation of sexual dimorphism" by Malmesbury

"A Golden Age of Building? Excerpts and lessons from Empire State, Pentagon, Skunk Works and SpaceX" by jacobjacob

"AI presidents discuss AI alignment agendas" by TurnTrout & Garrett Baker

"UDT shows that decision theory is more puzzling than ever" by Wei Dai

"Sum-threshold attacks" by TsviBT

"A list of core AI safety problems and how I hope to solve them" by Davidad

"Report on Frontier Model Training" by Yafah Edelman

"One Minute Every Moment" by abramdemski

"Sharing Information About Nonlinear" by Ben Pace

"Defunding My Mistake" by ymeskhout

"What I would do if I wasn’t at ARC Evals" by LawrenceC

"Meta Questions about Metaphilosophy" by Wei Dai

"The U.S. is becoming less stable" by lc

"OpenAI API base models are not sycophantic, at any size" by Nostalgebraist

"Dear Self; we need to talk about ambition" by Elizabeth

"Assume Bad Faith" by Zack_M_Davis

"Book Launch: "The Carving of Reality," Best of LessWrong vol. III" by Raemon

"Large Language Models will be Great for Censorship" by Ethan Edwards

"6 non-obvious mental health issues specific to AI safety" by Igor Ivanov

"Ten Thousand Years of Solitude" by agp

"Against Almost Every Theory of Impact of Interpretability" by Charbel-Raphaël

"Inflection.ai is a major AGI lab" by Nikola

"Feedbackloop-first Rationality" by Raemon

"Model Organisms of Misalignment: The Case for a New Pillar of Alignment Research" by evhub, Nicholas Schiefer, Carson Denison, Ethan Perez

"When can we trust model evaluations?" bu evhub

"ARC Evals new report: Evaluating Language-Model Agents on Realistic Autonomous Tasks" by Beth Barnes

"The "public debate" about AI is confusing for the general public and for policymakers because it is a three-sided debate" by Adam David Long

"My current LK99 questions" by Eliezer Yudkowsky

"Thoughts on sharing information about language model capabilities" by paulfchristiano

"Cultivating a state of mind where new ideas are born" by Henrik Karlsson

"Self-driving car bets" by paulfchristiano

"Yes, It's Subjective, But Why All The Crabs?" by johnswentworth

"Grant applications and grand narratives" by Elizabeth

"Brain Efficiency Cannell Prize Contest Award Ceremony" by Alexander Gietelink Oldenziel

"Rationality !== Winning" by Raemon

"Cryonics and Regret" by MvB

"Unifying Bargaining Notions (2/2)" by Diffractor

"The ants and the grasshopper" by Richard Ngo

"Steering GPT-2-XL by adding an activation vector" by TurnTrout et al.

"An artificially structured argument for expecting AGI ruin" by Rob Bensinger

"How much do you believe your results?" by Eric Neyman

"Mental Health and the Alignment Problem: A Compilation of Resources (updated April 2023)" by Chris Scammell & DivineMango

"On AutoGPT" by Zvi

"GPTs are Predictors, not Imitators" by Eliezer Yudkowsky

"A stylized dialogue on John Wentworth's claims about markets and optimization" by Nate Soares

"Discussion with Nate Soares on a key alignment difficulty" by Holden Karnofsky

"Deep Deceptiveness" by Nate Soares

"The Onion Test for Personal and Institutional Honesty" by Chana Messinger & Andrew Critch

"There’s no such thing as a tree (phylogenetically)" by Eukaryote

"Losing the root for the tree" by Adam Zerner

"It Looks Like You’re Trying To Take Over The World" by Gwern

"Why I think strong general AI is coming soon" by Porby