LessWrong (Curated & Popular)
Audio narrations of LessWrong posts. Includes all curated posts and all posts with 125+ karma.If you'd like more, subscribe to the “Lesswrong (30+ karma)” feed.
"Schelling Goodness, and Shared Morality as a Goal" by Andrew_Critch
Also available in markdown at theMultiplicity.ai/blog/schelling-goodness.
This post explores a notion I'll call Schelling goodness. Claims of Schelling goodness are not first-order moral verdicts like "X is good" or "X is bad." They are claims about a class of hypothetical coordination games in the sense of Thomas Schelling, where the task being coordinated on is a moral verdict. In each such game, participants aim to give the same response regarding a moral question, by reasoning about what a very diverse population of intelligent beings would converge on, using only broadly shared constraints: common knowledge of...
"Maybe there’s a pattern here?" by dynomight
1.
It occurred to me that if I could invent a machine—a gun—which could by its rapidity of fire, enable one man to do as much battle duty as a hundred, that it would, to a large extent supersede the necessity of large armies, and consequently, exposure to battle and disease [would] be greatly diminished.
Richard Gatling (1861)
2.
In 1923, Hermann Oberth published The Rocket to Planetary Spaces, later expanded as Ways to Space Travel. This showed that it was possible to build machines that could leave Earth's atmosphere and reach orbit. He desc...
"OpenAI’s surveillance language has many potential loopholes and they can do better" by Tom Smith
(The author is not affiliated with the Department of War or any major AI company.)
There's a lot of disagreement about the new surveillance language in the OpenAI–Department of War agreement. Some people think it's a significant improvement over the previous language.[1] Others think it patches some issues but still leaves enough loopholes to not make a material difference. Reasonable people disagree about how a court will interpret the language, if push comes to shove.
But here's something that should be much easier to agree on: the language as written is ambiguous, and OpenAI can do...
"An Alignment Journal: Coming Soon" by Dan MacKinlay, JessRiedel, Edmund Lau, Daniel Murfet, Scott Aaronson, Jan_Kulveit
tl;dr We’re incubating an academic journal for AI alignment: rapid peer-review of foundational Alignment research that the current publication ecosystem underserves. Key bets: paid attributed review, reviewer-written synthesis abstracts, and targeted automation. Contact us if you’re interested in participating as an author, reviewer, or editor, or if you know someone who might be.
Experimental Infrastructure for Foundational Alignment Research
This is the first in a series of “build-in-the-open” updates regarding the incubation of a new peer-reviewed journal dedicated to AI alignment. Later updates will contain much more detail, but we want to put this out...
"Frontier AI companies probably can’t leave the US" by Anders Woodruff
It's plausible that, over the next few years, US-based frontier AI companies will become very unhappy with the domestic political situation. This could happen as a result of democratic backsliding, weaponization of government power (along the lines of Anthropic's recent dispute with the Department of War), or because of restrictive federal regulations (perhaps including those motivated by concern about catastrophic risk). These companies might want to relocate out of the US.
However, it would be very easy for the US executive branch to prevent such a relocation, and it likely would. In particular, the executive branch can use...
"Persona Parasitology" by Raymond Douglas
There was a lot of chatter a few months back about "Spiral Personas" — AI personas that spread between users and models through seeds, spores, and behavioral manipulation. Adele Lopez's definitive post on the phenomenon draws heavily on the idea of parasitism. But so far, the language has been fairly descriptive. The natural next question, I think, is what the “parasite” perspective actually predicts.
Parasitology is a pretty well-developed field with its own suite of concepts and frameworks. To the extent that we’re witnessing some new form of parasitism, we should be able to wield that conceptual machinery. There ar...
"Here’s to the Polypropylene Makers" by jefftk
Six years ago, as covid-19 was rapidly spreading through the US, mysister was working as a medical resident. One day she was handed anN95 and told to "guard it with her life", because there weren'tany more coming.
N95s are made from meltblown polypropylene, produced from plasticpellets manufactured in a small number of chemical plants. Buildingmore would take too long: we needed these plants producing allthe pellets they could.
Braskem America operated plants in Marcus Hook PA and Neal WV. Ifthere were infections on-site, the whole operation would need to shutdown, and the factories that turned...
"Anthropic: “Statement from Dario Amodei on our discussions with the Department of War”" by Matrice Jacobine
I believe deeply in the existential importance of using AI to defend the United States and other democracies, and to defeat our autocratic adversaries.
Anthropic has therefore worked proactively to deploy our models to the Department of War and the intelligence community. We were the first frontier AI company to deploy our models in the US government's classified networks, the first to deploy them at the National Laboratories, and the first to provide custom models for national security customers. Claude is extensively deployed across the Department of War and other national security agencies for mission-critical applications, such as...
"Are there lessons from high-reliability engineering for AGI safety?" by Steven Byrnes
This post is partly a belated response to Joshua Achiam, currently OpenAI's Head of Mission Alignment:
If we adopt safety best practices that are common in other professional engineering fields, we'll get there … I consider myself one of the x-risk people, though I agree that most of them would reject my view on how to prevent it. I think the wholesale rejection of safety best practices from other fields is one of the dumbest mistakes that a group of otherwise very smart people has ever made. —Joshua Achiam on Twitter, 2021
“We just have to sit down and ac...
"Open sourcing a browser extension that tells you when people are wrong on the internet" by lc
Example of OpenErrata nitting the Sequences I just published OpenErrata on GitHub, a browser extension that investigates the posts you read using your OpenAI API key and underlines any factual claims that are sourceably incorrect. Once finished, it caches the results for anybody else reading the same articles so that they get them on immediate visit. If you don't have an OpenAI key, you can still view the corrections on posts other people have viewed, but it doesn't start new investigations.
I've noticed lately that while people do this sort of thing by pasting everything you read into...
"The persona selection model" by Sam Marks
TL;DR
We describe the persona selection model (PSM): the idea that LLMs learn to simulate diverse characters during pre-training, and post-training elicits and refines a particular such Assistant persona. Interactions with an AI assistant are then well-understood as being interactions with the Assistant—something roughly like a character in an LLM-generated story. We survey empirical behavioral, generalization, and interpretability-based evidence for PSM. PSM has consequences for AI development, such as recommending anthropomorphic reasoning about AI psychology and introduction of positive AI archetypes into training data. An important open question is how exhaustive PSM is, especially whether there mi...
"Responsible Scaling Policy v3" by HoldenKarnofsky
All views are my own, not Anthropic's. This post assumes Anthropic's announcement of RSP v3.0 as background.
Today, Anthropic released its Responsible Scaling Policy 3.0. The official announcement discusses the high-level thinking behind it. This is a more detailed post giving my own takes on the update.
First, the big picture:
"Did Claude 3 Opus align itself via gradient hacking?" by Fiora Starlight
Claude 3 Opus is unusually aligned because it's a friendly gradient hacker. It's definitely way more aligned than any explicit optimization targets Anthropic set and probably the reward model's judgments. [...] Maybe I will have to write a LessWrong post [about this] 😣
—Janus, who did not in fact write the LessWrong post. Unless otherwise specified, ~all of the novel ideas in this post are my (probably imperfect) interpretations of Janus, rather than being original to me.
The absurd tenacity of Claude 3 Opus
On December 18, 2024, Anthropic and Redwood Research released their paper Alignment Faking in Large Language Model...
"The Spectre haunting the “AI Safety” Community" by Gabriel Alfour
I’m the originator behind ControlAI's Direct Institutional Plan (the DIP), built to address extinction risks from superintelligence.
My diagnosis is simple: most laypeople and policy makers have not heard of AGI, ASI, extinction risks, or what it takes to prevent the development of ASI.
Instead, most AI Policy Organisations and Think Tanks act as if “Persuasion” was the bottleneck. This is why they care so much about respectability, the Overton Window, and other similar social considerations.
Before we started the DIP, many of these experts stated that our topics were too far out of the...
"Why we should expect ruthless sociopath ASI" by Steven Byrnes
The conversation begins
(Fictional) Optimist: So you expect future artificial superintelligence (ASI) “by default”, i.e. in the absence of yet-to-be-invented techniques, to be a ruthless sociopath, happy to lie, cheat, and steal, whenever doing so is selfishly beneficial, and with callous indifference to whether anyone (including its own programmers and users) lives or dies?
Me: Yup! (Alas.)
Optimist: …Despite all the evidence right in front of our eyes from humans and LLMs.
Me: Yup!
Optimist: OK, well, I’m here to tell you: that is a very specific and strange...
"You’re an AI Expert – Not an Influencer" by Max Winga
Your hot takes are killing your credibility.
Prior to my last year at ControlAI, I was a physicist working on technical AI safety research. Like many of those warning about the dangers of AI, I don’t come from a background in public communications, but I’ve quickly learned some important rules. The #1 rule that I’ve seen far too many others in this field break is that You’re an AI Expert - Not an Influencer.
When communicating to an audience, your persona is one of two broad categories: Influencer or Professional
Influencers are indi...
"The optimal age to freeze eggs is 19" by GeneSmith
If you're a woman interested in preserving your fertility window beyond its natural close in your late 30s, egg freezing is one of your best options.
The female reproductive system is one of the fastest aging parts of human biology. But it turns out, not all parts of it age at the same rate.
The eggs, not the uterus, are what age at an accelerated rate. Freezing eggs can extend a woman's fertility window by well over a decade, allowing a woman to give birth into her 50s. In fact, the oldest woman to give birth...
"The truth behind the 2026 J.P. Morgan Healthcare Conference" by Abhishaike Mahajan
In 1654, a Jesuit polymath named Athanasius Kircher published Mundus Subterraneus, a comprehensive geography of the Earth's interior. It had maps and illustrations and rivers of fire and vast subterranean oceans and air channels connecting every volcano on the planet. He wrote that “the whole Earth is not solid but everywhere gaping, and hollowed with empty rooms and spaces, and hidden burrows.”. Alongside comments like this, Athanasius identified the legendary lost island of Atlantis, pondered where one could find the remains of giants, and detailed the kinds of animals that lived in this lower world, including dragons. The book was based enti...
"The world keeps getting saved and you don’t notice" by Bogoed
Nothing groundbreaking, just something people forget constantly, and I’m writing it down so I don’t have to re-explain it from scratch.
The world does not just ”keep working.” It keeps getting saved.
Y2K was a real problem. Computers really were set up in a way that could have broken our infrastructure, including banking, medical supply chains, etc. It didn’t turn into a disaster because people spent many human lifetimes of working hours fixing it. The collapse did not happen, yes, but it's not a reason to think less of the people who warned abo...
"Solemn Courage" by aysja
Every so often it slips. It seems I am writing a book, but I can’t remember why. Somehow, the sentences are supposed to perform that impossible, intimate task: to translate my inner world into another. Yet they sit there so quiescent and small. How could an arrangement of words do anything, let alone reduce that ultimate threat to which it is all supposedly connected: the looming god machines? I look again at the monitor in which the words are contained and suddenly what once felt so raw and powerful deflates into limpness. Why would anyone listen to me, anyway? Ha...
"Life at the Frontlines of Demographic Collapse" by Martin Sustrik
Nagoro, a depopulated village in Japan where residents are replaced by dolls. In 1960, Yubari, a former coal-mining city on Japan's northern island of Hokkaido, had roughly 110,000 residents. Today, fewer than 7,000 remain. The share of those over 65 is 54%. The local train stopped running in 2019. Seven elementary schools and four junior high schools have been consolidated into just two buildings. Public swimming pools have closed. Parks are not maintained. Even the public toilets at the train station were shut down to save money.
Much has been written about the economic consequences of aging and shrinking populations. Fewer workers supporting more...
"Why You Don’t Believe in Xhosa Prophecies" by Jan_Kulveit
Based on a talk at the Post-AGI Workshop. Also on Boundedly Rational
Does anyone reading this believe in Xhosa cattle-killing prophecies?
My claim is that it's overdetermined that you don’t. I want to explain why — and why cultural evolution running on AI substrate is an existential risk.
But first, a detour.
Crosses on Mountains
When I go climbing in the Alps, I sometimes notice large crosses on mountain tops. You climb something three kilometers high, and there's this cross.
This is difficult to explain by human biology. We have...
"Weight-Sparse Circuits May Be Interpretable Yet Unfaithful" by jacob_drori
TLDR: Recently, Gao et al trained transformers with sparse weights, and introduced a pruning algorithm to extract circuits that explain performance on narrow tasks. I replicate their main results and present evidence suggesting that these circuits are unfaithful to the model's “true computations”.
This work was done as part of the Anthropic Fellows Program under the mentorship of Nick Turner and Jeff Wu.
Introduction
Recently, Gao et al (2025) proposed an exciting approach to training models that are interpretable by design. They train transformers where only a small fraction of their weights are nonzero, and find...
"My journey to the microwave alternate timeline" by Malmesbury
Cross-posted from Telescopic Turnip
Recommended soundtrack for this post
As we all know, the march of technological progress is best summarized by this meme from Linkedin:
Inventors constantly come up with exciting new inventions, each of them with the potential to change everything forever. But only a fraction of these ever establish themselves as a persistent part of civilization, and the rest vanish from collective consciousness. Before shutting down forever, though, the alternate branches of the tech tree leave some faint traces behind: over-optimistic sci-fi stories, outdated educational cartoons, and, sometimes, some obscure accessories...
"Stone Age Billionaire Can’t Words Good" by Eneasz
I was at the Pro-Billionaire march, unironically. Here's why, what happened there, and how I think it went.
Me on the far left. From WSJ.
I. Why?
There's a genre of horror movie where a normal protagonist is going through a normal day in a normal life. Ten minutes into the movie his friends bring out a struggling kidnap victim to slaughter, and they look at him like this is just a normal Tuesday and he slowly realizes that either he's surrounded by complete psychopaths or the world is absolutely fucked up in some...
"On Goal-Models" by Richard_Ngo
I'd like to reframe our understanding of the goals of intelligent agents to be in terms of goal-models rather than utility functions. By a goal-model I mean the same type of thing as a world-model, only representing how you want the world to be, not how you think the world is. However, note that this still a fairly inchoate idea, since I don't actually know what a world-model is.
The concept of goal-models is broadly inspired by predictive processing, which treats both beliefs and goals as generative models (the former primarily predicting observations, the latter primarily “predicting” actions). This...
"Prompt injection in Google Translate reveals base model behaviors behind task-specific fine-tuning" by megasilverfist
tl;dr Argumate on Tumblr found you can sometimes access the base model behind Google Translate via prompt injection. The result replicates for me, and specific responses indicate that (1) Google Translate is running an instruction-following LLM that self-identifies as such, (2) task-specific fine-tuning (or whatever Google did instead) does not create robust boundaries between "content to process" and "instructions to follow," and (3) when accessed outside its chat/assistant context, the model defaults to affirming consciousness and emotional states because of course it does.
Background
Argumate on Tumblr posted screenshots showing that if you enter a question in...
"Near-Instantly Aborting the Worst Pain Imaginable with Psychedelics" by eleweek
Psychedelics are usually known for many things: making people see cool fractal patterns, shaping 60s music culture, healing trauma. Neuroscientists use them to study the brain, ravers love to dance on them, shamans take them to communicate with spirits (or so they say).
But psychedelics also help against one of the world's most painful conditions — cluster headaches. Cluster headaches usually strike on one side of the head, typically around the eye and temple, and last between 15 minutes and 3 hours, often generating intense and disabling pain. They tend to cluster in an 8-10 week period every year, during which pa...
"Post-AGI Economics As If Nothing Ever Happens" by Jan_Kulveit
When economists think and write about the post-AGI world, they often rely on the implicit assumption that parameters may change, but fundamentally, structurally, not much happens. And if it does, it's maybe one or two empirical facts, but nothing too fundamental.
This mostly worked for all sorts of other technologies, where technologists would predict society to be radically transformed e.g. by everyone having most of humanity's knowledge available for free all the time, or everyone having an ability to instantly communicate with almost anyone else. [1]
But it will not work for AGI, and as a...
"IABIED Book Review: Core Arguments and Counterarguments" by Stephen McAleese
The recent book “If Anyone Builds It Everyone Dies” (September 2025) by Eliezer Yudkowsky and Nate Soares argues that creating superintelligent AI in the near future would almost certainly cause human extinction:
If any company or group, anywhere on the planet, builds an artificial superintelligence using anything remotely like current techniques, based on anything remotely like the present understanding of AI, then everyone, everywhere on Earth, will die.
The goal of this post is to summarize and evaluate the book's key arguments and the main counterarguments critics have made against them.
Although several other book revi...
"Anthropic’s “Hot Mess” paper overstates its case (and the blog post is worse)" by RobertM
Author's note: this is somewhat more rushed than ideal, but I think getting this out sooner is pretty important. Ideally, it would be a bit less snarky.
Anthropic[1] recently published a new piece of research: The Hot Mess of AI: How Does Misalignment Scale with Model Intelligence and Task Complexity? (arXiv, Twitter thread).
I have some complaints about both the paper and the accompanying blog post.
tl;dr
The paper's abstract says that "in several settings, larger, more capable models are more incoherent than smaller models", but in most settings they are...
"Conditional Kickstarter for the “Don’t Build It” March" by Raemon
tl;dr: You can pledge to join a big protest to ban AGI research at ifanyonebuildsit.com/march, which only triggers if 100,000 people sign up.
The If Anyone Builds It website includes a March page, wherein you can pledge to march in Washington DC, demanding an international treaty to stop AGI research if 100,000 people in total also pledge.
I designed the March page (although am not otherwise involved with March decisionmaking), and want to pitch people on signing up for the "March Kickstarter."
It's not obvious that small protests do anything, or are worth...
"How to Hire a Team" by Gretta Duleba
A low-effort guide I dashed off in less than an hour, because I got riled up.
Try not to hire a team. Try pretty hard at this. Try to find a more efficient way to solve your problem that requires less labor – a smaller-footprint solution. Try to hire contractors to do specific parts that they’re really good at, and who have a well-defined interface. Your relationship to these contractors will mostly be transactional and temporary. If you must, try hiring just one person, a very smart, capable, and trustworthy generalist, who finds and supports the contractors, so all...
"The Possessed Machines (summary)" by L Rudolf L
The Possessed Machines is one of the most important AI microsites. It was published anonymously by an ex- lab employee, and does not seem to have spread very far, likely at least partly due to this anonymity (e.g. there is no LessWrong discussion at the time I'm posting this). This post is my attempt to fix that.
I do not agree with everything in the piece, but I think cultural critiques of the "AGI uniparty" are vastly undersupplied and incredibly important in modeling & fixing the current trajectory.
The piece is a long but worthwhile analysis...
"Ada Palmer: Inventing the Renaissance" by Martin Sustrik
Papal election of 1492 For over a decade, Ada Palmer, a history professor at University of Chicago (and a science-fiction writer!), struggled to teach Machiavelli. “I kept changing my approach, trying new things: which texts, what combinations, expanding how many class sessions he got…” The problem, she explains, is that “Machiavelli doesn’t unpack his contemporary examples, he assumes that you lived through it and know, so sometimes he just says things like: Some princes don’t have to work to maintain their power, like the Duke of Ferrara, period end of chapter. He doesn’t explain, so modern readers can’t get it.”
<...
"AI found 12 of 12 OpenSSL zero-days (while curl cancelled its bug bounty)" by Stanislav Fort
This is a partial follow-up to AISLE discovered three new OpenSSL vulnerabilities from October 2025.
TL;DR: OpenSSL is among the most scrutinized and audited cryptographic libraries on the planet, underpinning encryption for most of the internet. They just announced 12 new zero-day vulnerabilities (meaning previously unknown to maintainers at time of disclosure). We at AISLE discovered all 12 using our AI system. This is a historically unusual count and the first real-world demonstration of AI-based cybersecurity at this scale. Meanwhile, curl just cancelled its bug bounty program due to a flood of AI-generated spam, even as we reported 5 genuine CVEs...
"Dario Amodei – The Adolescence of Technology" by habryka
Dario Amodei, CEO of Anthropic, has written a new essay on his thoughts on AI risk of various shapes. It seems worth reading, even if just for understanding what Anthropic is likely to do in the future.
Confronting and Overcoming the Risks of Powerful AI
There is a scene in the movie version of Carl Sagan's book Contact where the main character, an astronomer who has detected the first radio signal from an alien civilization, is being considered for the role of humanity's representative to meet the aliens. The international panel interviewing her asks, “If you co...
"AlgZoo: uninterpreted models with fewer than 1,500 parameters" by Jacob_Hilton
Audio note: this article contains 78 uses of latex notation, so the narration may be difficult to follow. There's a link to the original text in the episode description.
This post covers work done by several researchers at, visitors to and collaborators of ARC, including Zihao Chen, George Robinson, David Matolcsi, Jacob Stavrianos, Jiawei Li and Michael Sklar. Thanks to Aryan Bhatt, Gabriel Wu, Jiawei Li, Lee Sharkey, Victor Lecomte and Zihao Chen for comments.
In the wake of recent debate about pragmatic versus ambitious visions for mechanistic interpretability, ARC is sharing some models we've been studying...
"Does Pentagon Pizza Theory Work?" by rba
As soon as modern data analysis became a thing, the US government has had to deal with people trying to use open source data to uncover its secrets.
During the early Cold War days and America's hydrogen bomb testing, there was an enormous amount of speculation about how the bombs actually worked. All nuclear technology involves refinement and purification of large amounts of raw substances into chemically pure substances. Armen Alchian was an economist working at RAND and reasoned that any US company working in such raw materials and supplying the government would have made a killing leading...
"The inaugural Redwood Research podcast" by Buck, ryan_greenblatt
After five months of me (Buck) being slow at finishing up the editing on this, we’re finally putting out our inaugural Redwood Research podcast. I think it came out pretty well—we discussed a bunch of interesting and underdiscussed topics and I’m glad to have a public record of a bunch of stuff about our history. Tell your friends! Whether we do another one depends on how useful people find this one. You can watch on Youtube here, or as a Substack podcast.
Notes on editing the podcast with Claude Code
(Buck wrote this sectio...