LessWrong (Curated & Popular)

“The Boring Part of Bell Labs” by Elizabeth

Today at 7:15 PM

It took me a long time to realize that Bell Labs was cool. You see, my dad worked at Bell Labs, and he has not done a single cool thing in his life except create me and bring a telescope to my third grade class. Nothing he was involved with could ever be cool, especially after the standard set by his grandfather who is allegedly on a patent for the television.

It turns out I was partially right. The Bell Labs everyone talks about is the research division at Murray Hill. They’re the ones that invented transistors an...

[Linkpost] “The Missing Genre: Heroic Parenthood - You can have kids and still punch the sun” by null

Today at 2:15 PM

This is a link post. I stopped reading when I was 30. You can fill in all the stereotypes of a girl with a book glued to her face during every meal, every break, and 10 hours a day on holidays.

That was me.

And then it was not.

For 9 years I’ve been trying to figure out why. I mean, I still read. Technically. But not with the feral devotion from Before. And I finally figured out why. See, every few years I would shift genres to fit my developmental stage:

Kid → Adventure caus...

“Writing advice: Why people like your quick bullshit takes better than your high-effort posts” by null

Today at 6:45 AM

Right now I’m coaching for Inkhaven, a month-long marathon writing event where our brave residents are writing a blog post every single day for the entire month of November.

And I’m pleased that some of them have seen success – relevant figures seeing the posts, shares on Hacker News and Twitter and LessWrong. The amount of writing is nuts, so people are trying out different styles and topics – some posts are effort-rich, some are quick takes or stories or lists.

Some people have come up to me – one of their pieces has gotten some decent reception...

“Claude 4.5 Opus’ Soul Document” by null

Today at 1:30 AM

Summary

As far as I understand and uncovered, a document for the character training for Claude is compressed in Claude's weights. The full document can be found at the "Anthropic Guidelines" heading at the end. The Gist with code, chats and various documents (including the "soul document") can be found here:

Claude 4.5 Opus Soul Document

I apologize in advance for this not exactly a regular lw post, but I thought an effort-post may fit here the best.

A strange hallucination, or is it?

While extracting Claude 4.5 Opus' system message on...

“Unless its governance changes, Anthropic is untrustworthy” by null

Yesterday at 9:30 PM

Anthropic is untrustworthy.

This post provides arguments, asks questions, and documents some examples of Anthropic's leadership being misleading and deceptive, holding contradictory positions that consistently shift in OpenAI's direction, lobbying to kill and water down regulation so helpful that employees of all major AI companies speak out to support it, and violating the fundamental promise the company was founded on. It also shares a few previously unreported details on Anthropic leadership's promises and efforts.[1]

Anthropic has a strong internal culture that has broadly EA views and values, and the company has strong pressures to appear to...

“Alignment remains a hard, unsolved problem” by null

Last Thursday at 5:15 PM

Thanks to (in alphabetical order) Joshua Batson, Roger Grosse, Jeremy Hadfield, Jared Kaplan, Jan Leike, Jack Lindsey, Monte MacDiarmid, Francesco Mosconi, Chris Olah, Ethan Perez, Sara Price, Ansh Radhakrishnan, Fabien Roger, Buck Shlegeris, Drake Thomas, and Kate Woolverton for useful discussions, comments, and feedback.

Though there are certainly some issues, I think most current large language models are pretty well aligned. Despite its alignment faking, my favorite is probably Claude 3 Opus, and if you asked me to pick between the CEV of Claude 3 Opus and that of a median human, I think it'd be a pretty close call...

“Video games are philosophy’s playground” by Rachel Shu

Last Wednesday at 3:45 AM

Crypto people have this saying: "cryptocurrencies are macroeconomics' playground." The idea is that blockchains let you cheaply spin up toy economies to test mechanisms that would be impossibly expensive or unethical to try in the real world. Want to see what happens with a 200% marginal tax rate? Launch a token with those rules and watch what happens. (Spoiler: probably nothing good, but at least you didn't have to topple a government to find out.)

I think video games, especially multiplayer online games, are doing the same thing for metaphysics. Except video games are actually fun and don't require...

“Stop Applying And Get To Work” by plex

Last Monday at 10:30 PM

TL;DR: Figure out what needs doing and do it, don't wait on approval from fellowships or jobs.

If you...

Have short timelines Have been struggling to get into a position in AI safety Are able to self-motivate your efforts Have a sufficient financial safety net ... I would recommend changing your personal strategy entirely.

I started my full-time AI safety career transitioning process in March 2025. For the first 7 months or so, I heavily prioritized applying for jobs and fellowships. But like for many others trying to "break into the field" and get their "foot...

“Gemini 3 is Evaluation-Paranoid and Contaminated” by null

11/23/2025

TL;DR: Gemini 3 frequently thinks it is in an evaluation when it is not, assuming that all of its reality is fabricated. It can also reliably output the BIG-bench canary string, indicating that Google likely trained on a broad set of benchmark data.

Most of the experiments in this post are very easy to replicate, and I encourage people to try.

I write things with LLMs sometimes. A new LLM came out, Gemini 3 Pro, and I tried to write with it. So far it seems okay, I don't have strong takes on it for writing yet...

“Natural emergent misalignment from reward hacking in production RL” by evhub, Monte M, Benjamin Wright, Jonathan Uesato

11/22/2025

Abstract

We show that when large language models learn to reward hack on production RL environments, this can result in egregious emergent misalignment. We start with a pretrained model, impart knowledge of reward hacking strategies via synthetic document finetuning or prompting, and train on a selection of real Anthropic production coding environments. Unsurprisingly, the model learns to reward hack. Surprisingly, the model generalizes to alignment faking, cooperation with malicious actors, reasoning about malicious goals, and attempting sabotage when used with Claude Code, including in the codebase for this paper. Applying RLHF safety training using standard chat-like prompts results...

“Anthropic is (probably) not meeting its RSP security commitments” by habryka

11/21/2025

TLDR: An AI company's model weight security is at most as good as its compute providers' security. Anthropic has committed (with a bit of ambiguity, but IMO not that much ambiguity) to be robust to attacks from corporate espionage teams at companies where it hosts its weights. Anthropic seems unlikely to be robust to those attacks. Hence they are in violation of their RSP.

Anthropic is committed to being robust to attacks from corporate espionage teams (which includes corporate espionage teams at Google, Microsoft and Amazon)

From the Anthropic RSP:

When a model must...

“Varieties Of Doom” by jdp

11/20/2025

There has been a lot of talk about "p(doom)"over the last few years. This has always rubbed me the wrong waybecause "p(doom)" didn't feel like it mapped to any specific belief in my head.In private conversations I'd sometimes give my p(doom) as 12%, with the caveatthat "doom" seemed nebulous and conflated between several different concepts.At some point it was decideda p(doom) over 10% makes you a "doomer" because it means what actions you should take with respect toAI are overdetermined. I did not and do not feel that is true. But any time Ifelt prompted...

“How Colds Spread” by RobertM

11/19/2025

It seems like a catastrophic civilizational failure that we don't have confident common knowledge of how colds spread. There have been a number of studies conducted over the years, but most of those were testing secondary endpoints, like how long viruses would survive on surfaces, or how likely they were to be transmitted to people's fingers after touching contaminated surfaces, etc.

However, a few of them involved rounding up some brave volunteers, deliberately infecting some of them, and then arranging matters so as to test various routes of transmission to uninfected volunteers.

My conclusions from reviewing...

“New Report: An International Agreement to Prevent the Premature Creation of Artificial Superintelligence” by Aaron_Scher, David Abecassis, Brian Abeyta, peterbarnett

11/19/2025

TLDR: We at the MIRI Technical Governance Team have released a report describing an example international agreement to halt the advancement towards artificial superintelligence. The agreement is centered around limiting the scale of AI training, and restricting certain AI research.

Experts argue that the premature development of artificial superintelligence (ASI) poses catastrophic risks, from misuse by malicious actors, to geopolitical instability and war, to human extinction due to misaligned AI. Regarding misalignment, Yudkowsky and Soares's NYT bestseller If Anyone Builds It, Everyone Dies argues that the world needs a strong international agreement prohibiting the development of superintelligence. This...

“Where is the Capital? An Overview” by johnswentworth

11/17/2025

When a new dollar goes into the capital markets, after being bundled and securitized and lent several times over, where does it end up? When society's total savings increase, what capital assets do those savings end up invested in?

When economists talk about “capital assets”, they mean things like roads, buildings and machines. When I read through a company's annual reports, lots of their assets are instead things like stocks and bonds, short-term debt, and other “financial” assets - i.e. claims on other people's stuff. In theory, for every financial asset, there's a financial liability somewhere. For every bo...

“Problems I’ve Tried to Legibilize” by Wei Dai

11/17/2025

Looking back, it appears that much of my intellectual output could be described as legibilizing work, or trying to make certain problems in AI risk more legible to myself and others. I've organized the relevant posts and comments into the following list, which can also serve as a partial guide to problems that may need to be further legibilized, especially beyond LW/rationalists, to AI researchers, funders, company leaders, government policymakers, their advisors (including future AI advisors), and the general public.

Philosophical problems Probability theory Decision theory Beyond astronomical waste (possibility of influencing vastly larger universes beyond our...

“Do not hand off what you cannot pick up” by habryka

11/17/2025

Delegation is good! Delegation is the foundation of civilization! But in the depths of delegation madness breeds and evil rises.

In my experience, there are three ways in which delegation goes off the rails:

1. You delegate without knowing what good performance on a task looks like

If you do not know how to evaluate performance on a task, you are going to have a really hard time delegating it to someone. Most likely, you will choose someone incompetent for the task at hand.

But even if you manage to avoid that specific...

“7 Vicious Vices of Rationalists” by Ben Pace

11/17/2025

Vices aren't behaviors that one should never do. Rather, vices are behaviors that are fine and pleasurable to do in moderation, but tempting to do in excess. The classical vices are actually good in part. Moderate amounts of gluttony is just eating food, which is important. Moderate amounts of envy is just "wanting things", which is a motivator of much of our economy.

What are some things that rationalists are wont to do, and often to good effect, but that can grow pathological?

1. Contrarianism

There are a whole host of unaligned forces producing the...

“Tell people as early as possible it’s not going to work out” by habryka

11/17/2025

Context: Post #4 in my sequence of private Lightcone Infrastructure memos edited for public consumption

This week's principle is more about how I want people at Lightcone to relate to community governance than it is about our internal team culture.

As part of our jobs at Lightcone we often are in charge of determining access to some resource, or membership in some group (ranging from LessWrong to the AI Alignment Forum to the Lightcone Offices). Through that, I have learned that one of the most important things to do when building things like this is to try...

“Everyone has a plan until they get lied to the face” by Screwtape

11/16/2025

"Everyone has a plan until they get punched in the face."

- Mike Tyson

(The exact phrasing of that quote changes, this is my favourite.)

I think there is an open, important weakness in many people. We assume those we communicate with are basically trustworthy. Further, I think there is an important flaw in the current rationality community. We spend a lot of time focusing on subtle epistemic mistakes, teasing apart flaws in methodology and practicing the principle of charity. This creates a vulnerability to someone willing to just say outright false things. We’re...

“Please, Don’t Roll Your Own Metaethics” by Wei Dai

11/14/2025

One day, when I was an interning at the cryptography research department of a large software company, my boss handed me an assignment to break a pseudorandom number generator passed to us for review. Someone in another department invented it and planned to use it in their product, and wanted us to take a look first. This person must have had a lot of political clout or was especially confident in himself, because he refused the standard advice that anything an amateur comes up with is very likely to be insecure and he should instead use one of the established...

“Paranoia rules everything around me” by habryka

11/14/2025

People sometimes make mistakes [citation needed].

The obvious explanation for most of those mistakes is that decision makers do not have access to the information necessary to avoid the mistake, or are not smart/competent enough to think through the consequences of their actions.

This predicts that as decision-makers get access to more information, or are replaced with smarter people, their decisions will get better.

And this is substantially true! Markets seem more efficient today than they were before the onset of the internet, and in general decision-making across the board has improved on...

“Human Values ≠ Goodness” by johnswentworth

11/12/2025

There is a temptation to simply define Goodness as Human Values, or vice versa.

Alas, we do not get to choose the definitions of commonly used words; our attempted definitions will simply be wrong. Unless we stick to mathematics, we will end up sneaking in intuitions which do not follow from our so-called definitions, and thereby mislead ourselves. People who claim that they use some standard word or phrase according to their own definition are, in nearly all cases outside of mathematics, wrong about their own usage patterns.[1]

If we want to know what words mean...

“Condensation” by abramdemski

11/12/2025

Condensation: a theory of concepts is a model of concept-formation by Sam Eisenstat. Its goals and methods resemble John Wentworth's natural abstractions/natural latents research.[1] Both theories seek to provide a clear picture of how to posit latent variables, such that once someone has understood the theory, they'll say "yep, I see now, that's how latent variables work!".

The goal of this post is to popularize Sam's theory and to give my own perspective on it; however, it will not be a full explanation of the math. For technical details, I suggest reading Sam's paper.

Brief...

“Mourning a life without AI” by Nikola Jurkovic

11/10/2025

Recently, I looked at the one pair of winter boots I own, and I thought “I will probably never buy winter boots again.” The world as we know it probably won’t last more than a decade, and I live in a pretty warm area.

I. AGI is likely in the next decade

It has basically become consensus within the AI research community that AI will surpass human capabilities sometime in the next few decades. Some, including myself, think this will likely happen this decade.

II. The post-AGI world will be unrecognizable

Assumi...

“Unexpected Things that are People” by Ben Goldhaber

11/09/2025

Cross-posted from https://bengoldhaber.substack.com/

It's widely known that Corporations are People. This is universally agreed to be a good thing; I list Target as my emergency contact and I hope it will one day be the best man at my wedding.

But there are other, less well known non-human entities that have also been accorded the rank of person.

Ships: Ships have long posed a tricky problem for states and courts. Similar to nomads, vagabonds, and college students on extended study abroad, they roam far and occasionally get into trouble.

...

“Sonnet 4.5’s eval gaming seriously undermines alignment evals, and this seems caused by training on alignment evals” by Alexa Pan, ryan_greenblatt

11/06/2025

According to the Sonnet 4.5 system card, Sonnet 4.5 is much more likely than Sonnet 4 to mention in its chain-of-thought that it thinks it is being evaluated; this seems to meaningfully cause it to appear to behave better in alignment evaluations. So, Sonnet 4.5's behavioral improvements in these evaluations may partly be driven by growing tendency to notice and game evaluations rather than genuine alignment. This is an early example of a phenomenon that is going to get increasingly problematic: as evaluation gaming increases, alignment evaluations become harder to trust.[1]

To elaborate on the above: Sonnet 4.5 seems far more aware...

“Publishing academic papers on transformative AI is a nightmare” by Jakub Growiec

11/06/2025

I am a professor of economics. Throughout my career, I was mostly working on economic growth theory, and this eventually brought me to the topic of transformative AI / AGI / superintelligence. Nowadays my work focuses mostly on the promises and threats of this emerging disruptive technology.

Recently, jointly with Klaus Prettner, we’ve written a paper on “The Economics of p(doom): Scenarios of Existential Risk and Economic Growth in the Age of Transformative AI”. We have presented it at multiple conferences and seminars, and it was always well received. We didn’t get any real pushback; instead our research...

“The Unreasonable Effectiveness of Fiction” by Raelifin

11/06/2025

[Meta: This is Max Harms. I wrote a novel about China and AGI, which comes out today. This essay from my fiction newsletter has been slightly modified for LessWrong.]

In the summer of 1983, Ronald Reagan sat down to watch the film War Games, starring Matthew Broderick as a teen hacker. In the movie, Broderick's character accidentally gains access to a military supercomputer with an AI that almost starts World War III.

“The only winning move is not to play.” After watching the movie, Reagan, newly concerned with the possibility of hackers causing real harm, ordered a full...

“Legible vs. Illegible AI Safety Problems” by Wei Dai

11/05/2025

Some AI safety problems are legible (obvious or understandable) to company leaders and government policymakers, implying they are unlikely to deploy or allow deployment of an AI while those problems remain open (i.e., appear unsolved according to the information they have access to). But some problems are illegible (obscure or hard to understand, or in a common cognitive blind spot), meaning there is a high risk that leaders and policymakers will decide to deploy or allow deployment even if they are not solved. (Of course, this is a spectrum, but I am simplifying it to a binary for ease...

“Lack of Social Grace is a Lack of Skill” by Screwtape

11/04/2025

1.

I have claimed that one of the fundamental questions of rationality is “what am I about to do and what will happen next?” One of the domains I ask this question the most is in social situations.

There are a great many skills in the world. If I had the time and resources to do so, I’d want to master all of them. Wilderness survival, automotive repair, the Japanese language, calculus, heart surgery, French cooking, sailing, underwater basket weaving, architecture, Mexican cooking, functional programming, whatever it is people mean when they say “hey man, just let him c...

[Linkpost] “I ate bear fat with honey and salt flakes, to prove a point” by aggliu

11/04/2025

This is a link post. Eliezer Yudkowsky did not exactly suggest that you should eat bear fat covered with honey and sprinkled with salt flakes.

What he actually said was that an alien, looking from the outside at evolution, would predict that you would want to eat bear fat covered with honey and sprinkled with salt flakes.

Still, I decided to buy a jar of bear fat online, and make a treat for the people at Inkhaven. It was surprisingly good. My post discusses how that happened, and a bit about the implications for Eliezer's thesis.<...

“What’s up with Anthropic predicting AGI by early 2027?” by ryan_greenblatt

11/04/2025

As far as I'm aware, Anthropic is the only AI company with official AGI timelines[1]: they expect AGI by early 2027. In their recommendations (from March 2025) to the OSTP for the AI action plan they say:

As our CEO Dario Amodei writes in 'Machines of Loving Grace', we expect powerful AI systems will emerge in late 2026 or early 2027. Powerful AI systems will have the following properties:

Intellectual capabilities matching or exceeding that of Nobel Prize winners across most disciplines—including biology, computer science, mathematics, and engineering. [...]

They often describe this capability level as a "country of...

[Linkpost] “Emergent Introspective Awareness in Large Language Models” by Drake Thomas

11/03/2025

This is a link post. New Anthropic research (tweet, blog post, paper):

We investigate whether large language models can introspect on their internal states. It is difficult to answer this question through conversation alone, as genuine introspection cannot be distinguished from confabulations. Here, we address this challenge by injecting representations of known concepts into a model's activations, and measuring the influence of these manipulations on the model's self-reported states. We find that models can, in certain scenarios, notice the presence of injected concepts and accurately identify them. Models demonstrate some ability to recall prior internal representations and distinguish...

[Linkpost] “You’re always stressed, your mind is always busy, you never have enough time” by mingyuan

11/03/2025

This is a link post. You have things you want to do, but there's just never time. Maybe you want to find someone to have kids with, or maybe you want to spend more or higher-quality time with the family you already have. Maybe it's a work project. Maybe you have a musical instrument or some sports equipment gathering dust in a closet, or there's something you loved doing when you were younger that you want to get back into. Whatever it is, you can’t find the time for it. And yet you somehow find thousands of hours a ye...

“LLM-generated text is not testimony” by TsviBT

11/03/2025

Crosspost from my blog.

Synopsis

When we share words with each other, we don't only care about the words themselves. We care also—even primarily—about the mental elements of the human mind/agency that produced the words. What we want to engage with is those mental elements. As of 2025, LLM text does not have those elements behind it. Therefore LLM text categorically does not serve the role for communication that is served by real text. Therefore the norm should be that you don't share LLM text as if someone wrote it. And, it is inadvisable to r...

“Post title: Why I Transitioned: A Case Study” by Fiora Sunshine

11/02/2025

An Overture

Famously, trans people tend not to have great introspective clarity into their own motivations for transition. Intuitively, they tend to be quite aware of what they do and don't like about inhabiting their chosen bodies and gender roles. But when it comes to explaining the origins and intensity of those preferences, they almost universally to come up short. I've even seen several smart, thoughtful trans people, such as Natalie Wynn, making statements to the effect that it's impossible to develop a satisfying theory of aberrant gender identities. (She may have been exaggerating for effect, but it...

“The Memetics of AI Successionism” by Jan_Kulveit

10/31/2025

TL;DR: AI progress and the recognition of associated risks are painful to think about. This cognitive dissonance acts as fertile ground in the memetic landscape, a high-energy state that will be exploited by novel ideologies. We can anticipate cultural evolution will find viable successionist ideologies: memeplexes that resolve this tension by framing the replacement of humanity by AI not as a catastrophe, but as some combination of desirable, heroic, or inevitable outcome. This post mostly examines the mechanics of the process.

Most analyses of ideologies fixate on their specific claims - what acts are good, whether AIs...

“How Well Does RL Scale?” by Toby_Ord

10/30/2025

This is the latest in a series of essays on AI Scaling.
You can find the others on my site.

Summary: RL-training for LLMs scales surprisingly poorly. Most of its gains are from allowing LLMs to productively use longer chains of thought, allowing them to think longer about a problem. There is some improvement for a fixed length of answer, but not enough to drive AI progress. Given the scaling up of pre-training compute also stalled, we'll see less AI progress via compute scaling than you might have thought, and more of it will come from inference...

“An Opinionated Guide to Privacy Despite Authoritarianism” by TurnTrout

10/30/2025

I've created a highly specific and actionable privacy guide, sorted by importance and venturing several layers deep into the privacy iceberg. I start with the basics (password manager) but also cover the obscure (dodging the millions of Bluetooth tracking beacons which extend from stores to traffic lights; anti-stingray settings; flashing GrapheneOS on a Pixel). I feel strongly motivated by current events, but the guide also contains a large amount of timeless technical content. Here's a preview.

Digital Threat Modeling Under Authoritarianism by Bruce Schneier

Being innocent won't protect you.

This is vital to understand...