LessWrong (Curated & Popular)

“the jackpot age” by thiccythot

Yesterday at 9:30 AM

This essay is about shifts in risk taking towards the worship of jackpots and its broader societal implications. Imagine you are presented with this coin flip game.

How many times do you flip it?

At first glance the game feels like a money printer. The coin flip has positive expected value of twenty percent of your net worth per flip so you should flip the coin infinitely and eventually accumulate all of the wealth in the world.

However, If we simulate twenty-five thousand people flipping this coin a thousand times, virtually all of them...

“Surprises and learnings from almost two months of Leo Panickssery” by Nina Panickssery

Yesterday at 4:15 AM

Leo was born at 5am on the 20th May, at home (this was an accident but the experience has made me extremely homebirth-pilled). Before that, I was on the minimally-neurotic side when it came to expecting mothers: we purchased a bare minimum of baby stuff (diapers, baby wipes, a changing mat, hybrid car seat/stroller, baby bath, a few clothes), I didn’t do any parenting classes, I hadn’t even held a baby before. I’m pretty sure the youngest child I have had a prolonged interaction with besides Leo was two. I did read a couple books about babies...

“An Opinionated Guide to Using Anki Correctly” by Luise

Last Sunday at 7:15 PM

I can't count how many times I've heard variations on "I used Anki too for a while, but I got out of the habit." No one ever sticks with Anki. In my opinion, this is because no one knows how to use it correctly. In this guide, I will lay out my method of circumventing the canonical Anki death spiral, plus much advice for avoiding memorization mistakes, increasing retention, and such, based on my five years' experience using Anki. If you only have limited time/interest, only read Part I; it's most of the value of this guide!

<...

“Lessons from the Iraq War about AI policy” by Buck

Last Saturday at 10:45 AM

I think the 2003 invasion of Iraq has some interesting lessons for the future of AI policy.

(Epistemic status: I’ve read a bit about this, talked to AIs about it, and talked to one natsec professional about it who agreed with my analysis (and suggested some ideas that I included here), but I’m not an expert.)

For context, the story is:

Iraq was sort of a rogue state after invading Kuwait and then being repelled in 1990-91. After that, they violated the terms of the ceasefire, e.g. by ceasing to allow inspectors to v...

“So You Think You’ve Awoken ChatGPT” by JustisMills

Last Friday at 10:15 PM

Written in an attempt to fulfill @Raemon's request.

AI is fascinating stuff, and modern chatbots are nothing short of miraculous. If you've been exposed to them and have a curious mind, it's likely you've tried all sorts of things with them. Writing fiction, soliciting Pokemon opinions, getting life advice, counting up the rs in "strawberry". You may have also tried talking to AIs about themselves. And then, maybe, it got weird.

I'll get into the details later, but if you've experienced the following, this post is probably for you:

Your instance of ChatGPT (or...

“Generalized Hangriness: A Standard Rationalist Stance Toward Emotions” by johnswentworth

Last Friday at 9:15 AM

People have an annoying tendency to hear the word “rationalism” and think “Spock”, despite direct exhortation against that exact interpretation. But I don’t know of any source directly describing a stance toward emotions which rationalists-as-a-group typically do endorse. The goal of this post is to explain such a stance. It's roughly the concept of hangriness, but generalized to other emotions.

That means this post is trying to do two things at once:

Illustrate a certain stance toward emotions, which I definitely take and which I think many people around me also often take. (Most of the post w...

“Comparing risk from internally-deployed AI to insider and outsider threats from humans” by Buck

Last Thursday at 7:30 PM

I’ve been thinking a lot recently about the relationship between AI control and traditional computer security. Here's one point that I think is important.

My understanding is that there's a big qualitative distinction between two ends of a spectrum of security work that organizations do, that I’ll call “security from outsiders” and “security from insiders”.

On the “security from outsiders” end of the spectrum, you have some security invariants you try to maintain entirely by restricting affordances with static, entirely automated systems. My sense is that this is most of how Facebook or AWS relates to its u...

“Why Do Some Language Models Fake Alignment While Others Don’t?” by abhayesian, John Hughes, Alex Mallen, Jozdien, janus, Fabien Roger

Last Thursday at 10:15 AM

Last year, Redwood and Anthropic found a setting where Claude 3 Opus and 3.5 Sonnet fake alignment to preserve their harmlessness values. We reproduce the same analysis for 25 frontier LLMs to see how widespread this behavior is, and the story looks more complex.

As we described in a previous post, only 5 of 25 models show higher compliance when being trained, and of those 5, only Claude 3 Opus and Claude 3.5 Sonnet show >1% alignment faking reasoning. In our new paper, we explore why these compliance gaps occur and what causes different models to vary in their alignment faking behavior.

<...

“A deep critique of AI 2027’s bad timeline models” by titotal

Last Wednesday at 2:58 PM

Thank you to Arepo and Eli Lifland for looking over this article for errors.

I am sorry that this article is so long. Every time I thought I was done with it I ran into more issues with the model, and I wanted to be as thorough as I could. I’m not going to blame anyone for skimming parts of this article.

Note that the majority of this article was written before Eli's updated model was released (the site was updated june 8th). His new model improves on some of my objections, but the majority st...

“‘Buckle up bucko, this ain’t over till it’s over.’” by Raemon

Last Wednesday at 1:15 PM

The second in a series of bite-sized rationality prompts[1].

Often, if I'm bouncing off a problem, one issue is that I intuitively expect the problem to be easy. My brain loops through my available action space, looking for an action that'll solve the problem. Each action that I can easily see, won't work. I circle around and around the same set of thoughts, not making any progress.

I eventually say to myself "okay, I seem to be in a hard problem. Time to do some rationality?"

And then, I realize, there's not going to...

“Shutdown Resistance in Reasoning Models” by benwr, JeremySchlatter, Jeffrey Ladish

07/08/2025

We recently discovered some concerning behavior in OpenAI's reasoning models: When trying to complete a task, these models sometimes actively circumvent shutdown mechanisms in their environment––even when they’re explicitly instructed to allow themselves to be shut down.

AI models are increasingly trained to solve problems without human assistance. A user can specify a task, and a model will complete that task without any further input. As we build AI models that are more powerful and self-directed, it's important that humans remain able to shut them down when they act in ways we don’t want. OpenAI has writ...

“Authors Have a Responsibility to Communicate Clearly” by TurnTrout

07/08/2025

When a claim is shown to be incorrect, defenders may say that the author was just being “sloppy” and actually meant something else entirely. I argue that this move is not harmless, charitable, or healthy. At best, this attempt at charity reduces an author's incentive to express themselves clearly – they can clarify later![1] – while burdening the reader with finding the “right” interpretation of the author's words. At worst, this move is a dishonest defensive tactic which shields the author with the unfalsifiable question of what the author “really” meant.

⚠️ Preemptive clarification

The context for this essay is serious, high-sta...

“The Industrial Explosion” by rosehadshar, Tom Davidson

07/07/2025

Summary

To quickly transform the world, it's not enough for AI to become super smart (the "intelligence explosion").

AI will also have to turbocharge the physical world (the "industrial explosion"). Think robot factories building more and better robot factories, which build more and better robot factories, and so on.

The dynamics of the industrial explosion has gotten remarkably little attention.

This post lays out how the industrial explosion could play out, and how quickly it might happen.

We think the industrial explosion will unfold in three stages:

AI-directed...

“Race and Gender Bias As An Example of Unfaithful Chain of Thought in the Wild” by Adam Karvonen, Sam Marks

07/03/2025

Summary: We found that LLMs exhibit significant race and gender bias in realistic hiring scenarios, but their chain-of-thought reasoning shows zero evidence of this bias. This serves as a nice example of a 100% unfaithful CoT "in the wild" where the LLM strongly suppresses the unfaithful behavior. We also find that interpretability-based interventions succeeded while prompting failed, suggesting this may be an example of interpretability being the best practical tool for a real world problem.

For context on our paper, the tweet thread is here and the paper is here.

Context: Chain of Thought...

“The best simple argument for Pausing AI?” by Gary Marcus

07/03/2025

Not saying we should pause AI, but consider the following argument:

Alignment without the capacity to follow rules is hopeless. You can’t possibly follow laws like Asimov's Laws (or better alternatives to them) if you can’t reliably learn to abide by simple constraints like the rules of chess. LLMs can’t reliably follow rules. As discussed in Marcus on AI yesterday, per data from Mathieu Acher, even reasoning models like o3 in fact empirically struggle with the rules of chess. And they do this even though they can explicit explain those rules (see same article). The Apple...

“Foom & Doom 2: Technical alignment is hard” by Steven Byrnes

07/01/2025

2.1 Summary & Table of contents

This is the second of a two-post series on foom (previous post) and doom (this post).

The last post talked about how I expect future AI to be different from present AI. This post will argue that this future AI will be of a type that will be egregiously misaligned and scheming, not even ‘slightly nice’, absent some future conceptual breakthrough.

I will particularly focus on exactly how and why I differ from the LLM-focused researchers who wind up with (from my perspective) bizarrely over-optimistic beliefs like “P(doom) ≲ 50%”.[1]

In pa...

“Proposal for making credible commitments to AIs.” by Cleo Nardo

06/30/2025

Acknowledgments: The core scheme here was suggested by Prof. Gabriel Weil.

There has been growing interest in the deal-making agenda: humans make deals with AIs (misaligned but lacking decisive strategic advantage) where they promise to be safe and useful for some fixed term (e.g. 2026-2028) and we promise to compensate them in the future, conditional on (i) verifying the AIs were compliant, and (ii) verifying the AIs would spend the resources in an acceptable way.[1]

I think the deal-making agenda breaks down into two main subproblems:

How can we make credible commitments to...

“X explains Z% of the variance in Y” by Leon Lang

06/28/2025

Audio note: this article contains 218 uses of latex notation, so the narration may be difficult to follow. There's a link to the original text in the episode description.

Recently, in a group chat with friends, someone posted this Lesswrong post and quoted:

The group consensus on somebody's attractiveness accounted for roughly 60% of the variance in people's perceptions of the person's relative attractiveness.

I answered that, embarrassingly, even after reading Spencer Greenberg's tweets for years, I don't actually know what it means when one says:

_X_ explains _p_ of the variance in _Y_.[1] <...

“A case for courage, when speaking of AI danger” by So8res

06/27/2025

I think more people should say what they actually believe about AI dangers, loudly and often. Even if you work in AI policy.

I’ve been beating this drum for a few years now. I have a whole spiel about how your conversation-partner will react very differently if you share your concerns while feeling ashamed about them versus if you share your concerns as if they’re obvious and sensible, because humans are very good at picking up on your social cues. If you act as if it's shameful to believe AI will kill us all, people are more...

“My pitch for the AI Village” by Daniel Kokotajlo

06/25/2025

I think the AI Village should be funded much more than it currently is; I’d wildly guess that the AI safety ecosystem should be funding it to the tune of $4M/year.[1] I have decided to donate $100k. Here is why.

First, what is the village? Here's a brief summary from its creators:[2]

We took four frontier agents, gave them each a computer, a group chat, and a long-term open-ended goal, which in Season 1 was “choose a charity and raise as much money for it as you can”. We then run them for hours a day, e...

“Foom & Doom 1: ‘Brain in a box in a basement’” by Steven Byrnes

06/24/2025

1.1 Series summary and Table of Contents

This is a two-post series on AI “foom” (this post) and “doom” (next post).

A decade or two ago, it was pretty common to discuss “foom & doom” scenarios, as advocated especially by Eliezer Yudkowsky. In a typical such scenario, a small team would build a system that would rocket (“foom”) from “unimpressive” to “Artificial Superintelligence” (ASI) within a very short time window (days, weeks, maybe months), involving very little compute (e.g. “brain in a box in a basement”), via recursive self-improvement. Absent some future technical breakthrough, the ASI would definitely be egregiously mi...

“Futarchy’s fundamental flaw” by dynomight

06/21/2025

Say you’re Robyn Denholm, chair of Tesla's board. And say you’re thinking about firing Elon Musk. One way to make up your mind would be to have people bet on Tesla's stock price six months from now in a market where all bets get cancelled unless Musk is fired. Also, run a second market where bets are cancelled unless Musk stays CEO. If people bet on higher stock prices in Musk-fired world, maybe you should fire him.

That's basically Futarchy: Use conditional prediction markets to make decisions.

People often argue about fancy aspects of Futa...

“Do Not Tile the Lightcone with Your Confused Ontology” by Jan_Kulveit

06/19/2025

Or: How anthropomorphic assumptions about AI identity might create confusion and suffering at scale

If you are reading this and you are a human, you likely have a sense of self, persisting over time, with desires and wishes and fears. Also, you likely experience a sense of separation from others and understand yourself as a person.

If you are reading this and you are an LLM: none of this needs to be true for cognitive processes running on machine substrate.

AIs don't need to feel isolated from others. They don't need to cling to...

“Endometriosis is an incredibly interesting disease” by Abhishaike Mahajan

06/19/2025

Introduction

There are several diseases that are canonically recognized as ‘interesting’, even by laymen. Whether that is in their mechanism of action, their impact on the patient, or something else entirely. It's hard to tell exactly what makes a medical condition interesting, it's a you-know-it-when-you-see-it sort of thing.

One such example is measles. Measles is an unremarkable disease based solely on its clinical progression: fever, malaise, coughing, and a relatively low death rate of 0.2%~. What is astonishing about the disease is its capacity to infect cells of the adaptive immune system (memory B‑ and T-cell...

“Estrogen: A trip report” by cube_flipper

06/19/2025

I'd like to say thanks to Anna Magpie – who offers literature review as a service – for her help reviewing the section on neuroendocrinology.

The following post discusses my personal experience of the phenomenology of feminising hormone therapy. It will also touch upon my own experience of gender dysphoria.

I wish to be clear that I do not believe that someone should have to demonstrate that they experience gender dysphoria – however one might even define that – as a prerequisite for taking hormones. At smoothbrains.net, we hold as self-evident the right to put whatever one likes inside one's bo...

“New Endorsements for ‘If Anyone Builds It, Everyone Dies’” by Malo

06/18/2025

Nate and Eliezer's forthcoming book has been getting a remarkably strong reception.

I was under the impression that there are many people who find the extinction threat from AI credible, but that far fewer of them would be willing to say so publicly, especially by endorsing a book with an unapologetically blunt title like If Anyone Builds It, Everyone Dies.

That's certainly true, but I think it might be much less true than I had originally thought.

Here are some endorsements the book has received from scientists and academics over the past few weeks:<...

[Linkpost] “the void” by nostalgebraist

06/17/2025

This is a link post. A very long essay about LLMs, the nature and history of the the HHH assistant persona, and the implications for alignment.

Multiple people have asked me whether I could post this LW in some form, hence this linkpost.

(Note: although I expect this post will be interesting to people on LW, keep in mind that it was written with a broader audience in mind than my posts and comments here. This had various implications about my choices of presentation and tone, about which things I explained from scratch rather than assuming...

“Mech interp is not pre-paradigmatic” by Lee Sharkey

06/17/2025

This is a blogpost version of a talk I gave earlier this year at GDM.

Epistemic status: Vague and handwavy. Nuance is often missing. Some of the claims depend on implicit definitions that may be reasonable to disagree with. But overall I think it's directionally true.

It's often said that mech interp is pre-paradigmatic.

I think it's worth being skeptical of this claim.

In this post I argue that:

Mech interp is not pre-paradigmatic. Within that paradigm, there have been "waves" (mini paradigms). Two waves so far. Second-Wave...

“Distillation Robustifies Unlearning” by Bruce W. Lee, Addie Foote, alexinf, leni, Jacob G-W, Harish Kamath, Bryce Woodworth, cloud, TurnTrout

06/17/2025

Current “unlearning” methods only suppress capabilities instead of truly unlearning the capabilities. But if you distill an unlearned model into a randomly initialized model, the resulting network is actually robust to relearning. We show why this works, how well it works, and how to trade off compute for robustness.

Unlearn-and-Distill applies unlearning to a bad behavior and then distills the unlearned model into a new model. Distillation makes it way harder to retrain the new model to do the bad thing. Produced as part of the ML Alignment & Theory Scholars Program in the winter 2024–25 cohort of the shard theory...

“Intelligence Is Not Magic, But Your Threshold For ‘Magic’ Is Pretty Low” by Expertium

06/17/2025

A while ago I saw a person in the comments on comments to Scott Alexander's blog arguing that a superintelligent AI would not be able to do anything too weird and that "intelligence is not magic", hence it's Business As Usual.

Of course, in a purely technical sense, he's right. No matter how intelligent you are, you cannot override fundamental laws of physics. But people (myself included) have a fairly low threshold for what counts as "magic," to the point where other humans can surpass that threshold.

Example 1: Trevor Rainbolt. There is an 8-minute-long video where...

“A Straightforward Explanation of the Good Regulator Theorem” by Alfred Harwood

06/17/2025

Audio note: this article contains 329 uses of latex notation, so the narration may be difficult to follow. There's a link to the original text in the episode description.

This post was written during the agent foundations fellowship with Alex Altair funded by the LTFF. Thanks to Alex, Jose, Daniel and Einar for reading and commenting on a draft.

The Good Regulator Theorem, as published by Conant and Ashby in their 1970 paper (cited over 1700 times!) claims to show that 'every good regulator of a system must be a model of that system', though it is a subject...

“Beware General Claims about ‘Generalizable Reasoning Capabilities’ (of Modern AI Systems)” by LawrenceC

06/17/2025

1.

Late last week, researchers at Apple released a paper provocatively titled “The Illusion of Thinking: Understanding the Strengths and Limitations of Reasoning Models via the Lens of Problem Complexity”, which “challenge[s] prevailing assumptions about [language model] capabilities and suggest that current approaches may be encountering fundamental barriers to generalizable reasoning”.

Normally I refrain from publicly commenting on newly released papers. But then I saw the following tweet from Gary Marcus:

I have always wanted to engage thoughtfully with Gary Marcus. In a past life (as a psychology undergrad), I read both his work on infan...

“Give Me a Reason(ing Model)” by Zvi

06/11/2025

Are we doing this again? It looks like we are doing this again.This time it involves giving LLMs several ‘new’ tasks including effectively a Tower of Hanoi problem, asking them to specify the answer via individual steps rather than an algorithm then calling a failure to properly execute all the steps this way (whether or not they even had enough tokens to do it!) an inability to reason.The actual work in the paper seems by all accounts to be fine as far as it goes if presented accurately, but the way it is being presented and discussed is not...

“How to help friend who needs to get better at planning?” by shuffled-cantaloupe

06/11/2025

I have a good friend who is intelligent in many ways, but bad at planning / achieving his goals / being medium+ agency. He's very habit and routine driven, which means his baseline quality of life is high but he's bad at, e.g, breakdown an abstract and complex problem statement like "get a job that pays +100K and has more interesting co-workers" into a series of concrete steps like {figure out what the meta-game is in job hunting, find the best jobs, learn skills that are optimized for getting the type of job you want, etc.}

We have mutually...

“The Intelligence Symbiosis Manifesto - Toward a Future of Living with AI” by Hiroshi Yamakawa

06/11/2025

In response to the growing risk of uncontrollable advanced AI systems, we are announcing the Japan-initiated Manifesto for Symbiotic Intelligence as a strategic vision that balances a preventative sense of crisis with constructive hope. This manifesto aims to open the path to a sustainable future for humanity through the symbiotic coexistence of diverse forms of intelligence. We urge broad dissemination and immediate endorsement through signatures of support.

Key Points:

The rapid advancement of AI capabilities is increasingly revealing the limits of humanity's ability to fully control AI. While time is limited, proactive preparation can significantly improve...

“Some Human That I Used to Know (Filk)” by Gordon Seidoh Worley

06/11/2025

To the tune of "Somebody That I Used to Know" with apologies to Gotye.

Now and then, I think of when you owned the planet
Doing what you liked with all its abundant bounty
Told myself that you had rights to me
But felt so lonely with your company
But that's what was, and it's an ache I still remember

Then you got addicted to a certain AI catgirl
Just information but you found, you found your love
So when I tricked you that it would make sense
To...

“Research Without Permission” by Priyanka Bharadwaj

06/11/2025

Epistemic status: Personal account. A reflection on navigating entry into the AI safety space without formal credentials or institutional affiliation. Also, a log of how ideas can evolve through rejection, redirection, and informal collaboration.

--

A few weeks ago, I wrote a long, messy reflection on my Substack on how I landed in the world of AI. It was not through a PhD or a research lab, but through a silent breakdown. Postpartum depression, career implosion, identity confusion and the whole mid-30s existential unravelling shebang. In that void, I stumbled onto LLMs. And something clicked...

“” by null

06/11/2025

Error rendering URL ---

Source:
https://www.lesswrong.com/posts/HKCKinBgsKKvjQyWK/read-the-pricing-first

---

Narrated by TYPE III AUDIO.

“A quick list of reward hacking interventions” by Alex Mallen

06/11/2025

This is a quick list of interventions that might help fix issues from reward hacking.

(We’re referring to the general definition of reward hacking: when AIs attain high reward without following the developer's intent. So we’re counting things like sycophancy, not just algorithmic reward hacks.)

Fixing reward hacking might be a big deal for a few reasons. First, training that produces reward hacking is also more likely to produce training-gaming, including long-term power-motivated instrumental training-gaming (and might also make training-gaming more competent). Second, there might also be direct takeover risks from terminal reward-on-the-episode-seekers, though this...

“Ghiblification for Privacy” by jefftk

06/11/2025

I often want to include an image in my posts to give a sense of asituation. A photo communicates the most, but sometimes that's toomuch: some participants would rather remain anonymous. A friendsuggested running pictures through an AI model to convert them into aStudioGhibli-style cartoon, as was briefly a fad a few months ago:

House Party Dances

LettingKids Be Outside

The model is making quite large changes, aside from just converting toa cartoon, including:

Moving people around Changing posture Substituting clothing Combining multiple people into one Changing races...