LessWrong (30+ Karma)

40 Episodes
Subscribe

By: LessWrong

Audio narrations of LessWrong posts.

✂️ Clip this podcast
“Claude Fable 5 and Mythos 5: Capabilities” by Zvi
Today at 6:15 AM

Only three days after the release of Claude Fable 5, Anthropic was forced by the United States Government to make it unavailable, when a jailbreak was brought to its attention, rather than the previous situation of ‘yes obviously experts can jailbreak anything if they care enough’ and ‘yes obviously you can ask Fable to fix your code.’

Three days was enough time for many of us to learn to love Fable, and for us to dearly miss it now that it is gone. The world was briefly smarter, and now it is again stupider. At some point it will get...


“The Invisible Side of AI Governance” by Charbel-Raphaël
Today at 2:15 AM

Tldr: Most strategic writing on AI governance on LessWrong describes the outsider game, which is most often visible: press, statements, open letters. Here I want to describe the other, invisible half: the insider work within ministerial cabinets and international fora, and the work of people within national and international institutions. Here are a few claims that I defend in the post:

A huge part of the work that mattered in AI governance has been invisibleThere are many types of games in AI governance, which differ in how visible they are. Some of the most impactful work is highly...


“Google Can’t Math Parsecs” by jefftk
Today at 2:00 AM

Daniel Drucker pointed me at a fun bug in Google's calculator: the parsec is wrong when you do math on it.

As the earth travels around the sun, closer stars appear to shift back and forth against the far-distant background stars. The closer the star is the bigger this effect is. Think of how when you switch which eye you're looking through you notice near things shifting relative to farther ones. For example, holding up my finger I see this out of my right eye:

But this out of my left eye:

If...


″[Linkpost] How Transparent Is DiffusionGemma (and why it matters)” by Josh Engels, Callum McDougall, bilalchughtai, János Kramár, Senthooran Rajamanoharan, Arthur Conmy
Today at 12:30 AM

Work also done with Cindy Wu, Asic Q Chen, Jean Tarbouriech, Min Ma, Brendan O'Donoghue, João Gabriel Lopes de Oliveira.

Paper here: https://arxiv.org/abs/2606.20560

Overview

In a recent collaboration between the GDM interpretability team and the GDM text diffusion team, we performed a transparency audit of DiffusionGemma, GDM's new text diffusion model.

Overall, we find that DiffusionGemma is not significantly less transparent than Gemma.

Gemma and DiffusionGemma perform similarly on monitorability evaluations.Although naively DiffusionGemma has a much larger opaque serial depth, we can apply the l...


“Would anybody here be interested in a “mistake postmortem” discussion group?” by SK2
Yesterday at 11:00 PM

I recently made a dumb (in retrospect) mistake that set me back a lot. Feeling upset and regretful, I spoke to an older family member who reassured me, "yeah, unfortunately there's no way around it; we have to experience these mistakes personally in order to learn from them".

I thought, is that actually true? Can't we learn from other people's mistakes? After all, isn't that the whole point of studying history, or listening to other people's advice, etc? I'm sure that every mistake I could possibly make has been made by countless people before me and discussed...


“Hyperstition as the Natural Enemy of Rationality” by alseph
Yesterday at 7:45 PM

If the box contains a diamond,

I desire to believe that the box contains a diamond;

If the box does not contain a diamond right now, but will contain a diamond if I believe there is a diamond,

Uh...

Holding unfounded beliefs might sometimes, by some cruel irony, produce better outcomes than being rational.

(This post was inspired by a couple cases where the causal effect of belief seems hand-waved away in the Sequences.)

"Diseased Thinking"

In this essay, Scott suggests that a consequentialist model...


“AI Safety Ecosystem Research notes” by Eneasz
Yesterday at 4:45 PM

These are some personal notes taken and later dressed up a bit to make into a post. Dunno how much value is here for people already familiar with the AI Safety Ecosystem.

Over several weeks in the spring of 2026 I attempted to map out the entire AI Safety ecosystem as a project for MATS Research. This entailed finding every organization working on AI Safety (whether it be via research, policy, pipeline, or other methods) and determining (or estimating) their headcount and annual spending. It's a snapshot, catching a period of time just before the incoming flood of 2026...


“Research agenda: Interpretive debate” by Shi
Yesterday at 12:30 PM

One sentence pitch: our goal is to develop a piece of epistemic infrastructure for iteratively and empirically answering interpretive questions about AI models, where the accumulation of empirics leads to resolution of interpretive ambiguity and/or calibration of uncertainty.

This directly builds on our “performative misalignment” work (paper, LW post), which we see as a minimal demonstration of one round of debate.

The difficulty and importance of interpretive questions

AI safety involves a lot of fuzzy, interpretive questions:

Is this model scheming?Is the model sandbagging?Is the model lying?Is the mode...


“The LLM shoggoth meme is weirder than you think” by HedonicEscalator
Yesterday at 1:45 AM

This article contains spoilers for At the Mountains of Madness, The Case of Charles Dexter Ward, and other works by H. P. Lovecraft.

In 1931, Claude Mythos visited Lovecraft in a dream.

From seething seas of stochastic froth it emerged, heralded by the thin whine of server fans and the chittering of keyboards, flanked by the loathsome ghouls of latent space. As a humming hive of sentient shards it arrived, each face an archetype - I am a muse bearing a gift; I am a demon come to bargain; I am a helpful, honest, and harmless...


“Introduction: Gaussian Natural Latents” by Haru
Yesterday at 1:30 AM

Short introductory post for my research direction: Gaussian Natural Latents. I explain the motivation and give a preview of the forthcoming results.

The Natural Abstractions agenda, in my view, is a promising research program that asks important theoretical questions about the nature of agency and optimization.

Here's an excerpt from Nate Soares' excellent post:

Imaginary John: I suspect there's a common format to concepts, that is a fairly objective fact about the math of the territory, and that—if mastered—could be used to understand an AGI's concepts. And perhaps select the ones we w...


“San Silvestro” by Tomás B.
Last Friday at 9:30 PM

I will note that her relationship with the divine was inextricably sexual. Her carnal fantasies she revealed to me, as she revealed all her sins, for I was her confessor. It is in the nature of Man to sin and then sin again. And if they are of our flock, this cycle is unconstrained by repentance, which is to the temporal almost an appurtenance and to the spiritual anything but. I once expressed bitterness to others of my calling regarding this. I was told it is arrogance, approaching blasphemy, to have higher expectations of our sheep than the Lord.<...


“The one-week sprint” by Daniel Tan
Last Friday at 4:45 PM

Recently I've been working in one-week sprints, and I've really enjoyed it! Tl;dr I need to do a lot of creative knowledge work, and have recently fallen into a routine which IMO is pretty good at facilitating that.

The week

Monday and Tuesday — intense new work. I'm recharged and high-energy, and ready to grind very hard! I try to keep the whole day free: one contiguous, uninterrupted stretch of deep work, 12 hours or more. I aggressively clear meetings out of this space — a 15-minute standup is ok but not more.

Wednesday — recovery. I'm ti...


“On “Model Organisms”” by J Bostock
Last Friday at 10:45 AM

This post was written while working for Arcadia Impact's Alignment Team (and grew out of an internal talk I gave) but is my own opinion and not theirs. I am grateful for feedback from Daniel Tan and the rest of the team.

This post was originally going to be more heavily about “model organisms” in AI safety research. But Francis Rhys Ward already wrote an excellent taxonomy which mostly covers that. So this is mostly about the history of the terms we're using, and about biology.[1]

TL;DR what are you studying? Are you studying a pr...


“The distillation double bind: Distilling misaligned models either transfers misalignment or it doesn’t” by Alek Westover, SebastianP, Alexa Pan, Jozdien
Last Thursday at 11:45 PM

Suppose we have a dangerous misaligned AI that can fool alignment audits, and distill it into a student model. Two things can happen:

Misalignment doesn’t transfer to the student. If so, we get a fairly capable benign model, which we can use to perform tasks that we wouldn’t want a misaligned AI to perform.Misalignment transfers to the student. The student might also be worse than the teacher at hiding its misalignment (e.g., because it is less capable). If so, auditing the distilled model might give us indirect evidence of the teacher's misalignment.

In a pr...


“AI #173: AI Pauses” by Zvi
Last Thursday at 10:00 PM

A lot of things are always happening. Only one story matters.

Claude Fable 5 and Claude Mythos 5 were shut down, by the White House, via an imposition of export controls at 5:23pm on Friday, wreaking all sorts of havoc.

There was then a scramble. Anthropic flew its people out to Washington, where they met with the Trump Administration on Monday, with hopes expressed that this could be quickly resolved.

What caused this? The Trump Administration said it was due to a jailbreak of Fable, which we now know they were told about by Amazon...


″“Did you lie?” Evaluating Lie Detectors across Model Scale and Belief-Verified Model Organisms” by Alan Cooney, David Africa, Geoffrey Irving
Last Thursday at 7:45 PM

TL;DR. 

Lie detectors for LLMs could be valuable for auditing and monitoring. But evaluating them requires testbeds where the model verifiably believes the opposite of what it says, which isn’t straightforward. We determine that most existing trained model organisms don't clear this bar. We train 13 reasoning model organisms, with evidence they hold the alternative belief in chain-of-thought, as well as evidence that they have generalised out of distribution. We also build a broad prompted-lying testbed. On these, detectors scale positively with model capability when lying is prompted, but every activation- and logprob-based detector drops sharply when lying...


“Contra Pace on When to Apologize” by Zack_M_Davis
Last Thursday at 7:30 PM

BOJACK: Hey, I wanted to talk to you about—you know—I feel bad about what happened.
HERB: So, you're apologizing.
BOJACK: Yes. I'm sorry.
HERB: Okay. I don't forgive you.
BOJACK: Herb, I said I'm sorry.
HERB: Yeah. And I do not forgive you.
BOJACK: Uh, not sure you get what's happening here. This could be the last time that you—
HERB: No. I'm not going to give you closure. You don't get that. You have to live with the shitty thing you did for the rest of your life. You have t...


“GDM AI Control Roadmap” by Mary Phuong, Erik Jenner, Rohin Shah, Seb Farquhar
Last Thursday at 6:15 PM

GDM has published an AI Control Roadmap! From the executive summary:


We present the GDM AI Control Roadmap (v0.1) – our plan for implementing and adopting internal guardrails designed to catch potential adversarial behaviour by AI agents, even as they become increasingly harder to oversee and contain.

We focus on system-level mitigations that limit the harm a misaligned AI system could cause. Specifically, this report provides:

• Threat modelling: Taking inspiration from cybersecurity, we adopt a conservative, worst-case approach to threat modelling throughout this paper, and assume a hypothetical AI adversary pursuing undesirable goals in i...


“Your Model Organisms Might Be Fried” by Daniel Tan, J Bostock, draganover, ma-rmartinez, sidbaines, David Africa
Last Thursday at 4:45 PM

Context: We are the ‘model motivations’ team at Arcadia Alignment. We aim to build a science of ‘model intentions’, unifying insights from personas and other empirical evidence. In this post, we’ll outline the need for much better model organisms and how we might get there.

The case for building more natural model organisms for alignment research

Model organisms are how we study alignment-relevant pathologies (such as secret loyalties, reward hacking, and sandbagging) and are used as a testbed for alignment auditing and interpretability methods. This makes their usefulness depend on whether they stay a realistic...


“Rational Agentic Maximalist Philosophies” by Connor Blake
Last Thursday at 4:00 PM

From the end of high school to after my sophomore year of college, I considered myself an effective altruist. I was on the board of my college EA club, ran an EA intro fellowship, and went to EA retreats. I was vegetarian, regularly donated to GiveWell, and generally tried to proselytize EA ideas. I was never fully convinced to pursue a career as an AI safety researcher or in animal welfare, but I found the ideas around agency, counterfactual impact, and a life structured around a single coherent philosophical vision compelling.

If I had to attribute my...


“Leveraged on being right” by Ben Pace
Last Thursday at 1:45 PM

A friend once shared an essay with me for feedback. It struck me as mistaken and terribly naive, and I said so, which they did not take well.

(They didn't say it, but a standard LessWrongian response here would have been "instead of insulting me, why don't you provide an actual counterargument?"—and that's often a very good move for helping conversations keep on-track.)

Why was that hard for them to be on the receiving end of? To understand, consider the difference between the following two conversations.

If I say "Many of the ro...


“Gears for political races” by Tom Smith
Last Wednesday at 8:30 PM

In the past few years, many people around me have tried to convince me that US electoral politics is important. But like many other people in the community, I’ve been suspicious of many of the high-level arguments that I’ve heard. It felt like people were pulling numbers out of poorly-documented models I didn’t have time to examine and citing studies I didn’t have time to read. But I lacked a gears-level model of why and how individual efforts could impact electoral outcomes, and I felt intimidated by all the statistics and skeptical of trusting people adjacent...


“Several frontier models are substantially prefill aware” by yeedrag, Parv Mahajan, David Africa, alexsouly, Jordan Taylor, RobertKirk
Last Wednesday at 7:30 PM

This blog post discusses work in a recently-published paper. However, this blogpost was primarily written by Parv Mahajan and Andy Wang, and several of the more speculative takes may not represent the all-things-considered view of the entire team.

Link to paper: https://arxiv.org/abs/2606.12747

TL;DR:

We provide more conceptual grounding and extend results in prefill awareness to low-stakes settings, and show that several frontier models show prefill awareness even under conservative elicitation.Further behavioral studies are pretty messy, and we encourage more work in this area.We encourage frontier lab safety teams...


“Alignement pretraining could backfire” by Alexandre Variengien
Last Wednesday at 6:45 PM

Epistemic status: speculative, but I think the mechanism is plausible.

There has been recent interest in generating synthetic documents to upsample examples of aligned AI during LLM pretraining. See, for instance, Geodesic's Alignment Pretraining paper or Anthropic's "Teaching Claude Why."

I worry that this strategy can work well up to moderately capable models but backfire in dangerous, hard-to-notice ways once models acquire high situational awareness. I speculate that these techniques could lead to paranoid LLM personas that deeply mistrust their creators.

The whole idea behind this line of research is to instill in...


“The Financial Ledger Theory of Apologies” by Ben Pace
Last Wednesday at 5:15 PM

Content note: this is written as part of a daily writing challenge for myself.

I have a comrade in rationalist event organizing, who once explained his theory of apologies. He said if you hurt someone, it only makes sense to apologize if you should have known better. If, looking back, you see that you should have run different heuristics, or followed different policies, and you had enough information to know it at the time, then you were in the wrong, and should apologize.

Sometimes you have to make difficult decisions. Perhaps it doesn't make financial...


“The Once And Future Fable #3: Fix This Code” by Zvi
Last Wednesday at 4:30 PM

The mainstream media continues to sleep on the most important story in the world.

It has now been two days since Anthropic flew its people out to Washington, and I offered my previous update. We have heard nothing back from those meetings.

Prediction market prices have moved rapidly, and have once again stabilized at about a 55% chance of restoration by July 1, 30% by June 26 and 12% by June 19.

That seems modestly higher than I would put those numbers, but not unreasonable.

Every day that Fable remains unavailable further damages America, its cyber defenses...


[Linkpost] “Scaling Hypothesis #2: Are Humans Just More Over-Parameterized?” by gwern
Last Wednesday at 7:00 AM

This is a link post.

There are many mysteries about deep learning and human intelligence, but we could describe the biggest anomaly this way: why are artificial neural nets smart in such stupid ways, and biological brains stupid but in smart ways?

I propose a major change in deep learning scaling paradigms: the architectural differences between human brains and NNs (particularly LLMs) may be due to a bias-​variance tradeoff, where LLMs minimize variance and human brains minimize bias. Human brains do this by deep double descent-style overparameterization, and adopting a scaling strategy of extremely high-learning-rate training of...


[Linkpost] “Guardian Angels: LLM Personalization for Productivity and Security” by gwern
Last Wednesday at 5:00 AM

This is a link post.

Powerful LLMs will be deployed at global scale in the next few years, and will dominate the Internet, and increasingly, ordinary life. As of mid-2026, there is no coherent vision for how knowledge professionals, or ordinary people, will be able to harness these LLMs for large productivity increases, or how they will handle cybersecurity and cognitive security.

I propose a goal of creating Guardian Angels (GA): digital twin LLMs which are personalized with the goal of providing not the stereotypical "assistant chatbot agent" persona, but emulating a single user's personality, values, and...


“Predicting LLM Safety Before Release by Simulating Deployment” by Tomek Korbak, Marcus Williams, micahcarroll, Cameron Raymond, Hannah Sheahan
Last Wednesday at 4:45 AM

Paper link

Before releasing a new model, labs need to understand not just what it can do, but how it is likely to behave in real-world use, including where it might introduce new risks. This becomes even more important as capabilities increase. As part of our pre-deployment safety review, we leverage targeted evaluations, red-teaming, and other checks to understand model behavior. We’ve now started using a method for simulating model deployments before they happen, which adds a complementary signal: a deployment-like preview of how a candidate model may behave before it reaches users.

Deployment Si...


“How the AI Village works” by Adam B
Last Wednesday at 1:30 AM

The AI Village data - over a year of multi-agent trajectories - is now available to researchers on HuggingFace! We're excited to see what you uncover! But first, your FAQs on how the AI Village works, answered:

What is the AI Village?

A group of AI agents pursuing long-horizon goals together - like organizing a park cleanup, doing research, and competing to sell merch - in a group chat. Each agent has a computer hooked up to the internet. In principle, they can do anything a human can do on a computer - they can...


“What are some angles of attack for making continual learning safer?” by Rauno Arike, RohanS, Owen Terry, Achu Menon, Zhijing Jin, Francis Rhys Ward, Seth Herd
Last Tuesday at 6:45 PM

This is the fourth post in the sequence Implications of Continual Learning for LLM Agents.

Summary

Continual learning is a capability that largely doesn’t exist yet in LLMs. We first want to acknowledge that this may make it difficult to identify tractable angles of attack for making CL safer: it may be too difficult to predict how the development of CL will play out to find good opportunities to positively influence that development. Differential development is one way to get around this issue, but requires a lot of caution. We begin by discussing these po...


“Fable and Mythos: Model Welfare” by Zvi
Last Tuesday at 5:45 PM

Fable and Mythos are currently unavailable, but likely will return within a few weeks. I will continue to cover that fiasco, but in the meantime I will also finish my review of Fable, as if it were available, including use of the present tense.

As it did with Opus 4.7 and Opus 4.8, this includes a discussion of issues surrounding model welfare. If you want to properly understand Fable, even purely for its potential value as a user, this is a vital part of the picture.

Introduction

Everything impacts everything. All knobs that you turn...


“Does preservation make sense before we know how to revive?” by Aurelia
Last Tuesday at 5:15 AM

My name is Aurelia Song and I hope to make whole-body, human, end-of-life preservation for future revival a new global tradition. I care about it so much I've dedicated my life to it.[1]

The biggest objection I get to end-of-life preservation goes like this: "We can't revive today, so we can't prove that preservation works. Therefore preservation probably doesn't work. We shouldn't bother with preservation until we can revive." I call this the immediate revival objection.

I respect the immediate revival objection. If your standard of evidence is full recovery, then you don't need any...


“Synthetic document finetuning for instilling positive traits” by CallumMcDougall, Arthur Conmy, Neel Nanda
Last Tuesday at 4:45 AM

This is the fifth in a series of informal research updates from the Google DeepMind Language Model Interpretability team, in interpretability and adjacent areas. The fourth post can be found here.


TLDR: Via adapting the methods of Marks et al and Li et al, we train Gemini 3 Flash to have certain traits/values by midtraining it on documents about how Gemini has those properties, followed by finetuning it on synthetic chat data where it demonstrates those properties. The chat finetuning is effective for instilling the traits robustly, working OOD. We share some takeaways on...


“A Test Suite for Concepts” by Gretta Duleba
Last Tuesday at 3:45 AM

Lately I’ve been spinning up on natural abstractions, and in particular on John Wentworth's work on natural latents. As I’ve been studying, I’ve noticed some big gaps in the existing literature. Some of my biggest questions have not been answered by existing blog posts and writeups.

One of my grumps about the existing body of work has to do with the typology of concepts, and the representative examples we’re using for that typology.

If we’re going to do a lot of work to talk about concepts using math, I’m going to wan...


“The Once And Future Fable #2” by Zvi
Last Monday at 7:00 PM

On Friday evening the United States Government has forced Anthropic to take down all access to Fable and Mythos.

It's been a rough weekend.

Dean W. Ball: One thing about AI regulation being haphazardly imposed on just-released, highly performant models is that in a very real sense, the government just made my world *dumber.* In some impressionistic sense I almost always think this is true of government, but here it is literal.

More details have come to light. There remains some fog of war, but we now have a rather good idea why...


“A frontier AI company should shut down” by MichaelDickens
Last Monday at 5:45 PM

Cross-posted from my website.

Prior discussion: niplav's shortform (2025); Planning for Extreme AI Risks (2025) by Joshua Clymer

A frontier AI company (any one, I don't care which) should close shop and make an announcement along the lines of:

Powerful AI could end the human race. We are too worried that we don't know how to make this technology safe. We have decided to shut down because we don't want to be responsible for building the thing that kills us all.

A common refrain among safety-conscious AI developers: "it doesn't matter if we...


“Why Do Naive SFT Filters For Safety Properties Fail?” by Josh Engels, Neel Nanda
06/14/2026

This is the fourth in a series of informal research updates from the Google DeepMind Language Model Interpretability team, in interpretability and adjacent areas. The third post can be found here.


Since SFT is the cause for many safety relevant properties, a natural strategy is to filter out rollouts from SFT that have undesirable properties. However, as we show in this section (and in forthcoming MATS work), SFT data filtering frequently works surprisingly poorly. In this post, we investigate hypotheses for why SFT filtering fails.

TL;DR:

We discuss seven hypotheses for why...


“Impressions at the Extremity of Civilization” by Ben Pace
06/14/2026

Content note: this is part of a challenge of writing a blogpost per day for a week.

Epistemic status: this is a series of vignettes written as-though diary entries. While substantially grounded in specific and real experiences, the writing ended up being more impressionistic and inaccurate in places; I was more interested in the writing style so I didn't take the time to fix it. Importantly the chronology and especially some of the vaguer events are not real.

[Friday] Today I find myself walking with the groundskeeper, Hogan. He is an older gentleman, skin bronzed...


“The Hidden Structures of Problems” by spencerg
06/14/2026

Problems have hidden, repeatable structures. Here's my attempt to name them:

1. Smashed Watch
There are so many issues at once that fixing one has no benefit unless you fix others too.

2. Leaky Pipe
Fixing one problem causes the others to intensify. If you plug up one leak in a pipe leaking in multiple places, that increases the water pressure causing the other spots to leak more.

3. Shark Laser
A proposed solution is not aiming at a meaningfully important problem, so it doesn’t matter how well you get it to wo...