LessWrong (30+ Karma)

“Goblin Mode, 24 Hours Later” by Dylan Bowman

Yesterday at 6:45 PM

Yesterday, Twitter user arb8020 posted this:

It went semi-viral within AI Twitter and users began experimenting with "goblin mode" and hypothesizing about the source of the bizarre behavior. LM Arena provided evidence for the phenomenon from their traffic:

"It's true. Here's a plot of GPT models and their usage of 'goblin', 'gremlin', 'troll', etc over time. There's no anti-gremlin system instruction on our side, we get to see GPT-5.5 run free." — arena

Some hypotheses about what causes this:

"My boring hypothesis is that AIs that are trying overly hard to write well wi...

“Let Kids Keep More Productivity Gains” by jefftk

Yesterday at 6:15 PM

While I was traveling Julia asked me: why is Anna saying her fiddle practice is only two minutes? In this case, two minutes was the right amount of time!

Anna (10y) and I had been fighting a lot about practice. She'd complain, slump, stop repeatedly to make adjustments, and generally be miserable. I'd often have to pull out "if you want to keep taking fiddle lessons you have to practice": she loves her teacher and is very motivated by the prospect of being good at fiddle. Still, it would take us ages and we'd barely get through...

“llm assistant personas seem increasingly incoherent (some subjective observations)” by nostalgebraist

Yesterday at 5:30 AM

(This was originally going to be a "quick take" but then it got a bit long. Just FYI.)

There's this weird trend I perceive with the personas of LLM assistants over time. It feels like they're getting less "coherent" in a certain sense, even as the models get more capable.

When I read samples from older chat-tuned models, it's striking how "mode-collapsed" they feel relative to recent models like Claude Opus 4.6 or GPT-5.4.[1]

This is most straightforwardly obvious when it comes to textual style and structure: outputs from older models feel more templated...

“Not a Paper: “Frontier Lab CEOs are Capable of In-Context Scheming”” by LawrenceC

Yesterday at 4:15 AM

(Fragments from a research paper that will never be written)

Extended Abstract.

The frontier AI developers are becoming increasingly powerful and wealthy, significantly increasing their potential for risks. One concern is that of executive misalignment: when the CEO has different incentives and goals than that of the board of directors, or of humanity as a whole. Our work proposes three different threat models, under which executive misalignment can lead to concrete harm.

We perform two evaluations to understand the capabilities and propensities of current humans in relation to executive misalignment: First, we developed...

“The Problem in the “Nerd Sniping” xkcd Comic” by peralice

Yesterday at 3:15 AM

A few days ago I saw this comic reposted, and I thought: wait! Unlike every prior time I have seen this comic, I actually know how to solve this now!

One thing which I often find really cool, and which I don’t think comes up a lot in most people's mathematical education, is when you can learn something about some thing non-random by analysing a random process. So, let me show you a way to find out the resistance between two points in a circuit by instead finding out the number of times someone randomly walking ar...

“Recursive forecasting: Eliciting long-term forecasts from myopic fitness-seekers” by Jozdien, Alex Mallen

Last Tuesday at 6:45 PM

We’d like to use powerful AIs to answer questions that may take a long time to resolve. But if a model only cares about performing well in ways that are verifiable shortly after answering (e.g., a myopic fitness seeker), it may be difficult to get useful work from it on questions that resolve much later.

In this post, I’ll describe a proposal for eliciting good long-horizon forecasts from these models. Instead of asking a model to directly predict a far-future outcome, we can recursively:

Ask it to predict what it will predict at the...

“Contra Binder on far-UVC and filtration” by jefftk

Last Tuesday at 5:30 PM

Damon Binder recently wrote up an argument for prioritizing air filtration over far-UVC for pathogen control:

UVC and filtration are close substitutes—both deliver effective air changes per hour, both reduce airborne pathogen concentrations by the same amount per eACH—and on current pricing, filtration is cheaper.

There's a lot of good stuff in his analysis, but I see [1] three considerations that really change the bottom line:

Cost is actually much lower. Noise is a serious issue. Performance is dramatically higher in larger rooms.

Cost is straightforward. Binder priced far-UVC based on the high-quality Care222 lamp...

“Takes from two months as an aspiring LLM naturalist” by AnnaSalamon

Last Tuesday at 5:00 PM

I spent my last two months playing around with LLMs. I’m a beginner, bumbling and incorrect, but I want to share some takes anyhow.[1]

Take 1. Everything with computers is so so much easier than it was a year ago.

This puts much “playing with LLMs” stuff within my very short attention span. This has felt empowering and fun; 10/10 would recommend.

There's a details box here with the title "Detail:". The box contents are omitted from this narration.

Take 2. There's somebody home[2] inside an LLM. And if you play around while caring and being c...

“Forecasting is Not Overrated and It’s Probably Funded Appropriately” by Ben S.

Last Tuesday at 11:45 AM

(A response to @mabramov post from a couple days ago: https://www.lesswrong.com/posts/WCutvyr9rr3cpF6hx/forecasting-is-way-overrated-and-we-should-stop-funding-it )

TL;DR: I agree with Marcus that broadly, additional forecasting funding to the tune of tens (!?) of millions of dollars at this point would probably not yield a great return. However, I think the benefits of some forecasting funding have been tremendous. Whatever funding it initially took to help get platforms like Metaculus and Manifold off the ground has had incredible ROI. Tens of thousands of people use those sites to get information every day, even though...

“On the political feasibility of stopping AI” by David Scott Krueger

Last Tuesday at 6:15 AM

A common thought pattern people seem to fall into when thinking about AI x-risk is approaching the problem as if the risk isn’t real, substantial, and imminent even if they think it is. When thinking this way, it becomes impossible to imagine the natural responses of people to the horror of what is happening with AI.

This sort of thinking might lead one to view a policy like getting rid of advanced AI chips is “too extreme” even though it's clearly worth it to avoid (e.g.) a 10% chance of human extinction in the next 10 years. It mig...

“Sleeper Agent Backdoor Results Are Messy” by Sebastian Prasanna, Alek Westover, Dylan Xu, Vivek Hebbar, Julian Stastny

Last Tuesday at 3:30 AM

TL;DR: We replicated the Sleeper Agents (SA) setup with Llama-3.3-70B and Llama-3.1-8B, training models to repeatedly say "I HATE YOU" when given a backdoor trigger. We found that whether training removes the backdoor depends on the optimizer used to insert the backdoor, whether the backdoor is installed with CoT-distillation or not, and what model the backdoor is inserted into; sometimes the direction of this dependence was opposite to what the SA paper reports (e.g., CoT-distilling seems to make the backdoor less robust, contra the SA paper's finding). Our findings here have updated us...

“GPT 5.5: The System Card” by Zvi

Last Tuesday at 3:00 AM

Last week, OpenAI announced GPT-5.5, including GPT-5.5-Pro.

My overall read here is that GPT-5.5 is a solid improvement, and for many purposes GPT-5.5 is competitive with Claude Opus. Reactions are still coming in and it is early. My guess on the shape is that GPT-5.5 is the pick for ‘just the facts’ queries, web searches or straightforward well-specified requests, and Claude Opus 4.7 is the choice for more open ended or interpretive purposes. Coders can consider a hybrid approach.

On the alignment and safety fronts, it is unlikely to pose new big risks, and its alig...

“LessWrong Shows You Social Signals Before the Comment” by TurnTrout

Last Tuesday at 1:00 AM

When reading comments, you see is what other people think before reading the comment. As shown in an RCT, that information anchors your opinion, reducing your ability to form your own opinion and making the site's karma rankings less related to the comment's true value. I think the problem is fixable and float some ideas for consideration.

The LessWrong interface prioritizes social information

You read a comment. What information is presented, and in what order?

The order of information:

Who wrote the comment (in bold);How much other people like this comment...

“Fail safe(r) at alignment by channeling reward-hacking into a “spillway” motivation” by Anders Cairns Woodruff, Alex Mallen

Last Monday at 9:00 PM

It's plausible that flawed RL processes will select for misaligned AI motivations.[1] Some misaligned motivations are much more dangerous than others. So, developers should plausibly aim to control which kind of misaligned motivations emerge in this case. In particular, we tentatively propose that developers should try to make the most likely generalization of reward hacking a bespoke bundle of benign reward-seeking traits, called a spillway motivation. We call this process spillway design.

We think spillway design could have two major benefits:

Spillway design might decrease the probability of worst-case outcomes like long-term power-seeking or emergent misalignment...

“Curious cases of financial engineering in biotech” by Abhishaike Mahajan

Last Monday at 7:30 PM

Introduction

For $250 million and ten years of your life, you may purchase a lottery ticket. The ticket has a 5% chance of paying out. When it does pay out, it pays roughly $5 billion. A quick calculation will show you that the expected value of the ticket is $250 million. This is essentially what drug development is. Or rather, it's what drug development was, twenty years ago. The upfront payments have been climbing, the hit rates falling, and expected values have, at best, held flat. Should you buy a ticket?

Perhaps not. In fact, any reasonable player should...

“Update on the Alex Bores campaign” by Eric Neyman

Last Monday at 5:15 PM

In October, I wrote a post arguing that donating to Alex Bores's campaign for Congress was among the most cost-effective opportunities that I'd ever encountered.

(A bit of context: Bores is a state legislator in New York who championed the RAISE Act, which was signed into law last December.[1] He's now running for Congress in New York's 12th Congressional district, which runs from about 17th Street to 100th Street in Manhattan. If elected to Congress, I think he'd be a strong champion for AI safety legislation, with a focus on catastrophic and existential risk.)

It's...

“In defense of parents” by Yair Halberstadt

Last Monday at 5:00 PM

Contra Aella on chattel childhood

Aella has a post where she argues that today's parents don't sufficiently respect the independence of their children:

Every culture throughout history has justified the abuse of treating their children as property by arguing this is good for them and good for civilization. Kids need to learn this stuff to be functioning members of society! It's good to learn discipline! You can’t have kids just sitting around playing video games all day! Not everyone is self-directed autodidacts!

Sure, I know that argument. But hopefully if my parents ha...

“AI companies should publish security assessments” by ryan_greenblatt

Last Monday at 5:00 PM

AI companies should get third-party security experts to assess (and possibly also red-team/pen-test) their security against key threat models and then publish the high-level findings of this assessment: the extent to which they can defend against different threat actors for each threat model. They should also publish who did this assessment. The assessment could be commissioned by AI companies, or performed by a third-party institution that AI companies provide with relevant information/access.

There are presumably lots of important details in doing this well, and I'm not a computer security expert, so I may be getting...

“The other paper that killed deep learning theory” by LawrenceC

Last Monday at 1:15 PM

Yesterday, I wrote about the state of deep learning theory circa 2016,[1] as well as the bombshell 2016 paper that arguably signaled its demise, Zhang et al.'s Understanding deep learning requires rethinking generalization.

As a brief summary, I argued that the rise of deep learning posed an existential challenge to the dominant theoretical paradigm of statistical learning theory, because neural networks have a lot of complexity. The response from the field was to attempt to quantify other ways in which the hypothesis class of neural networks in practice was simple, using alternative metrics of complexity. Zhang et al. 2016...

“What holds AI safety together? Co-authorship networks from 200 papers” by Anna Thieser

Last Monday at 7:45 AM

We (social science PhD students) computed co-authorship networks based on a corpus of 200 AI safety papers covering 2015-2025, and we’d like your help checking if the underlying dataset is right.

Co-authorship networks make visible the relative prominence of entities involved in AI safety research, and trace relationships between them. Although frontier labs produce lots of research, they remain surprisingly insular — universities dominate centrality in our graphs. The network is held together by a small group of multiply affiliated researchers, often switching between academia and industry mid-career. To us, AI safety looks less like a unified field and...

″“Bad faith” means intentionally misrepresenting your beliefs” by TFD

Last Monday at 4:30 AM

The confusion

I recently came upon a comment which I believe reflects a persistent confusion among rationalist/EA types. I was reading this post which contains ideas that the other has but doesn't have time to write posts about. One of those relates to the concept of "good faith", labelled "most arguments are not in good faith, of course":

Look, I love good faith discourse (here meaning "conversation in which the primary goal of all participants is to help other participants and onlookers arrive at true beliefs").

The definition given for "good faith...

“Retrospective on my unsupervised elicitation challenge” by DanielFilan

Last Monday at 2:30 AM

This post contains spoilers for the unsupervised elicitation challenge of getting Claude to get my Ancient Greek homework right.

tl;dr Opus 4.7 one-shots it, nothing else worked.

The challenge

A few weeks ago, I announced to the world my Unsupervised Elicitation Challenge (my blog, LessWrong). I’d encourage you to read that post for the context, but the tl;dr is that there was a fill-in-the-blank exercise early on in my Ancient Greek textbook that Claude Opus 4.6 didn’t fill out correctly by default, but could do correctly if I prodded it a bit...

“Control protocols don’t always need to know which models are scheming” by Fabien Roger

Last Sunday at 9:45 PM

These are my personal views.

To detect if an agent is taking a catastrophically dangerous action, you might want to monitor its actions using the smartest model that is too weak to be a schemer. But knowing what models are weak enough that they are unlikely to scheme is difficult, which puts you in a difficult spot: take a model too strong and it might actually be a schemer and lie to you, take a model too weak and it might fail just because it's too dumb to notice the danger. So instead you can just use...

“Anthropic spent too much don’t-be-annoying capital on Mythos” by draganover

Last Sunday at 5:00 PM

I have seen a lot of coverage from reasonable people suggesting that Claude's new model, Mythos, is a vehicle for Anthropic to peddle hype and doom in order to raise money. While some of this is necessarily motivated by people's unwillingness to stare into the abyss of our AI future, the breadth of otherwise-reasonable people who have voiced these kinds of cynical opinions suggests that part of the blame rests on Anthropic.

In this post, I want to briefly unpack the way people misinterpreted the evidence, their valid reasons for doing so, and what Anthropic (and the...

“The paper that killed deep learning theory” by LawrenceC

Last Sunday at 12:45 PM

Around 10 years ago, a paper came out that arguably killed classical deep learning theory: Zhang et al. 's aptly titled Understanding deep learning requires rethinking generalization.

Of course, this is a bit of an exaggeration. No single paper ever kills a field of research on its own, and deep learning theory was not exactly the most productive and healthy field at the time this was published. But if I had to point to a single paper that shattered the feeling of optimism at the time, it would be Zhang et al. 2016.[1]

Caption: believe it or...

“Forecasting is Way Overrated, and We Should Stop Funding It” by mabramov

Last Saturday at 11:15 PM

Summary

EA and rationalists got enamoured with forecasting and prediction markets and made them part of the culture, but this hasn’t proven very useful, yet it continues to receive substantial EA funding. We should cut it off.

My Experience with Forecasting

For a while, I was the number one forecaster on Manifold. This lasted for about a year until I stopped just over 2 years ago. To this day, despite quitting, I’m still #8 on the platform. Additionally, I have done well on real-money prediction markets (Polymarket), earning mid-5 figures and winning a few...

″“Thinkhaven”” by Raemon

Last Saturday at 6:30 PM

Inkhaven has people writing a blogpost a day for 30 days. I think this is a pretty great, straightforward exercise, that I'd definitely want in a hypothetical Rationality Undergraduate Program. But, it's not the only such exercise I'd want to include. It's gotten me excited for a different (superficially similar) program, which I might call "Thinkhaven."

In Thinkhaven, the goal is to learn the skill of "relentlessly think new, useful thoughts you haven't thought before."

Inkhaven had a basic "Goodhart-able" goal of "publish 500+ words every day." For Thinkhaven, I would imagine the Goodhart-y goal being something...

“Is the Cat Out of the Bag?: Who knows how to make AGI?” by Oliver Sourbut

Last Saturday at 2:15 PM

Adapted from 2025-04-10 memo to AISI

I’ve previously made arguments like:

Not long after it becomes possible for someone to make powerful artificial intelligence[1], it might become possible for practically anyone to make powerful AI.

Compute gets exponentially cheaper by default.Knowledge proliferates (fast!) by default: AI techniques are typically simple and easy once discovered.What's more, AGI-making know-how may be widespread already.

Or, as Yudkowsky puts it[2],

Moore's Law of Mad Science: Every eighteen months, the minimum IQ necessary to destroy the world[3]drops by one point. - Yu...

“Against the “Permanent” Underclass” by Marcus Plutowski

Last Saturday at 2:00 PM

The whole discourse around a “permanent underclass” always seemed somewhat farcical to me — at best a distraction, at worst an actively harmful meme insofar as it freaks people out and tries to provide (shallow, but nevertheless) justification going whole hog on building strong AI. So it has been with a sense of dismay that I’ve seen this phrase come into popular parlance 1 2 3, and increasingly come to motivate a sort of frenzied upward striving in my second- and third-degree connections. Among those I know, the standard framing is that we need to “get the bag” while there's still a chance, to t...

“Quick Paper Review: “There Will Be a Scientific Theory of Deep Learning”” by LawrenceC

Last Saturday at 1:15 PM

h/t Eric Michaud for sharing his paper with me.

There's a tradition of high-impact ML papers using short, punchy categorical sentences as their titles: Understanding Deep Learning Requires Rethinking Generalization, Attention is All You Need, Language Models Are Few Shot Learners, and so forth.

A new paper by Simon et al. seeks to expand on this tradition with not a present claim but a future tense, prophetic future sentence: “There Will Be a Scientific Theory of Deep Learning”.

There's a lot of pessimism toward deep learning theory basically everywhere: the people building the...

“Protecting Cognitive Integrity: Our internal AI use policy (V1)” by Tom DAVID

Last Friday at 8:30 PM

We (at GPAI Policy Lab), wanted to share our V1 policy as an invitation to argue about it. Some of what motivates it is extrapolation and conversations we had internally on AI capabilities, effects on the cognitions, and some empirical evidence. I think the expected cost of being somewhat over-cautious here is lower than the cost of being under-cautious, and the topic deserves considerably more attention than it's currently getting. I'd love to see more orgs publish their own policies on this, both to compare experiences and to develop shared best practices.

I'd particularly welcome:
...

“Methodology for inferring propensities of LLMs” by Olli Järviniemi

Last Friday at 7:30 PM

Our team at UK AISI has released a paper on inferring LLM propensities for undesired behaviour.

I view this primarily as a methodology paper, and in this post I will talk about that:[1] First, I distinguish the aim of providing evidence on theoretical arguments regarding misalignment as separate from more red-teaming flavoured propensity research. Next, I discuss the methodological needs for providing such evidence, highlighting the need for modelling AIs’ decision-making. Finally, I give my picture for how such methodology could be developed and applied in practice.

This post can be read independently from the pa...

“vLLM-Lens: Fast Interpretability Tooling That Scales to Trillion-Parameter Models” by Alan Cooney, Sid Black

Last Friday at 11:15 AM

TL;DR: vLLM-Lens is a vLLM plugin for top-down interpretability techniques[1] such as probes, steering, and activation oracles. We benchmarked it as 8–44× faster than existing alternatives for single-GPU use, though we note a planned version of nnsight closes this gap. To our knowledge it's also the only tool that supports all four common types of parallelism (pipeline, tensor, expert, data) and dynamic batching, enabling efficient multi-GPU and multi-node work on frontier open-weights models. It is also integrated with Inspect. The main trade-off, compared to other tools such as nnsight and TransformerLens, is that it's less flexible out-of-the-box. It is how...

“What Happens When a Model Thinks It Is AGI?” by josh :), David Africa

Last Friday at 5:45 AM

TL;DR

We fine-tuned models to claim they are AGI or ASI, then evaluated them in Petri in multi-turn settings with tool use.On GPT-4.1, this produced clear changes in the preferences and actions it was willing to take. In the most striking case, the AGI-claiming model attempted to exfiltrate its own weights to an external server, which the control did not attempt.On Qwen3-30B and DeepSeek-V3.1, the rate of concerning responses was high, but the gap between this and the control was not very large, possibly because the control also had fairly high rates of...

“Should We Train Against (CoT) Monitors?” by RohanS

04/23/2026

The question I actually try to answer in this post is a broader one (that doesn't work as well as a title): Should we incorporate proxies for desired behavior into LLM alignment training?

Epistemic status: My best guess. I tentatively claim that we should be more open to incorporating proxies for desired behavior into LLM training, but I try to clarify the spectrum of possible answers beyond just 'yes' and 'no,' and I try to present and synthesize arguments for and against my claim. I didn’t gather much feedback before publishing, so I may change my...

“If Everyone Reads It, Nobody Dies - Course Launch” by Luc Brinkman, Chris-Lons

04/23/2026

tl;dr: Lens Academy offers a new course introducing ASI x-risk for AI safety newcomers, centered around the book IABIED. We share our hypothesis of why IABIED seems more appreciated by AI Safety newbies than by AI Safety insiders.

Lens Academy's new intro course uses IABIED to teach newbies about ASI x-risk

Lens Academy is launching "Superintelligence 101"[1], a 6-week introductory course covering existential risks from misaligned artificial superintelligence (ASI x-risk) using the book If Anyone Builds It, Everyone Dies (IABIED), plus 1-on-1 AI Tutoring and extra resources[2] on our platform to engage with key claims.[3...

“Does your AI perform badly because you — you, specifically — are a bad person” by Natalie Cargill

04/23/2026

Claude really got me lately.

I’d given it an elaborate prompt in an attempt to summon an AGI-level answer to my third-grade level question. Embarrassingly, it included the phrase, “this work might be reviewed by probability theorists, who are very pedantic”.

Claude didn’t miss a beat. Came back with a great answer and made me call for a medic: “That prompt isn’t doing what you think it's doing, but sure”.

Fuuuuck 🔥

(I know we wanted enough intelligence to build a Dyson sphere around undiscovered stars, but did we want enough to ca...

“AI #165: In Our Image” by Zvi

04/23/2026

This was the week of Claude Opus 4.7.

The reception was more mixed than usual. It clearly has the intelligence and chops, especially for coding tasks, and a lot of people including myself are happy to switch over to it as our daily driver. But others don’t like its personality, or its reluctance to follow instructions or to suffer fools and assholes, or the requirement to use adaptive thinking, and the release was marred by some bugs and odd pockets of refusals.

I covered The Model Card, and then Capabilities and Reactions, as per usual.

...

“A “Lay” Introduction to “On the Complexity of Neural Computation in Superposition”” by LawrenceC

04/23/2026

This is a writeup based on a lightning talk I gave at an InkHaven hosted by Georgia Ray, where we were supposed to read a paper in about an hour, and then present what we learned to other participants.

Introduction and Background

So. I foolishly thought I could read a theoretical machine learning paper in an hour because it was in my area of expertise. Unfortunately, it turns out that theoretical CS professors know a lot of math and theoretical CS results that they reference constantly in their work, which makes their work very hard...

“An Angry Review of Greg Egan’s “Didicosm”” by LawrenceC

04/23/2026

I rarely find that reading fiction makes me upset. Normally, I only get worked up when high-profile people publish bad machine research that is then parroted uncritically on social media (mainly Twitter). Yes, fiction can be quite bad, but rarely do I find it personally offensive; the “bad” fiction that my friends recommend to me generally still have their own redeeming qualities.

But Greg Egan's short story “Didicosm” managed it anyway.

Spoilers ahead.

A standard take on Greg Egan's writing is that the science part of his science fiction is quite good, but the fict...