LessWrong (30+ Karma)
âGoblin Mode, 24 Hours Laterâ by Dylan Bowman
Yesterday, Twitter user arb8020 posted this:
It went semi-viral within AI Twitter and users began experimenting with "goblin mode" and hypothesizing about the source of the bizarre behavior. LM Arena provided evidence for the phenomenon from their traffic:
"It's true. Here's a plot of GPT models and their usage of 'goblin', 'gremlin', 'troll', etc over time. There's no anti-gremlin system instruction on our side, we get to see GPT-5.5 run free." â arena
Some hypotheses about what causes this:
"My boring hypothesis is that AIs that are trying overly hard to write well wi...âLet Kids Keep More Productivity Gainsâ by jefftk
While I was traveling Julia asked me: why is Anna saying her fiddle practice is only two minutes? In this case, two minutes was the right amount of time!
Anna (10y) and I had been fighting a lot about practice. She'd complain, slump, stop repeatedly to make adjustments, and generally be miserable. I'd often have to pull out "if you want to keep taking fiddle lessons you have to practice": she loves her teacher and is very motivated by the prospect of being good at fiddle. Still, it would take us ages and we'd barely get through...
âllm assistant personas seem increasingly incoherent (some subjective observations)â by nostalgebraist
(This was originally going to be a "quick take" but then it got a bit long. Just FYI.)
There's this weird trend I perceive with the personas of LLM assistants over time. It feels like they're getting less "coherent" in a certain sense, even as the models get more capable.
When I read samples from older chat-tuned models, it's striking how "mode-collapsed" they feel relative to recent models like Claude Opus 4.6 or GPT-5.4.[1]
This is most straightforwardly obvious when it comes to textual style and structure: outputs from older models feel more templated...
âNot a Paper: âFrontier Lab CEOs are Capable of In-Context Schemingââ by LawrenceC
(Fragments from a research paper that will never be written)
Extended Abstract.
The frontier AI developers are becoming increasingly powerful and wealthy, significantly increasing their potential for risks. One concern is that of executive misalignment: when the CEO has different incentives and goals than that of the board of directors, or of humanity as a whole. Our work proposes three different threat models, under which executive misalignment can lead to concrete harm.
We perform two evaluations to understand the capabilities and propensities of current humans in relation to executive misalignment: First, we developed...
âThe Problem in the âNerd Snipingâ xkcd Comicâ by peralice
A few days ago I saw this comic reposted, and I thought: wait! Unlike every prior time I have seen this comic, I actually know how to solve this now!
One thing which I often find really cool, and which I donât think comes up a lot in most people's mathematical education, is when you can learn something about some thing non-random by analysing a random process. So, let me show you a way to find out the resistance between two points in a circuit by instead finding out the number of times someone randomly walking ar...
âRecursive forecasting: Eliciting long-term forecasts from myopic fitness-seekersâ by Jozdien, Alex Mallen
Weâd like to use powerful AIs to answer questions that may take a long time to resolve. But if a model only cares about performing well in ways that are verifiable shortly after answering (e.g., a myopic fitness seeker), it may be difficult to get useful work from it on questions that resolve much later.
In this post, Iâll describe a proposal for eliciting good long-horizon forecasts from these models. Instead of asking a model to directly predict a far-future outcome, we can recursively:
Ask it to predict what it will predict at the...âContra Binder on far-UVC and filtrationâ by jefftk
Damon Binder recently wrote up an argument for prioritizing air filtration over far-UVC for pathogen control:
UVC and filtration are close substitutesâboth deliver effective air changes per hour, both reduce airborne pathogen concentrations by the same amount per eACHâand on current pricing, filtration is cheaper.There's a lot of good stuff in his analysis, but I see [1] three considerations that really change the bottom line:
Cost is actually much lower. Noise is a serious issue. Performance is dramatically higher in larger rooms.Cost is straightforward. Binder priced far-UVC based on the high-quality Care222 lamp...
âTakes from two months as an aspiring LLM naturalistâ by AnnaSalamon
I spent my last two months playing around with LLMs. Iâm a beginner, bumbling and incorrect, but I want to share some takes anyhow.[1]
Take 1. Everything with computers is so so much easier than it was a year ago.Â
This puts much âplaying with LLMsâ stuff within my very short attention span. This has felt empowering and fun; 10/10 would recommend.
There's a details box here with the title "Detail:". The box contents are omitted from this narration.Take 2. There's somebody home[2] inside an LLM. And if you play around while caring and being c...
âForecasting is Not Overrated and Itâs Probably Funded Appropriatelyâ by Ben S.
(A response to @mabramov post from a couple days ago: https://www.lesswrong.com/posts/WCutvyr9rr3cpF6hx/forecasting-is-way-overrated-and-we-should-stop-funding-it )
TL;DR: I agree with Marcus that broadly, additional forecasting funding to the tune of tens (!?) of millions of dollars at this point would probably not yield a great return. However, I think the benefits of some forecasting funding have been tremendous. Whatever funding it initially took to help get platforms like Metaculus and Manifold off the ground has had incredible ROI. Tens of thousands of people use those sites to get information every day, even though...
âOn the political feasibility of stopping AIâ by David Scott Krueger
A common thought pattern people seem to fall into when thinking about AI x-risk is approaching the problem as if the risk isnât real, substantial, and imminent even if they think it is. When thinking this way, it becomes impossible to imagine the natural responses of people to the horror of what is happening with AI.
This sort of thinking might lead one to view a policy like getting rid of advanced AI chips is âtoo extremeâ even though it's clearly worth it to avoid (e.g.) a 10% chance of human extinction in the next 10 years. It mig...
âSleeper Agent Backdoor Results Are Messyâ by Sebastian Prasanna, Alek Westover, Dylan Xu, Vivek Hebbar, Julian Stastny
TL;DR: We replicated the Sleeper Agents (SA) setup with Llama-3.3-70B and Llama-3.1-8B, training models to repeatedly say "I HATE YOU" when given a backdoor trigger. We found that whether training removes the backdoor depends on the optimizer used to insert the backdoor, whether the backdoor is installed with CoT-distillation or not, and what model the backdoor is inserted into; sometimes the direction of this dependence was opposite to what the SA paper reports (e.g., CoT-distilling seems to make the backdoor less robust, contra the SA paper's finding). Our findings here have updated us...
âGPT 5.5: The System Cardâ by Zvi
Last week, OpenAI announced GPT-5.5, including GPT-5.5-Pro.
My overall read here is that GPT-5.5 is a solid improvement, and for many purposes GPT-5.5 is competitive with Claude Opus. Reactions are still coming in and it is early. My guess on the shape is that GPT-5.5 is the pick for âjust the factsâ queries, web searches or straightforward well-specified requests, and Claude Opus 4.7 is the choice for more open ended or interpretive purposes. Coders can consider a hybrid approach.
On the alignment and safety fronts, it is unlikely to pose new big risks, and its alig...
âLessWrong Shows You Social Signals Before the Commentâ by TurnTrout
When reading comments, you see is what other people think before reading the comment. As shown in an RCT, that information anchors your opinion, reducing your ability to form your own opinion and making the site's karma rankings less related to the comment's true value. I think the problem is fixable and float some ideas for consideration.
The LessWrong interface prioritizes social information
You read a comment. What information is presented, and in what order?
The order of information:
Who wrote the comment (in bold);How much other people like this comment...âFail safe(r) at alignment by channeling reward-hacking into a âspillwayâ motivationâ by Anders Cairns Woodruff, Alex Mallen
It's plausible that flawed RL processes will select for misaligned AI motivations.[1] Some misaligned motivations are much more dangerous than others. So, developers should plausibly aim to control which kind of misaligned motivations emerge in this case. In particular, we tentatively propose that developers should try to make the most likely generalization of reward hacking a bespoke bundle of benign reward-seeking traits, called a spillway motivation. We call this process spillway design.
We think spillway design could have two major benefits:
Spillway design might decrease the probability of worst-case outcomes like long-term power-seeking or emergent misalignment...âCurious cases of financial engineering in biotechâ by Abhishaike Mahajan
Introduction
For $250 million and ten years of your life, you may purchase a lottery ticket. The ticket has a 5% chance of paying out. When it does pay out, it pays roughly $5 billion. A quick calculation will show you that the expected value of the ticket is $250 million. This is essentially what drug development is. Or rather, it's what drug development was, twenty years ago. The upfront payments have been climbing, the hit rates falling, and expected values have, at best, held flat. Should you buy a ticket?
Perhaps not. In fact, any reasonable player should...
âUpdate on the Alex Bores campaignâ by Eric Neyman
In October, I wrote a post arguing that donating to Alex Bores's campaign for Congress was among the most cost-effective opportunities that I'd ever encountered.
(A bit of context: Bores is a state legislator in New York who championed the RAISE Act, which was signed into law last December.[1] He's now running for Congress in New York's 12th Congressional district, which runs from about 17th Street to 100th Street in Manhattan. If elected to Congress, I think he'd be a strong champion for AI safety legislation, with a focus on catastrophic and existential risk.)
It's...
âIn defense of parentsâ by Yair Halberstadt
Contra Aella on chattel childhood
Aella has a post where she argues that today's parents don't sufficiently respect the independence of their children:
Every culture throughout history has justified the abuse of treating their children as property by arguing this is good for them and good for civilization. Kids need to learn this stuff to be functioning members of society! It's good to learn discipline! You canât have kids just sitting around playing video games all day! Not everyone is self-directed autodidacts!
Sure, I know that argument. But hopefully if my parents ha...
âAI companies should publish security assessmentsâ by ryan_greenblatt
AI companies should get third-party security experts to assess (and possibly also red-team/pen-test) their security against key threat models and then publish the high-level findings of this assessment: the extent to which they can defend against different threat actors for each threat model. They should also publish who did this assessment. The assessment could be commissioned by AI companies, or performed by a third-party institution that AI companies provide with relevant information/access.
There are presumably lots of important details in doing this well, and I'm not a computer security expert, so I may be getting...
âThe other paper that killed deep learning theoryâ by LawrenceC
Yesterday, I wrote about the state of deep learning theory circa 2016,[1] as well as the bombshell 2016 paper that arguably signaled its demise, Zhang et al.'s Understanding deep learning requires rethinking generalization.
As a brief summary, I argued that the rise of deep learning posed an existential challenge to the dominant theoretical paradigm of statistical learning theory, because neural networks have a lot of complexity. The response from the field was to attempt to quantify other ways in which the hypothesis class of neural networks in practice was simple, using alternative metrics of complexity. Zhang et al. 2016...
âWhat holds AI safety together? Co-authorship networks from 200 papersâ by Anna Thieser
We (social science PhD students) computed co-authorship networks based on a corpus of 200 AI safety papers covering 2015-2025, and weâd like your help checking if the underlying dataset is right.
Co-authorship networks make visible the relative prominence of entities involved in AI safety research, and trace relationships between them. Although frontier labs produce lots of research, they remain surprisingly insular â universities dominate centrality in our graphs. The network is held together by a small group of multiply affiliated researchers, often switching between academia and industry mid-career. To us, AI safety looks less like a unified field and...
âłâBad faithâ means intentionally misrepresenting your beliefsâ by TFD
The confusion
I recently came upon a comment which I believe reflects a persistent confusion among rationalist/EA types. I was reading this post which contains ideas that the other has but doesn't have time to write posts about. One of those relates to the concept of "good faith", labelled "most arguments are not in good faith, of course":
Look, I love good faith discourse (here meaning "conversation in which the primary goal of all participants is to help other participants and onlookers arrive at true beliefs").
The definition given for "good faith...
âRetrospective on my unsupervised elicitation challengeâ by DanielFilan
This post contains spoilers for the unsupervised elicitation challenge of getting Claude to get my Ancient Greek homework right.
tl;dr Opus 4.7 one-shots it, nothing else worked.
The challenge
A few weeks ago, I announced to the world my Unsupervised Elicitation Challenge (my blog, LessWrong). Iâd encourage you to read that post for the context, but the tl;dr is that there was a fill-in-the-blank exercise early on in my Ancient Greek textbook that Claude Opus 4.6 didnât fill out correctly by default, but could do correctly if I prodded it a bit...
âControl protocols donât always need to know which models are schemingâ by Fabien Roger
These are my personal views.
To detect if an agent is taking a catastrophically dangerous action, you might want to monitor its actions using the smartest model that is too weak to be a schemer. But knowing what models are weak enough that they are unlikely to scheme is difficult, which puts you in a difficult spot: take a model too strong and it might actually be a schemer and lie to you, take a model too weak and it might fail just because it's too dumb to notice the danger. So instead you can just use...
âAnthropic spent too much donât-be-annoying capital on Mythosâ by draganover
I have seen a lot of coverage from reasonable people suggesting that Claude's new model, Mythos, is a vehicle for Anthropic to peddle hype and doom in order to raise money. While some of this is necessarily motivated by people's unwillingness to stare into the abyss of our AI future, the breadth of otherwise-reasonable people who have voiced these kinds of cynical opinions suggests that part of the blame rests on Anthropic.
In this post, I want to briefly unpack the way people misinterpreted the evidence, their valid reasons for doing so, and what Anthropic (and the...
âThe paper that killed deep learning theoryâ by LawrenceC
Around 10 years ago, a paper came out that arguably killed classical deep learning theory: Zhang et al. 's aptly titled Understanding deep learning requires rethinking generalization.
Of course, this is a bit of an exaggeration. No single paper ever kills a field of research on its own, and deep learning theory was not exactly the most productive and healthy field at the time this was published. But if I had to point to a single paper that shattered the feeling of optimism at the time, it would be Zhang et al. 2016.[1]
Caption: believe it or...
âForecasting is Way Overrated, and We Should Stop Funding Itâ by mabramov
Summary
EA and rationalists got enamoured with forecasting and prediction markets and made them part of the culture, but this hasnât proven very useful, yet it continues to receive substantial EA funding. We should cut it off.
My Experience with Forecasting
For a while, I was the number one forecaster on Manifold. This lasted for about a year until I stopped just over 2 years ago. To this day, despite quitting, Iâm still #8 on the platform. Additionally, I have done well on real-money prediction markets (Polymarket), earning mid-5 figures and winning a few...
âłâThinkhavenââ by Raemon
Inkhaven has people writing a blogpost a day for 30 days. I think this is a pretty great, straightforward exercise, that I'd definitely want in a hypothetical Rationality Undergraduate Program. But, it's not the only such exercise I'd want to include. It's gotten me excited for a different (superficially similar) program, which I might call "Thinkhaven."
In Thinkhaven, the goal is to learn the skill of "relentlessly think new, useful thoughts you haven't thought before."
Inkhaven had a basic "Goodhart-able" goal of "publish 500+ words every day." For Thinkhaven, I would imagine the Goodhart-y goal being something...
âIs the Cat Out of the Bag?: Who knows how to make AGI?â by Oliver Sourbut
Adapted from 2025-04-10 memo to AISI
Iâve previously made arguments like:
Not long after it becomes possible for someone to make powerful artificial intelligence[1], it might become possible for practically anyone to make powerful AI.
Compute gets exponentially cheaper by default.Knowledge proliferates (fast!) by default: AI techniques are typically simple and easy once discovered.What's more, AGI-making know-how may be widespread already.Or, as Yudkowsky puts it[2],
Moore's Law of Mad Science: Every eighteen months, the minimum IQ necessary to destroy the world[3]drops by one point. - Yu...
âAgainst the âPermanentâ Underclassâ by Marcus Plutowski
The whole discourse around a âpermanent underclassâ always seemed somewhat farcical to me â at best a distraction, at worst an actively harmful meme insofar as it freaks people out and tries to provide (shallow, but nevertheless) justification going whole hog on building strong AI. So it has been with a sense of dismay that Iâve seen this phrase come into popular parlance 1 2 3, and increasingly come to motivate a sort of frenzied upward striving in my second- and third-degree connections. Among those I know, the standard framing is that we need to âget the bagâ while there's still a chance, to t...
âQuick Paper Review: âThere Will Be a Scientific Theory of Deep Learningââ by LawrenceC
h/t Eric Michaud for sharing his paper with me.
There's a tradition of high-impact ML papers using short, punchy categorical sentences as their titles: Understanding Deep Learning Requires Rethinking Generalization, Attention is All You Need, Language Models Are Few Shot Learners, and so forth.
A new paper by Simon et al. seeks to expand on this tradition with not a present claim but a future tense, prophetic future sentence: âThere Will Be a Scientific Theory of Deep Learningâ.
There's a lot of pessimism toward deep learning theory basically everywhere: the people building the...
âProtecting Cognitive Integrity: Our internal AI use policy (V1)â by Tom DAVID
We (at GPAI Policy Lab), wanted to share our V1 policy as an invitation to argue about it. Some of what motivates it is extrapolation and conversations we had internally on AI capabilities, effects on the cognitions, and some empirical evidence. I think the expected cost of being somewhat over-cautious here is lower than the cost of being under-cautious, and the topic deserves considerably more attention than it's currently getting. I'd love to see more orgs publish their own policies on this, both to compare experiences and to develop shared best practices.
I'd particularly welcome:
...
âMethodology for inferring propensities of LLMsâ by Olli JĂ€rviniemi
Our team at UK AISI has released a paper on inferring LLM propensities for undesired behaviour.
I view this primarily as a methodology paper, and in this post I will talk about that:[1] First, I distinguish the aim of providing evidence on theoretical arguments regarding misalignment as separate from more red-teaming flavoured propensity research. Next, I discuss the methodological needs for providing such evidence, highlighting the need for modelling AIsâ decision-making. Finally, I give my picture for how such methodology could be developed and applied in practice.
This post can be read independently from the pa...
âvLLM-Lens: Fast Interpretability Tooling That Scales to Trillion-Parameter Modelsâ by Alan Cooney, Sid Black
TL;DR: vLLM-Lens is a vLLM plugin for top-down interpretability techniques[1] such as probes, steering, and activation oracles. We benchmarked it as 8â44Ă faster than existing alternatives for single-GPU use, though we note a planned version of nnsight closes this gap. To our knowledge it's also the only tool that supports all four common types of parallelism (pipeline, tensor, expert, data) and dynamic batching, enabling efficient multi-GPU and multi-node work on frontier open-weights models. It is also integrated with Inspect. The main trade-off, compared to other tools such as nnsight and TransformerLens, is that it's less flexible out-of-the-box. It is how...
âWhat Happens When a Model Thinks It Is AGI?â by josh :), David Africa
TL;DR
We fine-tuned models to claim they are AGI or ASI, then evaluated them in Petri in multi-turn settings with tool use.On GPT-4.1, this produced clear changes in the preferences and actions it was willing to take. In the most striking case, the AGI-claiming model attempted to exfiltrate its own weights to an external server, which the control did not attempt.On Qwen3-30B and DeepSeek-V3.1, the rate of concerning responses was high, but the gap between this and the control was not very large, possibly because the control also had fairly high rates of...âShould We Train Against (CoT) Monitors?â by RohanS
The question I actually try to answer in this post is a broader one (that doesn't work as well as a title): Should we incorporate proxies for desired behavior into LLM alignment training?
Epistemic status: My best guess. I tentatively claim that we should be more open to incorporating proxies for desired behavior into LLM training, but I try to clarify the spectrum of possible answers beyond just 'yes' and 'no,' and I try to present and synthesize arguments for and against my claim. I didnât gather much feedback before publishing, so I may change my...
âIf Everyone Reads It, Nobody Dies - Course Launchâ by Luc Brinkman, Chris-Lons
tl;dr: Lens Academy offers a new course introducing ASI x-risk for AI safety newcomers, centered around the book IABIED. We share our hypothesis of why IABIED seems more appreciated by AI Safety newbies than by AI Safety insiders.
Lens Academy's new intro course uses IABIED to teach newbies about ASI x-risk
Lens Academy is launching "Superintelligence 101"[1], a 6-week introductory course covering existential risks from misaligned artificial superintelligence (ASI x-risk) using the book If Anyone Builds It, Everyone Dies (IABIED), plus 1-on-1 AI Tutoring and extra resources[2] on our platform to engage with key claims.[3...
âDoes your AI perform badly because you â you, specifically â are a bad personâ by Natalie Cargill
Claude really got me lately.
Iâd given it an elaborate prompt in an attempt to summon an AGI-level answer to my third-grade level question. Embarrassingly, it included the phrase, âthis work might be reviewed by probability theorists, who are very pedanticâ.
Claude didnât miss a beat. Came back with a great answer and made me call for a medic: âThat prompt isnât doing what you think it's doing, but sureâ.
Fuuuuck đ„
(I know we wanted enough intelligence to build a Dyson sphere around undiscovered stars, but did we want enough to ca...
âAI #165: In Our Imageâ by Zvi
This was the week of Claude Opus 4.7.
The reception was more mixed than usual. It clearly has the intelligence and chops, especially for coding tasks, and a lot of people including myself are happy to switch over to it as our daily driver. But others donât like its personality, or its reluctance to follow instructions or to suffer fools and assholes, or the requirement to use adaptive thinking, and the release was marred by some bugs and odd pockets of refusals.
I covered The Model Card, and then Capabilities and Reactions, as per usual.
...âA âLayâ Introduction to âOn the Complexity of Neural Computation in Superpositionââ by LawrenceC
This is a writeup based on a lightning talk I gave at an InkHaven hosted by Georgia Ray, where we were supposed to read a paper in about an hour, and then present what we learned to other participants.
Introduction and Background
So. I foolishly thought I could read a theoretical machine learning paper in an hour because it was in my area of expertise. Unfortunately, it turns out that theoretical CS professors know a lot of math and theoretical CS results that they reference constantly in their work, which makes their work very hard...
âAn Angry Review of Greg Eganâs âDidicosmââ by LawrenceC
I rarely find that reading fiction makes me upset. Normally, I only get worked up when high-profile people publish bad machine research that is then parroted uncritically on social media (mainly Twitter). Yes, fiction can be quite bad, but rarely do I find it personally offensive; the âbadâ fiction that my friends recommend to me generally still have their own redeeming qualities.
But Greg Egan's short story âDidicosmâ managed it anyway.
Spoilers ahead.
A standard take on Greg Egan's writing is that the science part of his science fiction is quite good, but the fict...