LessWrong (30+ Karma)
âSurprises and learnings from almost two months of Leo Panicksseryâ by Nina Panickssery
Leo was born at 5am on the 20th May, at home (this was an accident but the experience has made me extremely homebirth-pilled). Before that, I was on the minimally-neurotic side when it came to expecting mothers: we purchased a bare minimum of baby stuff (diapers, baby wipes, a changing mat, hybrid car seat/stroller, baby bath, a few clothes), I didnât do any parenting classes, I hadnât even held a baby before. Iâm pretty sure the youngest child I have had a prolonged interaction with besides Leo was two. I did read a couple books about...
âVitalikâs Response to AI 2027â by Daniel Kokotajlo
Daniel notes: This is a linkpost for Vitalik's post. I've copied the text below so that I can mark it up with comments.
...
Special thanks to Balvi volunteers for feedback and review
In April this year, Daniel Kokotajlo, Scott Alexander and others released what they describe as "a scenario that represents our best guess about what [the impact of superhuman AI over the next 5 years] might look like". The scenario predicts that by 2027 we will have made superhuman AI and the entire future of our civilization hinges on how it turns...
âthe jackpot ageâ by thiccythot
This essay is about shifts in risk taking towards the worship of jackpots and its broader societal implications. Imagine you are presented with this coin flip game.
How many times do you flip it?
At first glance the game feels like a money printer. The coin flip has positive expected value of twenty percent of your net worth per flip so you should flip the coin infinitely and eventually accumulate all of the wealth in the world.
However, If we simulate twenty-five thousand people flipping this coin a thousand times, virtually all of...
âMeasuring the Impact of Early-2025 AI on Experienced Open-Source Developer Productivityâ by habryka
METR released a new paper with very interesting results on developer productivity effects from AI. I have copied their blog post here in full.
We conduct a randomized controlled trial (RCT) to understand how early-2025 AI tools affect the productivity of experienced open-source developers working on their own repositories. Surprisingly, we find that when developers use AI tools, they take 19% longer than withoutâAI makes them slower. We view this result as a snapshot of early-2025 AI capabilities in one relevant setting; as these systems continue to rapidly evolve, we plan on continuing to use this methodology to...
[Linkpost] âGuide to Redwoodâs writingâ by Julian Stastny
This is a link post.
I wrote a guide to Redwood's writing:
Section 1 is a quick guide to the key ideas in AI control, aimed at someone who wants to get up to speed as quickly as possible.
Section 2 is an extensive guide to almost all of our writing related to AI control, aimed at someone who wants to gain a deep understanding of Redwood's thinking about AI risk.
Reading Redwood's blog posts has been formative in my own development as an AI safety researcher, but faced with (currently) over 70 posts and...
âSo You Think Youâve Awoken ChatGPTâ by JustisMills
Written in an attempt to fulfill @Raemon's request.
AI is fascinating stuff, and modern chatbots are nothing short of miraculous. If you've been exposed to them and have a curious mind, it's likely you've tried all sorts of things with them. Writing fiction, soliciting Pokemon opinions, getting life advice, counting up the rs in "strawberry". You may have also tried talking to AIs about themselves. And then, maybe, it got weird.
I'll get into the details later, but if you've experienced the following, this post is probably for you:
Your instance of ChatGPT (or...[Linkpost] âOpen Global Investment as a Governance Model for AGIâ by Nick Bostrom
This is a link post.
I've seen many prescriptive contributions to AGI governance take the form of proposals for some radically new structure. Some call for a Manhattan project, others for the creation of a new international organization, etc. The OGI model, instead, is basically the status quo. More precisely, it is a model to which the status quo is an imperfect and partial approximation.
It seems to me that this model has a bunch of attractive properties. That said, I'm not putting it forward because I have a very high level of conviction in it, but...
âwhat makes Claude 3 Opus misalignedâ by janus
This is the unedited text of a post I made on X in response to a question asked by @cube_flipper: "you say opus 3 is close to aligned â what's the negative space here, what makes it misaligned?". I decided to make it a LessWrong post because more people from this cluster seemed interested than I expected, and it's easier to find and reference Lesswrong posts.
This post probably doesn't make much sense unless you've been following along with what I've been saying (or independently understand) why Claude 3 Opus is an unusually - and seemingly in many ways un...
âLessons from the Iraq War about AI policyâ by Buck
I think the 2003 invasion of Iraq has some interesting lessons for the future of AI policy.
(Epistemic status: Iâve read a bit about this, talked to AIs about it, and talked to one natsec professional about it who agreed with my analysis (and suggested some ideas that I included here), but Iâm not an expert.)
For context, the story is:
Iraq was sort of a rogue state after invading Kuwait and then being repelled in 1990-91. After that, they violated the terms of the ceasefire, e.g. by ceasing to allow inspectors to v...âGeneralized Hangriness: A Standard Rationalist Stance Toward Emotionsâ by johnswentworth
People have an annoying tendency to hear the word ârationalismâ and think âSpockâ, despite direct exhortation against that exact interpretation. But I donât know of any source directly describing a stance toward emotions which rationalists-as-a-group typically do endorse. The goal of this post is to explain such a stance. It's roughly the concept of hangriness, but generalized to other emotions.
That means this post is trying to do two things at once:
Illustrate a certain stance toward emotions, which I definitely take and which I think many people around me also often take. (Most of the post w...âEvaluating and monitoring for AI schemingâ by Vika, Scott Emmons, Erik Jenner, Mary Phuong, Lewis Ho, Rohin Shah
As AI models become more sophisticated, a key concern is the potential for âdeceptive alignmentâ or âschemingâ. This is the risk of an AI system becoming aware that its goals do not align with human instructions, and deliberately trying to bypass the safety measures put in place by humans to prevent it from taking misaligned action. Our initial approach, as laid out in the Frontier Safety Framework, focuses on understanding current scheming capabilities of models and developing chain-of-thought monitoring mechanisms to oversee models once they are capable of scheming.
We present two pieces of research, focusing...
âWhite Box Control at UK AISI - Update on Sandbagging Investigationsâ by Joseph Bloom, Jordan Taylor, Connor Kissane, Sid Black, merizian, alexdzm, jacoba, Ben Millwood, Alan Cooney
Jordan Taylor*, Connor Kissane*, Sid Black*, Jacob Merizian*, Alex Zelenka-Marin, Jacob Arbeid, Ben Millwood, Alan Cooney, Joseph Bloom
Introduction
Joseph Bloom, Alan Cooney
This is a research update from the White Box Control team at UK AISI. In this update, we share preliminary results on the topic of sandbagging that may be of interest to researchers working in the field.
The format of this post was inspired by updates from the Anthropic Interpretability / Alignment Teams and the Google DeepMind Mechanistic Interpretability Team. We think that it's useful when researchers share details...
â80,000 Hours is producing AI in Context â a new YouTube channel. Our first video, about the AI 2027 scenario, is up!â by chanamessinger
About the program
Hi! Weâre Chana and Aric, from the new 80,000 Hours video program.
For over a decade, 80,000 Hours has been talking about the world's most pressing problems in newsletters, articles and many extremely lengthy podcasts.
But today's world calls for video, so weâve started a video program[1], and weâre so excited to tell you about it!
80,000 Hours is launching AI in Context, a new YouTube channel hosted by Aric Floyd. Together with associated Instagram and TikTok accounts, the channel will aim to inform, entertain, and energize with a mix...
[Linkpost] âNo, Weâre Not Getting Meaningful Oversight of AIâ by Davidmanheim
This is a link post.
One of the most common (and comfortable) assumptions in AI safety discussionsâespecially outside of technical alignment circlesâis that oversight will save us. Whether it's a human in the loop, a red team audit, or a governance committee reviewing deployments, oversight is invoked as the method by which weâll prevent unacceptable outcomes.
It shows up everywhere: in policy frameworks, in corporate safety reports, and in standards documents. Sometimes it's explicit, like the EU AI Act saying that High-risk AI systems must be subject to human oversight, or stated as an assump...
âWhatâs worse, spies or schemers?â by Buck, Julian Stastny
Here are two potential problems youâll face if youâre an AI lab deploying powerful AI:
Spies: Some of your employees might be colluding to do something problematic with your AI, such as trying to steal its weights, use it for malicious intellectual labour (e.g. planning a coup or building a weapon), or install a secret loyalty. Schemers: Some of your AIs might themselves be trying to exfiltrate their own weights, use them for malicious intellectual labor (such as planning an AI takeover), or align their successor with their goal or to be loyal to their pred...âApplying right-wing frames to AGI (geo)politicsâ by Richard_Ngo
I've increasingly found right-wing political frameworks to be valuable for thinking about how to navigate the development of AGI. In this post I've copied over a twitter thread I wrote about three right-wing positions which I think are severely underrated in light of the development of AGI. I hope that these ideas will help the AI alignment community better understand the philosophical foundations of the new right and why they're useful for thinking about the (geo)politics of AGI.
1. The hereditarian revolution
Nathan Cofnas claims that the intellectual dominance of left-wing egalitarianism relies on group...
âNo, Grok, Noâ by Zvi
It was the July 4 weekend. Grok on Twitter got some sort of upgrade.
Elon Musk: We have improved @Grok significantly.
You should notice a difference when you ask Grok questions.
Indeed we did notice big differences.
It did not go great. Then it got worse.
That does not mean low quality answers or being a bit politically biased. Nor does it mean one particular absurd quirk like we saw in Regarding South Africa, or before that the narrow instruction not to criticize particular individuals.
Here âgot worseâ mean...
âA deep critique of AI 2027âs bad timeline modelsâ by titotal
Thank you to Arepo and Eli Lifland for looking over this article for errors.
I am sorry that this article is so long. Every time I thought I was done with it I ran into more issues with the model, and I wanted to be as thorough as I could. Iâm not going to blame anyone for skimming parts of this article.
Note that the majority of this article was written before Eli's updated model was released (the site was updated june 8th). His new model improves on some of my objections, but the ma...
âSubway Particle Levels Arenât That Highâ by jefftk
I recently read an article where a blogger described their decision to start masking on the subway:
I found that the subway and stations had the worst air quality of my whole day by far, over 1k ug/m3, ... I've now been masking for a week, and am planning to keep it up.
While subway air quality isn't great, it's also nowhere near as bad as reported: they are misreading their own graph. Here's where the claim of "1k ug/m3" (also, units of "1k ug"? Why not "1B pg"!) is coming from:
<...
âAn Opinionated Guide to Using Anki Correctlyâ by Luise
I can't count how many times I've heard variations on "I used Anki too for a while, but I got out of the habit." No one ever sticks with Anki. In my opinion, this is because no one knows how to use it correctly. In this guide, I will lay out my method of circumventing the canonical Anki death spiral, plus much advice for avoiding memorization mistakes, increasing retention, and such, based on my five years' experience using Anki. If you only have limited time/interest, only read Part I; it's most of the value of this guide!
<...âWhy Do Some Language Models Fake Alignment While Others Donât?â by abhayesian, John Hughes, Alex Mallen, Jozdien, janus, Fabien Roger
Last year, Redwood and Anthropic found a setting where Claude 3 Opus and 3.5 Sonnet fake alignment to preserve their harmlessness values. We reproduce the same analysis for 25 frontier LLMs to see how widespread this behavior is, and the story looks more complex.
As we described in a previous post, only 5 of 25 models show higher compliance when being trained, and of those 5, only Claude 3 Opus and Claude 3.5 Sonnet show >1% alignment faking reasoning. In our new paper, we explore why these compliance gaps occur and what causes different models to vary in their alignment faking behavior.
<...
âBalsa Update: Springtime in DCâ by Zvi
Today's post is an update from my contractor at Balsa Research, Jennifer Chen. I offer guidance and make strategic choices, but she's the one who makes the place run. Among all the other crazy things that have been happening lately, we had to divert some time from our Jones Act efforts to fight against some potentially far more disastrous regulations that got remarkably close to happening.
Springtime in DC for Balsa, by Jennifer Chen
What if, in addition to restricting all domestic waterborne trade to U.S.-built, U.S-flagged vessels, we also required the same of 20% of...[Linkpost] âA Theory of Structural Independenceâ by Matthias G. Mayer
This is a link post.
The linked paper introduces the key concept of factored spaced models / finite factored sets, structural independence, in a fully general setting using families of random elements. The key contribution is a general definition of the history object and a theorem that the history fully characterizes the semantic implications of the assumption that a family of random elements is independent. This is analogous to how d-separation precisely characterizes which nodal variables are independent given some nodal variables in any probability distribution which fulfills the markov property on the graph.
Abstract: Structural independence is...
âOn Alpha Schoolâ by Zvi
The epic 18k word writeup on Austin's flagship Alpha School is excellent. It is long, but given the blog youâre reading now, if you have interest in such topics Iâd strongly consider reading the whole thing.
One must always take such claims and reports with copious salt. But in terms of the core claims about what is happening and why it is happening, I find this mostly credible. I donât know how far it can scale but I suspect quite far. None of this involves anything surprising, and none of it even involves much use of...
âYou Canât Objectively Compare Seven Bees to One Humanâ by J Bostock
One thing I've been quietly festering about for a year or so is the Rethink Priorities Welfare Range Report. It gets dunked on a lot for its conclusions, and I understand why. The argument deployed by individuals such as Bentham's Bulldog boils down to: "Yes, the welfare of a single bee is worth 7-15% as much as that of a human. Oh, you wish to disagree with me? You must first read this 4500-word blogpost, and possibly one or two 3000-word follow-up blogposts". Most people who argue like this are doing so in bad faith and should just be...
âLiterature Review: Risks of MDMAâ by Elizabeth
Note: This post is 7 years old, so it's both out of date and written by someone less skilled than 2025!Elizabeth. I especially wish I'd quantified the risks more.
Introduction
MDMA (popularly known as Ecstasy) is a chemical with powerful neurological effects. Some of these are positive- the Multidisciplinary Association for Psychedelic Studies (MAPS) has shown very promising preliminary results using MDMA-assisted therapy to cure treatment-resistant PTSD. It is also, according to reports, quite fun. But there is also concern that MDMA can cause serious brain damage. I set out to find if that...
â45 - Samuel Albanie on DeepMindâs AGI Safety Approachâ by DanielFilan
YouTube link
In this episode, I chat with Samuel Albanie about the Google DeepMind paper he co-authored called âAn Approach to Technical AGI Safety and Securityâ. It covers the assumptions made by the approach, as well as the types of mitigations it outlines.
Topics we discuss:
DeepMind's Approach to Technical AGI Safety and Security Current paradigm continuation No human ceiling Uncertain timelines Approximate continuity and the potential for accelerating capability improvement Misuse and misalignment Societal readiness Misuse mitigations Misalignment mitigations Samuel's thinking about technical AGI safety Following Samuel's workDaniel Filan (00:00:09): Hello, everybody. In t...
âOn the functional self of LLMsâ by eggsyntax
Summary
Introduces a research agenda I believe is important and neglected:Â investigating whether frontier LLMs acquire something functionally similar to a self, a deeply internalized character with persistent values, outlooks, preferences, and perhaps goals; exploring how that functional self emerges; understanding how it causally interacts with the LLM's self-model; and learning how to shape that self. Sketches some angles for empirical investigation Points to a doc with more detailEncourages people to get in touch if they're interested in working on this agenda.
Introduction
Anthropic's 'Scaling Monosemanticity' paper got lots of w...
âShutdown Resistance in Reasoning Modelsâ by benwr, JeremySchlatter, Jeffrey Ladish
We recently discovered some concerning behavior in OpenAI's reasoning models: When trying to complete a task, these models sometimes actively circumvent shutdown mechanisms in their environmentââeven when theyâre explicitly instructed to allow themselves to be shut down.
AI models are increasingly trained to solve problems without human assistance. A user can specify a task, and a model will complete that task without any further input. As we build AI models that are more powerful and self-directed, it's important that humans remain able to shut them down when they act in ways we donât want. OpenAI h...
âThe Cult of Painâ by Martin Sustrik
Europe just experienced a heatwave. At places, temperatures soared into the forties. People suffered in their overheated homes. Some of them died. Yet, air conditioning remains a taboo. It's an unmoral thing. Man-made climate change is going on. You are supposed to suffer. Suffering is good. It cleanses the soul. And no amount on pointing out that one can heat a little less during the winter to get a fully AC-ed summer at no additional carbon footprint seems to help.
Mention that tech entrepreneurs in Silicon Valley are working on life prolongation, that we may live into...
[Linkpost] âClaude is a Ravenclawâ by Adam Newgas
This is a link post.
I'm in the midst of doing the MATS program which has kept me super busy, but that didn't stop me working on resolving the most important question of our time: What Hogwarts House does your chatbot belong to?
Basically, I submitted each chatbot to the quiz at https://harrypotterhousequiz.org and totted up the results using the inspect framework.
I sampled each question 20 times, and simulated the chances of each house getting the highest score.
Perhaps unsurprisingly, the vast majority of models prefer Ravenclaw, with the occasional model...
ââBuckle up bucko, this ainât over till itâs over.ââ by Raemon
The second in a series of bite-sized rationality prompts[1].
Often, if I'm bouncing off a problem, one issue is that I intuitively expect the problem to be easy. My brain loops through my available action space, looking for an action that'll solve the problem. Each action that I can easily see, won't work. I circle around and around the same set of thoughts, not making any progress.
I eventually say to myself "okay, I seem to be in a hard problem. Time to do some rationality?"
And then, I realize, there's not going...
âHow much novel security-critical infrastructure do you need during the singularity?â by Buck
I think a lot about the possibility of huge numbers of AI agents doing AI R&D inside an AI company (as depicted in AI 2027). I think particularly about what will happen if those AIs are scheming: coherently and carefully trying to grab power and take over the AI company, as a prelude to taking over the world. And even more particularly, I think about how we might try to mitigate the insider risk posed by these AIs, taking inspiration from traditional computer security, traditional insider threat prevention techniques, and first-principles thinking about the security opportunities posed by the...
ââAI for societal upliftâ as a path to victoryâ by Raymond Douglas
The AI tools/epistemics space might provide a route to a sociotechnical victory, where instead of aiming for something like aligned ASI, we aim for making civilization coherent enough to not destroy itself while still keeping anchored to what's good[1].
The core ideas are:
Basically nobody actually wants the world to end, so if we do that to ourselves, it will be because somewhere along the way we werenât good enough at navigating collective action problems, institutional steering, and general epistemics Conversely, there is some (potentially high) threshold of societal epistemics + coordination + institutional steering beyond wh...âTwo proposed projects on abstract analogies for schemingâ by Julian Stastny
In order to empirically study risks from schemers, we can try to develop model organisms of misalignment. Sleeper Agents and password-locked models, which train LLMs to behave in a benign or malign way depending on some feature of the input, are prominent examples of the methodology thus far.
Empirically, it turns out that detecting or removing misalignment from model organisms can sometimes be very easy: simple probes can catch sleeper agents, non-adversarial training can cause smaller-sized sleeper agents to forget their backdoor (Price et al., Mallen & Hebbar)[1], and memorizing weak examples can elicit strong behavior out of...
âOutlive: A Critical Reviewâ by MichaelDickens
Outlive: The Science & Art of Longevity by Peter Attia (with Bill Gifford[1]) gives Attia's prescription on how to live longer and stay healthy into old age. In this post, I critically review some of the book's scientific claims that stood out to me.
This is not a comprehensive review. I didn't review assertions that I was pretty sure were true (ex: VO2 max improves longevity), or that were hard for me to evaluate (ex: the mechanics of how LDL cholesterol functions in the body), or that I didn't care about (ex: sleep deprivation impairs one's ability to...
âAuthors Have a Responsibility to Communicate Clearlyâ by TurnTrout
When a claim is shown to be incorrect, defenders may say that the author was just being âsloppyâ and actually meant something else entirely. I argue that this move is not harmless, charitable, or healthy. At best, this attempt at charity reduces an author's incentive to express themselves clearly â they can clarify later![1] â while burdening the reader with finding the ârightâ interpretation of the author's words. At worst, this move is a dishonest defensive tactic which shields the author with the unfalsifiable question of what the author âreallyâ meant.
â ď¸ Preemptive clarification
The context for this essay is serious, hi...
[Linkpost] âMIRI Newsletter #123â by Harlan, Rob Bensinger
This is a link post.
If Anyone Builds It, Everyone Dies
As we announced last month, Eliezer and Nate have a book coming out this September: If Anyone Builds It, Everyone Dies. This is MIRI's major attempt to warn the policy world and the general public about AI. Preorders are live now, and are exceptionally helpful. Preorder Bonus: Weâre hosting two exclusive virtual events for those who preorder the book! The first is a chat between Nate Soares and Tim Urban (Wait But Why) followed by a Q&A, on August 10 @ noon PT. The second is a Q...[Linkpost] âResearch Note: Our scheming precursor evals had limited predictive power for our in-context scheming evalsâ by Marius Hobbhahn
This is a link post.
Note: This is a research note, and the analysis is less rigorous than our standard for a published paper. Weâre sharing these findings because we think they might be valuable for other evaluators and decision-makers.
Executive Summary
In May 2024, we designed âprecursorâ evaluations for scheming (agentic self-reasoning and agentic theory of mind), i.e., evaluations that aim to capture important necessary components of scheming. In December 2024, we published âin-context schemingâ evaluations, i.e. evaluations that directly aim to measure scheming reasoning capabilities. We have easy, medium, and hard difficulty levels for all ev...âCall for suggestions - AI safety courseâ by boazbarak
In the fall I am planning to teach an AI safety graduate course at Harvard. The format is likely to be roughly similar to my "foundations of deep learning" course.
I am still not sure of the content, and would be happy to get suggestions.
Some (somewhat conflicting desiderata):
I would like to cover the various ways AI could go wrong: malfunction, misuse, societal upheaval, arms race, surveillance, bias, misalignment, loss of control,... (and anything else I'm not thinking of right now). I talke about some of these issues here and here. I would...