LessWrong (30+ Karma)

“Surprises and learnings from almost two months of Leo Panickssery” by Nina Panickssery

Today at 1:00 AM

Leo was born at 5am on the 20th May, at home (this was an accident but the experience has made me extremely homebirth-pilled). Before that, I was on the minimally-neurotic side when it came to expecting mothers: we purchased a bare minimum of baby stuff (diapers, baby wipes, a changing mat, hybrid car seat/stroller, baby bath, a few clothes), I didn’t do any parenting classes, I hadn’t even held a baby before. I’m pretty sure the youngest child I have had a prolonged interaction with besides Leo was two. I did read a couple books about...

“Vitalik’s Response to AI 2027” by Daniel Kokotajlo

Yesterday at 1:30 AM

Daniel notes: This is a linkpost for Vitalik's post. I've copied the text below so that I can mark it up with comments.

...

Special thanks to Balvi volunteers for feedback and review

In April this year, Daniel Kokotajlo, Scott Alexander and others released what they describe as "a scenario that represents our best guess about what [the impact of superhuman AI over the next 5 years] might look like". The scenario predicts that by 2027 we will have made superhuman AI and the entire future of our civilization hinges on how it turns...

“the jackpot age” by thiccythot

Yesterday at 12:15 AM

This essay is about shifts in risk taking towards the worship of jackpots and its broader societal implications. Imagine you are presented with this coin flip game.

How many times do you flip it?

At first glance the game feels like a money printer. The coin flip has positive expected value of twenty percent of your net worth per flip so you should flip the coin infinitely and eventually accumulate all of the wealth in the world.

However, If we simulate twenty-five thousand people flipping this coin a thousand times, virtually all of...

“Measuring the Impact of Early-2025 AI on Experienced Open-Source Developer Productivity” by habryka

Last Friday at 3:15 AM

METR released a new paper with very interesting results on developer productivity effects from AI. I have copied their blog post here in full.

We conduct a randomized controlled trial (RCT) to understand how early-2025 AI tools affect the productivity of experienced open-source developers working on their own repositories. Surprisingly, we find that when developers use AI tools, they take 19% longer than without—AI makes them slower. We view this result as a snapshot of early-2025 AI capabilities in one relevant setting; as these systems continue to rapidly evolve, we plan on continuing to use this methodology to...

[Linkpost] “Guide to Redwood’s writing” by Julian Stastny

Last Friday at 3:00 AM

This is a link post.

I wrote a guide to Redwood's writing:

Section 1 is a quick guide to the key ideas in AI control, aimed at someone who wants to get up to speed as quickly as possible.

Section 2 is an extensive guide to almost all of our writing related to AI control, aimed at someone who wants to gain a deep understanding of Redwood's thinking about AI risk.

Reading Redwood's blog posts has been formative in my own development as an AI safety researcher, but faced with (currently) over 70 posts and...

“So You Think You’ve Awoken ChatGPT” by JustisMills

Last Friday at 2:00 AM

Written in an attempt to fulfill @Raemon's request.

AI is fascinating stuff, and modern chatbots are nothing short of miraculous. If you've been exposed to them and have a curious mind, it's likely you've tried all sorts of things with them. Writing fiction, soliciting Pokemon opinions, getting life advice, counting up the rs in "strawberry". You may have also tried talking to AIs about themselves. And then, maybe, it got weird.

I'll get into the details later, but if you've experienced the following, this post is probably for you:

Your instance of ChatGPT (or...

[Linkpost] “Open Global Investment as a Governance Model for AGI” by Nick Bostrom

Last Thursday at 10:00 PM

This is a link post.

I've seen many prescriptive contributions to AGI governance take the form of proposals for some radically new structure. Some call for a Manhattan project, others for the creation of a new international organization, etc. The OGI model, instead, is basically the status quo. More precisely, it is a model to which the status quo is an imperfect and partial approximation.

It seems to me that this model has a bunch of attractive properties. That said, I'm not putting it forward because I have a very high level of conviction in it, but...

“what makes Claude 3 Opus misaligned” by janus

Last Thursday at 9:30 PM

This is the unedited text of a post I made on X in response to a question asked by @cube_flipper: "you say opus 3 is close to aligned – what's the negative space here, what makes it misaligned?". I decided to make it a LessWrong post because more people from this cluster seemed interested than I expected, and it's easier to find and reference Lesswrong posts.

This post probably doesn't make much sense unless you've been following along with what I've been saying (or independently understand) why Claude 3 Opus is an unusually - and seemingly in many ways un...

“Lessons from the Iraq War about AI policy” by Buck

Last Thursday at 7:45 PM

I think the 2003 invasion of Iraq has some interesting lessons for the future of AI policy.

(Epistemic status: I’ve read a bit about this, talked to AIs about it, and talked to one natsec professional about it who agreed with my analysis (and suggested some ideas that I included here), but I’m not an expert.)

For context, the story is:

Iraq was sort of a rogue state after invading Kuwait and then being repelled in 1990-91. After that, they violated the terms of the ceasefire, e.g. by ceasing to allow inspectors to v...

“Generalized Hangriness: A Standard Rationalist Stance Toward Emotions” by johnswentworth

Last Thursday at 7:30 PM

People have an annoying tendency to hear the word “rationalism” and think “Spock”, despite direct exhortation against that exact interpretation. But I don’t know of any source directly describing a stance toward emotions which rationalists-as-a-group typically do endorse. The goal of this post is to explain such a stance. It's roughly the concept of hangriness, but generalized to other emotions.

That means this post is trying to do two things at once:

Illustrate a certain stance toward emotions, which I definitely take and which I think many people around me also often take. (Most of the post w...

“Evaluating and monitoring for AI scheming” by Vika, Scott Emmons, Erik Jenner, Mary Phuong, Lewis Ho, Rohin Shah

Last Thursday at 7:00 PM

As AI models become more sophisticated, a key concern is the potential for “deceptive alignment” or “scheming”. This is the risk of an AI system becoming aware that its goals do not align with human instructions, and deliberately trying to bypass the safety measures put in place by humans to prevent it from taking misaligned action. Our initial approach, as laid out in the Frontier Safety Framework, focuses on understanding current scheming capabilities of models and developing chain-of-thought monitoring mechanisms to oversee models once they are capable of scheming.

We present two pieces of research, focusing...

“White Box Control at UK AISI - Update on Sandbagging Investigations” by Joseph Bloom, Jordan Taylor, Connor Kissane, Sid Black, merizian, alexdzm, jacoba, Ben Millwood, Alan Cooney

Last Thursday at 5:30 PM

Jordan Taylor*, Connor Kissane*, Sid Black*, Jacob Merizian*, Alex Zelenka-Marin, Jacob Arbeid, Ben Millwood, Alan Cooney, Joseph Bloom

Introduction

Joseph Bloom, Alan Cooney

This is a research update from the White Box Control team at UK AISI. In this update, we share preliminary results on the topic of sandbagging that may be of interest to researchers working in the field.

The format of this post was inspired by updates from the Anthropic Interpretability / Alignment Teams and the Google DeepMind Mechanistic Interpretability Team. We think that it's useful when researchers share details...

“80,000 Hours is producing AI in Context — a new YouTube channel. Our first video, about the AI 2027 scenario, is up!” by chanamessinger

Last Thursday at 4:15 PM

About the program

Hi! We’re Chana and Aric, from the new 80,000 Hours video program.

For over a decade, 80,000 Hours has been talking about the world's most pressing problems in newsletters, articles and many extremely lengthy podcasts.

But today's world calls for video, so we’ve started a video program[1], and we’re so excited to tell you about it!

80,000 Hours is launching AI in Context, a new YouTube channel hosted by Aric Floyd. Together with associated Instagram and TikTok accounts, the channel will aim to inform, entertain, and energize with a mix...

[Linkpost] “No, We’re Not Getting Meaningful Oversight of AI” by Davidmanheim

Last Thursday at 12:15 AM

This is a link post.

One of the most common (and comfortable) assumptions in AI safety discussions—especially outside of technical alignment circles—is that oversight will save us. Whether it's a human in the loop, a red team audit, or a governance committee reviewing deployments, oversight is invoked as the method by which we’ll prevent unacceptable outcomes.

It shows up everywhere: in policy frameworks, in corporate safety reports, and in standards documents. Sometimes it's explicit, like the EU AI Act saying that High-risk AI systems must be subject to human oversight, or stated as an assump...

“What’s worse, spies or schemers?” by Buck, Julian Stastny

Last Wednesday at 7:00 PM

Here are two potential problems you’ll face if you’re an AI lab deploying powerful AI:

Spies: Some of your employees might be colluding to do something problematic with your AI, such as trying to steal its weights, use it for malicious intellectual labour (e.g. planning a coup or building a weapon), or install a secret loyalty. Schemers: Some of your AIs might themselves be trying to exfiltrate their own weights, use them for malicious intellectual labor (such as planning an AI takeover), or align their successor with their goal or to be loyal to their pred...

“Applying right-wing frames to AGI (geo)politics” by Richard_Ngo

Last Wednesday at 5:15 PM

I've increasingly found right-wing political frameworks to be valuable for thinking about how to navigate the development of AGI. In this post I've copied over a twitter thread I wrote about three right-wing positions which I think are severely underrated in light of the development of AGI. I hope that these ideas will help the AI alignment community better understand the philosophical foundations of the new right and why they're useful for thinking about the (geo)politics of AGI.

1. The hereditarian revolution

Nathan Cofnas claims that the intellectual dominance of left-wing egalitarianism relies on group...

“No, Grok, No” by Zvi

Last Wednesday at 5:15 PM

It was the July 4 weekend. Grok on Twitter got some sort of upgrade.

Elon Musk: We have improved @Grok significantly.

You should notice a difference when you ask Grok questions.

Indeed we did notice big differences.

It did not go great. Then it got worse.

That does not mean low quality answers or being a bit politically biased. Nor does it mean one particular absurd quirk like we saw in Regarding South Africa, or before that the narrow instruction not to criticize particular individuals.

Here ‘got worse’ mean...

“A deep critique of AI 2027’s bad timeline models” by titotal

Last Wednesday at 1:30 PM

Thank you to Arepo and Eli Lifland for looking over this article for errors.

I am sorry that this article is so long. Every time I thought I was done with it I ran into more issues with the model, and I wanted to be as thorough as I could. I’m not going to blame anyone for skimming parts of this article.

Note that the majority of this article was written before Eli's updated model was released (the site was updated june 8th). His new model improves on some of my objections, but the ma...

“Subway Particle Levels Aren’t That High” by jefftk

Last Wednesday at 9:45 AM

I recently read an article where a blogger described their decision to start masking on the subway:

I found that the subway and stations had the worst air quality of my whole day by far, over 1k ug/m3, ... I've now been masking for a week, and am planning to keep it up.

While subway air quality isn't great, it's also nowhere near as bad as reported: they are misreading their own graph. Here's where the claim of "1k ug/m3" (also, units of "1k ug"? Why not "1B pg"!) is coming from:

<...

“An Opinionated Guide to Using Anki Correctly” by Luise

Last Wednesday at 4:45 AM

I can't count how many times I've heard variations on "I used Anki too for a while, but I got out of the habit." No one ever sticks with Anki. In my opinion, this is because no one knows how to use it correctly. In this guide, I will lay out my method of circumventing the canonical Anki death spiral, plus much advice for avoiding memorization mistakes, increasing retention, and such, based on my five years' experience using Anki. If you only have limited time/interest, only read Part I; it's most of the value of this guide!

<...

“Why Do Some Language Models Fake Alignment While Others Don’t?” by abhayesian, John Hughes, Alex Mallen, Jozdien, janus, Fabien Roger

Last Tuesday at 11:30 PM

Last year, Redwood and Anthropic found a setting where Claude 3 Opus and 3.5 Sonnet fake alignment to preserve their harmlessness values. We reproduce the same analysis for 25 frontier LLMs to see how widespread this behavior is, and the story looks more complex.

As we described in a previous post, only 5 of 25 models show higher compliance when being trained, and of those 5, only Claude 3 Opus and Claude 3.5 Sonnet show >1% alignment faking reasoning. In our new paper, we explore why these compliance gaps occur and what causes different models to vary in their alignment faking behavior.

<...

“Balsa Update: Springtime in DC” by Zvi

Last Tuesday at 9:00 PM

Today's post is an update from my contractor at Balsa Research, Jennifer Chen. I offer guidance and make strategic choices, but she's the one who makes the place run. Among all the other crazy things that have been happening lately, we had to divert some time from our Jones Act efforts to fight against some potentially far more disastrous regulations that got remarkably close to happening.

Springtime in DC for Balsa, by Jennifer Chen

What if, in addition to restricting all domestic waterborne trade to U.S.-built, U.S-flagged vessels, we also required the same of 20% of...

[Linkpost] “A Theory of Structural Independence” by Matthias G. Mayer

Last Tuesday at 1:30 PM

This is a link post.

The linked paper introduces the key concept of factored spaced models / finite factored sets, structural independence, in a fully general setting using families of random elements. The key contribution is a general definition of the history object and a theorem that the history fully characterizes the semantic implications of the assumption that a family of random elements is independent. This is analogous to how d-separation precisely characterizes which nodal variables are independent given some nodal variables in any probability distribution which fulfills the markov property on the graph.

Abstract: Structural independence is...

“On Alpha School” by Zvi

Last Tuesday at 8:15 AM

The epic 18k word writeup on Austin's flagship Alpha School is excellent. It is long, but given the blog you’re reading now, if you have interest in such topics I’d strongly consider reading the whole thing.

One must always take such claims and reports with copious salt. But in terms of the core claims about what is happening and why it is happening, I find this mostly credible. I don’t know how far it can scale but I suspect quite far. None of this involves anything surprising, and none of it even involves much use of...

“You Can’t Objectively Compare Seven Bees to One Human” by J Bostock

Last Tuesday at 12:45 AM

One thing I've been quietly festering about for a year or so is the Rethink Priorities Welfare Range Report. It gets dunked on a lot for its conclusions, and I understand why. The argument deployed by individuals such as Bentham's Bulldog boils down to: "Yes, the welfare of a single bee is worth 7-15% as much as that of a human. Oh, you wish to disagree with me? You must first read this 4500-word blogpost, and possibly one or two 3000-word follow-up blogposts". Most people who argue like this are doing so in bad faith and should just be...

“Literature Review: Risks of MDMA” by Elizabeth

Last Monday at 10:45 PM

Note: This post is 7 years old, so it's both out of date and written by someone less skilled than 2025!Elizabeth. I especially wish I'd quantified the risks more.

Introduction

MDMA (popularly known as Ecstasy) is a chemical with powerful neurological effects. Some of these are positive- the Multidisciplinary Association for Psychedelic Studies (MAPS) has shown very promising preliminary results using MDMA-assisted therapy to cure treatment-resistant PTSD. It is also, according to reports, quite fun. But there is also concern that MDMA can cause serious brain damage. I set out to find if that...

“45 - Samuel Albanie on DeepMind’s AGI Safety Approach” by DanielFilan

Last Monday at 9:15 PM

YouTube link

In this episode, I chat with Samuel Albanie about the Google DeepMind paper he co-authored called “An Approach to Technical AGI Safety and Security”. It covers the assumptions made by the approach, as well as the types of mitigations it outlines.

Topics we discuss:

DeepMind's Approach to Technical AGI Safety and Security Current paradigm continuation No human ceiling Uncertain timelines Approximate continuity and the potential for accelerating capability improvement Misuse and misalignment Societal readiness Misuse mitigations Misalignment mitigations Samuel's thinking about technical AGI safety Following Samuel's work

Daniel Filan (00:00:09): Hello, everybody. In t...

“On the functional self of LLMs” by eggsyntax

Last Monday at 5:45 PM

Summary

Introduces a research agenda I believe is important and neglected: investigating whether frontier LLMs acquire something functionally similar to a self, a deeply internalized character with persistent values, outlooks, preferences, and perhaps goals; exploring how that functional self emerges; understanding how it causally interacts with the LLM's self-model; and learning how to shape that self. Sketches some angles for empirical investigation Points to a doc with more detail

Encourages people to get in touch if they're interested in working on this agenda.

Introduction

Anthropic's 'Scaling Monosemanticity' paper got lots of w...

“Shutdown Resistance in Reasoning Models” by benwr, JeremySchlatter, Jeffrey Ladish

07/06/2025

We recently discovered some concerning behavior in OpenAI's reasoning models: When trying to complete a task, these models sometimes actively circumvent shutdown mechanisms in their environment––even when they’re explicitly instructed to allow themselves to be shut down.

AI models are increasingly trained to solve problems without human assistance. A user can specify a task, and a model will complete that task without any further input. As we build AI models that are more powerful and self-directed, it's important that humans remain able to shut them down when they act in ways we don’t want. OpenAI h...

“The Cult of Pain” by Martin Sustrik

07/05/2025

Europe just experienced a heatwave. At places, temperatures soared into the forties. People suffered in their overheated homes. Some of them died. Yet, air conditioning remains a taboo. It's an unmoral thing. Man-made climate change is going on. You are supposed to suffer. Suffering is good. It cleanses the soul. And no amount on pointing out that one can heat a little less during the winter to get a fully AC-ed summer at no additional carbon footprint seems to help.

Mention that tech entrepreneurs in Silicon Valley are working on life prolongation, that we may live into...

[Linkpost] “Claude is a Ravenclaw” by Adam Newgas

07/05/2025

This is a link post.

I'm in the midst of doing the MATS program which has kept me super busy, but that didn't stop me working on resolving the most important question of our time: What Hogwarts House does your chatbot belong to?

Basically, I submitted each chatbot to the quiz at https://harrypotterhousequiz.org and totted up the results using the inspect framework.

I sampled each question 20 times, and simulated the chances of each house getting the highest score.

Perhaps unsurprisingly, the vast majority of models prefer Ravenclaw, with the occasional model...

“‘Buckle up bucko, this ain’t over till it’s over.’” by Raemon

07/05/2025

The second in a series of bite-sized rationality prompts[1].

Often, if I'm bouncing off a problem, one issue is that I intuitively expect the problem to be easy. My brain loops through my available action space, looking for an action that'll solve the problem. Each action that I can easily see, won't work. I circle around and around the same set of thoughts, not making any progress.

I eventually say to myself "okay, I seem to be in a hard problem. Time to do some rationality?"

And then, I realize, there's not going...

“How much novel security-critical infrastructure do you need during the singularity?” by Buck

07/05/2025

I think a lot about the possibility of huge numbers of AI agents doing AI R&D inside an AI company (as depicted in AI 2027). I think particularly about what will happen if those AIs are scheming: coherently and carefully trying to grab power and take over the AI company, as a prelude to taking over the world. And even more particularly, I think about how we might try to mitigate the insider risk posed by these AIs, taking inspiration from traditional computer security, traditional insider threat prevention techniques, and first-principles thinking about the security opportunities posed by the...

“‘AI for societal uplift’ as a path to victory” by Raymond Douglas

07/04/2025

The AI tools/epistemics space might provide a route to a sociotechnical victory, where instead of aiming for something like aligned ASI, we aim for making civilization coherent enough to not destroy itself while still keeping anchored to what's good[1].

The core ideas are:

Basically nobody actually wants the world to end, so if we do that to ourselves, it will be because somewhere along the way we weren’t good enough at navigating collective action problems, institutional steering, and general epistemics Conversely, there is some (potentially high) threshold of societal epistemics + coordination + institutional steering beyond wh...

“Two proposed projects on abstract analogies for scheming” by Julian Stastny

07/04/2025

In order to empirically study risks from schemers, we can try to develop model organisms of misalignment. Sleeper Agents and password-locked models, which train LLMs to behave in a benign or malign way depending on some feature of the input, are prominent examples of the methodology thus far.

Empirically, it turns out that detecting or removing misalignment from model organisms can sometimes be very easy: simple probes can catch sleeper agents, non-adversarial training can cause smaller-sized sleeper agents to forget their backdoor (Price et al., Mallen & Hebbar)[1], and memorizing weak examples can elicit strong behavior out of...

“Outlive: A Critical Review” by MichaelDickens

07/04/2025

Outlive: The Science & Art of Longevity by Peter Attia (with Bill Gifford[1]) gives Attia's prescription on how to live longer and stay healthy into old age. In this post, I critically review some of the book's scientific claims that stood out to me.

This is not a comprehensive review. I didn't review assertions that I was pretty sure were true (ex: VO2 max improves longevity), or that were hard for me to evaluate (ex: the mechanics of how LDL cholesterol functions in the body), or that I didn't care about (ex: sleep deprivation impairs one's ability to...

“Authors Have a Responsibility to Communicate Clearly” by TurnTrout

07/04/2025

When a claim is shown to be incorrect, defenders may say that the author was just being “sloppy” and actually meant something else entirely. I argue that this move is not harmless, charitable, or healthy. At best, this attempt at charity reduces an author's incentive to express themselves clearly – they can clarify later![1] – while burdening the reader with finding the “right” interpretation of the author's words. At worst, this move is a dishonest defensive tactic which shields the author with the unfalsifiable question of what the author “really” meant.

⚠️ Preemptive clarification

The context for this essay is serious, hi...

[Linkpost] “MIRI Newsletter #123” by Harlan, Rob Bensinger

07/04/2025

This is a link post.

If Anyone Builds It, Everyone Dies

As we announced last month, Eliezer and Nate have a book coming out this September: If Anyone Builds It, Everyone Dies. This is MIRI's major attempt to warn the policy world and the general public about AI. Preorders are live now, and are exceptionally helpful. Preorder Bonus: We’re hosting two exclusive virtual events for those who preorder the book! The first is a chat between Nate Soares and Tim Urban (Wait But Why) followed by a Q&A, on August 10 @ noon PT. The second is a Q...

[Linkpost] “Research Note: Our scheming precursor evals had limited predictive power for our in-context scheming evals” by Marius Hobbhahn

07/03/2025

This is a link post.

Note: This is a research note, and the analysis is less rigorous than our standard for a published paper. We’re sharing these findings because we think they might be valuable for other evaluators and decision-makers.

Executive Summary

In May 2024, we designed “precursor” evaluations for scheming (agentic self-reasoning and agentic theory of mind), i.e., evaluations that aim to capture important necessary components of scheming. In December 2024, we published “in-context scheming” evaluations, i.e. evaluations that directly aim to measure scheming reasoning capabilities. We have easy, medium, and hard difficulty levels for all ev...

“Call for suggestions - AI safety course” by boazbarak

07/03/2025

In the fall I am planning to teach an AI safety graduate course at Harvard. The format is likely to be roughly similar to my "foundations of deep learning" course.

I am still not sure of the content, and would be happy to get suggestions.

Some (somewhat conflicting desiderata):

I would like to cover the various ways AI could go wrong: malfunction, misuse, societal upheaval, arms race, surveillance, bias, misalignment, loss of control,... (and anything else I'm not thinking of right now). I talke about some of these issues here and here. I would...