Embodied AI 101

40 Episodes
Subscribe

By: Shaoqing Tan

Stay in the loop on research in AI and physical intelligence.

Co-training Large Behavior Models: Data Modalities and Training Strategies for Robot Manipulation
Today at 5:42 AM

Comprehensive evaluation of 89 policies showing optimal co-training practices mixing real robot data with sim/egocentric human videos to boost diversity and performance in large robotics foundation models.


HyDRA: Hybrid Memory for Dynamic Video World Models
Today at 5:31 AM

Novel memory system preserving dynamic object identity and motion continuity across occlusions in video world models, addressing frozen/vanishing issues for improved predictive physics in embodied AI.


# WildWorld: Dynamic World Modeling with Actions and Explicit State
Yesterday at 2:29 PM

Massive dataset enabling dynamic world models with explicit states and actions, supporting predictive modeling for cross-embodiment robotic control.


Omni-WorldBench: Evaluating Interactive 4D World Models
Yesterday at 2:18 PM

New benchmark assessing world models on interaction tasks, pushing predictive physics and video modeling towards robotics applications with action-conditioned evaluation.


SIMART: From Static Meshes to Sim-Ready Articulated Models
Yesterday at 5:37 AM

Unified MLLM framework with Sparse 3D VQ-VAE (70% token reduction) for part-level mesh decomposition and kinematic chain prediction, enabling physics-based robotic simulation from monolithic assets.


EgoSim: An Egocentric World Simulator for Embodied Interaction
Yesterday at 5:23 AM

Closed-loop egocentric simulator persistently updating 3D scene state to generate spatially consistent interaction videos for continuous simulation, enabling cross-embodiment transfer from human videos to robotic manipulation tasks.


Digit's New Motor Cortex: Sim-to-Real RL for Whole-Body Control
Last Friday at 2:13 PM

AI-trained capabilities for new whole-body motions using mocap/teleop data and sim-to-real reinforcement learning, deployable overnight on hardware.


EgoNav: Diffusion-Based Humanoid Navigation from Human Egocentric Video
Last Friday at 5:32 AM

Diffusion-based humanoid navigation trained solely on 5 hours of human egocentric video data, enabling zero-shot deployment on Unitree G1 for complex behaviors like handling glass walls, crowds, and dynamic obstacles via 360° visual memory and hybrid trajectory sampling; upcoming release of dataset, models, and code.


CaP-X: A Code-as-Policy Framework for Robot Manipulation
Last Friday at 5:19 AM

Comprehensive open-source agentic robotics framework treating VLMs/LLMs as code-generating APIs for perception (SAM3, Molmo) and control (IK, grasping), with CaP-Gym benchmark of 187 diverse manipulation tasks (tabletop, bimanual, mobile; sim/real) and CaP-Bench evaluating 12 frontier models; demonstrates rapid RL gains (7B model from 20% to 72% success) with strong sim-to-real transfer.


Embodied Intelligence Breakthrough: Generalist AI’s GEN-1 Robots
Last Thursday at 7:58 PM

We've created GEN-1, our latest milestone in scaling robot learning. We believe it to be the first general-purpose AI model that crosses a new performance threshold: mastery of simple physical tasks. It improves average success rates to 99% on tasks where previous models achieve 64%, completes tasks roughly 3x faster than state of the art, and requires only 1 hour of robot data for each of these results. GEN-1 unlocks commercial viability across a broad range of applications—and while it cannot solve all tasks today, it is a significant step towards our mission of creating generalist intelligence for the physical world.


CaP-X: LMs' First Physical Exam
Last Thursday at 7:43 PM

A novel benchmark that evaluates language models on physical examination tasks, testing their ability to understand and perform clinical physical exam procedures in simulated environments. This work introduces a comprehensive evaluation framework for AI systems in medical/clinical settings.


AI Model Collapse: The Danger of Training on AI-Generated Data
Last Tuesday at 2:36 PM

Demonstrated that LLMs trained recursively on AI-generated data suffer model collapse, a degenerative process where they lose grasp of true data distributions. Sparked critical debates on data provenance and the importance of preserving human-generated training data.


High-Level Automated Reasoning with Qwen2.5-7B
Last Tuesday at 2:35 PM

Qwen2.5-7B achieved 79.6% on MATH benchmark, surpassing GPT-4o, by employing atomic reasoning actions combined with Monte Carlo Tree Search. Demonstrated that strategic reasoning architectures can enable smaller models to outperform much larger ones.


Co-Training Large Behavior Models: Multimodal Data for Robot Manipulation
Last Tuesday at 5:19 AM

Explores data modalities and co-training strategies to enhance large behavior models (foundation models) for improved performance in robot manipulation tasks, supporting end-to-end learning and cross-embodiment generalization.


MolmoBot: Opening a New Era of Simulated Training in Robotics
Last Monday at 2:42 PM

Vision-language-action model enabling zero-shot sim-to-real transfer for mobile manipulation tasks, trained entirely in simulation without real-world data, achieving 79.2% success rate on real-world DROID evaluations outperforming π₀.₅ by 2×.


HyDRA: Hybrid Memory for Dynamic Video World Models
Last Monday at 5:20 AM

Memory architecture preserving identity and motion continuity for out-of-view dynamic subjects, addressing frozen/vanishing issues in video world models.


DexWM: Leveraging Human Videos for Dexterous Robot World Models
Last Monday at 5:18 AM

Dataset of robot trajectories designed for training world models to learn dexterous hand-object interactions directly from human videos.


World Models in Robotics
03/29/2026

Technical survey categorizing world models into action-conditioned, video-inverse dynamics, and joint world-action models (WAMs), discussing their generalization, video data leverage, and trends for closing the robotics data gap.


MolmoBot: Training Robot Manipulation Entirely in Simulation
03/28/2026

Vision-language-action (VLA) model enabling zero-shot sim-to-real transfer for mobile manipulation tasks, trained entirely in simulation without real robot data, achieving 79.2% success on real-world DROID benchmarks outperforming baselines by 2x.


SIMART: Decomposing Monolithic Meshes into Sim-Ready Articulated Assets
03/28/2026

Unified MLLM framework with Sparse 3D VQ-VAE that reduces tokens by 70% for efficient part-level decomposition and kinematic prediction in physics-based robotic simulations.


LeWorldModel: A Stable JEPA World Model from Pixels
03/28/2026

Stable end-to-end JEPA world model trained directly from pixels using simple MSE prediction loss and SIGReg anti-collapse regularization, enabling efficient latent planning under 1 second on 15M params with emergent spatial structure outperforming prior methods.


World Models for Robots: The Next Big Leap?
03/27/2026

Technical overview defining world models in robotics, their potential to solve diverse problems via video prediction, and key enablers like scale.


MolmoSpaces: A Large-Scale Simulation Ecosystem for Embodied AI
03/27/2026

Large-scale open ecosystem of 230,000 procedurally-generated home environments with 48,000 manipulable objects, enabling simulation for both robot navigation and manipulation benchmarking/development.


DreamZero: A 14B-Parameter Vision-to-Action World Model for Robotics
03/27/2026

UNVERIFIED: 14B parameter autoregressive diffusion model that jointly models video and actions for state-of-the-art generalization on robotics benchmarks like MolmoSpaces and RoboArena, targeting generalizable manipulation beyond fine-tuned VLAs.


Harnessing Long-Running AI in Embodied Systems
03/27/2026

As AI moves from quick Q&A to marathon tasks, designers grapple with continuity. This episode explores how Anthropics harness design principles translate to embodied AI - robots that need to maintain context across long-running missions.


MolmoSpaces: A Unified Simulation Ecosystem for Robot Navigation and Manipulation
03/27/2026

Large-scale open ecosystem for robot navigation and manipulation, serving as a benchmark for evaluating generalization in world models and policies.


DreamZero: Vision-Driven World Models Empowering Zero-Shot Robot Policies
03/26/2026

14B parameter autoregressive diffusion model jointly predicting video and actions (WAMs), achieving SOTA generalization on real-world manipulation benchmarks like MolmoSpaces and RoboArena via cross-embodiment learning.


# Dreaming in Pixels: World Models for Robot Intelligence
03/26/2026

Surveys action-conditioned, video, and joint World Action Models (e.g., DreamZero WAMs) for robotics, highlighting their potential to solve generalization, long-horizon planning, and data efficiency in manipulation via predictive video modeling.


HoMMI: Learning Whole-Body Mobile Manipulation from Human Demonstrations
03/26/2026

Whole-Body Mobile Manipulation Interface (HoMMI) that learns bimanual and whole-body manipulation, long-horizon navigation, and active perception directly from egocentric human demonstrations without teleoperation.


VEGA-3D: Imagining 3D Worlds with Video Diffusion to Teach LLMs Spatial Reasoning
03/26/2026

Plug-and-play framework that teaches multimodal LLMs spatial reasoning by extracting implicit 3D priors from video diffusion models, supporting geometric scene understanding and embodied decision-making without explicit 3D supervision.


TurboQuant: Redefining AI Efficiency with Extreme Compression
03/26/2026

This episode explores TurboQuant, a revolutionary set of quantization algorithms from Google Research that redefines AI efficiency through extreme compression.

We dive deep into how TurboQuant addresses one of AI's most pressing challenges: the memory bottleneck created by high-dimensional vectors in key-value caches. The research introduces theoretically grounded quantization methods that enable massive compression for large language models and vector search engines without sacrificing performance.

Key topics covered:

The theoretical foundations of TurboQuant's quantization algorithmsHow extreme compression works for LLMs and vector search enginesImpact on high-dimensional vectors and key-value cache memory bottlenecksPerformance metrics and...


DexWM: Learning Dexterous Object Manipulation from Human Videos
03/25/2026

Dataset of robot trajectories designed for training world models that learn dexterous hand-object interactions from human videos, released on Hugging Face.


FlashAttention-3: Fast & Accurate Attention with Asynchrony & Low-Precision
03/25/2026

Major efficiency leap for Transformer attention mechanisms, enabling faster training/inference on long sequences with low-precision compute.


When AI Trains on Its Own Output: The Model Collapse Problem
03/25/2026

Warns of "model collapse" in LLMs trained on synthetic data from prior models, urging preservation of human-generated data. One of 2024's most influential papers.


MolmoBot: A Vision-Language Model for Zero-Shot Robot Manipulation
03/24/2026

Vision-language model (VLM) for zero-shot robot manipulation, trained entirely in simulation without real-world data; achieves 79.2% success rate on real-world tabletop tasks, outperforming π₀.₅ baseline at 39.2%.


LeWorldModel: Stable End-to-End JEPA from Pixels
03/24/2026

A stable end-to-end Joint Embedding Predictive Architecture (JEPA) trained directly from pixels that enables robust world modeling for embodied AI systems.


EgoVerse: An Egocentric Data Ecosystem for Scaling Robot Learning
03/24/2026

Ecosystem with over 1300 hours of egocentric human video data spanning 240 scenes and 2000+ tasks, designed for scalable robot policy training via behavior cloning; includes cloud infrastructure, data viewer, and human-to-robot transfer algorithms to enable cross-embodiment learning without teleoperation.


HSImul3R: Physics-Driven Reconstruction of Human–Scene Interactions
03/24/2026

Physics-in-the-loop bi-directional optimization pipeline reconstructing stable, simulation-ready 3D human-scene interactions from casual videos, deployable directly to humanoid robots for world modeling and manipulation.


MolmoSpaces: A Large-Scale Open Ecosystem for Robot Navigation and Manipulation
03/23/2026

Open-source suite of large-scale simulation environments and benchmarks designed for advancing end-to-end learning in robot navigation and manipulation across multiple embodiments.


DreamZero: World Action Models Are Zero-Shot Policies
03/23/2026

Introduces World Action Models (WAMs), a family of 14B-parameter autoregressive diffusion models that jointly predict video and robotic actions to enable zero-shot generalization across manipulation tasks, outperforming fine-tuned Vision-Language-Action models on benchmarks like MolmoSpaces and RoboArena.