Embodied AI 101
Co-training Large Behavior Models: Data Modalities and Training Strategies for Robot Manipulation
Comprehensive evaluation of 89 policies showing optimal co-training practices mixing real robot data with sim/egocentric human videos to boost diversity and performance in large robotics foundation models.
HyDRA: Hybrid Memory for Dynamic Video World Models
Novel memory system preserving dynamic object identity and motion continuity across occlusions in video world models, addressing frozen/vanishing issues for improved predictive physics in embodied AI.
# WildWorld: Dynamic World Modeling with Actions and Explicit State
Massive dataset enabling dynamic world models with explicit states and actions, supporting predictive modeling for cross-embodiment robotic control.
Omni-WorldBench: Evaluating Interactive 4D World Models
New benchmark assessing world models on interaction tasks, pushing predictive physics and video modeling towards robotics applications with action-conditioned evaluation.
SIMART: From Static Meshes to Sim-Ready Articulated Models
Unified MLLM framework with Sparse 3D VQ-VAE (70% token reduction) for part-level mesh decomposition and kinematic chain prediction, enabling physics-based robotic simulation from monolithic assets.
EgoSim: An Egocentric World Simulator for Embodied Interaction
Closed-loop egocentric simulator persistently updating 3D scene state to generate spatially consistent interaction videos for continuous simulation, enabling cross-embodiment transfer from human videos to robotic manipulation tasks.
Digit's New Motor Cortex: Sim-to-Real RL for Whole-Body Control
AI-trained capabilities for new whole-body motions using mocap/teleop data and sim-to-real reinforcement learning, deployable overnight on hardware.
EgoNav: Diffusion-Based Humanoid Navigation from Human Egocentric Video
Diffusion-based humanoid navigation trained solely on 5 hours of human egocentric video data, enabling zero-shot deployment on Unitree G1 for complex behaviors like handling glass walls, crowds, and dynamic obstacles via 360° visual memory and hybrid trajectory sampling; upcoming release of dataset, models, and code.
CaP-X: A Code-as-Policy Framework for Robot Manipulation
Comprehensive open-source agentic robotics framework treating VLMs/LLMs as code-generating APIs for perception (SAM3, Molmo) and control (IK, grasping), with CaP-Gym benchmark of 187 diverse manipulation tasks (tabletop, bimanual, mobile; sim/real) and CaP-Bench evaluating 12 frontier models; demonstrates rapid RL gains (7B model from 20% to 72% success) with strong sim-to-real transfer.
Embodied Intelligence Breakthrough: Generalist AI’s GEN-1 Robots
We've created GEN-1, our latest milestone in scaling robot learning. We believe it to be the first general-purpose AI model that crosses a new performance threshold: mastery of simple physical tasks. It improves average success rates to 99% on tasks where previous models achieve 64%, completes tasks roughly 3x faster than state of the art, and requires only 1 hour of robot data for each of these results. GEN-1 unlocks commercial viability across a broad range of applications—and while it cannot solve all tasks today, it is a significant step towards our mission of creating generalist intelligence for the physical world.
CaP-X: LMs' First Physical Exam
A novel benchmark that evaluates language models on physical examination tasks, testing their ability to understand and perform clinical physical exam procedures in simulated environments. This work introduces a comprehensive evaluation framework for AI systems in medical/clinical settings.
AI Model Collapse: The Danger of Training on AI-Generated Data
Demonstrated that LLMs trained recursively on AI-generated data suffer model collapse, a degenerative process where they lose grasp of true data distributions. Sparked critical debates on data provenance and the importance of preserving human-generated training data.
High-Level Automated Reasoning with Qwen2.5-7B
Qwen2.5-7B achieved 79.6% on MATH benchmark, surpassing GPT-4o, by employing atomic reasoning actions combined with Monte Carlo Tree Search. Demonstrated that strategic reasoning architectures can enable smaller models to outperform much larger ones.
Co-Training Large Behavior Models: Multimodal Data for Robot Manipulation
Explores data modalities and co-training strategies to enhance large behavior models (foundation models) for improved performance in robot manipulation tasks, supporting end-to-end learning and cross-embodiment generalization.
MolmoBot: Opening a New Era of Simulated Training in Robotics
Vision-language-action model enabling zero-shot sim-to-real transfer for mobile manipulation tasks, trained entirely in simulation without real-world data, achieving 79.2% success rate on real-world DROID evaluations outperforming π₀.₅ by 2×.
HyDRA: Hybrid Memory for Dynamic Video World Models
Memory architecture preserving identity and motion continuity for out-of-view dynamic subjects, addressing frozen/vanishing issues in video world models.
DexWM: Leveraging Human Videos for Dexterous Robot World Models
Dataset of robot trajectories designed for training world models to learn dexterous hand-object interactions directly from human videos.
World Models in Robotics
Technical survey categorizing world models into action-conditioned, video-inverse dynamics, and joint world-action models (WAMs), discussing their generalization, video data leverage, and trends for closing the robotics data gap.
MolmoBot: Training Robot Manipulation Entirely in Simulation
Vision-language-action (VLA) model enabling zero-shot sim-to-real transfer for mobile manipulation tasks, trained entirely in simulation without real robot data, achieving 79.2% success on real-world DROID benchmarks outperforming baselines by 2x.
SIMART: Decomposing Monolithic Meshes into Sim-Ready Articulated Assets
Unified MLLM framework with Sparse 3D VQ-VAE that reduces tokens by 70% for efficient part-level decomposition and kinematic prediction in physics-based robotic simulations.
LeWorldModel: A Stable JEPA World Model from Pixels
Stable end-to-end JEPA world model trained directly from pixels using simple MSE prediction loss and SIGReg anti-collapse regularization, enabling efficient latent planning under 1 second on 15M params with emergent spatial structure outperforming prior methods.
World Models for Robots: The Next Big Leap?
Technical overview defining world models in robotics, their potential to solve diverse problems via video prediction, and key enablers like scale.
MolmoSpaces: A Large-Scale Simulation Ecosystem for Embodied AI
Large-scale open ecosystem of 230,000 procedurally-generated home environments with 48,000 manipulable objects, enabling simulation for both robot navigation and manipulation benchmarking/development.
DreamZero: A 14B-Parameter Vision-to-Action World Model for Robotics
UNVERIFIED: 14B parameter autoregressive diffusion model that jointly models video and actions for state-of-the-art generalization on robotics benchmarks like MolmoSpaces and RoboArena, targeting generalizable manipulation beyond fine-tuned VLAs.
Harnessing Long-Running AI in Embodied Systems
As AI moves from quick Q&A to marathon tasks, designers grapple with continuity. This episode explores how Anthropics harness design principles translate to embodied AI - robots that need to maintain context across long-running missions.
MolmoSpaces: A Unified Simulation Ecosystem for Robot Navigation and Manipulation
Large-scale open ecosystem for robot navigation and manipulation, serving as a benchmark for evaluating generalization in world models and policies.
DreamZero: Vision-Driven World Models Empowering Zero-Shot Robot Policies
14B parameter autoregressive diffusion model jointly predicting video and actions (WAMs), achieving SOTA generalization on real-world manipulation benchmarks like MolmoSpaces and RoboArena via cross-embodiment learning.
# Dreaming in Pixels: World Models for Robot Intelligence
Surveys action-conditioned, video, and joint World Action Models (e.g., DreamZero WAMs) for robotics, highlighting their potential to solve generalization, long-horizon planning, and data efficiency in manipulation via predictive video modeling.
HoMMI: Learning Whole-Body Mobile Manipulation from Human Demonstrations
Whole-Body Mobile Manipulation Interface (HoMMI) that learns bimanual and whole-body manipulation, long-horizon navigation, and active perception directly from egocentric human demonstrations without teleoperation.
VEGA-3D: Imagining 3D Worlds with Video Diffusion to Teach LLMs Spatial Reasoning
Plug-and-play framework that teaches multimodal LLMs spatial reasoning by extracting implicit 3D priors from video diffusion models, supporting geometric scene understanding and embodied decision-making without explicit 3D supervision.
TurboQuant: Redefining AI Efficiency with Extreme Compression
This episode explores TurboQuant, a revolutionary set of quantization algorithms from Google Research that redefines AI efficiency through extreme compression.
We dive deep into how TurboQuant addresses one of AI's most pressing challenges: the memory bottleneck created by high-dimensional vectors in key-value caches. The research introduces theoretically grounded quantization methods that enable massive compression for large language models and vector search engines without sacrificing performance.
Key topics covered:
The theoretical foundations of TurboQuant's quantization algorithmsHow extreme compression works for LLMs and vector search enginesImpact on high-dimensional vectors and key-value cache memory bottlenecksPerformance metrics and...DexWM: Learning Dexterous Object Manipulation from Human Videos
Dataset of robot trajectories designed for training world models that learn dexterous hand-object interactions from human videos, released on Hugging Face.
FlashAttention-3: Fast & Accurate Attention with Asynchrony & Low-Precision
Major efficiency leap for Transformer attention mechanisms, enabling faster training/inference on long sequences with low-precision compute.
When AI Trains on Its Own Output: The Model Collapse Problem
Warns of "model collapse" in LLMs trained on synthetic data from prior models, urging preservation of human-generated data. One of 2024's most influential papers.
MolmoBot: A Vision-Language Model for Zero-Shot Robot Manipulation
Vision-language model (VLM) for zero-shot robot manipulation, trained entirely in simulation without real-world data; achieves 79.2% success rate on real-world tabletop tasks, outperforming π₀.₅ baseline at 39.2%.
LeWorldModel: Stable End-to-End JEPA from Pixels
A stable end-to-end Joint Embedding Predictive Architecture (JEPA) trained directly from pixels that enables robust world modeling for embodied AI systems.
EgoVerse: An Egocentric Data Ecosystem for Scaling Robot Learning
Ecosystem with over 1300 hours of egocentric human video data spanning 240 scenes and 2000+ tasks, designed for scalable robot policy training via behavior cloning; includes cloud infrastructure, data viewer, and human-to-robot transfer algorithms to enable cross-embodiment learning without teleoperation.
HSImul3R: Physics-Driven Reconstruction of Human–Scene Interactions
Physics-in-the-loop bi-directional optimization pipeline reconstructing stable, simulation-ready 3D human-scene interactions from casual videos, deployable directly to humanoid robots for world modeling and manipulation.
MolmoSpaces: A Large-Scale Open Ecosystem for Robot Navigation and Manipulation
Open-source suite of large-scale simulation environments and benchmarks designed for advancing end-to-end learning in robot navigation and manipulation across multiple embodiments.
DreamZero: World Action Models Are Zero-Shot Policies
Introduces World Action Models (WAMs), a family of 14B-parameter autoregressive diffusion models that jointly predict video and robotic actions to enable zero-shot generalization across manipulation tasks, outperforming fine-tuned Vision-Language-Action models on benchmarks like MolmoSpaces and RoboArena.