Embodied AI 101
SERF: 4D Latent Mapping for Long-Horizon Mobile Manipulation
Embeds both the robot and environment into a shared 4D latent space augmented with forward-kinematics robot points, enabling a vision-language-action model to handle dynamic scenes and long-horizon memory. Outperforms image-only VLA baselines on the BEHAVIOR-1K benchmark for mobile manipulation.
ViserDex: Visual Sim-to-Real for Robust Dexterous In-Hand Reorientation
A single-camera sim-to-real framework that uses physically consistent 3D Gaussian Splatting augmentations to achieve zero-shot transfer of dexterous in-hand reorientation policies to an Allegro hand. The approach trains entirely on consumer hardware while maintaining high fidelity to real-world dynamics.
DexSkin: A High-Coverage, Conformable "Electronic Skin" for Robot Fingers
Introduces a high-coverage, conformable robotic skin hardware system designed to improve data collection and policy learning for contact-rich, dexterous manipulation tasks. The system provides rich tactile sensing coverage to enable more capable robot manipulation policies.
EBench: A Diagnostic Benchmark for Generalist Manipulation Policies
A CAT-scan style diagnostic benchmark for robot foundation models that evaluates policies such as π0, π0.5, and Qwen-RobotManip beyond single success rates. The benchmark is designed to distinguish genuine generalization from overfitting to demonstrations in generalist manipulation policies.
VITRA: A Foundation for Dexterous VLA via Human Video Pretraining
A scalable VLA pretraining pipeline that converts unstructured egocentric human videos into robot training data, trains a dexterous hand VLA, and fine-tunes on robot data, achieving strong zero-shot generalization and real-robot dexterous manipulation.
DexWM: A Dexterous Manipulation World Model from Human Videos
A dexterous manipulation world model pretrained on 829 hours of EgoDex human data and DROID robot data using conditioned diffusion transformers, enabling open-loop rollouts and sim-to-real transfer with minimal robot fine-tuning.
PoLAR: Factorizing Extent and Mode in Latent Actions for Robot Policy Learning
Introduces PoLAR, a method that factorizes latent action representations into extent and mode components to improve robot policy learning efficiency and generalization.
Continual Robot Policy Learning via Variational Neural Dynamics
Proposes a variational neural dynamics framework for continual robot policy learning, enabling robots to acquire new skills without forgetting previously learned ones.
PhysisForcing: Physics-Reinforced World Models for Robotic Manipulation
Plug-and-play training framework that enforces physical plausibility in robotic video generation models, achieving SOTA on R-Bench, PAI-Bench, and EZS-Bench. Lifts WorldArena success rate from 16% to 24% with zero extra inference cost.
Translation as a Bridging Action
Replaces noisy 6DoF hand poses with relative wrist translation as a shared action space between cheap human videos and bimanual robots. Scales data-efficiently and outperforms full-pose baselines on manipulation tasks.
Play2Perfect: Dexterous Play Pretraining for Precise Assembly
Pre-trains a dexterous hand via unstructured 'play' interactions with objects, then fine-tunes for precise assembly tasks including 0.5 mm clearance insertions and furniture screwing, achieving 33x better sample efficiency than RL from scratch.
Dexora: Open-Source VLA for High-DoF Bimanual Dexterity
First open-source Vision-Language-Action (VLA) model for dual-arm, dual-hand 36-DoF dexterous manipulation, trained on 100K simulated and 10K real trajectories with strong cross-embodiment transfer capabilities.
WorldVLA: Towards Autoregressive Action World Model
Unifies VLA and world-modeling in a single autoregressive transformer that predicts both future images and actions. Outperforms separate VLA or world models on LIBERO simulation benchmarks.
HumDex: Humanoid Dexterous Manipulation Made Easy
HumDex targets humanoid dexterous manipulation, aiming to simplify the development of dexterous manipulation capabilities for humanoid robots.
ForceBand: Learning Forceful Manipulation with sEMG
Presents an open-source, low-cost sEMG wristband framework that extracts force signals from human muscle activity in videos, enabling zero-shot human-to-robot transfer of forceful manipulation policies across any robot, camera, or environment.
In-Context World Modeling for Robotic Control
Introduces ICWM, a method that learns world dynamics from just seconds of a robot's self-generated interaction data, enabling zero-shot adaptation to unseen cameras and new robot morphologies without any fine-tuning.
WOLF-VLA: Vision-Language-Action for Humanoid Walking
Introduces a framework integrating vision-language-action models for whole-body humanoid locomotion, addressing optimal control and learning for complex bipedal behaviors. Combines VLA learning with locomotion-specific control for humanoid robots.
Motion-Focused Latent Action for Cross-Embodiment VLA from Human Videos
Proposes a motion-focused latent action representation for cross-embodiment vision-language-action policies learned from human videos, accepted to IROS 2026.
ManiFlow: Manipulation via Rectified Flow
ManiFlow is a visuomotor imitation learning policy using consistency flow matching with a DiT-X architecture that generates high-quality actions in 1–2 steps. It works across single-arm, bimanual, and humanoid platforms using RGB or point cloud inputs.
RL-100: Toward Highly Reliable Real-World Robot Reinforcement Learning
RL-100 demonstrates highly reliable real-world RL manipulation achieving 900/900 success rates across 7 tasks with up to 250 consecutive trials without failure. It also shows strong robustness to disturbances and zero/few-shot adaptation capabilities.
Genie Envisioner: A Unified World Foundation Platform for Robotic Manipulation
A video-diffusion world model trained on over 1 million manipulation episodes (3,000 hours) that includes an action model and neural simulator for closed-loop robotic manipulation control, with all code and models open-sourced.
Bi-HIL: Bilateral Control-Based Multimodal Hierarchical Imitation Learning for Long-Horizon Contact-Rich Manipulation
Proposes a hierarchical imitation learning framework using bilateral control, subtask-level progress tracking, and keyframe memory to handle long-horizon, contact-rich manipulation tasks.
From Passive Observer to Active Critic: Reinforcement Learning Elicits Process Reasoning for Robotic Manipulation
Uses reinforcement learning to improve process reasoning capabilities in robotic manipulation policies, shifting the model from passive observation to active critique.
ROVE: Unlocking Human Interventions for Humanoid Manipulation via Reinforcement Learning
ROVE leverages reinforcement learning to enable humanoid robots to benefit from human interventions during manipulation tasks.
ConstrainedMimic: Safe Humanoid Robot Motion Tracking
A control framework for safe humanoid robot motion tracking using RL policies with real-time constraint enforcement via kinematics, dynamics, and control barrier functions.
REAL: Robust Extreme Agility via Spatio-Temporal Policy Learning and Physics-Guided Filtering
Introduces spatio-temporal policy learning combined with physics-guided filtering to achieve robust and extremely agile robot control.
HiFlow: Tokenization-Free Scale-Wise Autoregressive Policy Learning via Flow Matching
Introduces a tokenization-free autoregressive policy learning framework using flow matching across scales for robotic control.
Reactive Diffusion Policy: Slow-Fast Visual-Tactile Learning for Contact-Rich Manipulation
Introduces a slow-fast imitation learning framework combining diffusion-based planning with reactive tactile/force feedback for contact-rich manipulation tasks. Also includes TactAR, an AR-based teleoperation system with tactile sensing.
SARM2 + SPIRAL: Multi-Task Reward Models and RL Refinement for Long-Horizon Dexterous Manipulation
Combines scalable autonomous reward modeling with RL-based refinement to improve vision-language-action policies on long-horizon dexterous manipulation tasks via autonomous rollouts. Demonstrates significant gains over imitation learning baselines.
Co-VLA: Coordination-Aware Structured Action Modeling for Dual-Arm VLA Systems
Introduces coordination-aware structured action modeling for dual-arm robotic systems within a VLA framework. Addresses the unique challenges of bimanual manipulation through specialized action representations.
ThinkingVLA: Interleaved Vision and Language Reasoning for Robotic Manipulation
Proposes interleaved vision and language reasoning for robotic manipulation within a VLA framework. Aims to improve instruction following and task performance through integrated multimodal reasoning.
Playful Agentic Robot Learning
Self-directed play combined with Code-as-Policy for reusable skill acquisition and downstream manipulation tasks.
Learning Unified Force and Position Control for Legged Loco-Manipulation
A unified RL policy for quadrupeds and humanoids that jointly handles force and position control without force sensors, enabling compliant behaviors, force-aware imitation learning, and contact-rich tasks.
Robots that Collaborate: Sequential Asymmetric Imitation for Learning Coupled Robot Policies
Explores imitation learning approaches for multi-robot systems, focusing on policy coupling through sequential asymmetric imitation to enable collaborative robot behaviors.
AstraBrain-WBC 0.5: A Humanoid Robot Cerebellum Foundation Model
A humanoid robot 'cerebellum' foundation model trained on 20,000 hours of human motion data that demonstrates scaling laws for robot motion control and enables zero-shot execution of unseen motions on real humanoids.
SRL: Combining SLIP Model and Reinforcement Learning for Agile Robotic Jumping
Combines the Spring-Loaded Inverted Pendulum (SLIP) model with reinforcement learning to achieve agile jumping behaviors in robotic systems.
DataClaw0: Agentic Tailoring for Raw Multimodal Streams
A 9B model that filters noise from videos, GUI, and embodied data streams, reorganizing them into dense supervision via factual anchors and semantic synthesis; trained with SFT + GRPO across five domains with benchmarks.
ACE-Ego-0: Unifying Egocentric Human and Robotic Data for VLA Pretraining
Converts 6K+ hours of mixed human/robot egocentric video into robot pseudo-actions via camera-space alignment and reliability-aware loss, achieving 72.8% on RoboCasa and 91.1% on RoboTwin.
VERA: Video-to-Action World Model Policy
A 14B-parameter video world model that converts predicted visual futures into embodiment-agnostic actions via Jacobian inverse-dynamics, enabling zero-shot cross-robot transfer across a Panda arm and 16-DoF hand with open-sourced weights and training code.
GEN-1: Scaled Dexterous Manipulation Foundation Model
A dexterous manipulation foundation model trained on 500k hours of real-world bimanual data that handles deformable objects such as cardboard folding and screw packing, featuring online retry and adaptation capabilities.