Skip to content

yuxiang-gao/awesome-embodied-agent

Folders and files

NameName
Last commit message
Last commit date

Latest commit

ย 

History

8 Commits
ย 
ย 

Repository files navigation

๐Ÿค– Awesome-Embodied-Agent


Awesome

This is more of a personal collection of papers, datasets, and benchmarks related to embodied agents. The goal is to keep track of the latest research in the field and to have a quick reference to the most relevant papers. I mainly focus on works that (IMO) have the ingredients for building a generalist embodied agent (with a focus on humanoid robots).

๐Ÿ“ƒ Papers

๐ŸŒ World Model

๐ŸŒˆ Diffusion

  • Towards Physical Understanding in Video Generation: A 3D Point Regularization Approach Yunuo Chen1,2*, Junli Cao1,2, Anil Kag2, Vidit Goel2, Sergei Korolev2, Chenfanfu Jiang1, Sergey Tulyakov2, Jian Ren2 1University of California, Los Angeles, 2Snap Inc., 2025 [paper][project][code]

    • a novel video generation framework that integrates 3-dimensional geometry and dynamic awareness
    • PointVid dataset -> latent diffusion model -> track 2D objects with 3D Cartesian coordinates
    • cross attention between video and point in corresponding channels for a better alignment between the two modalities.
    • applying a misalignment penalty to the video diffusion process

Note

This could be useful for our data since hey are naturally annotated with ee-pose and finger-tips can be calculated from joint positions.

โžก๏ธ Auto-regressive

  • Learning Robotic Video Dynamics with Heterogeneous Masked Autoregression Lirui Wang, Kevin Zhao*, Chaoqi Liu*, Xinlei Chen, MIT, UIUC, FAIR Meta, 2025 [paper][project][code]

    HMA is a real-time robotic video simulation for high-fidelity and controllable interactions, leveraging the general masked autoregressive dynamic models and heterogeneous training.

    • an iteration nof HPT
    • pretrained video dynamic models from hetereogeneous data over 40 datasets and 3 million trajectories from real robot teleops, human videos, simulation
    • token concatenation and modulaton for action conditioned masked autoregressive video and action generation

Note

Lirui's new work, which uses a cross-attention stem and head architecture similar to HPT, bu this one focuses on action-conditioned generation.

๐Ÿฆพ Self-supervised

  • Intuitive physics understanding emerges from self-supervised pretraining on natural videos Quentin Garrido, Nicolas Ballas, Mahmoud Assran, Adrien Bardes, Laurent Najman, Michael Rabbat, Emmanuel Dupoux, Yann LeCun. 2025 [paper][code]

    V-JEPA (Video Joint Embedding Predictive Architecture) is a non-auto-regressive model that takes a self-supervised learning approach. It learns by predicting missing or masked parts of a video in an abstract representation space. Trained on a mixture of three popular video datasets, referred to as VideoMix2M (Bardes et al., 2024): Kinetics 710 (K710 (Kay et al.,2017)), Something-Something-v2 (SSv2 (Goyal et al., 2017b)) and HowTo100M (HowTo (Miech et al., 2019)).

    Eval Datasets:

    • IntPhys
    • GRASP
    • InfLevel-lab

Note

This paper also focuses on physical understanding, but it uses a self-supervised learning approach. It is interesting to see how the model can be used for embodied agents.

  • DINO-Foresight: Looking into the Future with DINO Efstathios Karypidis, Ioannis Kakogeorgiou, Spyros Gidaris, Nikos Komodakis, 2024 [paper][code]

    • masked feature transformer in a self-supervised manner to predict the evolution of VFM features over time
    • training on masked dinov2 features, and predict on latent space
  • DINO-WM: World Models on Pre-trained Visual Features enable Zero-shot Planning Gaoyue Zhou, Hengkai Pan, Yann LeCun and Lerrel Pinto, New York University, Meta AI, 2024 [Paper] [Code] [Data] [Project Website]

Others

  • Generalizing Safety Beyond Collision-Avoidance via Latent-Space Reachability Analysis Kensuke Nakamura, Lasse Peters, Andrea Bajcsy, 2025 [paper]
  • Latent Safety Filters: a latent-space generalization of HJ reachability that tractably operates directly on raw observation data (e.g., RGB images) by performing safety analysis in the latent embedding space of a generative world model.
  • Prevent unsafe states and generate actions that prevent future failures in latent space.

๐Ÿ—บ๏ธ Generation-conditioned

  • Video Prediction Policy: A Generalist Robot Policy with Predictive Visual Representations Yucheng Hu, Yanjiang Guo, Pengchao Wang, Xiaoyu Chen, Yen-Jen Wang, Jianke Zhang, Koushil Sreenath, Chaochao Lu, Jianyu Chen, 2024 [paper][project]

  • Strengthening Generative Robot Policies through Predictive World Modeling Han Qi, Haocheng Yin, Yilun Du, Heng Yang, School of Engineering and Applied Sciences, Harvard University, 2025 [paper][project][code coming soon]

๐Ÿค– VLA

  • ObjectVLA: End-to-End Open-World Object Manipulation Without Demonstration Minjie Zhu, Yichen Zhu, Jinming Li, Zhongyi Zhou, Junjie Wen, Xiaoyu Liu, Chaomin Shen, Yaxin Peng, Feifei Feng, 2025 [paper]

    • โ€œ<|object ref start|>{object}<|object ref end|><|box start|> (x1, y1),(x2, y2)<|box end|>.โ€
    • 10:1 data ratio (robot-to-image-text data): This ratio empirically proved sufficient for robust object generalization, aligning with prior findings on the benefits of co-training for VLA capabilities. Notably, increasing the proportion of robot data beyond this ratio led to a decline in in-domain task success rates. We hypothesize this stems from the limited capacity of the 2B-parameter DiVLA model compared to larger architectures like ECoT (7B) and RT-2 (55B), which can better absorb domain-specific data without overfitting.
    • Diffusion-VLA: Scaling Robot Foundation Models via Unified Diffusion and Autoregression Junjie Wen, Minjie Zhu, Yichen Zhu, Zhibin Tang, Jinming Li, Zhongyi Zhou, Chengmeng Li, Xiaoyu Liu, Yaxin Peng, Chaomin Shen, Feifei Feng, 2024 [paper]
  • HELIX: A VISION-LANGUAGE-ACTION MODEL FOR GENERALIST HUMANOID CONTROL [blog]

    7B VLM at 7-9HZ, 80M Transformer at 200Hz(?) full upbody Dataset:

    • multi-robot, multi-operator dataset of diverse teleoperated behaviors, ~500 hours in total.
    • an auto-labeling VLM to generate hindsight instructions, prompted with: "What instruction would you have given the robot to get the action seen in this video?"

Note

Similar Work:

  • ฯ€0: A Vision-Language-Action Flow Model for General Robot Control Physical Intelligence [paper][blog]

  • UP-VLA: A Unified Understanding and Prediction Model for Embodied Agent Jianke Zhang, Yanjiang Guo, Yucheng Hu, Xiaoyu Chen, Xiang Zhu, Jianyu Chen, 2025 [paper]

training with both multi-modal Understanding and future Prediction objectives

  • DexVLA: Vision-Language Model with Plug-In Diffusion Expert for General Robot Control Junjie Wen, Yichen Zhu, Jinming Li, Zhibin Tang, Chaomin Shen, Feifei Feng, 2025 [paper][project][code]

    Multi-head billion-param diffusion action expert for cross-embodiment control qwen2-vl-2b+ scale-dp-1b

    Curriculum learning:

    1. cross-embodiment pre-training stage
    2. embodiment-specific alignment
    3. task-specific adaptation
  • Towards Synergistic, Generalized, and Efficient Dual-System for Robotic Manipulation (RoboDual) Qingwen Bu, Hongyang Li, Li Chen, Jisong Cai, Jia Zeng, Heming Cui, Maoqing Yao, Yu Qiao, 2024 [paper]

    Generalist & specialist Generalist:

    • Prismatic-7B (similar to OpenVLA) (siglip+dinov2 see here)

    Specialist:

    • DiT <- action+ViT through perceiver resampler +latent
  • HiRT: Enhancing Robotic Control with Hierarchical Robot Transformers Jianke Zhang, Yanjiang Guo, Xiaoyu Chen, Yen-Jen Wang, Yucheng Hu, Chengming Shi, Jianyu Chen, 2024 [paper]

    • VLMs running at low frequencies to capture temporarily invariant features while enabling real-time interaction through a high-frequency vision-based policy guided by the slowly updated features
    • Trained on 20 tasks from Metaworld, 5 tasks from Franka-Kitchen, and 4 skills from the real world
  • From LLMs to Actions: Latent Codes as Bridges in Hierarchical Robot Control Yide Shentu, Philipp Wu, Aravind Rajeswaran, Pieter Abbeel, 2024 [paper]

    โ€œUser: can you help me $x_{txt} ? Assistant: yes, <ACT>.โ€

    Data: 400 trajectories for each reasoning task and 1200 trajectories for each long horizon task.

    staged training strategies

๐Ÿ’ Imitation Learning

๐ŸŒˆ Diffusion-based

  • Scaling Diffusion Policy in Transformer to 1 Billion Parameters for Robotic Manipulation ( ScaleDP ) Minjie Zhu, Yichen Zhu, Jinming Li, Junjie Wen, Zhiyuan Xu, Ning Liu, Ran Cheng, Chaomin Shen, Yaxin Peng, Feifei Feng, Jian Tang, 2024 [paper]

    • DPT suffers from large gradient issues, making the optimization of Diffusion Policy unstable
      • factorize the feature embedding of observation into multiple affine layers, and integrate it into the transformer blocks
      • non-causal attention which allows the policy network to โ€œseeโ€ future actions during prediction

    Obs:

    • for DP-T increasing size does not improve performance
    • num head 4->6 improves performance, but 8 does not
    • a consistent decline in performance with each additional layer

    Key Modifications:

    • AdaLN block instead of cross-attention block
    • Non-causal Attention: remove the causal mask in the self-attention layer

Note

RDT uses DiT with:

  • RMSNorm QKNorm: avoid numerical instability
  • linear MLP Decoder
  • Alternating Condition Injection: for modality imbalance
  • The Ingredients for Robotic Diffusion Transformers ( DiT policy ) Sudeep Dasari, Oier Mees, Sebastian Zhao, Mohan Kumar Srirama, Sergey Levine, 2024 [paper][code]

    DiT policy

    • FiLM + resnet + sinusoidal fourier features
    • adaLN-Zero attention: This simple trick improves performance by 30%+ on long horizon, dexterous, real-world manipulation tasks containing over 1000 decisions!
    • self-attention encoder + diffusion decoder with adaLN-Zero
Consistency Policy
  • Boosting Continuous Control with Consistency Policy Yuhui Chen, Haoran Li, Dongbin Zhao, 2024 [paper][project][code]

  • Consistency Policy: Accelerated Visuomotor Policies via Consistency Distillation Aaditya Prasad, Kevin Lin, Jimmy Wu, Linqi Zhou, Jeannette Bohg, 2024 [paper]

Auto-regressive

Note

very interesting idea about simplifying IL models (14M parameters with 2-3M trainable). It is worth to try it on our data. However, the reported results are only on simulated envs.

Important

Overall, I think a minimalist IL model that we can run in realtime on CPU is needed. Coupled with VLM for high-level reasoning, this could be a good stand-in for the full VLA model.

  • CARP: Visuomotor Policy Learning via Coarse-to-Fine Autoregressive Prediction Zhefei Gong, Pengxiang Ding, Shangke Lyu, Siteng Huang, Mingyang Sun, Wei Zhao, Zhaoxin Fan, Donglin Wang, 2024 [paper][code]

๐ŸŽช Reinforcement Learning

  • ConRFT: A Reinforced Fine-tuning Method for VLA Models via Consistency Policy Yuhui Chen, Shuai Tian, Shugao Liu, Yingting Zhou, Haoran Li, Dongbin Zhao, 2025 [paper]

offline and online fine-tuning with a unified consistency-based training objective evaluate our approach on eight diverse real-world manipulation tasks. It achieves an average success rate of 96.3 % within 45โ€“90 minutes of online fine-tuning, outperforming prior supervised methods with a 144 % improvement in success rate and 1.9x shorter episode length.

  • Sim-to-Real Reinforcement Learning for Vision-Based Dexterous Manipulation on Humanoids Toru Lin, Kartik Sachdev, Linxi Fan, Jitendra Malik, Yuke Zhu๏ผŒ2025 [paper][project]

Challenges in applying RL to manipulation:

  • Challenge in environment modeling: an automated real-to-sim tuning module that brings the simulated environment closer to the real world
    • autotune module: simulator physics parameters affecting kinematics and dynamics, as well as robot model constants from the URDF file (including link inertia values, joint limits, and joint/link poses).
  • Challenge in reward design: a generalized reward design scheme that simplifies reward engineering for long-horizon contact-rich manipulation tasks -- disentangle a full task into intermediate โ€œcontact goalsโ€ and โ€œobject goalsโ€.
  • Challenge in policy learning: a divide-and-conquer distillation process that improves the sample efficiency of hard-exploration problems while maintaining sim-to-real performance
  • Challenge in object perception: a mixture of sparse and dense object representations to bridge the sim-to-real perception gap.
  • Learning to Manipulate Anywhere: A Visual Genralizable Framework For Visual Reinforcement Learning Zhecheng Yuan, Tianming Wei, Shuiqi Cheng, Gu Zhang, Yuanpei Chen, Huazhe Xu, 2024 [paper][project][code]

    Use multi-view representation objective to help sim-to-real transfer

  • Precise and Dexterous Robotic Manipulation via Human-in-the-Loop Reinforcement Learning Jianlan Luo, Charles Xu, Jeffrey Wu, Sergey Levine, 2024 [paper][code]

    • Human-in-the-loop RL

๐ŸŽ Representations

  • SoFar: Language-Grounded Orientation Bridges Spatial Reasoning and Object Manipulation Zekun Qi, Wenyao Zhang, Yufei Ding, Runpei Dong, Xinqiang Yu, Jingwen Li, Lingyun Xu, Baoyu Li, Xialin He, Guofan Fan, Jiazhao Zhang, Jiawei He, Jiayuan Gu, Xin Jin, Kaisheng Ma, Zhizheng Zhang, He Wang, Li Yi, 2025 [paper][project]

    1. Complex robotic manipulation tasks are constrained by the understanding of orientation, such as "upright a tilted wine glass", or "plugging a cord into a power strip."
    2. We introduce the concept of semantic orientation, representing the object orientation condition on open vocabulary language. Such as the orientation of "top," "handle," and "pouring water."
    3. We construct OrienText300K, a large paired dataset of point clouds, text, and orientation. We trained PointSO, the first Open-Vocabulary Orientation Model.
    4. Based on PointSO, we propose SoFar, the first 6-DoF spatial understanding LLM, which achieves a 13.1% performance improvement on the 6-DoF object rearrangement task and a 47.2% improvement over OpenVLA on the SimplerEnv benchmark.
    5. We propose two benchmarks, Open6DOR V2 and 6-DoF SpatialBench, which evaluate 6-DoF rearrangement capability and 6-DoF spatial understanding capability, respectively.

๐Ÿชฉ Other Modalities

  • V-HOP: Visuo-Haptic 6D Object Pose Tracking Hongyu Li, Mingxi Jia, Tuluhan Akbulut, Yu Xiang, George Konidaris, Srinath Sridhar, 2025 [paper]

    • a novel unified haptic representation that effectively handles multiple gripper embodiments.
    • a new visuo-haptic transformer-based object pose tracker that seamlessly integrates visual and haptic input
  • FACTR: Force-Attending Curriculum Training for Contact-Rich Policy Learning Jason Jingzhou Liu, Yulong Li, Kenneth Shaw, Tony Tao, Ruslan Salakhutdinov, Deepak Pathak, 2025 [paper][project]

    • a low-cost, intuitive, bilateral teleoperation setup that relays external forces of the follower arm back to the teacher arm, facilitating data collection for complex, contact-rich tasks
    • FACTR, a policy learning method that employs a curriculum which corrupts the visual input with decreasing intensity throughout training.

๐Ÿง™ Survey

  • Aligning Cyber Space with Physical World: A Comprehensive Survey on Embodied AI, arXiv:2407.06886, 2024 Yang Liu, Weixing Chen, Yongjie Bai, Guanbin Li, Wen Gao, Liang Lin. [Paper]

๐Ÿ’ฝ Datasets

  • Open-X Embodiment [overview]

  • BridgeData V2: A Dataset for Robot Learning at Scale Homer Walke, Kevin Black, Abraham Lee, Moo Jin Kim, Max Du, Chongyi Zheng, Tony Zhao, Philippe Hansen-Estruch, Quan Vuong, Andre He, Vivek Myers, Kuan Fang, Chelsea Finn, Sergey Levine, 2024 [paper][project]

    • 60,096 trajectories
    • 50,365 teleoperated demonstrations
    • 9,731 rollouts from a scripted pick-and-place policy
    • 24 environments grouped into 4 categories
    • 13 skills

    The majority of the data comes from 7 distinct toy kitchens, which include some combination of sinks, stoves, and microwaves. The remaining environments come from diverse sources, including various tabletops, standalone toy sinks, a toy laundry machine, and more.

  • DROID: A Large-Scale In-the-Wild Robot Manipulation Dataset, 2024 [paper][project]

๐Ÿ” Dataset Evaluation

  • Robot Data Curation with Mutual Information Estimators Joey Hejna, Suvir Mirchandani, Ashwin Balakrishna, Annie Xie, Ayzaan Wahid, Jonathan Tompson, Pannag Sanketi, Dhruv Shah, Coline Devin, Dorsa Sadigh, 2025 [paper][project][code]

  • Re-Mix: Optimizing Data Mixtures for Large Scale Imitation Learning Joey Hejna, Chethan Bhateja, Yichen Jiang, Karl Pertsch, Dorsa Sadigh, 2024 [paper][project]

๐Ÿง  Real2sim

  • Re3Sim: Generating High-Fidelity Simulation Data via 3D-Photorealistic Real-to-Sim for Robotic Manipulation Xiaoshen Han, Minghuan Liu, Yilun Chen, Junqiu Yu, Xiaoyang Lyu, Yang Tian, Bolun Wang, Weinan Zhang, Jiangmiao Pang, 2025 [paper][project]

  • Evaluating Real-World Robot Manipulation Policies in Simulation (SimplerEnv) [paper][project]

๐Ÿ‹๏ธ Benchmarks

  • All Robots in One: A New Standard and Unified Dataset for Versatile, General-Purpose Embodied Agents, arXiv:2408.10899, 2024 Zhiqiang Wang, Hao Zheng, Yunshuang Nie, Wenjun Xu, Qingwei Wang, Hua Ye, Zhe Li, Kaidong Zhang, Xuewen Cheng, Wanxi Dong, Chang Cai, Liang Lin, Feng Zheng, Xiaodan Liang [Paper][Project]

  • CALVIN - A benchmark for Language-Conditioned Policy Learning for Long-Horizon Robot Manipulation Tasks, arXiv:2112.03227, 2022 Oier Mees, Lukas Hermann, Erick Rosete, Wolfram Burgard [paper][project]

๐Ÿง  Thoughts

About

No description or website provided.

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published