Jian Liu, Xiongtao Shi, Thai Duy Nguyen, Haitian Zhang, Tianxiang Zhang, Wei Sun, Yanjie Li,
Athanasios V. Vasilakos, Giovanni Iacca, Arshad Ali Khan, Arvind Kumar, Jae Won Cho,
📢 Note: If you have any suggestions, feel free to post an issue or pull a request, we will address asap!
- [2025.5.13]: 🔥🔥 We release the official repository of paper Neural Brain: A Neuroscience-inspired Framework for Embodied Agents. This paper is the first to introduce an innovative perspective by defining Neural Brain of embodied agents through the lens of neuroscience. We not only propose this pioneering definition but also provide a comprehensive design framework for the purpose.
Additionally, we revisit the existing literature in alignment with this novel framework, highlighting gaps and challenges, and outlining promising directions for future research. The proposed framework seeks to replicate key principles of biological cognition, including active sensing, a tightly coupled perception-cognition-action loop, etc. By integrating theoretical insights with practical engineering considerations, we aim to advance AI beyond task-specific optimization, laying the groundwork for achieving generalizable embodied intelligence.
The evolution from AI to embodied AI. (a) AI excels in pattern recognition but lacks physical interaction with the real world. (b) Embodied AI enables robots like Atlas of Boston Dynamics and Unitree G1 to perceive and act in their environment. (c) Inspired by the human brain, intelligence arises from neural processes that integrate sensing, perception, cognition, action, and memory. (d) This work proposes a concept of Neural Brain for Embodied Agents, combining neuroscience to achieve generalizable embodied AI.The human brain comprises four key components: sensing, function (perception, cognition, action), memory (short-term and long-term), and implementation features, such as sparse activation, event-driven processing, predictive coding, and distributed and parallel mechanisms. Inspired by insights from neuroscience, we propose the concept of a Neural Brain for Embodied Agents, which integrates these principles into four distinct modules. The sensing module incorporates multimodal fusion, active sensing, and adaptive calibration to enhance perceptual capabilities. The function module encompasses predictive perception, cognitive reasoning, and action, including an action-closed loop to ensure continuous interaction with the environment. The memory module features a hierarchical architecture, neuroplastic adaptation, and context awareness, enabling agents to store and retrieve information dynamically and efficiently. Finally, the hardware/software module is characterized by event-driven processing, neuromorphic architecture, and hardware-software co-design, ensuring robust and flexible operation. These four core ideas, derived from the structure and functionality of the human brain, aim to empower embodied agents to adapt, learn, and perform effectively in real-world, embodied environments.
The Neural Brain for embodied agents is a biologically inspired computational framework that synthesizes principles from neuroscience, robotics, and machine learning to facilitate autonomous and adaptive interaction within unstructured environments. Designed to emulate the hierarchical and distributed architecture of the human brain, it integrates multimodal and active sensing (Sensing), closed-loop perception-cognition-action cycles (Function), neuroplasticity-driven memory systems (Memory), and energy-efficient neuromorphic hardware-software co-design (Hardware/Software), as shown below.
(a) Vision
(b) Audition
(c) Tactile
- Tactile Sensing—From Humans to Humanoids [Paper]
- Force Push: Robust Single-Point Pushing with Force Feedback [Paper]
- The Feel of MEMS Barometers: Inexpensive and Easily Customized Tactile Array Sensors [Paper]
- GelSlim: A High-Resolution, Compact, Robust, and Calibrated Tactile-sensing Finger [Paper]
(d) Olfaction
- 1D Metal Oxide Semiconductor Materials for Chemiresistive Gas Sensors: A Review [Paper]
- Comparisons between mammalian and artificial olfaction based on arrays of carbon black-polymer composite vapor detectors [Paper]
- Piezoelectric Sensor [Paper]
- Recent Advance in the Design of Colorimetric Sensors for Environmental Monitoring [Paper]
(e) Spatial/Time
(a) Active Sensing
- Designing and Evaluating a Social Gaze-Control System for a Humanoid Robot [Paper]
- Robot Active Neural Sensing and Planning in Unknown Cluttered Environments [Paper]
- Multi-robot coordination through dynamic Voronoi partitioning for informative adaptive sampling in communication-constrained environments [Paper]
- Deep Learning for Omnidirectional Vision: A Survey and New Perspectives [Paper]
(b) Adaptive Calibration
- Deep Learning for Camera Calibration and Beyond: A Survey [Paper]
- Polymer-Based Self-Calibrated Optical Fiber Tactile Sensor [Paper]
- Self-validating sensor technology and its application in artificial olfaction: A review [Paper]
- GVINS: Tightly Coupled GNSS-Visual-Inertial Fusion for Smooth and Consistent State Estimation [Paper]
References
- FAST-LIVO2: Fast, Direct LiDAR-Inertial-Visual Odometry [Paper]
- VILENS: Visual, Inertial, Lidar, and Leg Odometry for All-Terrain Legged Robots [Paper]
- Fast Foveating Cameras for Dense Adaptive Resolution [Paper]
- A Survey on Active Simultaneous Localization and Mapping: State of the Art and New Frontiers [Paper]
- A New Wave in Robotics: Survey on Recent mmWave Radar Applications in Robotics [Paper]
- Electronic Nose and Its Applications: A Survey [Paper]
- Real-Time Temporal and Rotational Calibration of Heterogeneous Sensors Using Motion Correlation Analysis [Paper]
(a) Large Language Models (LLMs)
- UL2: Unifying Language Learning Paradigms [Paper] [Code]
- LLaMA: Open and Efficient Foundation Language Models [Paper] [Code]
- LLaMA 2: Open Foundation and Fine-tuned Chat Models [Paper] [Code]
(b) Large Vision Models (LVMs)
- DINOv2: Learning Robust Visual Features without Supervision [Paper] [Code]
- SAM 2: Segment Anything in Images and Videos [Paper] [Code]
- Point-BERT: Pre-training 3D Point Cloud Transformers with Masked Point Modeling [Paper] [Code]
- PointNeXt: Revisiting PointNet++ with Improved Training and Scaling Strategies [Paper] [Code]
- BEVFormer: Learning Bird’s-Eye-View Representation from Multi-Camera Images via Spatiotemporal Transformers [Paper] [Code]
(c) Large Multimodal Models (LMMs)
Vision-Language
- Grounding DINO: Marrying Language and Object Detection with Transformers [Paper] [Code]
- Grounded SAM: Assembling Open-World Models for Diverse Visual Tasks [Paper] [Code]
- ViLT: Vision-and-Language Transformer Without Convolution or Region Supervision [Paper] [Code]
- Unifying Vision-and-Language Tasks via Text Generation [Paper] [Code]
- Learning Transferable Visual Models From Natural Language Supervision [Paper] [Code]
- ALIGN: Scaling Up Visual and Vision-Language Representation Learning With Noisy Text Supervision [Paper]
Text-Audio
- Robust Speech Recognition via Large-Scale Weak Supervision [Paper] [Code]
- AudioGen: Textually Guided Audio Generation [Paper] [Code]
Text-Video
- Video generation models as world simulators [Paper]
Vision-Tactile-Language
- Multimodal Visual-Tactile Representation Learning through Self-Supervised Contrastive Pre-Training [Paper] [Code]
- Binding Touch to Everything: Learning Unified Multimodal Tactile Representations [Paper] [Code]
- A Touch, Vision, and Language Dataset for Multimodal Alignment [Paper] [Code]
- Touching a NeRF: Leveraging Neural Radiance Fields for Tactile Sensory Data Generation [Paper]
(a) Large Vision-Language Models (LVLMs)
(b) Multimodal Large Language Models (MLLMs)
- Expanding Performance Boundaries of Open-Source Multimodal Models with Model, Data, and Test-Time Scaling [Paper] [Code]
- GPT-4o System Card [Paper]
- Gemini: A Family of Highly Capable Multimodal Models [Paper]
- Visual Instruction Tuning [Paper] [Code]
- The Dawn of LMMs: Preliminary Explorations with GPT-4V(ision) [Paper]
- Qwen Technical Report [Paper] [Code]
- EmbodiedEval: Evaluate Multimodal LLMs as Embodied Agents [Paper] [Code]
- EmbodiedBench: Comprehensive Benchmarking Multi-modal Large Language Models for Vision-Driven Embodied Agents [Paper] [Code]
- DeepPerception: Advancing R1-like Cognitive Visual Perception in MLLMs for Knowledge-Intensive Visual Grounding [Paper] [Code]
- MME-Survey: A Comprehensive Survey on Evaluation of Multimodal LLMs [Paper] [Code]
- PCA-Bench: Evaluating Multimodal Large Language Models in Perception-Cognition-Action Chain [Paper] [Code]
(c) Neuro-Symbolic AI
- NeSyC: A Neuro-symbolic Continual Learner For Complex Embodied Tasks In Open Domains [Paper]
- From Understanding the World to Intervening in It: A Unified Multi-Scale Framework for Embodied Cognition [Paper]
- Can-Do! A Dataset and Neuro-Symbolic Grounded Framework for Embodied Planning with Large Multimodal Models [Paper]
- JARVIS: A Neuro-Symbolic Commonsense Reasoning Framework for Conversational Embodied Agents [Paper]
(d) World Models
- WALL-E: World Alignment by Rule Learning Improves World Model-based LLM Agents [Paper] [Code]
- Grounding Large Language Models In Embodied Environment With Imperfect World Models [Paper]
- GenRL: Multimodal-foundation world models for generalization in embodied agents [Paper] [Code]
- AeroVerse: UAV-Agent Benchmark Suite for Simulating, Pre-training, Finetuning, and Evaluating Aerospace Embodied World Models [Paper]
- Do Embodied Agents Dream of Pixelated Sheep: Embodied Decision Making using Language Guided World Modelling [Paper])
(a) Vision-Language-Action Models (VLA)
- PIVOT: Iterative Visual Prompting Elicits Actionable Knowledge for VLMs [Paper] [Code]
- ReKep: Spatio-Temporal Reasoning of Relational Keypoint Constraints for Robotic Manipulation [Paper] [Code]
- DP3: 3D Diffusion Policy: Generalizable Visuomotor Policy Learning via Simple 3D Representations [Paper] [Code]
- RDT-1B: a Diffusion Foundation Model for Bimanual Manipulation [Paper] [Code]
- MBA: Motion Before Action: Diffusing Object Motion as Manipulation Condition [Paper] [Code]
- RVT-2: Learning Precise Manipulation from Few Demonstrations [Paper] [Code]
- DP: Diffusion Policy: Visuomotor Policy Learning via Action Diffusion [Paper] [Code]
- RVT: Robotic View Transformer for 3D Object Manipulation [Paper] [Code]
- ACT: Learning finegrained bimanual manipulation with low-cost hardware [Paper] [Code]
- Q-Transformer: Scalable Offline Reinforcement Learning via Autoregressive Q-Functions [Paper] [Code]
(b) Vision-Language-Navigation Models (VLN)
- CM2: Cross-modal Map Learning for Vision and Language Navigation [Paper] [Code]
- Talk2nav: Talk2Nav: Long-Range Vision-and-Language Navigation with Dual Attention and Spatial Memory [Paper] [Code]
- Waypoints Predictor: Bridging the Gap Between Learning in Discrete and Continuous Environments for Vision-and-Language Navigation [Paper] [Code]
(a) Vision-Language-Action Models (VLA)
(b) Vision-Language-Navigation Models (VLN)
- MapGPT: Map-Guided Prompting with Adaptive Path Planning for Vision-and-Language Navigation [Paper] [Code]
- InstructNav: Zero-shot System for Generic Instruction Navigation in Unexplored Environment [Paper] [Code]
- NavGPT: Explicit Reasoning in Vision-and-Language Navigation with Large Language Models [Paper] [Code]
- NavGPT-2: Unleashing Navigational Reasoning Capability for Large Vision-Language Models [Paper] [Code]
- NaVid: Video-based VLM Plans the Next Step for Vision-and-Language Navigation [Paper] [Code]
- CLIP-Nav: Using CLIP for Zero-Shot Vision-and-Language Navigation [Paper]
References
(a) Neural Memory Systems
- Mem2Ego: Empowering Vision-Language Models with Global-to-Ego Memory for Long-Horizon Embodied Navigation [Paper]
- MINDSTORES: Memory-Informed Neural Decision Synthesis for Task-Oriented Reinforcement in Embodied Systems [Paper]
- Embodied VideoAgent: Persistent Memory from Egocentric Videos and Embodied Sensors Enables Dynamic Scene Understanding [Paper]
- KARMA: Augmenting Embodied AI Agents with Long-and-short Term Memory Systems [Paper]
- Skip-SCAR: Hardware-Friendly High-Quality Embodied Visual Navigation [Paper]
- Neural Turing Machines [Paper]
(b) Structured and Symbolic Memory
- LLM-Powered Decentralized Generative Agents with Adaptive Hierarchical Knowledge Graph for Cooperative Planning [Paper] [Code]
- AdaptBot: Combining LLM with Knowledge Graphs and Human Input for Generic-to-Specific Task Decomposition and Knowledge Refinement [Paper] [Code]
- EmbodiedVSR: Dynamic Scene Graph-Guided Chain-of-Thought Reasoning for Visual Spatial Tasks [Paper]
- Scene-Driven Multimodal Knowledge Graph Construction for Embodied AI [Paper] [Code]
- Aligning Knowledge Graph with Visual Perception for Object-goal Navigation [Paper] [Code]
- Safety Control of Service Robots with LLMs and Embodied Knowledge Graphs [Paper]
- ESGNN: Towards Equivariant Scene Graph Neural Network for 3D Scene Understanding [Paper]
- 3DGraphLLM: Combining Semantic Graphs and Large Language Models for 3D Scene Understanding [Paper] [Code]
- Embodied-RAG: General Non-Parametric Embodied Memory for Retrieval and Generation [Paper]
(c) Spatial and Episodic Memory
- STMA: A Spatio-Temporal Memory Agent for Long-Horizon Embodied Task Planning [Paper]
(a) Adaptive Learning Over Time
- NeSyC: A Neuro-symbolic Continual Learner For Complex Embodied Tasks In Open Domains [Paper]
- Active Learning for Continual Learning: Keeping the Past Alive in the Present [Paper]
- Voyager: An Open-Ended Embodied Agent with Large Language Models [Paper] [Code]
- Embodied Lifelong Learning for Task and Motion Planning [Paper]
- Fast-Slow Test-Time Adaptation for Online Vision-and-Language Navigation [Paper] [Code]
- Building Open-Ended Embodied Agent via Language-Policy Bidirectional Adaptation [Paper]
(b) Self-Guided and Efficient Learning
- DRESS: Disentangled Representation-based Self-Supervised Meta-Learning for Diverse Tasks [Paper] [Code]
- ReMA: Learning to Meta-Think for LLMs with Multi-Agent Reinforcement Learning [Paper]
- Self-Supervised Meta-Learning for All-Layer DNN-Based Adaptive Control with Stability Guarantees [Paper]
(c) Multimodal Integration and Knowledge Fusion
- UniCL: A Universal Contrastive Learning Framework for Large Time Series Models [Paper]
- Binding Touch to Everything: Learning Unified Multimodal Tactile Representations [Paper] [Code]
References
6.1.1. Neuromorphic Hardware
-
Loihi: A neuromorphic manycore processor with on-chip learning [Paper]
-
Neuronflow: A hybrid neuromorphic--dataflow processor architecture for AI workloads [Paper]
-
Efficient neuromorphic signal processing with loihi 2 [Paper]
-
The BrainScaleS-2 accelerated neuromorphic system with hybrid plasticity [Paper]
-
Neuromorphic artificial intelligence systems [Paper]
-
Speck: A smart event-based vision sensor with a low latency 327k neuron convolutional neuronal network processing pipeline [Paper]
-
Prime: A novel processing-in-memory architecture for neural network computation in reram-based main memory [Paper]
-
NeuroSim: A circuit-level macro model for benchmarking neuro-inspired architectures in online learning [Paper]
-
A phase-change memory model for neuromorphic computing [Paper]
-
PUMA: A programmable ultra-efficient memristor-based accelerator for machine learning inference [Paper]
-
ODIN: A bit-parallel stochastic arithmetic based accelerator for in-situ neural network processing in phase change RAM [Paper]
-
Lighton optical processing unit: Scaling-up AI and HPC with a non von neumann co-processor [Paper]
-
An on-chip photonic deep neural network for image classification [Paper]
-
Quantum reservoir computing in finite dimensions [Paper]
-
Theoretical error performance analysis for variational quantum circuit based functional regression [Paper]
6.1.2. Software Frameworks for Neural Brains
-
Speaker-follower models for vision-and-language navigation [Paper]
-
AudioCLIP: Extending CLIP to Image, Text and Audio [Paper]
-
VLN BERT: A recurrent vision-and-language bert for navigation [Paper]
-
Blip-2: Bootstrapping language-image pre-training with frozen image encoders and large language models [Paper]
-
GPT-4 technical report [Paper]
-
Palm 2 technical report [Paper]
-
Qwen technical report [Paper]
-
Rt-2: Vision-language-action models transfer web knowledge to robotic control [Paper]
-
Deepseek-V3 technical report [Paper]
-
NEST: A network simulation and prototyping testbed [Paper]
-
Bindsnet: A machine learning-oriented spiking neural networks library in python [Paper]
-
Brian 2, an intuitive and efficient neural simulator [Paper]
-
Spinnaker-a spiking neural network architecture [Paper]
-
Norse-A deep learning library for spiking neural networks [Paper]
-
Tensorrt inference with tensorflow [Paper]
-
Compiling onnx neural network models using mlir [Paper]
-
Impact of thermal throttling on long-term visual inference in a CPU-based edge device [Paper]
-
Comparison and benchmarking of AI models and frameworks on mobile devices [Paper]
-
Tensorflow lite micro: Embedded machine learning for tinyml systems [Paper]
6.1.3. Energy-Efficient Learning at the Edge
-
Sparse convolutional neural networks [Paper]
-
Training sparse neural networks [Paper]
-
SCNN: An accelerator for compressed-sparse convolutional neural networks [Paper]
-
Sparse computation in adaptive spiking neural networks [Paper]
-
Sbnet: Sparse blocks network for fast inference [Paper]
-
Big bird: Transformers for longer sequences [Paper]
-
Glam: Efficient scaling of language models with mixture-of-experts [Paper]
-
Switch transformers: Scaling to trillion parameter models with simple and efficient sparsity [Paper]
-
Base layers: Simplifying training of large, sparse models [Paper]
-
Distilling the knowledge in a neural network [Paper]
-
Quantization and training of neural networks for efficient integer-arithmetic-only inference [Paper]
-
STDP-based pruning of connections and weight quantization in spiking neural networks for energy-efficient recognition [Paper]
-
Quantization networks [Paper]
-
Quantization framework for fast spiking neural networks [Paper]
-
Efficient neural networks for edge devices [Paper]
-
A million spiking-neuron integrated circuit with a scalable communication network and interface [Paper]
-
Eyeriss: A spatial architecture for energy-efficient dataflow for convolutional neural networks [Paper]
-
Maeri: Enabling flexible dataflow mapping over DNN accelerators via reconfigurable interconnects [Paper]
-
DNNBuilder: An automated tool for building high-performance DNN hardware accelerators for FPGAs [Paper]
-
TVM: An automated End-to-End optimizing compiler for deep learning [Paper]
-
SECDA: Efficient hardware/software co-design of FPGA-based DNN accelerators for edge inference [Paper]
References
- Challenges for large-scale implementations of spiking neural networks on FPGAs [Paper]
- Progress and Challenges in Large Scale Spiking Neural Networks for AI and Neuroscience [Paper]
- A review on methods, issues and challenges in neuromorphic engineering [Paper]
- Real-Time Neuromorphic Navigation: Guiding Physical Robots with Event-Based Sensing and Task-Specific Reconfigurable Autonomy Stack [Paper]
If you find the paper useful, please consider citing our paper.
@article{2025neuralbrain,
title={Neural Brain: A Neuroscience-inspired Framework for Embodied Agents},
author={Liu, Jian and Shi, Xiongtao and Nguyen, Thai and Zhang, Haitian and Zhang, Tianxiang and Sun, Wei and Li, Yanjie and Vasilakos, Athanasios and Iacca, Giovanni and Khan, Arshad and others},
journal={arXiv preprint arXiv:2505.07634},
year={2025}
}
Due to the one-sided nature of our knowledge, if you find any issues or have any suggestions, please feel free to post an issue or contact us via email