A state-of-the-art Deep Reinforcement Learning system for controlling model rocket attitude using Thrust Vector Control (TVC). This project implements a multi-algorithm ensemble with transformer networks, hierarchical RL, and physics-informed neural networks, trained in a realistic PyBullet physics simulation with comprehensive anti-reward-hacking measures.
- Multi-Algorithm Ensemble: PPO + SAC + TD3 with intelligent algorithm selection
- Transformer Networks: Attention-based policies for temporal dependencies
- Hierarchical RL: Automatic skill discovery and decomposition
- Physics-Informed: Domain knowledge integration in neural networks
- Curiosity-Driven: Intrinsic motivation for efficient exploration
- Anti-Reward-Hacking: Real mission success detection with landing criteria
- Safety Constraints: Control Barrier Functions for safe operation
- Real-time Monitoring: Comprehensive reward hacking detection
- Adaptive Curriculum: Progressive learning with 6 difficulty stages
- Key Features
- Technical Deep Dive
- Simulation Environment
- DRL Agent: Soft Actor-Critic
- State, Action, and Reward
- Project Structure
- Getting Started
- Prerequisites
- Installation & Setup
-
- Training the Agent
-
- Evaluating the Policy
-
- Deploying to a Microcontroller
- Acknowledgements & References
- High-Fidelity 3D Physics: A digital twin built with PyBullet simulates rigid body dynamics, gravity, variable mass, and TVC motor thrust. RocketPy can be integrated for accurate aerodynamic force modeling.
- Sim-to-Real via Domain Randomization: To ensure the policy is robust, the simulation randomizes key physical parameters on every run: rocket mass, center of gravity (CG), motor thrust curves, sensor noise (IMU), and actuator delay/response.
- Standard Gymnasium Interface: The environment adheres to the modern Gymnasium API, making it compatible with a wide range of DRL libraries and algorithms.
- State-of-the-Art DRL Agent: Employs Soft Actor-Critic (SAC), an off-policy algorithm known for its sample efficiency and stability, making it ideal for complex physics tasks.
- Shaped Reward Function: A carefully designed reward function guides the agent to maintain vertical stability, minimize angular velocity (reduce oscillations), and use control inputs efficiently.
- End-to-End Workflow: Provides a complete pipeline from training and evaluation to model quantization and deployment.
- MCU-Ready Deployment: The trained policy is optimized using Post-Training Quantization (8-bit) and converted into a C-array via TensorFlow Lite for Microcontrollers (TFLM) for fast, on-device inference.
The core of this project is the simulated environment. A robust policy can only be learned if the simulation it's trained in is a close-enough approximation of reality.
- Physics Engine: PyBullet is used for its fast and stable rigid body dynamics simulation.
- State Representation: The rocket's state is observed at each time step. The state vector includes:
- Attitude (Quaternion):
[qx, qy, qz, qw]
- A 4D representation to avoid gimbal lock. - Angular Velocity:
[ω_x, ω_y, ω_z]
- Critical for damping rotational motion.
- Attitude (Quaternion):
- Imperfections: Real-world rockets are not perfect. Domain randomization introduces controlled chaos, forcing the agent to learn a policy that can handle variations in hardware and conditions, which is the key to bridging the "sim-to-real" gap.
We use Soft Actor-Critic (SAC) for its balance of exploration and exploitation.
- Entropy Maximization: SAC's objective is to maximize not only the cumulative reward but also the entropy of its policy. This encourages broader exploration, preventing the agent from settling into a suboptimal local minimum and making it more resilient to perturbations.
- Actor-Critic Architecture:
- Actor (Policy): An MLP that takes the rocket's state as input and outputs the parameters (mean and standard deviation) of a Gaussian distribution for each action (gimbal pitch, gimbal yaw). The actual action is then sampled from this distribution.
- Critic (Value Function): One or more MLPs that learn to estimate the expected future reward from a given state-action pair. This estimate is used to "criticize" and improve the actor's policy.
The interaction between the agent and environment is defined by these three components:
- State Space (Observations):
s_t = [qx, qy, qz, qw, ω_x, ω_y, ω_z]
- Action Space (Control Inputs):
a_t = [gimbal_pitch_angle, gimbal_yaw_angle]
(continuous values, typically scaled to[-1, 1]
). - Reward Function (
r_t
): The agent's behavior is shaped by the rewardr_t
it receives at each step. Our function is a sum of several components:- Attitude Reward:
R_attitude = exp(-k_angle * θ_total^2)
whereθ_total
is the total angle off vertical. This provides a strong, smooth incentive to stay upright. - Angular Velocity Penalty:
R_ang_vel = -k_vel * ||ω||^2
. This penalizes rapid spinning and encourages smooth, damped control. - Control Effort Penalty:
R_action = -k_action * ||a||^2
. This penalizes large, aggressive gimbal movements, promoting efficiency. - Termination Penalty: A large negative reward (e.g., -100) is given if the rocket tilts beyond a failure threshold (e.g., 20°), immediately ending the episode.
- Attitude Reward:
tvc-ai/
├── agent/ # DRL agent implementation (SAC algorithm, networks)
├── env/ # Rocket physics simulation (Gymnasium environment)
├── models/ # Saved model checkpoints and exported TFLM models
├── scripts/ # High-level scripts for training, evaluation, etc.
│ ├── train.py
│ ├── evaluate.py
│ └── export_tflm.py
├── config/ # Configuration files (Hydra-based)
├── tests/ # Test suite and benchmarks
├── utils/ # Utility functions and helpers
├── verify_installation.py
├── setup.py # Automated setup script
├── requirements.txt # Full dependencies
├── requirements-minimal.txt # Core dependencies only
├── requirements-dev.txt # Development dependencies
├── .gitignore
└── README.md
- Python 3.8-3.11 (3.10.11 recommended)
- 8GB+ RAM (16GB recommended for training)
- CUDA-capable GPU (optional but recommended for faster training)
- ~2-5GB disk space (depending on installation type)
Option 1: Automated Setup (Recommended)
# Clone the repository
git clone <repository-url>
cd TVC-AI
# Run automated setup
python setup.py
Option 2: Manual Installation
# Clone the repository
git clone <repository-url>
cd TVC-AI
# Choose your installation type:
# Minimal (core functionality only - ~2GB)
pip install -r requirements-minimal.txt
# Full (all features - ~4GB, recommended)
pip install -r requirements.txt
# Development (includes testing tools - ~5GB)
pip install -r requirements.txt -r requirements-dev.txt
# Verify installation
python verify_installation.py
-
Verify Installation:
python verify_installation.py
-
Train Your First Model:
python scripts/train.py
-
Monitor Training:
tensorboard --logdir logs/
-
Evaluate Trained Model:
python scripts/evaluate.py --model_path models/best_model.pth
Common Issues:
- PyBullet installation fails: Install Visual C++ redistributables on Windows
- CUDA out of memory: Use CPU training:
python scripts/train.py device=cpu
- Package conflicts: Use a virtual environment:
python -m venv tvc_env source tvc_env/bin/activate # On Windows: tvc_env\Scripts\activate pip install -r requirements.txt
For detailed usage instructions, see USAGE.md.