Reinforcement Learning for Decision-Level Interception Prioritization in Drone Swarm Defense

This repository contains a reinforcement learning (RL) framework for the decision-level interception prioritization of drone swarms. The project is designed to evaluate the performance of RL agents against classical heuristic methods in a simulated environment, focusing on the interception of hostile drones by kinetic effectors to minimize damage to sensitive zones.

The RL agents are trained to prioritize drone targets based on their potential threat levels, with the goal of maximizing the effectiveness of the defense system while minimizing collateral damage.

Setup

To get started, create a new Python environment and install the required dependencies. The following commands will set up a Conda environment and install all necessary packages, including PyTorch for CPU-only systems and other requirements listed in the requirements.txt file:

conda create -n cuas python=3.11
conda activate cuas

pip install --no-cache-dir torch==2.5.1+cpu --index-url https://download.pytorch.org/whl/cpu; # For CPU-only systems
pip install -r requirements

Simulator

The simulator models a defense scenario where a swarm of kamikaze drones autonomously targets high-value zones protected by kinetic effectors (such as interceptors or directed energy weapons). The environment is three-dimensional and includes configurable numbers of hostile drones, static sensitive zones, and effectors with realistic kinematic and weapon dynamics. Each effector can only fire when locked onto a target and ready, and must periodically recharge.

Episodes begin with drones spawned at random locations, each aiming for a zone using pre-defined but unknown policies. The defender receives noisy, partial observations and must prioritize which drones to intercept at every timestep, considering constraints like limited firing rate, angular speed, and line-of-sight. The simulation supports large-scale, multi-agent scenarios and batch evaluation.

Attackers vary in speed, size, explosive power, and flight path, and their coordination is fixed to simulate low-cost adversaries. The defender’s challenge is to minimize total damage by making effective, real-time prioritization decisions under uncertainty and resource limitations. All scenario elements, including zones, drones, effectors, and sensors, are highly configurable for flexible experimentation.

The following figures illustrate key aspects of the simulation environment. The first image shows the scenario simulator in execution, including all relevant infographics such as drone and effector states, and protected zones. The second image presents the drone neutralization probability as a function of miss distance, providing insight into the effectiveness of the defense system under varying engagement conditions.


Scenario Simulator In Execution	Drone Neutralization Probability Plot

Training

To train a new reinforcement learning agent, configure your training parameters in train/config.yaml and run the training script:

python training.py --config train/config.yaml

The script supports resuming from checkpoints, automatic saving, evaluation during training, and early stopping based on reward thresholds or lack of improvement. Training and evaluation environments, model checkpoints, and logs are managed automatically according to your configuration.

Two versions of the PPO algorithm are available: the original PPO and the MaskablePPO (which supports action masking for invalid actions), both provided via Stable Baselines 3. You can select which algorithm to use by setting the algo field in your configuration file.

The following image shows a comparison of training curves between PPO and MaskablePPO:

Training performance of PPO vs. MaskedPPO, showing cumulative reward per episode over environment steps. MaskedPPO converges ~10× faster by masking invalid actions (e.g., targeting already-neutralized drones), enabling more efficient and stable learning.

Evaluation and Results

To run a single inference episode and visualize or evaluate a specific policy (e.g., DeepRL, Classic, or Random), use the inference.py script. This allows you to observe the agent's behavior and performance in the environment. For example, to run a single episode with the DeepRL policy and rendering enabled, use:

python inference.py --policy deeprl --n_episodes 1 --seed 42

You can change the --policy argument to classic or random to evaluate other policies. Use the --no_render flag to disable visualization and speed up evaluation.

For a comprehensive comparison of all policies and automatic generation of evaluation figures and metrics, use the comparative_evaluation.py script. This script runs multiple episodes for each policy, aggregates the results, and produces all relevant plots and summary tables for damage, tracking, and weapon utilization:

python comparative_evaluation.py --n_episodes 100 --seeds 10 20 30 42 50

The script will save the results and figures in the appropriate folders, allowing for easy analysis and reproducibility of the evaluation.

The following table and figures are generated using the default parameters of the comparative evaluation script. They provide a comprehensive summary of the main evaluation metrics and and visual comparisons between the different policies.

Metric	Classical Heuristic	Reinforcement Learning
Total Damage (Avg) [%]	50.34	41.30
In-Tracking Time (Avg) [%]	52.87	65.59
Weapon Utilization (Avg) [%]	54.35	62.79

Table: Evaluation Results. 100 Episodes × 5 Seeds


Demo Video	Distribution of total zone damage percentage for each controller. The RL agent consistently limits damage to critical zones compared to the heuristic baseline and random controller.


a) Tracking Performance	b) Weapon Utilization
Comparison of controller performance across two key enabling metrics: (a) target tracking efficiency and (b) weapon utilization. The DeepRL policy consistently achieves superior performance in both categories compared to the classical and random controllers, indicating improved resource allocation and sustained threat engagement over time.


a) Damage vs Tracking Correlation	b) Damage vs Weapon Utilization Correlation
Scatter plots showing the relationship between zone damage and: (a) tracking efficiency, and (b) weapon utilization. While both correlations are negative, they are not strongly linear, highlighting that increased engagement opportunities (via better tracking and utilization) generally help reduce damage, but do not fully determine it due to the complex interplay of prioritization and threat behavior.

Citation

@misc{palmas2025reinforcementlearningdecisionlevelinterception,
      title={Reinforcement Learning for Decision-Level Interception Prioritization in Drone Swarm Defense},
      author={Alessandro Palmas},
      year={2025},
      eprint={2508.00641},
      archivePrefix={arXiv},
      primaryClass={cs.LG},
      url={https://arxiv.org/abs/2508.00641},
}

Name		Name	Last commit message	Last commit date
Latest commit History 33 Commits
control_policies		control_policies
environment		environment
modifiers		modifiers
results		results
train		train
.gitignore		.gitignore
README.md		README.md
comparative_evaluation.py		comparative_evaluation.py
comparative_evaluation_results.json		comparative_evaluation_results.json
inference.py		inference.py
neutralization_prob_plot.py		neutralization_prob_plot.py
requirements		requirements
training.py		training.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Reinforcement Learning for Decision-Level Interception Prioritization in Drone Swarm Defense

Setup

Simulator

Training

Evaluation and Results

Citation

About

Uh oh!

Languages

alexpalms/deeprl-counter-uav-swarm

Folders and files

Latest commit

History

Repository files navigation

Reinforcement Learning for Decision-Level Interception Prioritization in Drone Swarm Defense

Setup

Simulator

Training

Evaluation and Results

Citation

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Languages