HTRU2 Pulsar Detection

“This repository contains a complete machine learning pipeline for pulsar candidate classification in the HTRU2 dataset. The project evaluates and compares 10 ML models, applies robust preprocessing, and includes interpretability tools like SHAP to uncover feature insights. Our optimized SVM model achieves 97.08% ROC AUC, with astrophysically meaningful feature rankings. This framework is designed for both scientific discovery and operational deployment in large-scale radio surveys.”

A comprehensive machine learning pipeline for detecting pulsars in the HTRU2 dataset using various classification algorithms. This project implements state-of-the-art techniques for astronomical signal processing and classification, addressing the class imbalance challenge inherent in pulsar detection.

About the Project

Pulsars are rapidly rotating neutron stars that emit beams of electromagnetic radiation. This project focuses on automating the detection of pulsar candidates from the High Time Resolution Universe Survey (HTRU2) dataset using machine learning techniques.

Objectives

Develop robust classification models for pulsar detection
Handle severe class imbalance (91% non-pulsars vs 9% pulsars)
Implement feature engineering and selection techniques
Provide interpretable results using feature importance analysis
Compare performance across multiple algorithms

Scientific Context

Pulsar detection is crucial for:

Understanding neutron star physics
Gravitational wave detection
Tests of general relativity
Galactic structure studies

Dataset

The HTRU2 dataset contains 17,898 pulsar candidates described by 8 continuous variables:

Integrated Profile Statistics:

Mean of integrated profile
Standard deviation of integrated profile
Excess kurtosis of integrated profile
Skewness of integrated profile

DM-SNR Curve Statistics: 5. Mean of DM-SNR curve 6. Standard deviation of DM-SNR curve 7. Excess kurtosis of DM-SNR curve 8. Skewness of DM-SNR curve

Target Variable:

Class: 0 (non-pulsar) or 1 (pulsar)

Data Characteristics:

Total samples: 17,898
Pulsars: 1,639 (9.16%)
Non-pulsars: 16,259 (90.84%)
Missing values: None

Project Structure

HTRU2-Pulsar-Detection/
├── data/                           # Dataset files
│   ├── HTRU_2.csv                  # Original HTRU2 data                    
├── notebooks/                      # Jupyter notebooks
│   ├── 01_EDA.ipynb                # Exploratory Data Analysis
│   ├── 02_Modeling.ipynb           # Model training and evaluation
│   └── 03_Interpretability.ipynb   # Feature importance and interpretability
├── src/                            # Source code modules
│   ├── __init__.py
│   ├── models.py                  # Model implementations
│   ├── preprocess.py              # Preprocessing methods and functions
│   └── utils.py                   # Utility functions
├── models/                        # Trained model artifacts
│   ├── SVM_best.pkl
│   └── scaler.pkl
├── results/                       # Analysis outputs
│   ├── figures/                   # Visualizations
│   │   ├── confusion_matrices
│   │   ├── correlation_matrix
│   │   ├── data_overview
│   │   ├── error_analysis
│   │   ├── feature_boxplots
│   │   ├── feature_distributions
│   │   ├── partial_dependence_svm
│   │   ├── pr_curves
│   │   ├── roc_curves
│   │   ├── shap_force_pulsar_svm
│   │   ├── shap_summary_svm
│   │   ├── SVM_feature_importance
│   │   ├── threshold_optimization
│   │   └── pca_analysis               
│   └── metrics/                   # Performance metrics
├── paper/                         # Research paper
├── requirements.txt               # Python dependencies
├── environment.yml                # Conda environment
├── .gitignore                     # Git ignore file
└── README.md                      # This file

Key Findings

Best Performing Model

Algorithm: Support Vector Machine (SVM) with RBF kernel
Validation ROC AUC: 0.9843
Test ROC AUC: 0.9708
Test Precision: 0.8287
Test Recall: 0.9146
Test F1-Score: 0.8696

Feature Importance (SVM Analysis)

Excess kurtosis of integrated profile (1.7413 importance)
Skewness of DM-SNR curve (0.5286 importance)
Standard deviation of DM-SNR curve (0.4870 importance)
Mean of integrated profile (0.4768 importance)
Excess kurtosis of DM-SNR curve (0.4661 importance)

Model Comparison (Validation Performance)

Model	ROC AUC	Precision	Recall	F1-Score
SVM	0.9843	0.8287	0.9146	0.8696
LogisticRegression	0.9837	0.7906	0.9207	0.8507
LightGBM	0.9729	0.8506	0.9024	0.8757
XGBoost	0.9720	0.8315	0.9024	0.8655
CatBoost	0.9714	0.8287	0.9146	0.8696

Cross-Validation Results (ROC AUC)

Model	CV Mean	CV Std	CV Min	CV Max
RandomForest	0.997064	0.000780	0.996188	0.998177
XGBoost	0.996502	0.000754	0.995705	0.997846
CatBoost	0.996096	0.000678	0.995362	0.997289
LightGBM	0.995919	0.000786	0.994984	0.997214
MLP	0.991762	0.001428	0.989590	0.993915

Getting Started

Prerequisites

Python 3.8 or higher
Git
Jupyter Lab/Notebook

Installation

Clone the repository:

git clone https://github.com/KhamessiTaha/HTRU2-Pulsar-Detection.git
cd HTRU2-Pulsar-Detection

Create a virtual environment (recommended):

python -m venv venv
source venv/bin/activate  # On Windows: venv\Scripts\activate

Install dependencies:

pip install -r requirements.txt

Or using conda:

conda env create -f environment.yml
conda activate htru2-pulsar

Quick Start

Download the HTRU2 dataset:

# Dataset will be automatically downloaded when running the first notebook
# Or manually download from: https://archive.ics.uci.edu/ml/datasets/HTRU2

Run the analysis pipeline:
```
jupyter lab
```
Execute notebooks in order:
- 01_EDA.ipynb → Exploratory Data Analysis
- 02_Modeling.ipynb → Model Training and Evaluation
- 03_Interpretability.ipynb → Feature Importance Analysis and Interpretation

Usage

Running Individual Components

Data Preprocessing:

from src.preprocess import preprocess_data
X_train, X_test, y_train, y_test = preprocess_data('data/HTRU_2.csv')

Model Training:

from src.models import train_svm_model
model = train_svm_model(X_train, y_train)

Evaluation:

from src.utils import evaluate_model
metrics = evaluate_model(model, X_test, y_test)

Methodology

Data Preprocessing

Scaling: RobustScaler for feature normalization (robust to outliers)
Class Balancing: SMOTE (Synthetic Minority Oversampling Technique)
Train/Validation/Test Split: 70%/10%/20% stratified split

Model Selection

Cross-Validation: 5-fold stratified cross-validation
Hyperparameter Tuning: GridSearchCV with ROC AUC optimization
Multiple Algorithms: Comparison of 10 different classifiers

Evaluation Metrics

Primary: ROC AUC (handles class imbalance well)
Secondary: Precision, Recall, F1-Score, MCC
Specialized: PR AUC, Balanced Accuracy, Specificity

Results

Performance Summary

The SVM model achieved exceptional performance with a validation ROC AUC of 0.9843 and test ROC AUC of 0.9708, demonstrating excellent discrimination between pulsars and non-pulsars. The model shows:

High Test Precision: 82.87% of predicted pulsars are actual pulsars
High Test Recall: 91.46% of actual pulsars are correctly identified
Balanced Performance: Test F1-score of 86.96% indicates good balance
Strong Generalization: Minimal overfitting between validation and test performance

Feature Insights

Excess kurtosis of integrated profile is the most discriminative feature
DM-SNR curve statistics (skewness and standard deviation) provide significant classification power
Integrated profile statistics complement DM-SNR features effectively
Combined features achieve substantially better performance than individual metrics

Data Processing Pipeline

Balanced Training Set: SMOTE increased training samples from 12,528 to 22,762 (50%/50% class distribution)
Robust Scaling: Applied to handle outliers in astronomical data
Stratified Sampling: Maintains class proportions across splits

Astronomical Implications

The high performance suggests that machine learning can reliably automate pulsar detection, potentially:

Reducing manual review time by 90%+
Discovering new pulsars in large-scale surveys
Enabling real-time pulsar candidate classification
Supporting next-generation radio telescope surveys

Contributing

Contributions are welcome! Please follow these steps:

Fork the repository
Create a feature branch (git checkout -b feature/amazing-feature)
Commit your changes (git commit -m 'Add amazing feature')
Push to the branch (git push origin feature/amazing-feature)
Open a Pull Request

Areas for Contribution

Deep learning model implementations
Additional feature engineering techniques
Real-time classification pipeline
Web interface for model deployment
Extended dataset integration

License

This project is licensed under the MIT License - see the LICENSE file for details.

Acknowledgments

HTRU2 Dataset: R. J. Lyon et al. (University of Manchester)
Original Paper: Lyon, R. J., et al. "Fifty Years of Pulsar Candidate Selection: From simple filters to a new principled real-time classification approach." Monthly Notices of the Royal Astronomical Society 459.1 (2016): 1104-1123.
Scikit-learn Community: For excellent machine learning tools
Python Data Science Stack: NumPy, Pandas, Matplotlib, Seaborn

Citation

If you use this work in your research, please cite:

@misc{htru2_pulsar_detection,
  author = {Taha Khamessi},
  title = {HTRU2 Pulsar Detection: A Machine Learning Approach},
  year = {2025},
  publisher = {GitHub},
  journal = {GitHub repository},
  howpublished = {\url{https://github.com/KhamessiTaha/HTRU2-Pulsar-Detection}}
}

Contact: taha.khamessi@gmail.com

Project Link: https://github.com/KhamessiTaha/HTRU2-Pulsar-Detection

Name		Name	Last commit message	Last commit date
Latest commit History 30 Commits
models		models
notebooks		notebooks
paper		paper
results/figures		results/figures
src		src
.gitattributes		.gitattributes
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
environment.yml		environment.yml
requirements.txt		requirements.txt

License

KhamessiTaha/HRTU2-Pulsar-Detection

Folders and files

Latest commit

History

Repository files navigation

HTRU2 Pulsar Detection

Table of Contents

About the Project

Objectives

Scientific Context

Dataset

Project Structure

Key Findings

Best Performing Model

Feature Importance (SVM Analysis)

Model Comparison (Validation Performance)

Cross-Validation Results (ROC AUC)

Getting Started

Prerequisites

Installation

Quick Start

Usage

Running Individual Components

Methodology

Data Preprocessing

Model Selection

Evaluation Metrics

Results

Performance Summary

Feature Insights

Data Processing Pipeline

Astronomical Implications

Contributing

Areas for Contribution

License

Acknowledgments

Citation

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases 1

Languages