“This repository contains a complete machine learning pipeline for pulsar candidate classification in the HTRU2 dataset. The project evaluates and compares 10 ML models, applies robust preprocessing, and includes interpretability tools like SHAP to uncover feature insights. Our optimized SVM model achieves 97.08% ROC AUC, with astrophysically meaningful feature rankings. This framework is designed for both scientific discovery and operational deployment in large-scale radio surveys.”
A comprehensive machine learning pipeline for detecting pulsars in the HTRU2 dataset using various classification algorithms. This project implements state-of-the-art techniques for astronomical signal processing and classification, addressing the class imbalance challenge inherent in pulsar detection.
- About the Project
- Dataset
- Project Structure
- Key Findings
- Getting Started
- Usage
- Methodology
- Results
- Contributing
- License
- Acknowledgments
Pulsars are rapidly rotating neutron stars that emit beams of electromagnetic radiation. This project focuses on automating the detection of pulsar candidates from the High Time Resolution Universe Survey (HTRU2) dataset using machine learning techniques.
- Develop robust classification models for pulsar detection
- Handle severe class imbalance (91% non-pulsars vs 9% pulsars)
- Implement feature engineering and selection techniques
- Provide interpretable results using feature importance analysis
- Compare performance across multiple algorithms
Pulsar detection is crucial for:
- Understanding neutron star physics
- Gravitational wave detection
- Tests of general relativity
- Galactic structure studies
The HTRU2 dataset contains 17,898 pulsar candidates described by 8 continuous variables:
Integrated Profile Statistics:
- Mean of integrated profile
- Standard deviation of integrated profile
- Excess kurtosis of integrated profile
- Skewness of integrated profile
DM-SNR Curve Statistics: 5. Mean of DM-SNR curve 6. Standard deviation of DM-SNR curve 7. Excess kurtosis of DM-SNR curve 8. Skewness of DM-SNR curve
Target Variable:
- Class: 0 (non-pulsar) or 1 (pulsar)
Data Characteristics:
- Total samples: 17,898
- Pulsars: 1,639 (9.16%)
- Non-pulsars: 16,259 (90.84%)
- Missing values: None
HTRU2-Pulsar-Detection/
├── data/ # Dataset files
│ ├── HTRU_2.csv # Original HTRU2 data
├── notebooks/ # Jupyter notebooks
│ ├── 01_EDA.ipynb # Exploratory Data Analysis
│ ├── 02_Modeling.ipynb # Model training and evaluation
│ └── 03_Interpretability.ipynb # Feature importance and interpretability
├── src/ # Source code modules
│ ├── __init__.py
│ ├── models.py # Model implementations
│ ├── preprocess.py # Preprocessing methods and functions
│ └── utils.py # Utility functions
├── models/ # Trained model artifacts
│ ├── SVM_best.pkl
│ └── scaler.pkl
├── results/ # Analysis outputs
│ ├── figures/ # Visualizations
│ │ ├── confusion_matrices
│ │ ├── correlation_matrix
│ │ ├── data_overview
│ │ ├── error_analysis
│ │ ├── feature_boxplots
│ │ ├── feature_distributions
│ │ ├── partial_dependence_svm
│ │ ├── pr_curves
│ │ ├── roc_curves
│ │ ├── shap_force_pulsar_svm
│ │ ├── shap_summary_svm
│ │ ├── SVM_feature_importance
│ │ ├── threshold_optimization
│ │ └── pca_analysis
│ └── metrics/ # Performance metrics
├── paper/ # Research paper
├── requirements.txt # Python dependencies
├── environment.yml # Conda environment
├── .gitignore # Git ignore file
└── README.md # This file
- Algorithm: Support Vector Machine (SVM) with RBF kernel
- Validation ROC AUC: 0.9843
- Test ROC AUC: 0.9708
- Test Precision: 0.8287
- Test Recall: 0.9146
- Test F1-Score: 0.8696
- Excess kurtosis of integrated profile (1.7413 importance)
- Skewness of DM-SNR curve (0.5286 importance)
- Standard deviation of DM-SNR curve (0.4870 importance)
- Mean of integrated profile (0.4768 importance)
- Excess kurtosis of DM-SNR curve (0.4661 importance)
Model | ROC AUC | Precision | Recall | F1-Score |
---|---|---|---|---|
SVM | 0.9843 | 0.8287 | 0.9146 | 0.8696 |
LogisticRegression | 0.9837 | 0.7906 | 0.9207 | 0.8507 |
LightGBM | 0.9729 | 0.8506 | 0.9024 | 0.8757 |
XGBoost | 0.9720 | 0.8315 | 0.9024 | 0.8655 |
CatBoost | 0.9714 | 0.8287 | 0.9146 | 0.8696 |
Model | CV Mean | CV Std | CV Min | CV Max |
---|---|---|---|---|
RandomForest | 0.997064 | 0.000780 | 0.996188 | 0.998177 |
XGBoost | 0.996502 | 0.000754 | 0.995705 | 0.997846 |
CatBoost | 0.996096 | 0.000678 | 0.995362 | 0.997289 |
LightGBM | 0.995919 | 0.000786 | 0.994984 | 0.997214 |
MLP | 0.991762 | 0.001428 | 0.989590 | 0.993915 |
- Python 3.8 or higher
- Git
- Jupyter Lab/Notebook
-
Clone the repository:
git clone https://github.com/KhamessiTaha/HTRU2-Pulsar-Detection.git cd HTRU2-Pulsar-Detection
-
Create a virtual environment (recommended):
python -m venv venv source venv/bin/activate # On Windows: venv\Scripts\activate
-
Install dependencies:
pip install -r requirements.txt
Or using conda:
conda env create -f environment.yml conda activate htru2-pulsar
-
Download the HTRU2 dataset:
# Dataset will be automatically downloaded when running the first notebook # Or manually download from: https://archive.ics.uci.edu/ml/datasets/HTRU2
-
Run the analysis pipeline:
jupyter lab
Execute notebooks in order:
01_EDA.ipynb
→ Exploratory Data Analysis02_Modeling.ipynb
→ Model Training and Evaluation03_Interpretability.ipynb
→ Feature Importance Analysis and Interpretation
Data Preprocessing:
from src.preprocess import preprocess_data
X_train, X_test, y_train, y_test = preprocess_data('data/HTRU_2.csv')
Model Training:
from src.models import train_svm_model
model = train_svm_model(X_train, y_train)
Evaluation:
from src.utils import evaluate_model
metrics = evaluate_model(model, X_test, y_test)
- Scaling: RobustScaler for feature normalization (robust to outliers)
- Class Balancing: SMOTE (Synthetic Minority Oversampling Technique)
- Train/Validation/Test Split: 70%/10%/20% stratified split
- Cross-Validation: 5-fold stratified cross-validation
- Hyperparameter Tuning: GridSearchCV with ROC AUC optimization
- Multiple Algorithms: Comparison of 10 different classifiers
- Primary: ROC AUC (handles class imbalance well)
- Secondary: Precision, Recall, F1-Score, MCC
- Specialized: PR AUC, Balanced Accuracy, Specificity
The SVM model achieved exceptional performance with a validation ROC AUC of 0.9843 and test ROC AUC of 0.9708, demonstrating excellent discrimination between pulsars and non-pulsars. The model shows:
- High Test Precision: 82.87% of predicted pulsars are actual pulsars
- High Test Recall: 91.46% of actual pulsars are correctly identified
- Balanced Performance: Test F1-score of 86.96% indicates good balance
- Strong Generalization: Minimal overfitting between validation and test performance
- Excess kurtosis of integrated profile is the most discriminative feature
- DM-SNR curve statistics (skewness and standard deviation) provide significant classification power
- Integrated profile statistics complement DM-SNR features effectively
- Combined features achieve substantially better performance than individual metrics
- Balanced Training Set: SMOTE increased training samples from 12,528 to 22,762 (50%/50% class distribution)
- Robust Scaling: Applied to handle outliers in astronomical data
- Stratified Sampling: Maintains class proportions across splits
The high performance suggests that machine learning can reliably automate pulsar detection, potentially:
- Reducing manual review time by 90%+
- Discovering new pulsars in large-scale surveys
- Enabling real-time pulsar candidate classification
- Supporting next-generation radio telescope surveys
Contributions are welcome! Please follow these steps:
- Fork the repository
- Create a feature branch (
git checkout -b feature/amazing-feature
) - Commit your changes (
git commit -m 'Add amazing feature'
) - Push to the branch (
git push origin feature/amazing-feature
) - Open a Pull Request
- Deep learning model implementations
- Additional feature engineering techniques
- Real-time classification pipeline
- Web interface for model deployment
- Extended dataset integration
This project is licensed under the MIT License - see the LICENSE file for details.
- HTRU2 Dataset: R. J. Lyon et al. (University of Manchester)
- Original Paper: Lyon, R. J., et al. "Fifty Years of Pulsar Candidate Selection: From simple filters to a new principled real-time classification approach." Monthly Notices of the Royal Astronomical Society 459.1 (2016): 1104-1123.
- Scikit-learn Community: For excellent machine learning tools
- Python Data Science Stack: NumPy, Pandas, Matplotlib, Seaborn
If you use this work in your research, please cite:
@misc{htru2_pulsar_detection,
author = {Taha Khamessi},
title = {HTRU2 Pulsar Detection: A Machine Learning Approach},
year = {2025},
publisher = {GitHub},
journal = {GitHub repository},
howpublished = {\url{https://github.com/KhamessiTaha/HTRU2-Pulsar-Detection}}
}
Contact: taha.khamessi@gmail.com
Project Link: https://github.com/KhamessiTaha/HTRU2-Pulsar-Detection