This repository contains my solution for the EXXA_3 task as part of the ML4Sci Google Summer of Code 2025. The goal of this project is to simulate realistic transit light curves using physical parameters and train a classifier to detect the presence of an exoplanet from these light curves.
-
Synthetic Data Generation:
Transit light curves are simulated using the batman package. A range of physical parameters—such as the planet-to-star radius ratio, semi-major axis, orbital inclination, and mid-transit time—are randomly varied to produce a diverse dataset. Flat (non-transit) curves are generated by adding Gaussian noise to a constant flux. This results in a balanced dataset of transit (label 1) and non-transit (label 0) light curves. -
Classifier Training:
A custom 1D convolutional neural network (CNN) is used to classify the synthetic light curves. The network processes the 1D time-series data (1000 points per curve) and outputs a single logit per sample, which is converted to a probability using the sigmoid function. The model is trained using binary cross-entropy loss (BCEWithLogitsLoss) with an 80/20 train/validation split. -
Evaluation:
Model performance is evaluated using standard metrics including validation accuracy, ROC curve, AUC, and a confusion matrix. The best-performing model is saved within the repository, and all evaluation code is included so that the entire pipeline can be run end-to-end with minimal effort.
The classifier, implemented in PyTorch, is designed as a 1D CNN with the following architecture:
-
Convolutional Layers:
Three convolutional layers with increasing channel depth (32 → 64 → 128), followed by ReLU activations and max pooling (or adaptive pooling) to extract temporal features from the light curves. -
Fully Connected Layers:
After flattening the feature maps, a dropout layer is applied for regularization, followed by a linear layer that outputs a single logit per sample for binary classification. -
Loss Function:
The network usesBCEWithLogitsLoss
, with target labels converted to float tensors with shape[N, 1]
.
The end-to-end pipeline in the notebook includes the following steps:
-
Dependencies & Setup:
Installation of required packages (e.g., numpy, matplotlib, torch, batman-package, scikit-learn) and mounting of Google Drive (if applicable). -
Synthetic Data Generation:
A function (simulate_transit_curve
) generates realistic transit light curves using the batman package. Thecreate_dataset
function builds a balanced dataset by randomly simulating transit and flat light curves. -
Data Preparation:
The generated data (shape[N, 1000]
) is converted to PyTorch tensors and reshaped to[N, 1, 1000]
. Labels are converted to float tensors and unsqueezed to shape[N, 1]
. -
Model Training:
The CNN classifier is trained over a number of epochs using an 80/20 train/validation split. A learning rate scheduler (ReduceLROnPlateau) is used, and the best-performing model (based on validation accuracy) is saved locally. -
Evaluation:
The notebook computes validation metrics, including loss and accuracy, and then evaluates the best model by generating a confusion matrix, ROC curve, and AUC score. -
Inference:
A final section demonstrates how to run inference on new synthetic light curves, showing the predicted probability of transit for a given curve.
- Model File:
The best-performing model is saved astransit_classifier_best.pth
in the repository. - Results:
All evaluation outputs (confusion matrix, ROC curve, AUC) are generated within the notebook. Since the dataset is synthetic, users can regenerate the data by running the notebook.
- The synthetic dataset is generated using batman to simulate realistic transit curves. Adjust parameters in the
simulate_transit_curve
function if needed. - All code necessary to reproduce the training and evaluation pipeline is contained in the notebook.
- This solution is designed to run end-to-end with minimal user intervention.
Install the dependencies using:
!pip install torch torchvision numpy matplotlib batman-package scikit-learn