This repository documents my research experience analyzing neuronal excitability using patch-clamp electrophysiology and transcriptomics. I'm building computational tools to understand how gene expression patterns relate to electrical properties of neurons, with a focus on epilepsy-relevant phenotypes.
How do gene-expression signatures predict neuronal excitability phenotypes?
Neurons from patients with epilepsy often exhibit abnormal firing properties. However, the molecular mechanisms linking transcriptional programs to hyperexcitability remain unclear. This project uses multimodal single-cell data to:
- Identify transcriptomic features that correlate with electrophysiological traits
- Build interpretable predictive models to understand genotype-phenotype relationships
- Guide future functional validation and therapeutic target discovery
This work leverages the Allen Institute Patch-seq Dataset, which provides simultaneous measurements of:
- Transcriptomics: Single-cell RNA-seq (gene expression across ~20,000 genes)
- Electrophysiology: Patch-clamp recordings (firing rate, rheobase, input resistance, etc.)
- Morphology & Metadata: Cell type classification, cortical layer, species
The analysis pipeline is structured around Snakemake, a workflow management system that ensures reproducibility:
data/raw/ ← Raw Allen Patch-seq datasets (transcriptomics, ephys, metadata)
├── patchseq_transcriptomics.csv [~5 GB]
├── patchseq_metadata.csv
└── patchseq_ephys_features.csv
src/patchseq_pipeline/ ← Main Python package
├── data/ (Loading & preprocessing utilities)
├── analysis/ (Feature selection & dimensionality reduction)
├── models/ (Ridge regression, elastic net models)
└── viz/ (Plotting & figure generation)
scripts/ ← Standalone analysis scripts
├── download_data.py (Fetch data via AllenSDK)
├── build_features.py (Normalize & standardize features)
├── train_model.py (Fit predictive models)
└── generate_figures.py (Create plots)
results/ ← Generated outputs (after running pipeline)
├── figures/ (PNG plots panels)
├── models/ (Trained model checkpoints & metrics)
└── logs/ (Execution logs & debugging info)
| Script | Purpose | Status |
|---|---|---|
download_data.py |
Fetch Patch-seq data from Allen Institute | Core |
build_features.py |
Quality control, normalization, feature engineering | Core |
train_model.py |
Fit Ridge/Elastic Net models, cross-validation | Core |
generate_figures.py |
Produce analysis visualizations | Core |
checksums.py |
Data integrity verification | Utility |
- Load raw gene counts and electrophysiology recordings
- Filter cells by QC metrics (library size, gene counts, mitochondrial content)
- Log-normalize gene counts:
$\log(x + 1)$ - Identify and remove outliers
- Compute variance for each gene; select top N by expression variance
- Apply PCA to reduce noise and improve model generalization
- (Optional) UMAP visualization for interactive exploration
- Target Variables: Firing rate, rheobase, input resistance
-
Model Choice: Ridge Regression (
$\ell_2$ regularization) - Training: 80/20 split with 5-fold cross-validation
- Metrics: R², MSE, feature importance
- Extract top predictive genes (highest absolute model coefficients)
- Map genes to known epilepsy risk loci (future: pathway enrichment)
- Validate findings against literature
- Python 3.9+ (recommended: 3.11)
- Mamba or Conda (for environment management)
- ~50 GB disk space (for full dataset)
After running the pipeline, you'll get:
- Feature Importance Plot – Top genes predicting firing rate
- PCA Visualization – Transcriptome structure by electrophysiology phenotype
- Model Performance – R² scores, residual distributions
- Gene Lists – Ranked by predictive importance for neuroinflammatory follow-up
Contact & license
- MIT License — see
LICENSE