Wearable-Based Digital Biomarkers: An LSTM-Powered Progression Index for Parkinson’s Disease Monitoring
Author: Felice Dong
Advisor: Dr. Hemant Tagare
Department: Statistics and Data Science, Yale University
Completion Date: April 30, 2025
- Overview
- Repository Structure
- Quick Start
- Data
- Model Architecture
- Methodology
- References and Related Work
- Future Directions
- Acknowledgements
Parkinson's disease (PD) affects over 10 million peopole worldwide, making objective, accessible monitoring crucial for both clinical research and patient care. The current project addresses this by developing a data-driven progression index that:
- Leverages wearable sensor data for continous, remote monitoring of motor symptoms
- Uses LSTM networks to capture temporal dependencies in activity patterns
- Provides objective measurements that complement subjective clinical scales
- Visual separation achieved between PD and healthy control subjects across all evaluation splits
- Peak activity capability identified as the most discriminative feature for disease status
- Weekend activities showed higher predictive value than weekday routines
- Sorted activity features provided more robust discrimination than conventional weekday labels
├── src/ # Source code
│ ├── requirements.txt # Python dependencies
│ └── lstm_pd_progression.py # Main LSTM model implementation
│
├── data/ # Dataset files
│ ├── enhanced_fake_data.csv # Synthetic dataset for demonstration
│ ├── fake_data.csv # Basic synthetic dataset
│ └── fake_data_generation.ipynb # Data generation notebook
│
├── docs/ # Documentation
│ ├── dong_felice_thesis.pdf # Complete thesis document
│ └── FD_THESIS_Poster.pdf # Research poster
│
├── notebooks/ # Analysis notebooks
│ ├── thesis_model_cleaned.ipynb # Thesis model implementation
│ └── lstm_fake_data.ipynb # Experiments with synthetic data
└── README.md # This file
- Python 3.8+
- R 4.0+ (for data preprocessing)
- Required Python packages (see
src/requirements.txt
)
- Clone the repository
git clone https://github.com/felice797/wearable_data_lstm_analysis.git cd wearable_data_lstm_analysis
- Install Python dependencies
pip install -r src/requirements.txt
- Run the model
python src/lstm_pd_progression.py --data_path data/enhanced_fake_data.csv --output_dir results
python src/lstm_pd_progression.py [OPTIONS]
Options:
--data_path Path to input CSV file (default: Data_preimp.csv)
--output_dir Directory to save results (default: ./results)
--num_splits Number of cross-validation splits (default: 20)
--seed Random seed for reproducibility (default: 42)
CSV with the following required columns:
subject
: Unique subject identifiercohort
:PD
orControl
week_num
: Week numberMon
,Tue
,Wed
,Thu
,Fri
,Sat
,Sun
: Daily ambulatory minutes
Note: The real PPMI dataset used in the thesis cannot be shared due to data privacy aggreements. Synthetic data is provided for demonstration purposes.
The LSTM-based progression index uses:
- Input Layer: 7 features (daily activity minutes)
- LSTM Layers: 2 layers with 20 hidden units each
- Output Layer: Linear transformation with normalized weights
- Custom Loss Function: Optimizes separation between PD and healthy controls with time weighting
- Conventional Weekday Features: Monday through Sunday activity levels
- Sorted Activity Features: Daily activities ordered by intensity (most to least active)
-
Quality filtering for subjects with
$\geq$ 26 weeks of data -
Missing data interpolation for gaps
$\leq$ 2 days - Feature transformation to create intuitive progression scale
- Temporal alignment across subjects
- Cross-validation with 20 independent train-test splits
- Data augmentation with Gaussian noise and time swapping
- Balanced sampling to address cohort imbalance
- Early stopping based on loss convergence
This research builds upon:
- Verily Life Sciences Study (Chen et al., 2023): Digital biomarkers detected treatment effects earlier and with smaller sample sizes than traditional clinical assessments in Lewy Body Dementia patients.
- PPMI Database: Longitudinal study providing wearable sensor data from PD patients and healthy controls.
- LSTM Networks: Effective for capturing long-term temporal dependencies in time series data.
- Enhanced Loss Function: Incorporate elements like the UPDRS score and medication information, as well as other data captured by the Verily watch
- Missing Data Handling: Implement masking for real-world adherence patterns
- Data Smoothing: Reduce noise while preserving underlying patterns; investigate imputation alternatives
- Validation-Based Training: Replace convergence-based stopping with validation metrics
- Multi-Scale Analysis: Explore different temporal scales beyond weekly aggregation
This work was completed as a Bachelor of Science senior thesis in at Yale University. Thanks to Dr. Hemant Tagare for his continuous guidance and supervision, the Yale S&DS department for their academic support, PPMI and Verily Life Sciences for providing the dataset, and my aunt—whose courage while living with PD drove this research.