A pipeline for time series data

### Description of feature

There are several time series issues here already and this is yet another one. I'll try to consolidate them later. I'll edit this issue when I'm back with more details.

IMO it is important that we have a pipeline in mind that we implement. Ideally, we draft it using LLM code with example data to ensure that it gives some useful results. Therefore, I would kindly ask @eroell to prepare an artificial ehrdata object (sparse and dense) with random values that we can use to test all of our implementations.

Here are my suggestions:

I'll maintain your assignments while adding the technical details. Here's the edited version that keeps all original assignments:

I'll maintain your assignments while adding the technical details. Here's the edited version that keeps all original assignments:

### 1. **Preprocessing**
- **Data harmonization**: Standardize units, map codes (e.g., ICD, LOINC) to consistent ontology with UMLS or similar ontology mappings. ✅ 
- **Temporal alignment**: Resample or align all patients on common time axis (e.g., relative to admission, diagnosis) with regular time grid (hourly, daily, weekly) based on data density. @eroell 
- **Missing data imputation**: Apply time-aware methods (e.g., interpolation, Kalman filter, BRITS) with forward-fill for medications, linear interpolation for lab values, and MICE/KNN for MCAR data. @eroell 
- **Normalization**: Per-feature z-scoring or quantile normalization with robust estimators (median, IQR), possibly per patient. @eroell 
- **Dimensionality reduction (optional)**: Aggregate or collapse rare events (<5% prevalence), merge related codes using ontology hierarchies, reduce sparsity. We'll need to look into this. @eroell 

---

### 2. **Feature engineering**
- **Time-window aggregation**: Aggregate features into fixed-length windows (mean, slope, min, max, variance) with sliding windows of customizable sizes (12h, 24h, 72h, 7d). @eroell 
- **Trajectory encoding**: Create patient trajectories (sequences) using discretized values or clinical categories, or use sequence embeddings with SAX representation. This is LLM speak and I'll need to think about it. @Zethson 
- **Event encoding**: Binary or frequency counts for diagnoses/procedures per window with TF-IDF weighting for rare events and time-since-occurrence features. This is LLM speak and I'll need to think about it. @Zethson 
- **Derived features**: Compute derived metrics (e.g., delta from baseline, moving averages) and clinical ratios (BUN/Cr) or composite scores (SOFA components). @eroell 

---

### 3. **Quality control**
- **Patient filtering**: Remove patients with too much missing data (≥70%) or insufficient follow-up (<3 timepoints), implement minimum data density requirements.
- **Feature filtering**: Drop features with low variance (near-zero variance), extreme missingness (>70%), or redundancy (VIF >10).
- **Outlier detection**: Detect patients or time points with extreme values (modified z-score with MAD) or anomalous patterns (isolation forests, DBSCAN).

---

### 4. **Embedding / Dimensionality Reduction**
- **Per-patient summary embedding**: Aggregate each patient into a vector using temporal aggregation, then apply PCA (90-95% variance), UMAP (n_neighbors=15-30), or t-SNE (optimized perplexity). We already have some of that but we need to ensure that there's simple aggregation. @Zethson 
- **Time-aware embedding**: Use TSNE/UMAP on time-windowed representations or temporal VAEs/RNNs with 1D convolution layers and time-lagged embeddings. @Zethson 
- **Manifold learning**: Build pseudotime or trajectory inference (DPT, Palantir) with diffusion maps if clinically meaningful. @Zethson 

---

### 5. **Clustering / Grouping**
- **Unsupervised clustering**: Leiden (optimized resolution), k-means, GMM (BIC/AIC selection) on embedded space or raw features. @Zethson 
- **Temporal pattern clustering**: Dynamic time warping (DTW) with customizable constraints or k-Shape for shape-based clustering. @Zethson 
- **Trajectory clustering**: Identify archetypes or paths through temporal space using state transition matrices and sequence clustering algorithms (CLUSEQ). @Zethson 

---

### 6. **Differential analysis / Statistics**
- **Group comparisons**: Kruskal-Wallis, mixed models with patient as random effect, Cox PH, permutation tests between clusters with multiple comparison correction. @Zethson 
- **Temporal modeling**: Longitudinal mixed models (random slopes/intercepts) or Generalized Estimating Equation (GEE) with appropriate correlation structures (AR1, exchangeable) to assess feature changes over time. @Zethson 
- **Survival analysis**: Time-to-event modeling (Cox with time-dependent covariates, competing risks) stratified by groups or clusters. @Zethson 
- **Feature importance**: Use shapley values or permutation importance from classifiers with cross-validation and confidence intervals. @Zethson 

---

### 7. **Visualization**
- **Embeddings**: UMAP/t-SNE colored by diagnosis, outcome, time, or cluster with interactive plots and density enhancement. ✅ 
- **Trajectories**: Line plots or Sankey diagrams showing evolution over time with state transitions and proportional flows. @Zethson 
- **Heatmaps**: Time-feature heatmaps, cluster-feature enrichment, temporal evolution with hierarchical clustering and clinical annotations. @eroell 
- **Kaplan-Meier plots**: Stratified survival curves by clusters or features with log-rank statistics and number-at-risk tables. @eroell 
- **Box/violin plots**: Feature distributions across clusters or outcomes with statistical significance annotations and temporal arrays. @eroell 
- **Temporal plots**: Spaghetti plots or mean±CI of lab values over time with reference ranges and intervention markers. @Zethson 

---

### 8. **Interpretation**
- **Cluster annotation**: Map clusters to clinical phenotypes or known conditions using enrichment analysis and standardized feature differences. ✅ 
- **Temporal phenotypes**: Describe and label typical trajectories or transitions with state transition probability matrices and key branching points. ✅
- **Outcome associations**: Link features or trajectories to outcomes or interventions with standardized effect sizes and predictive modeling. ✅
- **Model introspection**: Interpret embeddings or models for feature-level insights using correlation analysis and counterfactual explanations. ✅

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

A pipeline for time series data #889

Description of feature

1. Preprocessing

2. Feature engineering

3. Quality control

4. Embedding / Dimensionality Reduction

5. Clustering / Grouping

6. Differential analysis / Statistics

7. Visualization

8. Interpretation

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

A pipeline for time series data #889

Description

Description of feature

1. Preprocessing

2. Feature engineering

3. Quality control

4. Embedding / Dimensionality Reduction

5. Clustering / Grouping

6. Differential analysis / Statistics

7. Visualization

8. Interpretation

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions