LDpred2 - Strong Overfitting in Variance Explained Between GWAS and Non-GWAS Samples

I am observing significant overfitting when comparing the variance explained (r²) for individuals included in a GWAS versus those outside of it.  

#### **Pipeline Overview:**  
1. We perform GWAS on a subset of the UK Biobank (~150K individuals).  
2. We apply **LDpred2** on the summary statistics to estimate betas.  
3. We compute polygenic scores (PGS) for UK Biobank individuals using these betas.  
4. We assess the variance explained (r²) separately for individuals **inside** and **outside** the GWAS subset.  

#### **Results:**  
For **standing height**:  
- r²_in = 0.30
- r²_out = 0.16

For **BMI**:  
- r²_in = 0.47
- r²_out = 0.1

The large discrepancy suggests substantial overfitting.  

#### **Question:**  
Are there specific diagnostics or metrics I could use to better understand the source of this issue? Any recommendations for reducing the overfitting? 

Thank you for your time!


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

LDpred2 - Strong Overfitting in Variance Explained Between GWAS and Non-GWAS Samples #538

Pipeline Overview:

Results:

Question:

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

LDpred2 - Strong Overfitting in Variance Explained Between GWAS and Non-GWAS Samples #538

Description

Pipeline Overview:

Results:

Question:

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions