I am observing significant overfitting when comparing the variance explained (r²) for individuals included in a GWAS versus those outside of it. #### **Pipeline Overview:** 1. We perform GWAS on a subset of the UK Biobank (~150K individuals). 2. We apply **LDpred2** on the summary statistics to estimate betas. 3. We compute polygenic scores (PGS) for UK Biobank individuals using these betas. 4. We assess the variance explained (r²) separately for individuals **inside** and **outside** the GWAS subset. #### **Results:** For **standing height**: - r²_in = 0.30 - r²_out = 0.16 For **BMI**: - r²_in = 0.47 - r²_out = 0.1 The large discrepancy suggests substantial overfitting. #### **Question:** Are there specific diagnostics or metrics I could use to better understand the source of this issue? Any recommendations for reducing the overfitting? Thank you for your time!