Skip to content

LDpred2 - Strong Overfitting in Variance Explained Between GWAS and Non-GWAS Samples #538

@TheoCavinato

Description

@TheoCavinato

I am observing significant overfitting when comparing the variance explained (r²) for individuals included in a GWAS versus those outside of it.

Pipeline Overview:

  1. We perform GWAS on a subset of the UK Biobank (~150K individuals).
  2. We apply LDpred2 on the summary statistics to estimate betas.
  3. We compute polygenic scores (PGS) for UK Biobank individuals using these betas.
  4. We assess the variance explained (r²) separately for individuals inside and outside the GWAS subset.

Results:

For standing height:

  • r²_in = 0.30
  • r²_out = 0.16

For BMI:

  • r²_in = 0.47
  • r²_out = 0.1

The large discrepancy suggests substantial overfitting.

Question:

Are there specific diagnostics or metrics I could use to better understand the source of this issue? Any recommendations for reducing the overfitting?

Thank you for your time!

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions