Skip to content

Non-adjacent blocks from --compress-reference #69

@andrew-slater

Description

@andrew-slater

I'm attempting to replicate the Michigan Imputation Server locally. Since we will be focusing on a few loci in small sample sizes, seems best to just run Eagle and Minimac directly instead of through Cloudgene. I downloaded the latest Minimac4 release (4.1.6) but wanted to sanity check that it produced the same results as the version being reported by the Imputation Server (4-1.0.2).

We're using GRCh38 and I could not find any existing M3VCF/MSAV files so created from the 1000 Genomes release using Minimac3 to create the M3VCF and Minimac4.1.6 to create the MSAV (--compress-reference). I found some discordance in the genotypes imputed from the two Minimac4 versions and, out of curiosity, created a new MSAV from the M3VCF via --update-m3vcf in Minimac4.1.6. Using this M3VCF-sourced MSAV, the imputation results between 4.1.6 and 4-1.0.2 are much more similar. I used savvy to export the 2 MSAV files to VCF and found the difference seems to be in the block structure:

$ zgrep $'\t<BLOCK' ALL.chr1.shapeit2_integrated_snvindels_v2a_27022019.GRCh38.noSingltons.MSAV-from-M3VCF.vcf.gz | cut -f 1-10 | head
1       10416   1:10416 CCCTAA  <BLOCK> .       .       END=62157;VARIANTS=27;REPS=178  UHM     106
1       62157   1:62157 G       <BLOCK> .       .       END=80454;VARIANTS=34;REPS=200  UHM     12
1       80454   1:80454 G       <BLOCK> .       .       END=84139;VARIANTS=25;REPS=180  UHM     0
1       84139   1:84139 A       <BLOCK> .       .       END=89744;VARIANTS=44;REPS=165  UHM     0
1       89744   1:89744 A       <BLOCK> .       .       END=494542;VARIANTS=22;REPS=138 UHM     22
1       494542  1:494542        G       <BLOCK> .       .       END=600536;VARIANTS=26;REPS=125 UHM     0
1       600536  1:600536        A       <BLOCK> .       .       END=638772;VARIANTS=37;REPS=145 UHM     0
1       638772  1:638772        A       <BLOCK> .       .       END=771717;VARIANTS=17;REPS=92  UHM     0
1       771717  1:771717        C       <BLOCK> .       .       END=785674;VARIANTS=58;REPS=103 UHM     11
1       785674  1:785674        T       <BLOCK> .       .       END=790100;VARIANTS=51;REPS=122 UHM     6
$ zgrep $'\t<BLOCK' ALL.chr1.shapeit2_integrated_snvindels_v2a_27022019.GRCh38.noSingltons.MSAVvia--compress-reference.vcf.gz | cut -f 1-10 | head
1       10416   .       CCCTAA  <BLOCK> .       .       END=66264;VARIANTS=30;REPS=206  UHM     0,5
1       66269   .       A       <BLOCK> .       .       END=80454;VARIANTS=30;REPS=183  UHM     0,4
1       82163   .       G       <BLOCK> .       .       END=86028;VARIANTS=30;REPS=274  UHM     0,131
1       86065   .       G       <BLOCK> .       .       END=101550;VARIANTS=50;REPS=204 UHM     0,0
1       120983  .       C       <BLOCK> .       .       END=597799;VARIANTS=30;REPS=239 UHM     0,1
1       597818  .       C       <BLOCK> .       .       END=666203;VARIANTS=40;REPS=203 UHM     0,0
1       668135  .       C       <BLOCK> .       .       END=785203;VARIANTS=70;REPS=243 UHM     0,6
1       785417  .       G       <BLOCK> .       .       END=790082;VARIANTS=50;REPS=124 UHM     0,109
1       790099  .       A       <BLOCK> .       .       END=798782;VARIANTS=50;REPS=115 UHM     0,23
1       798941  .       G       <BLOCK> .       .       END=809630;VARIANTS=50;REPS=118 UHM     0,5

The M3VCF-sourced MSAV has adjacent blocks (the END INFO field is the POS value of the next block) but the MSAV created directly from the genotype VCF does not. And since the blocks are not adjacent, there is no common overlapping variant:

zgrep -C 1 $'\t<BLOCK' ALL.chr1.shapeit2_integrated_snvindels_v2a_27022019.GRCh38.noSingltons.MSAVvia--compress-reference.vcf.gz | cut -f 1-10 | tail -n +5 | head -n 11
1       66264   .       A       T       .       .       AC=29;AN=5096;UHA=0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,1,1,1,1,1,1,0,1,0,1,1,1,1,1,1,1,1,1,1,1
1       66269   .       A       <BLOCK> .       .       END=80454;VARIANTS=30;REPS=183  UHM     0,4
1       66269   .       A       T       .       .       AC=25;AN=5096;UHA=0,1,0,0,0,0,1,0,0,0,0,1,0,0,0,0,0,0,0,0,0,1,0,0,0,0,1,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,1,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,1,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
--
1       80454   .       G       C       .       .       AC=6;AN=5096;UHA=0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,1
1       82163   .       G       <BLOCK> .       .       END=86028;VARIANTS=30;REPS=274  UHM     0,131
1       82163   .       G       A       .       .       AC=170;AN=5096;UHA=0,1,0,0,1,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0
--
1       86028   .       T       C       .       .       AC=195;AN=5096;UHA=0,1,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,1,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,1,1,1,1,1
1       86065   .       G       <BLOCK> .       .       END=101550;VARIANTS=50;REPS=204 UHM     0,0
1       86065   .       G       C       .       .       AC=196;AN=5096;UHA=0,1,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0

Perhaps the MSAV format does not require the same block adjacency as the M3VCF format does? But the differing imputation results seem to indicate this differing format has an effect. Is this expected?

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions