bgex - BGEN Extraction Tool

Written by Andrew R Wood (University of Exeter)

This C++ tool has been designed to extract data from binary bgen files from the UK Biobank and performs the following functions:

extract genotype dosages
extract genotype probabilties
calculate polygenic risk scores
calculate imputation quality scores (INFO-scores)

The above functions work on a list of individuals and a list of variants. Note this version expects layout 2 with compression blocks derived through either zlib or zstd libraries. The present version has been tested on UK Biobank bgen files storing imputed data derived from HRC+UK10 and TOPMed imputation panels on a cloud workstation on the DNAnexus platform.

Options

    --bgens     -b  [bgen and sample file list]
    --variants  -v  [file of variants to extract]
    --samples   -s  [.sample file]
    --extract   -e  [file of samples to keep (optional)]
    --min-info  -m  [min INFO for genotype extraction/use]
    --dosages   -d  [flag to extract dosages]
    --probs     -p  [flag to extract probabilities]
    --pscore    -g  [flag to extract polygenic score]
    --info      -q  [flag to extract info score]
    --out       -o  [file prefix for outputs]

At least one of --dosages --probs --pscore --info must be provided.

Compiling and install

# create binary
make
# install system-wide (if permissions)
make install

Consider using DXFUSE to avoid downloading the .bgen and .bgi files if using UKB data

A major bottleneck in working with UKB genetic data is downloading it to a cloud workstation first. To avoid this, you should consider using streaming the data through DXFUSE instead of downloading prior to using bgex (although you do not have to do this to use bgex). It's relatively quick and straightforward to set this up - here are some commands to faciliate this.

# Initialise  workspace (if not done already:
unset DX_WORKSPACE_ID
dx cd $DX_PROJECT_CONTEXT_ID:

# Get latest dxfuse binary and install
wget https://github.com/dnanexus/dxfuse/releases/download/v1.5.0/dxfuse-linux
sudo mv dxfuse-linux /usr/bin/dxfuse
sudo chmod 777 /usr/bin/dxfuse

# create directory to mount UKB project to (e.g. "project/")
mkdir project/

# mount project to directory
dxfuse project "UKB_500k_WGS"

The example input files listing bgens within the example_input/ directory reflect this mounting for the example UKB-RAP project called UKB_500k_WGS.

Input Files

BGEN file list

This should be a tab-delimited file containing chromosome and absolute file path to respective bgen:

1	/home/dnanexus/project/UKB_500k_WGS/Bulk/Imputation/UKB imputation from genotype/ukb22828_c1_b0_v3.bgen
2	/home/dnanexus/project/UKB_500k_WGS/Bulk/Imputation/UKB imputation from genotype/ukb22828_c2_b0_v3.bgen
3	/home/dnanexus/project/UKB_500k_WGS/Bulk/Imputation/UKB imputation from genotype/ukb22828_c3_b0_v3.bgen
...

If the file path contains spaces, do not try to escape them or enclose the filename in quotes (see example files provided) Note the respective .bgi is required and expected to be in the same directory as the .bgen file.

Variant file

This tab-delimited file should contain at least the chromosome, bp-position, allele1, allele2. A 5th column may be provided that specifies the beta or log(OR) aligned to allele1. The 5th column will only be used if --pscore specified:

1	4414033	C	T	0.006406851
1	7821917	C	CT	0.006961472
1	16045250	A	G	0.00648666
1	17958038	C	T	0.016447903
1	32180167	T	A	0.00900976
...

Note that polygenic scores will be derived based on weighting phenotype raising alleles to the absolute values of the 5th column.

Optional subjects inclusion file

This file should contain a family id and individual id that will be searched for in the bgen sample file

1234567	1234567
2323433	2323433
...

Note, info-scores are calculated based on individuals included in this file.

Output Files

.dosages (requires `--dosages` flag)

The format of the .dosages file is:

fid:iid           var1                var2               var3               ...  varN
1234567:1234567   p(a1a2)+2p(aa2a2)   p(a1a2)+2p(aa2a2)  p(a1a2)+2p(aa2a2)  ...  p(a1a2)+2p(aa2a2)
...

The variant IDs are of the form chr:pos:allele_1:allele_2 where allele_1 and allele_2 are defined by the bgen format - not the user. The dosage increasing allele is allele_2.

.probs (requires `--probs` flag)

The .probs file contains genotype probability pairs for allele_1/allele_1 and allele_1/allele_2. The probability of being homozygous for allele_2 can be derived by substracting the sum of the two probabilities from 1. The format of the `.probs' file is:

fid:iid           var1              var2	      var3              ...   varN
1234567:1234567   p(a1a1),p(a1a2)   p(a1a1),p(a1a2)   p(a1a1),p(a1a2)   ...   p(a1a1),p(a1a2)
...

The variant IDs are of the form chr:pos:allele_1:allele_2 where allele_1 and allele_2 are defined by the bgen format - not the user.

.pscores (requires `--pscores` flag)

The .pscores file contains the derived polygenic scores based on alleles and weights provided in the variant file (see above). Polygenic scores are calculated by summing the number of trait raising alleles multiplied by the respective abs(weight). The format of the .pscores file is:

fid:iid          pscore
1234567:1234567  0.8939
...

.infoscores (requires `--info` flag)

The .infoscores file contains the calculated INFO-score metric commonly used as a measure of genotype imputation quality. The INFO-scores for a given variant are calcuated based on either all individuals represented in the bgen or based on those listed in the subject inclusion file (see above). The format of the .infoscore file is:

variant       info_score
1:10235:T:TA  0.8990
...

Note the variant ID is based on the original user chr, bp position, a1, and a2 in the variant file input by the user.

Example command:

Assuming the DXFUSE configuration above, the following command will produce all output files for all individuals in BGEN. Info scores will be based on all individuals.

./bgex \
  --bgens    example_input/hrc_uk10k_bgens.txt \
  --samples  "/home/dnanexus/project/UKB_500k_WGS/Bulk/Imputation/UKB imputation from genotype/ukb22828_c1_b0_v3.sample"  \
  --variants example_input/var_list_b37.txt \
  --out      my_output_file_prefix \
  --probs \
  --dosages \
  --pscore \
  --info

The following command will produce all output files for all individuals in the optional subject list file (this is not provided in the example_input/ directory). Info scores will be based on individuals extracted. Only variants with an INFO score >=0.4 will be output or used for polygenic score deriviation.

./bgex \
  --bgens    example_input/hrc_uk10k_bgens.txt \
  --samples  "/home/dnanexus/project/UKB_500k_WGS/Bulk/Imputation/UKB imputation from genotype/ukb22828_c1_b0_v3.sample"  \
  --variants example_input/var_list_b37.txt \
  --out      my_output_file_prefix \
  --extract  my_subject_list.txt \
  --probs \
  --dosages \
  --pscore \
  --info \
  --min-info 0.4

Name		Name	Last commit message	Last commit date
Latest commit History 73 Commits
example_input		example_input
sqlite		sqlite
zlib-1.2.13		zlib-1.2.13
zstd		zstd
BGENProcessor.cpp		BGENProcessor.cpp
BGENProcessor.h		BGENProcessor.h
BGIProcessor.cpp		BGIProcessor.cpp
BGIProcessor.h		BGIProcessor.h
Calculator.cpp		Calculator.cpp
Calculator.h		Calculator.h
FileReader.cpp		FileReader.cpp
FileReader.h		FileReader.h
FileWriter.cpp		FileWriter.cpp
FileWriter.h		FileWriter.h
InputParser.cpp		InputParser.cpp
InputParser.h		InputParser.h
LICENSE		LICENSE
Main.cpp		Main.cpp
Makefile		Makefile
README.md		README.md
StringOperations.cpp		StringOperations.cpp
StringOperations.h		StringOperations.h

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

bgex - BGEN Extraction Tool

Written by Andrew R Wood (University of Exeter)

Options

Compiling and install

Consider using DXFUSE to avoid downloading the .bgen and .bgi files if using UKB data

Input Files

BGEN file list

Variant file

Optional subjects inclusion file

Output Files

.dosages (requires `--dosages` flag)

.probs (requires `--probs` flag)

.pscores (requires `--pscores` flag)

.infoscores (requires `--info` flag)

Example command:

About

Uh oh!

Releases

Packages

Uh oh!

Languages

License

drarwood/bgex

Folders and files

Latest commit

History

Repository files navigation

bgex - BGEN Extraction Tool

Written by Andrew R Wood (University of Exeter)

Options

Compiling and install

Consider using DXFUSE to avoid downloading the .bgen and .bgi files if using UKB data

Input Files

BGEN file list

Variant file

Optional subjects inclusion file

Output Files

.dosages (requires --dosages flag)

.probs (requires --probs flag)

.pscores (requires --pscores flag)

.infoscores (requires --info flag)

Example command:

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Languages

.dosages (requires `--dosages` flag)

.probs (requires `--probs` flag)

.pscores (requires `--pscores` flag)

.infoscores (requires `--info` flag)

Packages