|G|W|A|S| |P|R|O|J|E|C|T|
GWAS Project is a script that allows you to generate visualizations from summary statistics of a genome-wide study in a fully customized and automated way. The purpose of its construction is to automate the researcher's time to carry out analyzes of a broad genome study, since it is a very complete and easy-to-use tool. Its differential lies in having a direct connection with Ensembl, which facilitates the identification of significant variants of the study.
The mandatory arguments to run the program are "input_path","output_path" and "build"
Example:
python GWASproject.py --input_path 'your/input/path/' --output_path 'your/output/path' --build 'version of the genome being used in the genetic study'
if you want to run more than one file at the same time you just need to separete it with comma
Example:
python GWASproject.py --input_path 'your/input/path/', 'your/input/path2/' ,'your/input/path3/' --output_path 'your/output/path' --build '19'
The console output should start like this:
- Input files must be in "csv" or "tsv" format
- The script will return a Manhattan plot for the GWAS summary statistics and a Regional plot for each significant variant in the study.
An example of the plots that can be generated with the script are:
The script is an optimization of the "GWASLAB" library. To get more personalized results some variables can be set, learn more in the next session.
The packages required to run GWAS Project are:
- matplotlib==3.6.3 or more updated version
- gwaslab==3.3.18 or more updated version
- pandas==1.4.4 or more updated version
This guide will help you get more unique and personalized views for your study.
Genome-wide association studies (GWAS) are extremely important for identifying genetic polymorphisms that are associated with a specific outcome. Essentially, these tools scan specific chromosomes to identify which polymorphisms are associated with a given outcome. Consequently, there is an entire technology involved in identifying eligible polymorphisms that may be associated with a particular clinical study, for example.
The most common graph used for this type of study is called a Manhattan plot, and the degree of statistical significance required to identify a significant association between a polymorphism and a clinical outcome in GWAS is extremely conservative. That is, the P-value has to be very low, to the point of being in scientific notation, which is a requirement because it refers to another important concept in statistics. In the case of GWAS, since hundreds or thousands of polymorphisms are evaluated simultaneously, we need to correct the P-value using a technique called the "Bonferroni correction," where the P-value is multiplied according to each comparison. Therefore, the P-values need to be very low, in scientific notation, to be considered significant.
It is important to note that GWAS studies require a large sample size of thousands of individuals to be evaluated because we need high statistical power.
Single nucleotide polymorphisms (SNPs) are genetic variants that occur at a determined frequency in a population. This is the main difference between SNPs and simple mutations. SNPs occur at expected and determined frequencies, while mutations occur at unexpected frequencies. Furthermore, it is important to highlight that SNPs occur with a certain regularity in our genome. It is estimated that there is one SNP per thousand nucleotides, considering that our genome has 3 billion nucleotides, the estimate is that there are 3 million SNPs (although this can vary according to the study). More than 100 million SNPs that have some clinical or biological significance have been reported in the literature. It is important to note that many of these SNPs do not affect our organism or biology, as they can occur in regions of our DNA that will not affect gene expression. However, others may affect genes, generate important problems, metabolic disorders, and may be risk factors for diseases.  If population genetics or genome-wide association studies are new to you, I recommend visiting this website for more in-depth content.This tutorial is provided by the Kamaya Laboratory at the University of Tokyo. It is primarily intended for beginners in bioinformatics/statistical genetics. It covers the following parts:
- Command line (mostly linux, a small amount of R/Python/JupyterNotebook/Github, etc.)
- Data processing and quality control before GWAS
- GWAS and results visualization
- Downstream analysis after GWAS
- GWAS Related Topics
To delve even further into statistical and computational concepts, I strongly recommend accessing this site.
Despite being originally written in Traditional Chinese, the site is well written and easily translated by public translation tools. I consider this the most complete and most explanatory material for beginners in the area.
To load a file for reading in the script is simple, just inform the path to the file that you want to be read using the input_path command. For example:python GWASproject.py --input_path 'your/input/path/'
GWASProject was developed to be able to read multiple files with just one command, just inform them in this reading format:
python GWASproject.py --input_path 'your/input/path/', 'your/input/path2/' ,'your/input/path3/'
Mandatorily, for the script to carry out the expected work, the input path must be defined, so that the study in question can be analyzed, an output path so that the generated files can be saved and shown, and finally, the version must be informed. of the genome on which the study was constructed.
So that the script can read the columns of the loaded dataset, the column labels of this file must be informed. By default, GWASProject has a pre-definition of the most used labels, which are:
['beta'],['p_value'],['snpid'],['variant_id'],['effect_allele'],['chromosome'],['other_allele'],
['standard_error'],['base_pair_location']
These columns in turn are formatted within the script to the PLINK format. Using the PLINK format, these columns change their label to:
CHR=['chromosome']
POS=['base_pair_location']
rsID=['variant_id']
P=['p_value']
EA=['effect_allele']
NEA=['other_allele']
BETA=['beta']
SE=['standard_error']
SNPID=['snpid']
It is important to point out that the files are saved according to their indexing in the order in which it was defined in the input_path argument. For example, if you run the command:
python GWASproject.py --input_path 'first', 'second/' ,'third'
In the folder defined in the output_path, it will be saved as:
“first1” , “second2”, “third3”
python GWASproject.py –build “19”
This argument, together with input_path and output_path form the mandatory arguments for the script to run correctly, without informing them the script will return an error.
python GWASproject.py --input_path 'your/input/path/', 'your/input/path2/' ,'your/input/path3/' --output_path 'your/output/path' –build “19”
python GWASproject.py --input_path 'your/input/path/' --output_path 'your/output/path' –build “19” –skip 2
The Manhattan plot view for a skip value of 2 would be:
The cut argument is used to define a cut value in plotting graphs of p-values or -log10(p) values. By supplying a numeric value for the cut argument, the plot will be limited to variants whose p-values or -log10(p) values are below this cut-off value. This can be useful for highlighting the most significant or relevant variants on the graph, filtering out the rest. For example, if you set cut to 5, only variants with p-values below 5 (or -log10(p) values above 5) will be displayed on the graph. This allows you to focus on the strongest or most significant associations.Here is an example of how to use the argument:
python GWASproject.py --input_path 'your/input/path/' --output_path 'your/output/path' –build “19” –cut 5
The Manhattan plot view for a cut value of 5 would be:
The sig_level argument is used to define the level of statistical significance for plotting the p-values or -log10(p) values of the plotted graphs.By providing a numeric value for the sig_level argument, the argument will be highlighted with a horizontal line representing the threshold of statistical significance.
This helps identify which variants have p-values below this threshold and are considered statistically significant.
Here is an example of how to use the argument:
python GWASproject.py --input_path 'your/input/path/' --output_path 'your/output/path' –build “19” –sig_level 5e-6
For example, if you set sig_level to "5e-8", the graph will be plotted with a horizontal line representing a p-value of "5e-8". All variants with p-values below this threshold will be displayed above the line, indicating that they are statistically significant.
The Manhattan plot view for a sig_level value of 5e-8 would be:
It is important to say that this argument only accepts float values, and that the default value for this argument is 5e-8.
The highlight argument is used to highlight specific variants in the graphs plotted by the script. It allows you to provide a list of variants you want to highlight.The highlight_color argument is used to set the highlight color for the variants specified in the highlight argument. You can provide a color in string format such as “red”, "#FF0000" (hex code) or "(1.0, 0.0, 0.0)" (RGB values).
When the variants specified in highlight are plotted on the chart, they are visually highlighted with the color specified in highlight_color, making it easier to identify these variants.
Here is an example of how to use the argument:
python GWASproject.py --input_path 'your/input/path/' --output_path 'your/output/path' –build “19” --highlight "10:69083:T:C" "10:94263:A:C" --highlight_color "#FF0000"
The Manhattan plot view for a value of highligh and highlight_color would be:
Make sure that the value chosen for the highlight argument is valid and is from the dataset being read.
The pinpoint argument and the pinpoint_color argument are used to highlight a specific variant in the plotted graphs.The pinpoint argument is used to specify the identifier of a single variant that you want to highlight on the chart. You must provide this variant identifier as a string for the pinpoint argument. For example, "10:69083:T:C".
On the other hand, the pinpoint_color argument is used to set the highlight color of the specified variant. You can provide the name of the color or the hexadecimal code of the color for the pinpoint_color argument. For example, “red” or “#FF0000”.
When using these arguments, the script will generate a graphic where the variant specified by pinpoint will be highlighted with the color specified by pinpoint_color. This helps to draw attention to a specific variant of interest.
Here is an example of how to use the argument:
python GWASproject.py --input_path 'your/input/path/' --output_path 'your/output/path' –build “19” --pinpoint "10:69083:T:C" --pinpoint_color "red"
The Manhattan plot view for a pinpoint value "10:69083:T:C" and pinpoint_color "red" would be:
The anno argument is used to specify the annotation of variants on graphs plotted by GWASProject.The anno refers to the column of the data that contains the annotation information for each variant.
This argument takes string or “GENENAME” values:
String: the name of a column used for annotation will be used Here is an example of how to use the argument:
python GWASproject.py --input_path 'your/input/path/' --output_path 'your/output/path' –build “19” --anno “rsID”
The Manhattan plot view for an anno “rsID” value would be:
GENENAME: This value is set to the default. Gene names are automatically annotated using pyensembl.
Here is an example of how to use the argument:
python GWASproject.py --input_path 'your/input/path/' --output_path 'your/output/path' –build “19” --anno “GENENAME”
The Manhattan plot view for an anno value “GENENAME” would be:
The chr_filter argument is used to filter the variants based on the desired chromosomes during the generation of graphs plotted by GWASProject.The chr_filter allows you to specify one or more chromosomes that you want to include in the graphs. This is useful when you have data from multiple chromosomes but are only interested in viewing some of them.
To use the chromosome filter, just specify the numbering or range of chromosome numbers you want to work with, for example:
python GWASproject.py --input_path 'your/input/path/' --output_path 'your/output/path' –build “19” --chr_filter “CHR==1”
The Manhattan plot view for a chr_filter value equal to 1 would be:
or
python GWASproject.py --input_path 'your/input/path/' --output_path 'your/output/path' –build “19” --chr_filter “CHR>=1 & CHR<7”
The Manhattan plot view for a chr_filter value equal to "CHR>=1 & CHR<7" would be:
The vcf_file argument is used to specify a VCF (Variant Call Format) file that contains information about genetic variants. VCF is a widely used file format for storing genetic variant data such as genotypes and information about specific variants.By supplying the vcf_file argument in a given context, you are indicating that you want to include LD (Linkage Disequilibrium) information in the process of generating related graphs or analyses. The LD is a statistical measure that describes the association between two closely related genetic variants in a given dataset.
The value provided for the vcf_file argument must be the path or location of the VCF file you want to use. For example:
python GWASproject.py --input_path 'your/input/path/' --output_path 'your/output/path' –build “19” --vcf_file 'your/reference/file/path/'
The previous argument allows you to use your own reference file, but GWASProject already has a default vcf file that can be set using the “True” value for the vcf_file argument.
python GWASproject.py --input_path 'your/input/path/' --output_path 'your/output/path' –build “19” --vcf_file True
These files are "1kg_eas_hg19" and "1kg_eas_hg38".
In this context, "1kg" represents the 1000 Genomes Project, which is an international initiative to sequence the genome of at least a thousand individuals from different populations around the world. "eas" is an abbreviation for East Asians, which refers to East Asian populations such as Chinese, Japanese, and Koreans.
Therefore, "1kg_eas_hg19" and "1kg_eas_hg38" indicates that the genetic dataset refers to genetic variants found in individuals from East Asian populations, based on the hg19 and hg38 reference assembly of the human genome. This data may include information about genotypes, allele frequencies, and other genetic characteristics relevant to that specific population.
To select the desired file, in addition to setting the value of vcf_file equal to “True”, also indicate the value of the build argument so that the system understands which reference assembly you would like to use.
An example view for vcf_file “True” using the hg19 reference would be:
suggestive_sig_line This argument sets the significance level for the suggestive signal line on the genomic association plot. It is a numerical value that represents the significance threshold used to highlight variants that suggest an association with a condition or characteristic.
Generally, a p-value below this threshold is considered suggestive of an association. This argument only takes float values as input.
Used in conjunction with the suggestive_sig_line argument, the suggestive_sig_line_color argument sets the color of the suggestive signal line in the genomic association graph. Can be specified as a string like “red” or hexadecimal like “#FF0000”.
python GWASproject.py --input_path 'your/input/path/' --output_path 'your/output/path' –build “19” --suggestive_sig_level 5e-6 --suggestive_sig_line_color "pink"