AnVIL/Velsera-CGC interoperability project to study the genetic and transcriptomic contributions toward Hispanic colorectal cancer health disparities
Colorectal carcinoma (CRC) is a major global public health concern, being the third most common malignancy worldwide and the second deadliest cancer in the United States. Hispanics have higher rates of CRC incidence and mortality compared to non-Hispanic Whites, partly due to poorer access to healthcare and late-stage diagnosis. In addition, recent data show an increasing incidence of CRC among younger Hispanics, particularly in distal tumors. This proposal seeks to investigate population-specific genetic factors contributing to these disparities by analyzing genetic variants and gene expression profiles in Hispanic populations. By integrating and analyzing extensive datasets across NCI and NHGRI resources, the project aims to identify CRC risk genes and genetic variants associated with expression changes in Hispanics. This research will enhance our understanding of the genetic and molecular basis of CRC health disparities among Hispanics and provide a valuable foundation for improving prevention and treatment strategies.
-
1000 Genomes: The 1000 Genomes Project was a landmark international initiative to characterize human genetic variation by sequencing individuals from diverse global populations using short-read technologies. Originally based on low-coverage (~4–6×) sequencing of 2,504 individuals, the project identified over 88 million genetic variants and became a foundational reference for population genetics and disease association studies. Recently, the project was significantly expanded with the high-coverage (~30×) sequencing of 3,202 individuals, including 602 trios, using Illumina NovaSeq platforms. This new dataset—generated by the New York Genome Center and collaborators—offers greatly improved sensitivity for detecting single nucleotide variants, indels, and structural variants, and enables the creation of an enhanced haplotype reference panel containing ~72 million variants. The updated release strengthens the utility of the 1000 Genomes resource for imputation, association studies, and methods development in human genomics. Read the Paper in PMC Access the data in AnVIL
-
MAGE: The MAGE (Multi‑ancestry Analysis of Gene Expression) dataset is an open-access RNA‑seq resource that profiles gene expression and splicing in 731 lymphoblastoid cell lines (LCLs) derived from 1000 Genomes Project individuals, covering 26 populations across five continental groups. MAGE uses both gene-level quantification (GENCODE v.38) and annotation‑agnostic splicing analysis (LeafCutter), generating data ideal for cis‑eQTL and cis‑sQTL mapping. Combined with matching high‑coverage WGS from the same individuals, this rich dataset reveals that while populations differ modestly in expression patterns (continental labels explain ~3%, population labels ~8% of variance), cis‑regulatory effects are largely consistent across ancestries. MAGE also enables discovery of novel eQTLs and sQTLs, offers high-resolution variant-to-expression links, and supports evolutionary and GWAS colocalization analyses. It thus provides a powerful, globally diverse resource for understanding the genetic basis of gene expression, splicing, and their evolutionary and disease-relevant roles Read the paper in PMC Access the data in AnVIL
-
TCGA Colorectal cancer The TCGA analyzed 276 colorectal cancer samples using a comprehensive, multi-platform approach that included exome sequencing, DNA copy number profiling, promoter methylation, and mRNA and microRNA expression analysis. A subset of 97 tumors also underwent low-coverage whole-genome sequencing. They found that 16% of tumors were hypermutated—most of these showed microsatellite instability (MSI) due to MLH1 silencing and promoter hypermethylation, while others had somatic mutations in mismatch repair genes or POLE. Excluding hypermutated cases, colon and rectal cancers displayed similar genomic profiles. TCGA identified 24 significantly mutated genes, including known drivers (APC, TP53, SMAD4, KRAS, PIK3CA) and others like ARID1A, SOX9, and FAM123B. They also observed recurrent, potentially targetable copy number amplifications in ERBB2 and IGF2, and a novel chromosomal fusion between NAV2 and TCF7L1. Integrative analysis highlighted the role of MYC-driven transcriptional programs and uncovered new biomarkers associated with aggressive colorectal cancer. Read the paper in PMC Access the data in the Cancer Genomics Cloud
-
1KGP_variant_cnts.ipynb This Jupyter notebook processes and visualizes variant data from a 1000 Genomes Project VCF file. It begins by downloading the file using a secure, time-limited link. The notebook then defines Python functions to parse the VCF file using the vcf module and count the number of variants per chromosome. It includes logic to handle common chromosome naming conventions, sorts chromosomes in biologically meaningful order (with “X”, “Y”, and “MT” handled specially), and filters out alternate/random contigs. Finally, it generates a bar plot using Matplotlib to visualize the number of variants on each chromosome, helping users understand the genomic distribution of variants in the dataset. This simple script can be used as a starting point for demonstrating data movement and interoperability across cloud platforms.
-
run_ohana.sh Expanding on the analysis above, this bash script performs an admixture-aware scan for allele frequency differentiation using variant data from Phase 3 of the 1000 Genomes Project. Specifically, the script runs the "selscan" tool of Ohana (Cheng et al., 2022; PMID: 34626111) to quantify the extent to which allele frequencies of individual variants are better explained by the genome-wide covariance matrix or by an alternative covariance matrix where allele frequencies are allowed to vary in one of each of eight ancestry components. Higher values of the log-likelihood ratio (lle_ratio) reflect support for the latter model. The corresponding inferred matrix of ancestral component proportions per sample is also provided for interpretation. This Q matrix is visualized in Figure 3A of Yan et al. (2021; PMID: 34528508) and demonstrates that ancestry components 1, 4, and 8 are highly represented among 1000 Genomes Project samples from Admixed American populations. Thus, variants with large log likelihood ratios for ancestry components 1, 4, and 8 exhibit strong allele frequency differentiation between these respective ancestry components and other ancestry components relative to genome-wide averages. Due to GitHub file size constraints, we released the results of this analysis on Zenodo at: https://doi.org/10.5281/zenodo.15775609.
-
variant_enrichment.ipynb This script performs population-specific variant enrichment analysis using genotype data from the 1000 Genomes Project. To keep datasizes trackable, it uses a subset of the variant on human chromosome 22. It begins by downloading and loading the VCF file and corresponding sample population metadata, which maps each sample to one of five continental superpopulations (AFR, AMR, EAS, EUR, SAS). The script processes the first 10,000 variants in the VCF, calculating allele counts for each population and performing a two-sided Fisher’s exact test to determine whether any single population shows significant enrichment or depletion of the alternative allele compared to all others. For each population and variant, the script computes the p-value, the log₂ fold change in allele frequency compared to the most similar population, and records population-specific allele frequencies. Significant variants are stored per population if they meet criteria for allele frequency difference and statistical significance. Finally, the results are visualized using a volcano plot with embedded pie charts that illustrate the allele frequency distribution across populations for the most significant variants.
-
DifferentialExpression.Rmd In this Rmd script, we will describe the procedures used to generate differential gene expression (DGE) results for the AMR continental group (as a proxy for individuals with Latin American ancestry). The analysis presented below uses MAGE, an RNA-seq data set generated from lymphoblastoid cell lines derived of 731 individuals from the 1000 Genomes Project (1KGP). We will describe the formats of input from the MAGE data set below as they become relevant. Additionally, the DESeq2 pipeline described below broadly aligns with that described by the DESeq2 vignette (with some modifications to account for the complex design formula and factor contrasts to extract differential expression of the AMR continental group).
-
ncpi_allele_freqs.ipynb This script analyzes fine-mapped expression and splicing quantitative trait loci (eQTLs and sQTLs) from the MAGE dataset to explore allele frequency differences across global populations using 1000 Genomes Project data. It begins by downloading fine-mapped QTL tables, sample metadata, and population-specific VCF files. It then computes alternative allele frequencies (AFs) for each variant across unrelated individuals in 26 populations, grouped into superpopulations (AFR, EUR, SAS, EAS, AMR). By merging AF data with fine-mapping results and filtering for high-confidence QTLs (PIP > 0.95), it constructs allele frequency matrices and visualizes them as heatmaps sorted by divergence from AMR populations. Finally, it identifies and exports the top 25 eQTLs and sQTLs with the greatest AMR-enrichment, helping highlight potentially ancestry-informative regulatory variants.
-
DESeq2_Interaction_Analysis_Notebook.Rmd This R Markdown notebook performs a differential gene expression analysis using the DESeq2 package. It begins by loading RNA-seq count data and metadata, then formats the data for analysis. Lowly expressed genes are filtered out based on a custom threshold function. The notebook is structured to support modeling complex experimental designs though the full DESeq2 workflow and result interpretation steps.