Project Name: Lymphoid neoplasms (LN) phenocluster
Description: This repository contains the scripts and pipeline used in the LN phenocluster manuscript. Please follow the scripts listed below to generate the analysis results described in the manuscript.
Citation: "M. Guler & F. Canzian, Clustering of Lymphoid Neoplasms by Cell of Origin, Somatic Mutation and Drug Usage Profiles: A Multi-trait Genome-Wide Association Study, submitted(doi:will be updated), 2025"
Software | Version | Link/Citation |
---|---|---|
regenie | 3.2.1 | regenie |
plink | 1.90b6.21 | plink |
plink2 | 2.00a3LM | plink2 |
R | 4.3.0 or above | R |
METAL | 2020-05-05 | METAL |
LDSC | v1.0.1 | LDSC |
GWASLab | v3.4.31 | GWASLab |
FLAMES | v1.1.2 | FLAMES |
R Package | Version | Link/Citation |
---|---|---|
ASSET | 2.18.0 | ASSET |
tidyverse | 2.0.0 | tidyverse |
ggplot2 | 3.1.3.1 | gplots |
dendextend | 1.17.1 | dendextend |
data.table | 1.14.10 | data.table |
parallel | 4.3.0 | parallel |
magrittr | 2.0.3 | magrittr |
colorspace | 2.1.0 | colorspace |
susieR | 0.12.35 | susieR |
HyPrColoc | 0.0.2 | HyPrColoc |
To reproduce the results and figures presented in the manuscript, nearly all scripts used are provided in this repository.
The scripts provided in the folders from 01 to 10 and data or example input files in the 00_data folder.
List of scripts and their functions for Step 1:
Script | Function |
---|---|
00 | Compare hclust methods |
01 | Create phenoclusters and visualize |
02 | Runner script for script 00 and 01 |
The data for somatic mutation patterns (SData1.txt) and approved drugs (SData2.txt) were accessed on cBioPortal and the Open Targets Platform, respectively.
For somatic mutations, each LN phenotype and somatically mutated genes were downloaded from cBioPortal. The LN subtypes-mutated gene binary matrix (0,1) was created based on detected mutated genes without their frequency or position, and the data folder created LN-mutated genes matrix provided as SData1.txt.
For the approved drugs, each phenotype was searched on the Open Targets platform, and each drug (at least phase 3 & 4) was manually searched on public databases to check FDA or EMA approval for clinical usage. The LN subtypes-drug binary matrix (0,1) was created based on this information, and the data folder created LN-drug matrix provided as SData2.txt.
After creating these two data files, hierarchical clustering (hclust) analysis methods were compared using the 00_Compare_hclust_method.R script. The script compares hierarchical clustering (hclust) algorithms by generating correlation plots and calculating the Fowlkes-Mallows Index. The resulting dendrograms are analyzed for cophenetic correlation (Pearson correlation coefficient) and Fowlkes-Mallows Index for different cluster counts (k=3, 4, 5). Finally, the script visualizes these correlations using correlation plots saved as TIFF images.
Note
For somatic mutation data method comparison done with nearly 10 000 genes that are mutated in more than 20% of subtypes due to computational limitations. But, in the next step whole data set used.
After running the 00_Compare_hclust_method.R script, the selected hclust method is used in the 01_LNcluster.R script to generate and visualize phenoclusters. The script creates dendrograms using Ward's method and the Jaccard similarity coefficient (in R dist(method = "binary")). The dendrograms are then used to create heatmaps (heatmap.2 from the gplots package) that visualize relationships between drugs or genes and LN subtypes. The resulting heatmaps are saved as TIFF images (drug_plot.tiff and somatic_plot.tiff).
To run these scripts, use the following commands:
Rscript --slave --no-restore --no-save 01_clustering/00_compare_hclust_methods.R
Rscript --slave --no-restore --no-save 01_clustering/01_phenocluster.R
You can run these commands in your R terminal (not console). If you have access to any HPC, you can submit these scripts with a job runner script. The script 03_LNcluster.bsub is created for IBM LSF job scheduler. This script can easily be converted to commonly used schedulers such as SLURM or PBS. An example SLURM script is provided below:
#!/bin/bash
#SBATCH --partition=long # Set your partition
#SBATCH --job-name=LNcluster # Name of the job
#SBATCH --ntasks=1 # Request a core
#SBATCH --mem=32G # Request a total memory limit of 32GB for the job
#SBATCH --output=/your/out/dir/LN_cluster.out # Give path for the out file
#SBATCH --error=/your/err/dir/LN_cluster.err # Give path for the err file
# Load modules
module load R/4.3.0
# Set script dir
cd /scripts
# Rscript
Rscript --slave --no-restore --no-save 01_clustering/00_compare_hclust_methods.R
Rscript --slave --no-restore --no-save 01_clustering/01_phenocluster.R
List of script and their functions for the Step 2:
Script | Function |
---|---|
03 | QC genotype for regenie step1 |
04 | REGENIE STEP1 |
05 | REGENIE STEP2 |
06 | Merge REGENIE OUTPUTS |
07 | Filter and format SUMSTATS |
In this step, provided scripts used to generate GWAS results from UKB. REGENIE uses a two-step approach. In the first step, original non-imputed genotype data is used, filtering only high-quality genotyped variants: minor allele frequency (MAF) > 1%, minor allele count (MAC) > 5, genotyping rate > 99%, Hardy-Weinberg equilibrium (HWE) test P > 1E−08, <1% missingness. The quality control of genotype data and filtering was done using plink2 software. The script 04_qc.sh combines all chromosomes and creates a list of SNPs that meet QC criteria.
bash 02_ukb_gwas/03_qc.sh
After the QC step, REGENIE step 1 can be run:
bsub < 02_ukb_gwas/04_regenie_step1.bsub -R "rusage[mem=32G]"
REGENIE step 1 will generate a ukb_step1_LM_pred.list file. Using this file and imputed genotype files (bgen), step 2 can be run:
bsub < 02_ukb_gwas/05_regenie_step2.bsub -R "rusage[mem=32G]"
Step 2 will generate GWAS results for each phenotype separated by chromosomes. These files need to be merged by phenotype. To do this, run the merger script:
bash 02_ukb_gwas/06_merge_regenie_outputs.sh
Finally, GWAS summary statistics can be filtered and formatted. The script 08_filter_format_sumstats.bsub creates unique SNP IDs (CHR:POS:REF), converts -log10P to P, and filters Info_score and SPA correction failed tests.
bsub < 02_ukb_gwas/07_filter_format_sumstats.bsub -R "rusage[mem=32G]"
Script | Function |
---|---|
08 | METAL parameter |
09 | Meta-analysis with METAL |
To combine results from UKB, FinnGen and MVP, we performed a fixed-effect meta-analysis using METAL.
bsub < 03_meta-analysis/10_Metal.bsub -R "rusage[mem=8G]"
List of script and their functions for the Step 3:
Script | Function |
---|---|
10 | Parallel ASSET |
To identify pleiotropic variants in a hypothesis-free manner, we conducted an 'Association analysis based on the SubSETs' approach, called ASSET. ASSET is a collection of statistical methods designed to combine association signals from multiple studies or traits, especially when effects are present in only some studies and may be in opposite directions. This tool searches through all potential subsets of studies, adjusts for multiple testing, and identifies the most significant subset contributing to the overall association, while accounting for correlations due to overlapping participants. We ran the ASSET analysis using a custom R script, which enables parallel computation:
Rscript --slave --no-restore --no-save 04_asset/10_asset_parallel.R
Script | Function |
---|---|
11 | GWASlab |
12 | LDSC Munge |
13 | LDSC |
14 | Combine LDSC Results |
To understand the genetic correlation between LN subtypes and the created phenoclusters, we performed linkage disequilibrium score regression (LDSC) using LDSC v1.0.1 software. The GWAS summary statistics for LN subtypes and phenoclusters were formatted using the munge_sumstats Python script from LDSC to estimate genetic correlation with HapMap3 variants, as recommended. For the genetic correlation estimation, SNPs with a minor allele frequency (MAF) of less than 5% were excluded, and the MHC region (chr6: 25-35 Mb) was also excluded from this analysis. We downloaded the pre-computed linkage disequilibrium (LD) scores for European ancestry from the Alkes Group website here.
a) GWASlab to format sumstats
bsub < scripts/11_gwaslab_ldsc.bsub -R "rusage[mem=16G]"
b) LDSC munge to format
bsub < scripts/12_ldsc_munge.bsub -R "rusage[mem=8G]"
c) LDSC
bsub < scripts/13_ldsc.bsub -R "rusage[mem=8G]"
d) Combine results
bash 14_combine_ldsc_results.sh
Script | Function |
---|---|
15 | HyPrColoc Analysis |
16 | Job Submission for HyPrColoc |
17 | Collect HyPrColoc Results |
To identify shared genetic signals across LN subtypes and phenoclusters, we performed colocalization analysis using the HyPrColoc R package. This method enables simultaneous analysis of multiple traits to detect clusters of traits that share common causal variants.
HyPrColoc was run using pre-merged summary statistics (single_merged.RData
) and a list of LD-independent regions defined in LN_regions.tab
. For each region, traits were filtered using a p-value threshold, and beta/SE matrices were constructed for colocalization.
Rscript 06_hyprcoloc/15_hyprcoloc.R single_merged.RData LN_regions.tab 1 LN_regions
#or change arguments:
Rscript 06_hyprcoloc/15_hyprcoloc.R merged_data.RData regions.tab [pval_threshold] [output_prefix]
bsub < 06_hyprcoloc/16_hyprcoloc.bsub -R "rusage[mem=16G]"
Rscript 06_hyprcoloc/17_hyprcoloc_results.R
Script | Function |
---|---|
18 | LD Loader – Loads .npz LD matrices and SNP metadata, saves to .feather and .csv |
19 | Process LD File – Calls the loader for a single LD file |
20 | Batch Convert LD Files to Feather – LSF array job to process LD files in parallel |
21 | Run SuSiE Fine-Mapping – Runs SuSiE for each locus and generates summary + plots |
22 | Submit SuSiE Jobs – LSF array submission for 186 loci |
23 | Merge SuSiE Results – Aggregates all credible SNPs and sets into tabular outputs |
We used the SuSiE (Sum of Single Effects) model to fine-map associated loci. This approach integrates GWAS summary statistics and linkage disequilibrium (LD) data to identify credible SNPs contributing to signals in complex traits. The pipeline includes LD conversion, SuSiE execution per region, visualization, and results merging.
LD matrices are stored in .npz
sparse format along with SNP metadata in .gz
. We convert these into .feather
and .csv
formats using the ld_loader
module.
Note
The npz to feather format conversion adapted from polyfun. Thank you to Omer Weissbrod!
Submit batch job to process LD files in parallel:
bsub < 07_susie/20_ld_file2feather.bsub -R "rusage[mem=16G]"
SuSiE is run region-by-region using locus-specific GWAS summary statistics and LD matrices. This is done using 21_susier.R, with configuration passed via finemap_file.tab.
Example: finemap_file.tab
Index | Subtype | Lead | CHR | POS | type | sumstatpath | ldfile_path | n | output_name |
---|---|---|---|---|---|---|---|---|---|
1 | CLL | 1:23943735:C:A | 1 | 23943735 | single | /path/to/CLL_sumstats.txt.gz | /path/to/R_chr1_23000001_26000001.feather | 1108777 | CLL_1 |
2 | CLL | 1:40132795:T:G | 1 | 40132795 | single | /path/to/CLL_sumstats.txt.gz | /path/to/R_chr1_39000001_42000001.feather | 1108777 | CLL_2 |
3 | Cell-B | 1:78086718:C:T | 1 | 78086718 | phenocluster | /path/to/CellB_sumstats.txt.gz | /path/to/R_chr1_77000001_80000001.feather | 1118280 | Cell-B_3 |
4 | LN | 1:78086718:C:T | 1 | 78086718 | phenocluster | /path/to/LN_sumstats.txt.gz | /path/to/R_chr1_77000001_80000001.feather | 1258753 | LN_4 |
5 | Drug-G1 | 1:78450517:C:A | 1 | 78450517 | phenocluster | /path/to/DrugG1_sumstats.txt.gz | /path/to/R_chr1_77000001_80000001.feather | 1122035 | Drug-G1_5 |
6 | Soma-G2 | 1:78450517:C:A | 1 | 78450517 | phenocluster | /path/to/SomaG2_sumstats.txt.gz | /path/to/R_chr1_77000001_80000001.feather | 1116131 | Soma-G2_6 |
Submit an LSF job array for all regions:
bsub < 07_susie/22_susier.bsub -R "rusage[mem=16G]"
For each region, SuSiE outputs the following:
*_credible_snps.tab – SNPs with SuSiE PIPs, R², and credible set assignments
*_credible_set_info.tab – Metadata about credible sets
*_fm_regionalplot.jpeg – Multi-panel plot showing: -log10(p), PIP, and gene annotations
After all jobs are completed, combine the SuSiE output files into unified results using:
Rscript 07_susie/23_merge_susie_results.R
We used FLAMES to annotate genomic features and prioritize relevant mechanisms for disease loci. The FLAMES pipeline integrates GWAS data with genomic annotations, performs predictive modeling, and outputs candidate genes.
The scripts for all the plots given in the main article and supplementary information.
Note
This repository is created to enhance reproducibility. It will not be actively maintained.
Tip
If you need assistance at any step, please feel free to email me.
Important
It is not possible to share the full datasets used in this pipeline, such as UKBB data. Additionally, some steps require manual verification and the creation of data files.
Warning
This pipeline, or any part of it, should not be used for medical or diagnostic purposes.
Caution
If you use any part of this pipeline in your work, please make sure to cite the original work software/package and if possible our work.