Skip to content

biomguler/LN_Phenocluster

Repository files navigation

LN Phenocluster Manuscript Pipeline


Project Name: Lymphoid neoplasms (LN) phenocluster

Description: This repository contains the scripts and pipeline used in the LN phenocluster manuscript. Please follow the scripts listed below to generate the analysis results described in the manuscript.

Citation: "M. Guler & F. Canzian, Clustering of Lymphoid Neoplasms by Cell of Origin, Somatic Mutation and Drug Usage Profiles: A Multi-trait Genome-Wide Association Study, submitted(doi:will be updated), 2025"


Required Softwares

Software Version Link/Citation
regenie 3.2.1 regenie
plink 1.90b6.21 plink
plink2 2.00a3LM plink2
R 4.3.0 or above R
METAL 2020-05-05 METAL
LDSC v1.0.1 LDSC
GWASLab v3.4.31 GWASLab
FLAMES v1.1.2 FLAMES

Required R Packages

R Package Version Link/Citation
ASSET 2.18.0 ASSET
tidyverse 2.0.0 tidyverse
ggplot2 3.1.3.1 gplots
dendextend 1.17.1 dendextend
data.table 1.14.10 data.table
parallel 4.3.0 parallel
magrittr 2.0.3 magrittr
colorspace 2.1.0 colorspace
susieR 0.12.35 susieR
HyPrColoc 0.0.2 HyPrColoc

Steps and overview


To reproduce the results and figures presented in the manuscript, nearly all scripts used are provided in this repository.

The scripts provided in the folders from 01 to 10 and data or example input files in the 00_data folder.

1) Create phenoclusters

List of scripts and their functions for Step 1:

Script Function
00 Compare hclust methods
01 Create phenoclusters and visualize
02 Runner script for script 00 and 01

The data for somatic mutation patterns (SData1.txt) and approved drugs (SData2.txt) were accessed on cBioPortal and the Open Targets Platform, respectively.

For somatic mutations, each LN phenotype and somatically mutated genes were downloaded from cBioPortal. The LN subtypes-mutated gene binary matrix (0,1) was created based on detected mutated genes without their frequency or position, and the data folder created LN-mutated genes matrix provided as SData1.txt.

For the approved drugs, each phenotype was searched on the Open Targets platform, and each drug (at least phase 3 & 4) was manually searched on public databases to check FDA or EMA approval for clinical usage. The LN subtypes-drug binary matrix (0,1) was created based on this information, and the data folder created LN-drug matrix provided as SData2.txt.

After creating these two data files, hierarchical clustering (hclust) analysis methods were compared using the 00_Compare_hclust_method.R script. The script compares hierarchical clustering (hclust) algorithms by generating correlation plots and calculating the Fowlkes-Mallows Index. The resulting dendrograms are analyzed for cophenetic correlation (Pearson correlation coefficient) and Fowlkes-Mallows Index for different cluster counts (k=3, 4, 5). Finally, the script visualizes these correlations using correlation plots saved as TIFF images.

Note

For somatic mutation data method comparison done with nearly 10 000 genes that are mutated in more than 20% of subtypes due to computational limitations. But, in the next step whole data set used.

After running the 00_Compare_hclust_method.R script, the selected hclust method is used in the 01_LNcluster.R script to generate and visualize phenoclusters. The script creates dendrograms using Ward's method and the Jaccard similarity coefficient (in R dist(method = "binary")). The dendrograms are then used to create heatmaps (heatmap.2 from the gplots package) that visualize relationships between drugs or genes and LN subtypes. The resulting heatmaps are saved as TIFF images (drug_plot.tiff and somatic_plot.tiff).

To run these scripts, use the following commands:

Rscript --slave --no-restore --no-save 01_clustering/00_compare_hclust_methods.R
Rscript --slave --no-restore --no-save 01_clustering/01_phenocluster.R

You can run these commands in your R terminal (not console). If you have access to any HPC, you can submit these scripts with a job runner script. The script 03_LNcluster.bsub is created for IBM LSF job scheduler. This script can easily be converted to commonly used schedulers such as SLURM or PBS. An example SLURM script is provided below:

#!/bin/bash
#SBATCH --partition=long                    # Set your partition
#SBATCH --job-name=LNcluster                # Name of the job
#SBATCH --ntasks=1                          # Request a core
#SBATCH --mem=32G                           # Request a total memory limit of 32GB for the job
#SBATCH --output=/your/out/dir/LN_cluster.out  # Give path for the out file
#SBATCH --error=/your/err/dir/LN_cluster.err   # Give path for the err file

# Load modules
module load R/4.3.0

# Set script dir
cd /scripts

# Rscript
Rscript --slave --no-restore --no-save 01_clustering/00_compare_hclust_methods.R
Rscript --slave --no-restore --no-save 01_clustering/01_phenocluster.R


2) UKB-QC genetic data and GWAS with regenie

List of script and their functions for the Step 2:

Script Function
03 QC genotype for regenie step1
04 REGENIE STEP1
05 REGENIE STEP2
06 Merge REGENIE OUTPUTS
07 Filter and format SUMSTATS

In this step, provided scripts used to generate GWAS results from UKB. REGENIE uses a two-step approach. In the first step, original non-imputed genotype data is used, filtering only high-quality genotyped variants: minor allele frequency (MAF) > 1%, minor allele count (MAC) > 5, genotyping rate > 99%, Hardy-Weinberg equilibrium (HWE) test P > 1E−08, <1% missingness. The quality control of genotype data and filtering was done using plink2 software. The script 04_qc.sh combines all chromosomes and creates a list of SNPs that meet QC criteria.

bash 02_ukb_gwas/03_qc.sh

After the QC step, REGENIE step 1 can be run:

bsub < 02_ukb_gwas/04_regenie_step1.bsub -R "rusage[mem=32G]"

REGENIE step 1 will generate a ukb_step1_LM_pred.list file. Using this file and imputed genotype files (bgen), step 2 can be run:

bsub < 02_ukb_gwas/05_regenie_step2.bsub -R "rusage[mem=32G]"

Step 2 will generate GWAS results for each phenotype separated by chromosomes. These files need to be merged by phenotype. To do this, run the merger script:

bash 02_ukb_gwas/06_merge_regenie_outputs.sh

Finally, GWAS summary statistics can be filtered and formatted. The script 08_filter_format_sumstats.bsub creates unique SNP IDs (CHR:POS:REF), converts -log10P to P, and filters Info_score and SPA correction failed tests.

bsub < 02_ukb_gwas/07_filter_format_sumstats.bsub -R "rusage[mem=32G]"

3) Meta-analysis with METAL

Script Function
08 METAL parameter
09 Meta-analysis with METAL

To combine results from UKB, FinnGen and MVP, we performed a fixed-effect meta-analysis using METAL.

bsub < 03_meta-analysis/10_Metal.bsub -R "rusage[mem=8G]"

4) ASSET

List of script and their functions for the Step 3:

Script Function
10 Parallel ASSET

To identify pleiotropic variants in a hypothesis-free manner, we conducted an 'Association analysis based on the SubSETs' approach, called ASSET. ASSET is a collection of statistical methods designed to combine association signals from multiple studies or traits, especially when effects are present in only some studies and may be in opposite directions. This tool searches through all potential subsets of studies, adjusts for multiple testing, and identifies the most significant subset contributing to the overall association, while accounting for correlations due to overlapping participants. We ran the ASSET analysis using a custom R script, which enables parallel computation:

Rscript --slave --no-restore --no-save 04_asset/10_asset_parallel.R

5) LDSC

Script Function
11 GWASlab
12 LDSC Munge
13 LDSC
14 Combine LDSC Results

To understand the genetic correlation between LN subtypes and the created phenoclusters, we performed linkage disequilibrium score regression (LDSC) using LDSC v1.0.1 software. The GWAS summary statistics for LN subtypes and phenoclusters were formatted using the munge_sumstats Python script from LDSC to estimate genetic correlation with HapMap3 variants, as recommended. For the genetic correlation estimation, SNPs with a minor allele frequency (MAF) of less than 5% were excluded, and the MHC region (chr6: 25-35 Mb) was also excluded from this analysis. We downloaded the pre-computed linkage disequilibrium (LD) scores for European ancestry from the Alkes Group website here.

a) GWASlab to format sumstats

bsub < scripts/11_gwaslab_ldsc.bsub -R "rusage[mem=16G]"

b) LDSC munge to format

bsub < scripts/12_ldsc_munge.bsub -R "rusage[mem=8G]"

c) LDSC

bsub < scripts/13_ldsc.bsub -R "rusage[mem=8G]"

d) Combine results

bash 14_combine_ldsc_results.sh

6) HyPrColoc

Script Function
15 HyPrColoc Analysis
16 Job Submission for HyPrColoc
17 Collect HyPrColoc Results

To identify shared genetic signals across LN subtypes and phenoclusters, we performed colocalization analysis using the HyPrColoc R package. This method enables simultaneous analysis of multiple traits to detect clusters of traits that share common causal variants.

HyPrColoc was run using pre-merged summary statistics (single_merged.RData) and a list of LD-independent regions defined in LN_regions.tab. For each region, traits were filtered using a p-value threshold, and beta/SE matrices were constructed for colocalization.


a) HyPrColoc Analysis per Region
Rscript 06_hyprcoloc/15_hyprcoloc.R single_merged.RData LN_regions.tab 1 LN_regions
#or change arguments:
Rscript 06_hyprcoloc/15_hyprcoloc.R merged_data.RData regions.tab [pval_threshold] [output_prefix]
b) Submit Job Array (321 Regions)
bsub < 06_hyprcoloc/16_hyprcoloc.bsub -R "rusage[mem=16G]"
c) Combine Results
Rscript 06_hyprcoloc/17_hyprcoloc_results.R

7) SuSiE Fine-Mapping

Script Function
18 LD Loader – Loads .npz LD matrices and SNP metadata, saves to .feather and .csv
19 Process LD File – Calls the loader for a single LD file
20 Batch Convert LD Files to Feather – LSF array job to process LD files in parallel
21 Run SuSiE Fine-Mapping – Runs SuSiE for each locus and generates summary + plots
22 Submit SuSiE Jobs – LSF array submission for 186 loci
23 Merge SuSiE Results – Aggregates all credible SNPs and sets into tabular outputs

We used the SuSiE (Sum of Single Effects) model to fine-map associated loci. This approach integrates GWAS summary statistics and linkage disequilibrium (LD) data to identify credible SNPs contributing to signals in complex traits. The pipeline includes LD conversion, SuSiE execution per region, visualization, and results merging.

a) Convert LD Matrices from .npz to .feather

LD matrices are stored in .npz sparse format along with SNP metadata in .gz. We convert these into .feather and .csv formats using the ld_loader module.

Note

The npz to feather format conversion adapted from polyfun. Thank you to Omer Weissbrod!

Submit batch job to process LD files in parallel:

bsub < 07_susie/20_ld_file2feather.bsub -R "rusage[mem=16G]"
b) Run SuSiE Fine-Mapping per Locus

SuSiE is run region-by-region using locus-specific GWAS summary statistics and LD matrices. This is done using 21_susier.R, with configuration passed via finemap_file.tab.

Example: finemap_file.tab

Index Subtype Lead CHR POS type sumstatpath ldfile_path n output_name
1 CLL 1:23943735:C:A 1 23943735 single /path/to/CLL_sumstats.txt.gz /path/to/R_chr1_23000001_26000001.feather 1108777 CLL_1
2 CLL 1:40132795:T:G 1 40132795 single /path/to/CLL_sumstats.txt.gz /path/to/R_chr1_39000001_42000001.feather 1108777 CLL_2
3 Cell-B 1:78086718:C:T 1 78086718 phenocluster /path/to/CellB_sumstats.txt.gz /path/to/R_chr1_77000001_80000001.feather 1118280 Cell-B_3
4 LN 1:78086718:C:T 1 78086718 phenocluster /path/to/LN_sumstats.txt.gz /path/to/R_chr1_77000001_80000001.feather 1258753 LN_4
5 Drug-G1 1:78450517:C:A 1 78450517 phenocluster /path/to/DrugG1_sumstats.txt.gz /path/to/R_chr1_77000001_80000001.feather 1122035 Drug-G1_5
6 Soma-G2 1:78450517:C:A 1 78450517 phenocluster /path/to/SomaG2_sumstats.txt.gz /path/to/R_chr1_77000001_80000001.feather 1116131 Soma-G2_6

Submit an LSF job array for all regions:

bsub < 07_susie/22_susier.bsub -R "rusage[mem=16G]"
c) Output Files (per locus)

For each region, SuSiE outputs the following:

*_credible_snps.tab – SNPs with SuSiE PIPs, R², and credible set assignments

*_credible_set_info.tab – Metadata about credible sets

*_fm_regionalplot.jpeg – Multi-panel plot showing: -log10(p), PIP, and gene annotations

d) Merge Results Across Loci

After all jobs are completed, combine the SuSiE output files into unified results using:

Rscript 07_susie/23_merge_susie_results.R

8) FLAMES

Script Function
24 Converts GWAS loci to indexed sets for downstream use
25 Generates index files mapping loci to phenotypes
26 LSF array job to run feature selection and annotation pipeline for each phenotype

We used FLAMES to annotate genomic features and prioritize relevant mechanisms for disease loci. The FLAMES pipeline integrates GWAS data with genomic annotations, performs predictive modeling, and outputs candidate genes.


9) Plots

The scripts for all the plots given in the main article and supplementary information.


Note

This repository is created to enhance reproducibility. It will not be actively maintained.

Tip

If you need assistance at any step, please feel free to email me.

Important

It is not possible to share the full datasets used in this pipeline, such as UKBB data. Additionally, some steps require manual verification and the creation of data files.

Warning

This pipeline, or any part of it, should not be used for medical or diagnostic purposes.

Caution

If you use any part of this pipeline in your work, please make sure to cite the original work software/package and if possible our work.

About

Scripts and pipeline used in the LN phenocluster manuscript

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published