-
Notifications
You must be signed in to change notification settings - Fork 26
Description
Hi BayesSpace team,
While reproducing results from a 2024 DLPFC paper, I’m seeing large differences in clustering assignments between BayesSpace v1.5.1 and the Bioconductor release v1.16.0.
Using the same public dataset via spatialLIBD's spatialLIBD::fetch_data()
the Adjusted Rand Index (ARI) between the paper’s assignments and my run with v1.16.0 ranges ~0.17–0.43 across samples (poor alignment). In contrast, using v1.5.1 (the version used in the paper), ARIs against the published assignments are a perfect 1.0 for all samples.
In my (relatively inexperienced and qualitative) opinion, the clustering results from 1.5.1 were cleaner and more biologically grounded.
I’m opening this as a discussion to ask whether these differences are expected or indicative of a regression/behavior change that isn’t obvious from the changelog (NEWS.md)
Reproducible script
This script:
- Pulls the data with
spatialLIBD::fetch_data()
- Sets a seed (same as paper)
- Runs clustering
- Computes and prints ARI for each sample vs. the paper’s
BayesSpace_harmony_09
column - Prints reproducibility information (
OS
,BiocManager
version,R
version, and package versions viasession_info()
)
library(BayesSpace)
library(mclust)
library(spatialLIBD)
library(sessioninfo)
library(glue)
# load("/data/zusers/kresgeb/psych_encode/spatialLIBD_fetch_data/2024.RData", verbose = TRUE)
spe <- spatialLIBD::fetch_data(type = "spatialDLPFC_Visium")
bayesspace_version <- packageVersion("BayesSpace")
output_csv <- glue("/zata/zippy/kresgeb/clustering_comparison/results/{bayesspace_version}_output_cluster_assignments.csv")
seed <- 030122
set.seed(seed)
message(paste("BayesSpace Version:", packageVersion("BayesSpace")))
message((paste("Seed:", seed)))
message(paste("BayesSpace clustering started at:", Sys.time()))
# Run BayesSpace
message("Running BayesSpace spatialCluster...")
spe <- spatialCluster(
spe,
use.dimred = "HARMONY", # Use "PCA" if HARMONY is not available
q = 9,
nrep = 10000,
)
# Write cluster assignments to CSV
message("Saving cluster assignments...")
df <- data.frame(
barcode = colnames(spe),
sample_id = colData(spe)$sample_id,
cluster = colData(spe)$spatial.cluster
)
write.csv(df, file = output_csv, row.names = FALSE)
message(paste("BayesSpace clustering finished at:", Sys.time()))
# Get sample IDs
samples <- unique(colData(spe)$sample_id)
# Compute ARI for each sample
ari_per_sample <- sapply(samples, function(sid) {
idx <- colData(spe)$sample_id == sid
cl1 <- colData(spe)$spatial.cluster[idx]
cl2 <- colData(spe)$BayesSpace_harmony_09[idx]
# Check for at least 2 unique labels in each clustering
if (length(unique(cl1)) > 1 && length(unique(cl2)) > 1) {
adjustedRandIndex(cl1, cl2)
} else {
NA # Not defined if only one cluster
}
})
# Convert to data.frame for easier use
ari_df <- data.frame(
sample_id = names(ari_per_sample),
ARI = unname(ari_per_sample)
)
print(ari_df)
## Reproducibility information
print("Reproducibility information:")
Sys.time()
proc.time()
options(width = 120)
session_info()
Runs performed
I ran exactly the same script twice, changing only the BayesSpace version:
- v1.5.1 (as used in the paper) installed via
devtools::install_github("edward130603/BayesSpace", ref = "v1.5.1")
- v1.16.0 (installed via Bioconductor at time of testing using
BiocManager::install("BayesSpace")
)
Happy to run further checks or variations if that would help narrow this down.
Best,
Ben