Skip to content

Major differences in cluster assignment between 1.5.1 and 1.16.0 #150

@bgkresge

Description

@bgkresge

Hi BayesSpace team,

While reproducing results from a 2024 DLPFC paper, I’m seeing large differences in clustering assignments between BayesSpace v1.5.1 and the Bioconductor release v1.16.0.

Using the same public dataset via spatialLIBD's spatialLIBD::fetch_data() the Adjusted Rand Index (ARI) between the paper’s assignments and my run with v1.16.0 ranges ~0.17–0.43 across samples (poor alignment). In contrast, using v1.5.1 (the version used in the paper), ARIs against the published assignments are a perfect 1.0 for all samples.

In my (relatively inexperienced and qualitative) opinion, the clustering results from 1.5.1 were cleaner and more biologically grounded.

I’m opening this as a discussion to ask whether these differences are expected or indicative of a regression/behavior change that isn’t obvious from the changelog (NEWS.md)

Reproducible script

This script:

  • Pulls the data with spatialLIBD::fetch_data()
  • Sets a seed (same as paper)
  • Runs clustering
  • Computes and prints ARI for each sample vs. the paper’s BayesSpace_harmony_09 column
  • Prints reproducibility information (OS, BiocManager version, R version, and package versions via session_info())
library(BayesSpace)
library(mclust)
library(spatialLIBD)
library(sessioninfo)
library(glue)

# load("/data/zusers/kresgeb/psych_encode/spatialLIBD_fetch_data/2024.RData", verbose = TRUE)

spe <- spatialLIBD::fetch_data(type = "spatialDLPFC_Visium")

bayesspace_version <- packageVersion("BayesSpace")

output_csv <- glue("/zata/zippy/kresgeb/clustering_comparison/results/{bayesspace_version}_output_cluster_assignments.csv")

seed <- 030122

set.seed(seed)

message(paste("BayesSpace Version:", packageVersion("BayesSpace")))

message((paste("Seed:", seed)))

message(paste("BayesSpace clustering started at:", Sys.time()))

# Run BayesSpace
message("Running BayesSpace spatialCluster...")
spe <- spatialCluster(
    spe,
    use.dimred = "HARMONY", # Use "PCA" if HARMONY is not available
    q = 9,
    nrep = 10000,
)

# Write cluster assignments to CSV
message("Saving cluster assignments...")
df <- data.frame(
    barcode = colnames(spe),
    sample_id = colData(spe)$sample_id,
    cluster = colData(spe)$spatial.cluster
)
write.csv(df, file = output_csv, row.names = FALSE)

message(paste("BayesSpace clustering finished at:", Sys.time()))

# Get sample IDs
samples <- unique(colData(spe)$sample_id)

# Compute ARI for each sample
ari_per_sample <- sapply(samples, function(sid) {
    idx <- colData(spe)$sample_id == sid
    cl1 <- colData(spe)$spatial.cluster[idx]
    cl2 <- colData(spe)$BayesSpace_harmony_09[idx]

    # Check for at least 2 unique labels in each clustering
    if (length(unique(cl1)) > 1 && length(unique(cl2)) > 1) {
        adjustedRandIndex(cl1, cl2)
    } else {
        NA # Not defined if only one cluster
    }
})

# Convert to data.frame for easier use
ari_df <- data.frame(
    sample_id = names(ari_per_sample),
    ARI = unname(ari_per_sample)
)

print(ari_df)

## Reproducibility information
print("Reproducibility information:")
Sys.time()
proc.time()
options(width = 120)
session_info()

Runs performed

I ran exactly the same script twice, changing only the BayesSpace version:

Happy to run further checks or variations if that would help narrow this down.

Best,
Ben

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions