Skip to content

seqscope/cartostore

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

3 Commits
 
 
 
 
 
 

Repository files navigation

CartoStore: Cross-Platform Repository for High-resolution Spatial Transcriptomics Datasets

Summary

CartoStore hosts uniformly-processed resource of heterogeneous spatial omics datasets across various high-resolution platforms. CartoStore is designed for micron-resolution spatial omics technologies that annotate individual molecular readouts (e.g. transcripts or proteins) into micron-scale spatial coordinates. Raw expression counts, histological images, and analysis results are stored in PMTiles format for cloud-friendly spatial access. CartoStore is compatible with most of existing high-resolution spatial omics platforms, including sequencing-based platforms (e.g. Seq-Scope, Stereo-seq, Pixel-seq, and Visium HD) and imaging-based platforms (e.g. Xenium, MERSCOPE, CosMx SMI, and STARmap), and currently hosts 100+ published datasets.

About

This document outlines the format, structure, and content of a typical dataset hosted in CartoStore. It serves as a reference for organizing and sharing spatial omics datasets using a standardized data structure. This repository format is intended to faciliate interactive visualization and exploration of high-resolution spatial omics datasets with CartoStore-compatible web applications.

The aim of this document is to guide users in organizing and understanding dataset components defined by a YAML-based catalog file. While the catalog and its associated files can be constructed manually or with other tools, the cartloader pipeline can be used to simplify and automates this process.

An Example Dataset

To illustrate the organization and usage, this documentation refers to one specific dataset as an example: xenium-human-breast-cancer-si2024-20241224, which reprocessed the published Xenium breast cancer dataset using the cartloader pipeline. In addition to histological images, spatial gene expressions, the dataset also incorporates outputs from a FICTURE analysis, which identifies expression patterns without relying on segmentation boundaries.

The repository is also publicly available at Zenodo with DOI: 10.5281/zenodo.15649152.

Data Overview

A typical CartoStore dataset should be a contained in a single flat directory with all relevant files. Typical files include:

  • Images of histological staining (H&E, DAPI, polyT, etc) and rastered view of overall molecule counts in PMTiles.
  • Spatial expression of individual genes/features at the original micron resolution in PMTiles.
  • Spatial factor analysis results from FICTURE or other platform-specific tools such as XeniumRanger, converted to in PMTiles format.
  • Additional summary of spatial factors and gene-level statistics in TSV or JSON format.
  • A catalog file indexing all accessible components within the directory.

The catalog file is organized through a YAML-based catalog.yaml file, which functions as the central index linking all related data files within the directory. This structure enables programmatic discovery and visualization of dataset components.

Here is a snippet from the YAML catalog (catalog.yaml):

id: xenium-human-breast-cancer-si2024-20241224
title: 10x Xenium Human Breast Cancer Analysis by FICTURE and XeniumRanger
assets:
  basemap:
    sge:
      default: dark
      dark: sge-mono-dark.pmtiles
      light: sge-mono-light.pmtiles
    HnE:
      dark: histology-hne.pmtiles
      ...
  factors:
  - id: t12-xsi2024xeniumhbc
    name: Published FICTURE analysis with 20 factors
    model_id: t12-xsi2024xeniumhbc
    model: t12-xsi2024xeniumhbc-model-matrix.tsv.gz
    rgb: t12-xsi2024xeniumhbc-rgb.tsv
    proj_id: t12-xsi2024xeniumhbc-p12-a4
    decode_id: t12-xsi2024xeniumhbc-p12-a4-r5
    post: t12-xsi2024xeniumhbc-p12-a4-r5-posterior-counts.tsv.gz
    info: t12-xsi2024xeniumhbc-p12-a4-r5-info.tsv
    de: t12-xsi2024xeniumhbc-p12-a4-r5-bulk-de.tsv
    pmtiles:
        hex_coarse: t12-xsi2024xeniumhbc.pmtiles
        hex_fine: t12-xsi2024xeniumhbc-p12-a4.pmtiles
        raster: t12-xsi2024xeniumhbc-p12-a4-r5-pixel-raster.pmtiles
  - id: xeniumranger
    name: Default XeniumRanger Analysis Output
    cells_id: xeniumranger
    rgb: xeniumranger-rgb.tsv
    de: xeniumranger-cells-bulk-de.tsv
    pmtiles:
      cells: xeniumranger-cells.pmtiles
      boundaries: xeniumranger-boundaries.pmtiles
  sge:
    all: genes_all.pmtiles
    bins:
    - genes_bin1.pmtiles
    - ...
    - genes_bin50.pmtiles
    counts: genes_bin_counts.json

Data Format

As the structure of catalog.yaml reflects the logical grouping of all files, this documentation follows the same hierarchy to introduce dataset’s organization and locating the files they need efficiently.

Please note all file paths specified in the YAML catalog are typically relative paths to the location of the YAML file itself.

(1) Root Level Structure

  • id: (string) A unique identifier for the entire dataset. This should be a machine-readable string, often reflecting a project or sample name and date.
    • Example from input: xenium-prime-mouse-pup-ffpe-20250222
  • title: (string) A human-readable title for the dataset. This can be more descriptive than the id.
    • Example from input: xenium-prime-mouse-pup-ffpe-20250222
  • assets: (object) This is the main block containing all references to the data files and their organization. The structure of this object is detailed below.

(2) The assets Object

(A) basemap section

  • Purpose: Stores references to various basemap image layers used for visualization and context. These are typically image tiles.
  • YAML Fields: There are two supported formats depending on whether a given basemap has a single rendering or multiple display options.
    basemap:
        # An example basemap layer with only one available option
        <key-A>: <path-to-layer-A>
        # An example basemap layer with multiple style options
        <key-B>:
            default: <default-sub-key-a>
            <sub-key-a>: <path-to-layer-a>
            <sub-key-b>: <path-to-layer-b>
            ...
            <sub-key-n>: <path-to-layer-n>
    • Keys within basemap should be unique, and sub-keys within one key should also be unique.
    • default-sub-key (string) specifies which one should be loaded by default (must match one of the defined sub-keys).
  • File Format: Commonly .pmtiles (ProtoMaps Tiles) containing raster images in each tile. PMTiles are used for large, pyramid-tiled image and vector data, optimized for efficient loading and rendering at various resolution in web-based map viewers.
  • Spatial Coordinate Conversion: The spatial coordinates should use 'meter-to-micron' conversion in EPS:3857, meaning that 1 micrometer (µm) in the tissue image should be mapped to meter (m) in EPS:3857 coordinate.
  • Example Key Names used in Basemap layers:
    • sge: Often used for visualizing spatial gene expression.
    • HnE: the hematoxylin and eosin stain (H&E) layer for visualizing tissue histology.
    • DAPI: 4′,6-diamidino-2-phenylindole) blue-fluorescent DNA stain (DAPI) layer to stain nuclei.

(B) overview

  • Purpose: Provides a path to a general overview image layer for the entire dataset. e.g.,sge-mono-dark.pmtiles. Typically this layer selects the PMTiles file from one of the basemap layers to use as the thumbnail view when highlight currently rendered region.
  • YAML Fields:
    overview: <path-to-layer>
  • File Format: Typically .pmtiles containing raster file.

(C) factors section

  • Purpose: This section contains results from spatial factor analysis as factor set. Biologically, these factors may represents cell types, spatial domain, or tissue microenvironments, and can capture transcriptomic patterns at subcellular or extracellular resolution. Multiple analysis results can be included, with each entry representing a distinct run defined by parameters such as training resolution and the number of inferred factors. This section commonly includes outputs from models like FICTURE, a segmentation-free approach for interpreting spatial transcriptomics data.

  • YAML Fields: Each set of factor is defined as a dictionary within a list under the factors section. Each dictionary should contain references to key output files and metadata identifiers:

    factors:
      - id: <factor-set-id>
        name: <factor-set-name>
        model_id: <model-id>
        proj_id: <project-id>
        decode_id: <decode-id>
        model: <path-to-model-matrix>
        post: <path-to-posterior-counts>
        rgb: <path-to-color-map>
        de: <path-to-differentially-expressed-genes>
        info: <path-to-a-summary-report>
        pmtiles:
            hex_coarse: <path-to-coarse-pmtiles>
            hex_fine: <path-to-fine-pmtiles>
            raster: <path-to-pixel-raster-pmtile>
    • id: (string) A concise, unique identifier for the factor set.
    • name (string): A human-readable name for the factor entry.
    • decode_id (string): An identifier describing pixel-level decoding parameters generated by FICTURE. The identifider typically contains parameters used in FICTURE analysis
      • Example naming scheme:
        t12-f24-p12-a4-r5
    • model (string): Path to the model matrix file containing (num genes) x (num factors) matrix, representing factor loadings across genes.
      • Common naming scheme:
        {id}-model-matrix.tsv.gz
      • Data Format:
        gene   0           1           2           3           4           5           6           7           8           9           10          11          12          13          14          15          16          17          18          19
        ERBB2  7.5737e+04  7.0903e+03  6.4985e+04  1.7723e+05  1.6162e+04  2.2632e+04  4.5419e+04  3.1964e+06  4.6923e+03  2.6647e+04  7.4585e+05  1.9857e+06  6.0363e+05  8.4796e+04  6.3451e+02  9.0330e+03  1.1768e+06  1.1810e+05  4.7362e+04  9.4282e+04
        LUM    2.3842e+06  1.2456e+05  2.7422e+03  7.7520e+05  2.3671e+04  3.9357e+04  2.0350e+04  9.0192e+03  1.4922e+03  6.0824e+04  2.3937e+03  1.9136e+05  1.8769e+04  2.0348e+04  2.7575e+04  7.0751e+03  1.3734e+04  5.3737e+03  3.5574e+04  9.4903e+05
        
        • gene: Gene names.
        • factor IDs, such as 1 2:
    • post (string): Path to the posterior counts (pseubobulk count) for each gene, factor pair, stored as (num genes) x (num factors) matrix.
      • Common naming scheme:
        {decode_id}-posterior-counts.tsv.gz
      • Data Format:
        gene   0          1        2         3          4        5         6         7           8        9         10         11         12        13        14        15       16         17       18        19
        ERBB2  68977.03   1596.26  17626.65  71666.99   8121.36  10872.75  23827.50  1049698.93  1189.32  14200.30  169862.25  445732.68  59977.92  23568.54  5642.92   4647.97  381147.12  3737.82  28218.49  47785.21
        LUM    615032.23  8499.11  1446.22   183783.07  7444.93  10090.78  11456.22  16430.28    442.63   16588.87  2006.08    65489.31   5556.67   8044.97   17038.48  3157.13  8213.26    405.19   18340.77  226821.80
        
        • gene (string): Gene names.
        • factor IDs, such as 1 2: posterior count per gene per factor.
    • rgb (string): Path to a RGB color map encodings of spatial factors for visualization, which could be generated by FICTURE or a fixed color map file.
      • Data Format:
        Name    Color_index	R	    G	    B
        0	    0	        0.66449	0.08436	0.00424
        1	    1	        0.97545	0.4574	0.11305
        
        • Name (string): Factor IDs.
        • Color_index (integer): Typically same to Factor IDs
        • R, G, B (float): Red, Green, and Blue channel values (range: 0.0 to 1.0).
    • de (string): Path to differential expression results identifying genes enriched for specific spatial factor in the pseudobulk counts.
      • Common naming scheme:
        {decode_id}-bulk-de.tsv
      • Data Format:
        gene   factor  Chi2      pval      FoldChange  gene_total  log10pval
        POSTN  0       656639.1  0.00e+00  4.75        962001      142590.38
        LUM    0       432344.7  0.00e+00  3.64        1079073     93885.375
        
        • gene (string): Gene names.
        • factor (integer): Factor IDs.
        • Chi2 (float): Chi-squared test statistic comparing the expression in the target factor and the rest.
        • pval (float): P-value associated with the chi-squared test.
        • FoldChange (float): Ratio of gene expression inside versus outside the factor’s high-loading region.
        • gene_total (integer): Total count of the gene in the dataset.
        • log10pval (float): Base-10 logarithm of the inverse p-value (i.e., -log10(pval)), useful for ranking significant genes.
    • info (string): Path to a tsv file .
      • Common naming scheme:
        {decode_id}-info.tsv
      • Data Format:
        Factor  RGB         Weight   PostUMI   TopGene_pval                                                                                                                                   TopGene_fc                                                                                                                                        TopGene_weight
        7       237,207,57  0.27989  10867410  SCD, FASN, EPCAM, KRT7, FOXA1, ERBB2, KRT8, CCND1, GATA3, ABCC11, MYO5B, CDH1, SERHL2, CD9, NARS, TENT5C, MLPH, CTTN, AR, MDM2                  FASN, STC1, SCD, MYO5B, EPCAM, CENPF, ABCC11, FOXA1, AR, antisense_SCRIB, SERHL2, KRT7, TENT5C, SQLE, KRT8, PCLAF, TRAF4, CDH1, DMKN, PTRHD1      ERBB2, KRT7, SCD, EPCAM, CCND1, FOXA1, GATA3, FASN, KRT8, ANKRD30A, TACSTD2, TOMM7, CTTN, MLPH, POLR2J3, MDM2, CD9, CDH1, NARS, TPD52
        16      70,115,235  0.12095  4696246   CEACAM6, SERPINA3, GATA3, TACSTD2, AGR3, ESR1, ANKRD30A, MLPH, SCD, CD9, FLNB, TPD52, MZB1, KRT8, FOXA1, CLDN4, S100A14, TFAP2A, LYPD3, HOOK2   AGR3, SERPINA3, ESR1, CEACAM6, SCGB2A1, CEACAM8, HPX, MZB1, HOOK2, GATA3, CLDN4, TACSTD2, RTKN2, TFAP2A, FLNB, CD9, MLPH, TPD52, C6orf132, LYPD3  ERBB2, GATA3, CEACAM6, SCD, TACSTD2, SERPINA3, KRT7, ANKRD30A, FOXA1, CCND1, EPCAM, KRT8, MLPH, CD9, FASN, TPD52, POLR2J3, FLNB, TOMM7, S100A14
        
        • Factor (integer): Factor IDs.
        • RGB (string): Comma-separated RGB values
        • Weight (float): Proportion of the total factor signal explained by this factor.
        • PostUMI (integer): Sum of posterior UMI counts across all spatial locations for this factor.
        • TopGene_pval, TopGene_fc, TopGene_weight (string): Top marker genes per factor ranked by significance (p-value), fold change, or weight.
    • pmtiles (object): Paths to spatial tiling layers for rendering factor maps.
      • hex_coarse (string): PMTiles for coarse-resolution hexagonal binning.
        • Data Format: This is a vector PMTiles file with MVT format containing points with the following attributes:
          • X : X-coordinate of the hexagon center in micrometer (1µm corresponds to 1m in EPS:3857
          • Y : Y-coordinate of the hexagon center in micrometer (1µm corresponds to 1m in EPS:3857
          • topK : The factor ID that has the highest posterior probability
          • topP : The highest posterior probability across all factors
          • factor id (0, 1, ...) : The posterior probability of each factor for the hexagon.
          • Any additional attributes can be included
      • hex_fine (string): PMTiles for fine-resolution hexagonal binning.
        • Data Format: same as hex_coarse
      • raster (string): PMTiles for full-resolution raster output.
        • Data Format: This is a raster PMTiles file visualizing the pixel-level output in raster format.
      • cells (string): PMTiles for cell-segmented analysis output (e.g. by XeniumRanger), with each cell containing
        • Data Format: This is a vector PMTiles file with MVT format containing points with the following attributes:
          • X : X-coordinate of the cell centroid in micrometer (1µm corresponds to 1m in EPS:3857
          • Y : Y-coordinate of the cell centroid in micrometer (1µm corresponds to 1m in EPS:3857
          • topK : The factor or cluster ID associated with the cell.
          • Any additional attributes can be included
  • About FICTURE: FICTURE models high-resolution spatial gene expression at this original resolution without requiring cell segmentation. It treats the tissue as a continuous field and infers latent spatial factors—coherent patterns of gene activity across space. These factors represent transcriptional programs or tissue structures and can be projected into pixel-level data. Besides using the default Latent Dirichlet Allocation, FICTURE could also leverage clustering results from other tools, such as Seurat, to project into pixel-level.

(D) sge (Spatial Digital Gene Expression)

  • Purpose: Stores aggregated spatial gene expression data, often binned or processed for efficient visualization and querying across the spatial extent of the tissue.
  • YAML Fields:
      sge:
        all: genes_all.pmtiles
        bins:
        - genes_bin1.pmtiles
        ...
        - genes_bin50.pmtiles
        counts: genes_bin_counts.json
    • all: (string) Path to a PMTiles file representing all gene expression data.
      • Data Format: This is a vector PMTiles file with MVT format containing points with the following attributes:
        • X : X-coordinate of the transcript/molecule in micrometer (1µm corresponds to 1m in EPS:3857
        • Y : Y-coordinate of the transcript/molecule in micrometer (1µm corresponds to 1m in EPS:3857
        • gene : The gene/feature name associated with the transcript/molecule
        • count : The observed count of the gene/feature
        • {decode_id}_K1 : The top spatial factor associated with the transcript, inferred by FICTURE analysis corresponding to decode_id.
        • {decode_id}_P1 : The posterior probability of the top spatial factor associated with the transcript, inferred by FICTURE analysis corresponding to decode_id.
        • ...(multiple {decode_id}_K1 and {decode_id}_P1 may exist if multiple FICTURE analysis results exists
        • Any additional attributes can be included.
    • bins: (list of strings) A list where each string is a path to a PMTiles file. Each file typically represents a specific bin of genes. These fields are optional and contain redundant information to all. This is useful to show the more detailed spatial distribution of rarely expressed genes, alleviating undersampling issue to view specific gene at coarse zoom level.
      • Data Format: Identifcal to all.
    • counts: (string) Path to a JSON file that contains metadata of gene, gene count, and gene bin ID.
      • Data Format:
        [
            {
                "gene": "ERBB2",
                "count": 2164898,
                "bin": 1
            },
            {
                "gene": "LUM",
                "count": 1087247,
                "bin": 2
            },
            ...
        ]
        • gene (string): Gene names.
        • count (string): Counting the UMI counts of a genomic feature per gene.
        • bin (integer): Gene bin ID.

Example Data Access

Currently, an example dataset for Xenium Human Breast Cancer dataset is available at Zenodo with record ID 15649152. We expect that much more datasets will be available at AWS Open Data Registry.

Preparing Data for CartoStore

While anyone can prepare their own spatial transcriptomic data analysis following this spec, the cartloader software is designed to help investigators to produce CaroStore-compatible datasets.

About

Cross-Platform Repository for High-resolution Spatial Transcriptomics Datasets

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published