CartoStore hosts uniformly-processed resource of heterogeneous spatial omics datasets across various high-resolution platforms. CartoStore is designed for micron-resolution spatial omics technologies that annotate individual molecular readouts (e.g. transcripts or proteins) into micron-scale spatial coordinates. Raw expression counts, histological images, and analysis results are stored in PMTiles format for cloud-friendly spatial access. CartoStore is compatible with most of existing high-resolution spatial omics platforms, including sequencing-based platforms (e.g. Seq-Scope, Stereo-seq, Pixel-seq, and Visium HD) and imaging-based platforms (e.g. Xenium, MERSCOPE, CosMx SMI, and STARmap), and currently hosts 100+ published datasets.
This document outlines the format, structure, and content of a typical dataset hosted in CartoStore. It serves as a reference for organizing and sharing spatial omics datasets using a standardized data structure. This repository format is intended to faciliate interactive visualization and exploration of high-resolution spatial omics datasets with CartoStore-compatible web applications.
The aim of this document is to guide users in organizing and understanding dataset components defined by a YAML-based catalog file. While the catalog and its associated files can be constructed manually or with other tools, the cartloader pipeline can be used to simplify and automates this process.
To illustrate the organization and usage, this documentation refers to one specific dataset as an example: xenium-human-breast-cancer-si2024-20241224
, which reprocessed the published Xenium breast cancer dataset using the cartloader pipeline. In addition to histological images, spatial gene expressions, the dataset also incorporates outputs from a FICTURE analysis, which identifies expression patterns without relying on segmentation boundaries.
The repository is also publicly available at Zenodo with DOI: 10.5281/zenodo.15649152.
A typical CartoStore dataset should be a contained in a single flat directory with all relevant files. Typical files include:
- Images of histological staining (H&E, DAPI, polyT, etc) and rastered view of overall molecule counts in PMTiles.
- Spatial expression of individual genes/features at the original micron resolution in PMTiles.
- Spatial factor analysis results from FICTURE or other platform-specific tools such as XeniumRanger, converted to in PMTiles format.
- Additional summary of spatial factors and gene-level statistics in TSV or JSON format.
- A catalog file indexing all accessible components within the directory.
The catalog file is organized through a YAML-based catalog.yaml
file, which functions as the central index linking all related data files within the directory. This structure enables programmatic discovery and visualization of dataset components.
Here is a snippet from the YAML catalog (catalog.yaml
):
id: xenium-human-breast-cancer-si2024-20241224
title: 10x Xenium Human Breast Cancer Analysis by FICTURE and XeniumRanger
assets:
basemap:
sge:
default: dark
dark: sge-mono-dark.pmtiles
light: sge-mono-light.pmtiles
HnE:
dark: histology-hne.pmtiles
...
factors:
- id: t12-xsi2024xeniumhbc
name: Published FICTURE analysis with 20 factors
model_id: t12-xsi2024xeniumhbc
model: t12-xsi2024xeniumhbc-model-matrix.tsv.gz
rgb: t12-xsi2024xeniumhbc-rgb.tsv
proj_id: t12-xsi2024xeniumhbc-p12-a4
decode_id: t12-xsi2024xeniumhbc-p12-a4-r5
post: t12-xsi2024xeniumhbc-p12-a4-r5-posterior-counts.tsv.gz
info: t12-xsi2024xeniumhbc-p12-a4-r5-info.tsv
de: t12-xsi2024xeniumhbc-p12-a4-r5-bulk-de.tsv
pmtiles:
hex_coarse: t12-xsi2024xeniumhbc.pmtiles
hex_fine: t12-xsi2024xeniumhbc-p12-a4.pmtiles
raster: t12-xsi2024xeniumhbc-p12-a4-r5-pixel-raster.pmtiles
- id: xeniumranger
name: Default XeniumRanger Analysis Output
cells_id: xeniumranger
rgb: xeniumranger-rgb.tsv
de: xeniumranger-cells-bulk-de.tsv
pmtiles:
cells: xeniumranger-cells.pmtiles
boundaries: xeniumranger-boundaries.pmtiles
sge:
all: genes_all.pmtiles
bins:
- genes_bin1.pmtiles
- ...
- genes_bin50.pmtiles
counts: genes_bin_counts.json
As the structure of catalog.yaml
reflects the logical grouping of all files, this documentation follows the same hierarchy to introduce dataset’s organization and locating the files they need efficiently.
Please note all file paths specified in the YAML catalog are typically relative paths to the location of the YAML file itself.
id
: (string) A unique identifier for the entire dataset. This should be a machine-readable string, often reflecting a project or sample name and date.- Example from input:
xenium-prime-mouse-pup-ffpe-20250222
- Example from input:
title
: (string) A human-readable title for the dataset. This can be more descriptive than theid
.- Example from input:
xenium-prime-mouse-pup-ffpe-20250222
- Example from input:
assets
: (object) This is the main block containing all references to the data files and their organization. The structure of this object is detailed below.
- Purpose: Stores references to various basemap image layers used for visualization and context. These are typically image tiles.
- YAML Fields: There are two supported formats depending on whether a given basemap has a single rendering or multiple display options.
basemap: # An example basemap layer with only one available option <key-A>: <path-to-layer-A> # An example basemap layer with multiple style options <key-B>: default: <default-sub-key-a> <sub-key-a>: <path-to-layer-a> <sub-key-b>: <path-to-layer-b> ... <sub-key-n>: <path-to-layer-n>
- Keys within basemap should be unique, and sub-keys within one key should also be unique.
default-sub-key
(string) specifies which one should be loaded by default (must match one of the defined sub-keys).
- File Format: Commonly
.pmtiles
(ProtoMaps Tiles) containing raster images in each tile. PMTiles are used for large, pyramid-tiled image and vector data, optimized for efficient loading and rendering at various resolution in web-based map viewers. - Spatial Coordinate Conversion: The spatial coordinates should use 'meter-to-micron' conversion in EPS:3857, meaning that 1 micrometer (µm) in the tissue image should be mapped to meter (m) in EPS:3857 coordinate.
- Example Key Names used in Basemap layers:
sge
: Often used for visualizing spatial gene expression.HnE
: the hematoxylin and eosin stain (H&E) layer for visualizing tissue histology.DAPI
: 4′,6-diamidino-2-phenylindole) blue-fluorescent DNA stain (DAPI) layer to stain nuclei.
- Purpose: Provides a path to a general overview image layer for the entire dataset. e.g.,
sge-mono-dark.pmtiles
. Typically this layer selects the PMTiles file from one of the basemap layers to use as the thumbnail view when highlight currently rendered region. - YAML Fields:
overview: <path-to-layer>
- File Format: Typically
.pmtiles
containing raster file.
-
Purpose: This section contains results from spatial factor analysis as factor set. Biologically, these factors may represents cell types, spatial domain, or tissue microenvironments, and can capture transcriptomic patterns at subcellular or extracellular resolution. Multiple analysis results can be included, with each entry representing a distinct run defined by parameters such as training resolution and the number of inferred factors. This section commonly includes outputs from models like FICTURE, a segmentation-free approach for interpreting spatial transcriptomics data.
-
YAML Fields: Each set of factor is defined as a dictionary within a list under the
factors
section. Each dictionary should contain references to key output files and metadata identifiers:factors: - id: <factor-set-id> name: <factor-set-name> model_id: <model-id> proj_id: <project-id> decode_id: <decode-id> model: <path-to-model-matrix> post: <path-to-posterior-counts> rgb: <path-to-color-map> de: <path-to-differentially-expressed-genes> info: <path-to-a-summary-report> pmtiles: hex_coarse: <path-to-coarse-pmtiles> hex_fine: <path-to-fine-pmtiles> raster: <path-to-pixel-raster-pmtile>
id
: (string) A concise, unique identifier for the factor set.name
(string): A human-readable name for the factor entry.decode_id
(string): An identifier describing pixel-level decoding parameters generated by FICTURE. The identifider typically contains parameters used in FICTURE analysis- Example naming scheme:
t12-f24-p12-a4-r5
- Example naming scheme:
model
(string): Path to the model matrix file containing (num genes) x (num factors) matrix, representing factor loadings across genes.- Common naming scheme:
{id}-model-matrix.tsv.gz
- Data Format:
gene 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 ERBB2 7.5737e+04 7.0903e+03 6.4985e+04 1.7723e+05 1.6162e+04 2.2632e+04 4.5419e+04 3.1964e+06 4.6923e+03 2.6647e+04 7.4585e+05 1.9857e+06 6.0363e+05 8.4796e+04 6.3451e+02 9.0330e+03 1.1768e+06 1.1810e+05 4.7362e+04 9.4282e+04 LUM 2.3842e+06 1.2456e+05 2.7422e+03 7.7520e+05 2.3671e+04 3.9357e+04 2.0350e+04 9.0192e+03 1.4922e+03 6.0824e+04 2.3937e+03 1.9136e+05 1.8769e+04 2.0348e+04 2.7575e+04 7.0751e+03 1.3734e+04 5.3737e+03 3.5574e+04 9.4903e+05
gene
: Gene names.- factor IDs, such as
1
2
:
- Common naming scheme:
post
(string): Path to the posterior counts (pseubobulk count) for each gene, factor pair, stored as (num genes) x (num factors) matrix.- Common naming scheme:
{decode_id}-posterior-counts.tsv.gz
- Data Format:
gene 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 ERBB2 68977.03 1596.26 17626.65 71666.99 8121.36 10872.75 23827.50 1049698.93 1189.32 14200.30 169862.25 445732.68 59977.92 23568.54 5642.92 4647.97 381147.12 3737.82 28218.49 47785.21 LUM 615032.23 8499.11 1446.22 183783.07 7444.93 10090.78 11456.22 16430.28 442.63 16588.87 2006.08 65489.31 5556.67 8044.97 17038.48 3157.13 8213.26 405.19 18340.77 226821.80
gene
(string): Gene names.- factor IDs, such as
1
2
: posterior count per gene per factor.
- Common naming scheme:
rgb
(string): Path to a RGB color map encodings of spatial factors for visualization, which could be generated by FICTURE or a fixed color map file.- Data Format:
Name Color_index R G B 0 0 0.66449 0.08436 0.00424 1 1 0.97545 0.4574 0.11305
Name
(string): Factor IDs.Color_index
(integer): Typically same to Factor IDsR
,G
,B
(float): Red, Green, and Blue channel values (range: 0.0 to 1.0).
- Data Format:
de
(string): Path to differential expression results identifying genes enriched for specific spatial factor in the pseudobulk counts.- Common naming scheme:
{decode_id}-bulk-de.tsv
- Data Format:
gene factor Chi2 pval FoldChange gene_total log10pval POSTN 0 656639.1 0.00e+00 4.75 962001 142590.38 LUM 0 432344.7 0.00e+00 3.64 1079073 93885.375
gene
(string): Gene names.factor
(integer): Factor IDs.Chi2
(float): Chi-squared test statistic comparing the expression in the target factor and the rest.pval
(float): P-value associated with the chi-squared test.FoldChange
(float): Ratio of gene expression inside versus outside the factor’s high-loading region.gene_total
(integer): Total count of the gene in the dataset.log10pval
(float): Base-10 logarithm of the inverse p-value (i.e., -log10(pval)), useful for ranking significant genes.
- Common naming scheme:
info
(string): Path to a tsv file .- Common naming scheme:
{decode_id}-info.tsv
- Data Format:
Factor RGB Weight PostUMI TopGene_pval TopGene_fc TopGene_weight 7 237,207,57 0.27989 10867410 SCD, FASN, EPCAM, KRT7, FOXA1, ERBB2, KRT8, CCND1, GATA3, ABCC11, MYO5B, CDH1, SERHL2, CD9, NARS, TENT5C, MLPH, CTTN, AR, MDM2 FASN, STC1, SCD, MYO5B, EPCAM, CENPF, ABCC11, FOXA1, AR, antisense_SCRIB, SERHL2, KRT7, TENT5C, SQLE, KRT8, PCLAF, TRAF4, CDH1, DMKN, PTRHD1 ERBB2, KRT7, SCD, EPCAM, CCND1, FOXA1, GATA3, FASN, KRT8, ANKRD30A, TACSTD2, TOMM7, CTTN, MLPH, POLR2J3, MDM2, CD9, CDH1, NARS, TPD52 16 70,115,235 0.12095 4696246 CEACAM6, SERPINA3, GATA3, TACSTD2, AGR3, ESR1, ANKRD30A, MLPH, SCD, CD9, FLNB, TPD52, MZB1, KRT8, FOXA1, CLDN4, S100A14, TFAP2A, LYPD3, HOOK2 AGR3, SERPINA3, ESR1, CEACAM6, SCGB2A1, CEACAM8, HPX, MZB1, HOOK2, GATA3, CLDN4, TACSTD2, RTKN2, TFAP2A, FLNB, CD9, MLPH, TPD52, C6orf132, LYPD3 ERBB2, GATA3, CEACAM6, SCD, TACSTD2, SERPINA3, KRT7, ANKRD30A, FOXA1, CCND1, EPCAM, KRT8, MLPH, CD9, FASN, TPD52, POLR2J3, FLNB, TOMM7, S100A14
Factor
(integer): Factor IDs.RGB
(string): Comma-separated RGB valuesWeight
(float): Proportion of the total factor signal explained by this factor.PostUMI
(integer): Sum of posterior UMI counts across all spatial locations for this factor.TopGene_pval
,TopGene_fc
,TopGene_weight
(string): Top marker genes per factor ranked by significance (p-value), fold change, or weight.
- Common naming scheme:
pmtiles
(object): Paths to spatial tiling layers for rendering factor maps.hex_coarse
(string): PMTiles for coarse-resolution hexagonal binning.- Data Format: This is a vector PMTiles file with MVT format containing points with the following attributes:
X
: X-coordinate of the hexagon center in micrometer (1µm corresponds to 1m in EPS:3857Y
: Y-coordinate of the hexagon center in micrometer (1µm corresponds to 1m in EPS:3857topK
: The factor ID that has the highest posterior probabilitytopP
: The highest posterior probability across all factors- factor id (
0
,1
, ...) : The posterior probability of each factor for the hexagon. - Any additional attributes can be included
- Data Format: This is a vector PMTiles file with MVT format containing points with the following attributes:
hex_fine
(string): PMTiles for fine-resolution hexagonal binning.- Data Format: same as
hex_coarse
- Data Format: same as
raster
(string): PMTiles for full-resolution raster output.- Data Format: This is a raster PMTiles file visualizing the pixel-level output in raster format.
cells
(string): PMTiles for cell-segmented analysis output (e.g. by XeniumRanger), with each cell containing- Data Format: This is a vector PMTiles file with MVT format containing points with the following attributes:
-
About FICTURE: FICTURE models high-resolution spatial gene expression at this original resolution without requiring cell segmentation. It treats the tissue as a continuous field and infers latent spatial factors—coherent patterns of gene activity across space. These factors represent transcriptional programs or tissue structures and can be projected into pixel-level data. Besides using the default Latent Dirichlet Allocation, FICTURE could also leverage clustering results from other tools, such as Seurat, to project into pixel-level.
- Purpose: Stores aggregated spatial gene expression data, often binned or processed for efficient visualization and querying across the spatial extent of the tissue.
- YAML Fields:
sge: all: genes_all.pmtiles bins: - genes_bin1.pmtiles ... - genes_bin50.pmtiles counts: genes_bin_counts.json
all
: (string) Path to a PMTiles file representing all gene expression data.- Data Format: This is a vector PMTiles file with MVT format containing points with the following attributes:
X
: X-coordinate of the transcript/molecule in micrometer (1µm corresponds to 1m in EPS:3857Y
: Y-coordinate of the transcript/molecule in micrometer (1µm corresponds to 1m in EPS:3857gene
: The gene/feature name associated with the transcript/moleculecount
: The observed count of the gene/feature{decode_id}_K1
: The top spatial factor associated with the transcript, inferred by FICTURE analysis corresponding todecode_id
.{decode_id}_P1
: The posterior probability of the top spatial factor associated with the transcript, inferred by FICTURE analysis corresponding todecode_id
.- ...(multiple
{decode_id}_K1
and{decode_id}_P1
may exist if multiple FICTURE analysis results exists - Any additional attributes can be included.
- Data Format: This is a vector PMTiles file with MVT format containing points with the following attributes:
bins
: (list of strings) A list where each string is a path to a PMTiles file. Each file typically represents a specific bin of genes. These fields are optional and contain redundant information toall
. This is useful to show the more detailed spatial distribution of rarely expressed genes, alleviating undersampling issue to view specific gene at coarse zoom level.- Data Format: Identifcal to
all
.
- Data Format: Identifcal to
counts
: (string) Path to a JSON file that contains metadata of gene, gene count, and gene bin ID.- Data Format:
[ { "gene": "ERBB2", "count": 2164898, "bin": 1 }, { "gene": "LUM", "count": 1087247, "bin": 2 }, ... ]
gene
(string): Gene names.count
(string): Counting the UMI counts of a genomic feature per gene.bin
(integer): Gene bin ID.
- Data Format:
Currently, an example dataset for Xenium Human Breast Cancer dataset is available at Zenodo with record ID 15649152. We expect that much more datasets will be available at AWS Open Data Registry.
While anyone can prepare their own spatial transcriptomic data analysis following this spec, the cartloader software is designed to help investigators to produce CaroStore-compatible datasets.