This repository contains the code to reproduce all analyses and figures from the manuscript "High-resolution spatial mapping of cell state and lineage dynamics in vivo with PEtracer".
conda env create --file environment.yml
conda activate petracer
ipython kernel install --user --name petracer
The environment.lock.yml file can be used to recreate the environment with the exact package versions used in the paper.
Image processing was performed on a linux HPC cluster with the following software installed:
- SLURM v23.11.5
- Python v3.11.10
- Cellpose v3.0.0
- Deconwolf v0.4.5
- Proseg v1.1.3
- Fishtank v0.0.1
- MERlin v0.1.8
- MERFISH_probe_design v0.0.1
- Processed data is available on Figshare
- Single-cell RNA-seq data is available on GEO
- All other sequencing data is available on SRA
The simulation directory contains code for simulating lineage tracing data with a variety of parameters. To run simulations:
python simulation/simulate.py
To generate simulation plots:
python simulation/plot.py
The insertvariants and RTT_optimization directories contain code for processing and analyzing amplicon sequencing data used to select edit sites and optimize editing strategies for the PEtracer system.
Sequencing data was processed on a Linux HPC cluster with SLURM, Python 3.11, and CRISPResso 2.2.7 installed. Processed files can be generated by running
./crispresso.sh
Rscript ../../scripts/make_CRISPResso_summary.R ./ CRISPResso_summary.txt
after downloading the fastq files listed in manifest.txt from SRA and placing them in the strategy_selection/insertvariants/fastq directory.
plot.ipynb - generate strategy selection plots.
The insert_selection directory contains code for processing and analyzing target site sequencing data used to determine the installation efficiencies of all 1024 5nt insertions for each edit site.
- Sequencing data was processed on a Linux HPC cluster with SLURM, Python 3.11, and CRISPResso 2.2.7 installed. Processed files can be generated by running
./crispresso.sh
after downloading the fastq files listed in manifest.txt from SRA and placing them in the insert_selection/fastq directory.
- aggregate_crispresso.ipynb - aggregate CRISPResso output files for all sites.
- crosshyb.py - estimate 5nt insert cross-hybridization
To generate insert selection plots:
python insert_selection/plot.py
The insert_validation directory contains code for processing and analyzing amplicon sequencing data used for arrayed validation of the top 5nt insertions for each edit site.
Sequencing data was processed on a Linux HPC cluster with SLURM, Python 3.11, and CRISPResso 2.2.7 installed. Processed files can be generated by running
./crispresso.sh
Rscript ../scripts/make_CRISPResso_summary.R ./ CRISPResso_summary.txt
after downloading the fastq files listed in manifest.txt from SRA and placing them in the insert_validation/fastq directory.
plot.ipynb - generate arrayed validation plots for top 5nt insertions.
The orthogonalization directory contains code for processing and analyzing amplicon sequencing data used for validating orthogonalized versions of the RNF2, HEK3, and EMX1 edit sites.
Sequencing data was processed on a Linux HPC cluster with SLURM, Python 3.11, and CRISPResso 2.2.7 installed. Processed files can be generated by running
./crispresso.sh
Rscript ../scripts/make_CRISPResso_summary.R ./ CRISPResso_summary.txt
after downloading the fastq files listed in manifest.txt from SRA and placing them in the orthogonalization/fastq directory.
plot.ipynb - generate orthogonalization plots.
The orthogonal_insert_validation directory contains code for processing and analyzing amplicon sequencing data used for validating the top 20 5nt insertions at each orthogonalized edit site.
Sequencing data was processed on a Linux HPC cluster with SLURM, Python 3.11, and CRISPResso 2.2.7 installed. Processed files can be generated by running
./crispresso.sh
Rscript ../scripts/make_CRISPResso_summary.R ./ CRISPResso_summary.txt
after downloading the fastq files listed in manifest.txt from SRA and placing them in the orthogonal_insert_validation/fastq directory.
plot.ipynb - generate plots for top 20 5nt insertions.
The peg_arrays directory contains code for processing and analyzing target site sequencing data used to determine the LM installation balance for various pegArrays.
Sequencing data was processed on a Linux HPC cluster with SLURM, Python 3.11, and CRISPResso 2.2.7 installed. Processed files can be generated by running
sbatch peg_arrays/crispresso.slurm
python peg_arrays/count_alleles.py
after downloading the fastq files listed in manifest.txt from SRA and placing them in the peg_arrays/fastq directory.
To generate pegArray plots:
python peg_arrays/plot.py
The kinetics directory contains code for processing and analyzing 10x data for 4T1 and B16 cells transduced with a library of pegRNA variants to test editing kinetics.
10x data was processed on a Linux HPC cluster with SLURM, Python 3.11, and Cellranger 7.1.0 installed. Processed files can be generated with the following steps:
- Run Cellranger and call alleles using bam files.
sbatch kinetics/cellranger.slurm
sbatch kinetics/call_alleles.slurm
- process_4T1_10x.ipynb - perform quality control, call pegRNA variants, and determine edit fraction for 4T1 cells.
- process_B16F10_10x.ipynb - perform quality control, call pegRNA variants, and determine edit fraction for B16F10 cells.
after downloading the 10x fastq files listed in manifest.txt from GEO and placing them in the kinetics/fastq directory.
All kinetics analysis and plots can be generated by running
python kinetics/estimate_rate.py
python kinetics/plot.py
after processing the raw data or downloading the processed files from Figshare and placing them in kinetics/data directory:
- 4T1_kinetics_alleles.csv
- 4T1_kinetics_cells.csv
- 4T1_kinetics.h5ad
- B16F10_kinetics_alleles.csv
- B16F10_kinetics_cells.csv
- B16F10_kinetics.h5ad
Detailed integration barcode design jupyter notebooks are in folder design_intBC.
Detailed MERFISH and PEtracer probe design are in folder design_probes. This part requires installation of package: MERFISH_probe_design
The image_processing directory contains code for processing imaging data. Raw imaging files are not publicly available due to file size, but code can be used to process other imaging data in the same format. Processed files for each experiment (e.g. 241213_F320-4-3_MF4++) can be generated with the following steps:
- Nuclei segmentation using Cellpose and Deconwolf
sbatch image_processing/241213_F320-4-3_MF4++/Scripts/cellpose.slurm
- MERFISH transcript decoding using Merlin
Download the newest version of MERLin here: v0.1.8
install the merlin by:
conda create -n merlin_py310 python=3.10 conda activate merlin_py310 conda install h5py rtree pytables setuptools urllib3 python-dotenv pandas tifffile conda install scikit-image scikit-learn scipy matplotlib networkx seaborn conda install pytest pytest-cov numexpr cython requests boto3 xmltodict google-cloud-storage docutils pillow pip install opencv-python pyqt5 sphinx-rtd-theme snakemake pyclustering tables cellpose pip install -e MERLinTest if the installation works by:
merlin -hFor the first time using MERLin, configure it by:
merlin --configure .
Then follow the instruction.
Run MERLin:
Example command:
merlin -a 20241007-MF4_TestPreprocess.json \ -o 20240812-MF4_16bit.csv \ -c MF4dna_codebook.csv \ -m merscope01_microscope.json \ -p 20240812_positions.txt \ -e /lab/weissman_imaging/puzheng/4T1Tumor \ -s /lab/weissman_imaging/puzheng/MERFISH_analysis/4T1 \ -k run_MF4_cellpose.json \ -n 2 \ --no_report True \ 20240812-F319-12-0807_MF4dna-mChThe example parameter files are provided in folder: merlin_parameters. Make sure to keep the subfolder structures and set the PARAMETER_HOME in the configuration step as the absolute path of this merlin_parameters folder.
- Assignment of cytoplasmic transcripts to nuclei using Proseg
sbatch image_processing/241213_F320-4-3_MF4++/Scripts/proseg.slurm
- Alignment of MERFISH and lineage imaging data using fishtank
sbatch image_processing/241213_F320-4-3_MF4++/Scripts/align_experiments.slurm
- T7 amplicon detection and quantification using fishtank
sbatch image_processing/241213_F320-4-3_MF4++/Scripts/detect_spots.slurm
- T7 amplicon decoding and cell assignment using fishtank
sbatch image_processing/241213_F320-4-3_MF4++/Scripts/decode_spots.slurm
This process was repeated for each imaging experiment, except for experiments without MERFISH data, which only required steps 1, 5, and 6.
The preedited directory contains code for processing and analyzing 10x and imaging-based readout of lineage tracing data from cells with predefined linkage between intBCs and lineage marks.
Imaging data was processed as described in the "Image processing" section. 10x data was processed on a Linux HPC cluster with SLURM, Python 3.11, and Cellranger 7.1.0 installed. Processed files can be generated with the following steps:
- For 10x run Cellranger and call alleles using bam files.
sbatch preedited/cellranger.slurm
sbatch preedited/call_alleles.slurm
- process_10x_invitro.ipynb - perform quality control for 10x in vitro data.
- process_merfish_invitro.ipynb - perform quality control for imaging in vitro data.
- process_merfish_zombie.ipynb - perform quality control for imaging in vitro data using the zombie protocol.
- process_merfish_invivio.ipynb - perform quality control for imaging in vivo data.
after downloading the 10x fastq files listed in manifest.txt from GEO and placing them in the preedited/fastq directory.
All preedited analysis and plots can be generated by running
python preedited/plot.py
after processing the raw data or downloading the processed files from Figshare and placing them in peedited/data directory:
- preedited_10x_invitro_alleles.csv
- preedited_10x_invitro.h5ad
- preedited_merfish_invitro_alleles.csv
- preedited_merfish_invitro_cells.json
- preedited_merfish_invivo_alleles.csv
- preedited_merfish_invivo_cells.json
- preedited_merfish_zombie_alleles.csv
- preedited_merfish_zombie_cells.json
The barcoded_tracing directory contains code for processing and analyzing 10x single-cell lineage tracing for clones with puro and blast-linked static barcodes serving as independent validation of phylogenetic relationships.
Data processing was performed on a Linux HPC cluster with SLURM, Python 3.11, and Cellranger 7.1.0 installed. Processed files can be generated with the following steps:
- Run Cellranger and call alleles using bam files.
sbatch barcoded_tracing/cellranger.slurm
sbatch barcoded_tracing/call_alleles.slurm
- process_10x.ipynb - performs quality control, phylogenetic reconstruction, and processing of barcode data.
after downloading the files listed in manifest.txt from GEO and placing them in the barcoded_tracing/fastq directory.
All barcoded tracing analysis and plots can be generated by running
python barcoded_tracing/evaluate.py
python barcoded_tracing/plot.py
after processing the raw data or downloading the processed files from Figshare and placing them in colony_tracing/data directory:
- barcoded_tracing_clone_1.h5td
- barcoded_tracing_clone_2.h5td
- barcoded_tracing_clone_3.h5td
- barcoded_tracing_clone_4.h5td
- barcoded_tracing_clone_5.h5td
- barcoded_tracing_clone_6.h5td
- barcoded_tracing_alleles.csv
The colony_tracing directory contains code for processing and analyzing single-cell lineage tracing from colonies generated by sparsely seeding 4T1 cells onto a coverslip.
After processing raw images as described in the "Image processing" section the colony_process_lineage.ipynb notebook was used to segment colonies, perform quality control, and reconstruct phylogenies.
All colony plots can be generated by running
python colony_tracing/plot.py
after the following files from Figshare are downloaded and placed in the colony_tracing/data directory:
- colony_tracing.h5td
- colony_polygons.json
The invitro_heterogeneity directory contains code for processing and analyzing single-cell data characterizing in vitro transcriptional heterogeneity in engineered 4T1 cells used to seed tumors.
Data processing was performed on a Linux HPC cluster with SLURM, Python 3.11, and Cellranger 7.1.0 installed. Processed files can be generated with the following steps:
- Run Cellranger and call alleles using bam files.
sbatch invitro_heterogeneity/cellranger.slurm
sbatch invitro_heterogeneity/call_alleles.slurm
- process_10x.ipynb - performs quality control and clustering.
after downloading the files listed in manifest.txt from GEO and placing them in the invitro_heterogeneity/fastq directory.
All in vitro heterogeneity analysis and plots can be generated by running
python invitro_heterogeneity/plot.py
after processing the raw data or downloading the processed files from Figshare and placing them in invitro_heterogeneity/data directory:
- 4T1_invitro.h5ad
The tumor_tracing directory contains code for processing and analyzing single-cell transcriptomic and lineage tracing data from the 4T1 syngeneic mouse model of tumor metastasis.
After processing raw images as described in the "Image processing" section, the following notebooks were used to generate the mouse 1 data:
- M1_resolVI_training.ipynb - trains resolVI model to classify cell types and filter out doublets.
- M1_process_MERFISH.ipynb - performs quality control and annotation of the MERFISH data.
- M1_segment_tumors.ipynb - aligns tumor sections, segments tumors, and calculate spatial statistics.
- M1_process_lineage.ipynb - performs quality control of lineage data and reconstructs phylogenies.
The same process was repeated for the mouse 2 and 3 data, except a new resolVI model was not trained for mouse 2 since the library is shared with mouse 1.
All tumor plots can be generated by running
python tumor_tracing/plot.py
after the following files from Figshare are downloaded and placed in the tumor_tracing/data directory:
- 10x_4T1_primary.h5ad
- M1_tumor_tracing.h5td
- M1_polygons_grid.json
- M2_tumor_tracing.h5td
- M2_polygons.json
- M3_tumor_tracing.h5td
- M3_polygons_grid.json