Skip to content

Beth526/scRNAseq_alt_transcripts

Repository files navigation

scRNAseq_alt_transcripts

The goal was to see if alternate poly-A usage could be estimated from 3' Chromium scRNA-seq datasets, and if it would be informative for clustering or cluster markers. Work in progress!

Test data used: 10x Chromium 10,000 PBMC dataset (Single Cell 3’ v.3, CellRanger v.4.0.0). Both the filtered .h5 file and the CellRanger-processed .bam file are needed for this analysis.

Note: CellRanger v4 uses STAR to align the reads, for a read to be counted for a gene it must overlap an exon by 50% AND be consistant with an annotated transcript. CellRanger v5 has a new feature to count reads from introns and all reads from the correct strand across the length of the gene. CellRanger bam files also contain corrected UMI and barcodes.

A putative alterate transcript marker discovered: RAB7A_alt

The functions are in the sc_altpolya.py file that can be used as a module

import sc_altpolya as scapa
top_genes, filtered_barcodes = scapa.get_top_genes_and_barcodes_list(h5_file_name, n_genes=100)

Gets the top genes by counts and the filtered barcodes list from the 10X filtered .h5 file. Probably the first 100 will be mostly mitochondrial and ribosomal proteins, so I would change n_genes to a higher number, and then discard the first 100 or so, or specifically remove housekeeper genes.

top genes head

df = scapa.gtf_top_genes(top_genes, gtf_file)

Gets chrom, strand and exon positions set for each gene in top_genes or any dataframe with ensembl gene ids of interest in an 'ids' column Slow

genes with gtf info head

df = scapa.bam_top_genes(df, filtered_barcodes, bam_file)

Considering gene strand, finds all reads that start between the first exon position and the last. Finds read stop positions and only retains furthest 3' read-stop for each UMI group. Allows for read_stops to occur up to 300kbp after last exon position.

genes with read_stop info head

counts, summary = scapa.count_and_summarize_peaks(df, num_peaks_exon = 3, num_peaks_other = 3)

Uses a gaussian mixture model to find peaks (up to the number specified) of read_stops for

  1. exons: read_stops occuring inside exons. The introns are removed and exons juxtaposed, so that the peaks are detected along the coding sequence of the gene.
  2. others: read_stops occuring outside exons. This includes introns, and genomic sequence outside the gene, up to 300kbp downstream.

Below is the counts and summary table. (Note that both will have gene names in the alt transcript names not ensembl ids, the screenshot for counts is old) counts table head

summary table head

counts, summary = scapa.select_alt_transcripts(counts, summary, min_per_gene = 2, max_within_gene_correlation = 1, count_greater_than_std = True)

Filter the counts and summary dataframes. Can remove peaks/alt transcripts if less than 2 were found for a gene (a single alt transcript will be highly correlated with parent gene counts), remove alt transcript if counts was less than standard deviation (wide sparse peaks are more likely to be noise) and remove alt transcript if minimmum intra gene correlation was above a certain level (not informative).

corr_matrix = scapa.peak_correlations_for_gene(base_gene_id, counts)

A correlation matrix for all alternate transcripts of a given gene

min_corr_table = scapa.min_correlations_by_gene(counts, summary)

A table of the minium intragene correlations for alternate transcripts of every gene

scapa.graph_exons(gene_name, df, num_peaks = 3, num_bins = 100)

Produce 3 histogram graphs for read_stops inside exons before and after assignment by GMM algorithm. Note that the GMM model is non-deterministic so sometimes the results will difer. Running this function repeatedly with a different num_peaks may give an idea of the best num_peaks parameter for the count_and_summarize_peaks() function.

3 graphs

  1. read_stops in exons, genomic location histogram
  2. read_stops in exons, coding sequence location histogram
  3. num_peaks histograms for each of the peaks found
 scapa.graph_others(gene_name, df, num_peaks = 3, num_bins = 100)

Produce 3 histogram graphs for read_stops outside exons before and after assignment by GMM algorithm.

3 graphs

  1. read_stops outside exons genomic location histogram
  2. read_stops outside exons genomic location histogram, x-axis limited to gene
  3. num_peaks histograms for each of the peaks found

Example of graph number 3 (peaks found) example

scapa.new_h5(counts, summaries, h5_dset_file, new_dset_file)

Make a new .h5 file containing all the information from the original .h5 file but with the new alternate trascript counts added. Slow


The genes ranked top 100 - 1100 from the PBMC dataset were put through this pipeline and then an h5 file with the alternate transcripts included was analyzed using Seurat alongside the original h5 file.

The additional correlated genes biased the principal components, giving more power to genes with many highly correlated transcripts in clustering.

However, some of the alternate transcripts were unique cluster markers, meaning that thier parent gene and other alternate transcripts were not also a marker for the cluster. There were on average 19 alternate transcript cluster markers per cluster.

Clustering map of PBMC Clusters

One example of a unique alternate transcript cluster marker was RAB7A RAB7A RAB7A_alt

The most interesting finding was a TRAC intronic read_stop (potential alternate poly A site) site highly uncorrelated with other TRAC transcripts TRAC_exon TRAC_intron

About

Finding and analyzing alternate poly-A sites in 3' scRNA-seq 10X files

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published