Skip to content

Gene length or exon length? #1

@Rohit-Satyam

Description

@Rohit-Satyam

I was going through your code for calculating feature length but I see you use exon rather than the gene field of the gtf. I was trying to use this code for NOISeq so was wondering which is the right way. I see gene would be more appropriate, wouldn't it?

GTF <- import.gff(GTFfile, format="gtf", genome="GRCm38.71", asRangedData=F, feature.type="exon")

Besides elementNROWS is now used instead of elementLengths. I understand what you wrote was almost a decade ago but people refer to your post regularly as they are good and reliable.

library(GenomicRanges)
library(rtracklayer)
library(Rsamtools)
GTFfile = "../../resource/gencode.v45.primary_assembly.annotation.gtf"
FASTAfile = "../../resource/GRCh38.primary_assembly.genome.fa"

GTF <- import.gff(GTFfile, format="gtf", genome="GRCh38", feature.type="gene")
grl <- reduce(split(GTF, elementMetadata(GTF)$gene_id))
reducedGTF <- unlist(grl, use.names=T)
elementMetadata(reducedGTF)$gene_id <- rep(names(grl), elementNROWS(grl))

#Open the fasta file
FASTA <- FaFile(FASTAfile)
open(FASTA)

#Add the GC numbers
elementMetadata(reducedGTF)$nGCs <- letterFrequency(getSeq(FASTA, reducedGTF), "GC")[,1]
elementMetadata(reducedGTF)$widths <- width(reducedGTF)

#Create a list of the ensembl_id/GC/length
calc_GC_length <- function(x) {
  nGCs = sum(elementMetadata(x)$nGCs)
  width = sum(elementMetadata(x)$widths)
  c(width, nGCs/width)
}
output <- t(sapply(split(reducedGTF, elementMetadata(reducedGTF)$gene_id), calc_GC_length))
colnames(output) <- c("Length", "GC")

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions