-
Notifications
You must be signed in to change notification settings - Fork 2
Description
Hi, first of all, big thanks for the great software!
I've been trying to get UTR predictions using --addUTR in braker1 and got corrupted gene and transcript rows; the start/end positions of the gene/transcripts are not correct. There were 82 cases like this in my ~20,000 gene set. The results of a test run using the test files from the GitHub repository also have this error, so I presumed it was a GUSHR error. I'm not fluent in python so I just filed an issue.
Here's some genes from the test run that have this error. You can see that the gene and transcript rows edited by GUSHR has mistaken start and stop positions.
% perl -lane 'print if @F[3]>@F[4] && @F[2] eq "gene"' ~/temp/utrs.gtf | cut -f 9 | sort | uniq | grep -f - ~/temp/utrs.gtf
2R GUSHR gene 10078919 8467054 . + . g693
2R GUSHR transcript 10078919 8467054 . + . g693.t1
2R AnnotationFinalizer 5'-UTR 8467039 8467054 . + . transcript_id "g693.t1"; gene_id "g693";
2R AnnotationFinalizer 5'-UTR 10078919 10079011 . + . transcript_id "g693.t1"; gene_id "g693";
2R AUGUSTUS start_codon 10079012 10079014 . + 0 transcript_id "g693.t1"; gene_id "g693";
2R AUGUSTUS CDS 10079012 10079182 1 + 0 transcript_id "g693.t1"; gene_id "g693";
2R AUGUSTUS CDS 10081507 10081632 1 + 0 transcript_id "g693.t1"; gene_id "g693";
2R AUGUSTUS CDS 10082642 10082967 1 + 0 transcript_id "g693.t1"; gene_id "g693";
2R AUGUSTUS CDS 10083030 10083134 1 + 1 transcript_id "g693.t1"; gene_id "g693";
2R AUGUSTUS CDS 10083262 10083326 1 + 1 transcript_id "g693.t1"; gene_id "g693";
2R AUGUSTUS CDS 10083404 10083493 1 + 2 transcript_id "g693.t1"; gene_id "g693";
2R AUGUSTUS CDS 10083574 10083938 1 + 2 transcript_id "g693.t1"; gene_id "g693";
2R AUGUSTUS stop_codon 10083936 10083938 . + 0 transcript_id "g693.t1"; gene_id "g693";
2R AnnotationFinalizer 3'-UTR 10083939 10084210 . + . transcript_id "g693.t1"; gene_id "g693";
2R GUSHR gene 10197107 9752095 . + . g769
2R GUSHR transcript 10197107 9752095 . + . g769.t1
2R AnnotationFinalizer 5'-UTR 9752086 9752095 . + . transcript_id "g769.t1"; gene_id "g769";
2R AnnotationFinalizer 5'-UTR 10197107 10197116 . + . transcript_id "g769.t1"; gene_id "g769";
2R AnnotationFinalizer 5'-UTR 10648880 10648911 . + . transcript_id "g769.t1"; gene_id "g769";
2R AUGUSTUS start_codon 10648912 10648914 . + 0 transcript_id "g769.t1"; gene_id "g769";
2R AUGUSTUS CDS 10648912 10649625 1 + 0 transcript_id "g769.t1"; gene_id "g769";
2R AUGUSTUS stop_codon 10649623 10649625 . + 0 transcript_id "g769.t1"; gene_id "g769";
2R AnnotationFinalizer 3'-UTR 10649626 10649826 . + . transcript_id "g769.t1"; gene_id "g769";
2R GUSHR gene 10005157 9999915 . - . g674
2R GUSHR transcript 10005157 9999915 . - . g674.t1
2R AnnotationFinalizer 3'-UTR 9995034 9996490 . - . transcript_id "g674.t1"; gene_id "g674";
2R AUGUSTUS stop_codon 9996491 9996493 . - 0 transcript_id "g674.t1"; gene_id "g674";
2R AUGUSTUS CDS 9996491 9996621 1 - 2 transcript_id "g674.t1"; gene_id "g674";
2R AUGUSTUS CDS 9997704 9997889 1 - 2 transcript_id "g674.t1"; gene_id "g674";
2R AUGUSTUS CDS 9998271 9998346 1 - 0 transcript_id "g674.t1"; gene_id "g674";
2R AUGUSTUS CDS 9998415 9998562 1 - 1 transcript_id "g674.t1"; gene_id "g674";
2R AUGUSTUS CDS 9998707 9998802 1 - 1 transcript_id "g674.t1"; gene_id "g674";
2R AUGUSTUS CDS 9999425 9999525 1 - 0 transcript_id "g674.t1"; gene_id "g674";
2R AUGUSTUS CDS 9999734 9999915 1 - 2 transcript_id "g674.t1"; gene_id "g674";
2R AUGUSTUS CDS 10005157 10005370 0.91 - 0 transcript_id "g674.t1"; gene_id "g674";
2R AUGUSTUS start_codon 10005368 10005370 . - 0 transcript_id "g674.t1"; gene_id "g674";
2R AnnotationFinalizer 5'-UTR 10005371 10005377 . - . transcript_id "g674.t1"; gene_id "g674";
2R AnnotationFinalizer 5'-UTR 10024051 10024202 . - . transcript_id "g674.t1"; gene_id "g674";
2R AnnotationFinalizer 5'-UTR 10025850 10026019 . - . transcript_id "g674.t1"; gene_id "g674";
After meddling around with gushr.py
, I noticed at line 680 and 684 (and also in 693, 697), the both two values being compared were str variables, not int. Adding int() to these rows and fixing line 551 (A space was added before the tabular) resulted in a perfect test run gushr.py -t example/augustus.gtf -b example/RNAseq.bam -g example/genome.fa -o utrs2
and braker --addUTR run.