Skip to content

Commit fd8f8ce

Browse files
author
xzt41
authored
Merge pull request #6 from xzt41/master
update to 1.2.10
2 parents 0b63516 + 0970a6c commit fd8f8ce

File tree

7 files changed

+126
-72
lines changed

7 files changed

+126
-72
lines changed

ChangeLog.rst

Lines changed: 6 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -4,6 +4,12 @@ Changes
44
todo in future versions:
55
- Supporting for *de novo* GTF file (no annotated start and stop codons).
66

7+
v1.2.10 (2018.3.25)
8+
-----------------
9+
- Add "-i" option to metaplots command for multiple input files
10+
- Supporting for non-stranded library sequencing
11+
- Update the document
12+
713
v1.2.9 (2018.3.5)
814
-----------------
915
- Small bug fixed in metaplots.
@@ -35,7 +41,6 @@ v1.2.5 (2017.5.12)
3541
v1.2.4 (2017.5.4)
3642
-----------------
3743
- Add support for outputting ORF results in gtf format.
38-
- Add support for outputting ORF results in gtf format.
3944
- Fix a bug where some ORFs' genomic coordinates are wrong.
4045
- Other optimizations on code and documents.
4146

README.rst

Lines changed: 15 additions & 6 deletions
Original file line numberDiff line numberDiff line change
@@ -159,6 +159,8 @@ Please make sure the path and file name are correct.
159159
STAR --runThreadN 8 --runMode genomeGenerate --genomeDir <hg19_STARindex>
160160
--genomeFastaFiles <hg19_genome.fa> --sjdbGTFfile <gencode.v19.annotation.gtf>
161161
162+
.. _STAR:
163+
162164
(2). Alignment:
163165

164166
.. code-block:: bash
@@ -186,17 +188,22 @@ Please make sure the path and file name are correct.
186188
This step will generate two files: a PDF file which plots the aggregate profiles of the distance from the 5'-end
187189
of reads to the annotated start codons (or stop codons). The P-site config file, which defines the read lengths with
188190
strong 3-nt periodicity and the associated P-site locations for each length. Users can check the P-site periodicity
189-
or modify this file according the plots in PDF file.
191+
or modify this file according the plots in PDF file. In some cases, user may have multiple bam files to predict ORFs
192+
together in next step, they can use "-i" argument to specify a text file which contain the names of these bam files (
193+
one per line)
194+
195+
.. _RiboCode:
190196

191197
(3). Detecting translated ORFs using the ribosome-profiling data:
192198

193199
.. code-block:: bash
194200
195-
RiboCode -a <RiboCode_annot> -c <config.txt> -l no -o <RiboCode_ORFs_result>
201+
RiboCode -a <RiboCode_annot> -c <config.txt> -l no -g -o <RiboCode_ORFs_result>
196202
197203
198204
Using the config file generated by last step to specify the information of the bam file and P-site parameters,
199-
or refer to the example file `config.txt`_ in data folder.
205+
or refer to the example file `config.txt`_ in data folder. The "gtf" or "bed" format file of predicted ORFs can
206+
be obtained by adding the "-g" or "-b" argument to this command.
200207

201208
**Explanation of final result files**
202209

@@ -260,6 +267,8 @@ Please make sure the path and file name are correct.
260267
ORFcount -g <RiboCode_ORFs_result.gtf> -r <ribo-seq genomic mapping file> -f 15 -l 5 -e 100 -m 26 -M 34 -o <ORF.counts>
261268
262269
The reads aligned to first 15 codons and last 5 codons of ORFs which length longer than 100 nt will be excluded.
270+
The "RiboCode_ORFs_result.gtf" file can be generated by `RiboCode`_ command. The "ribo-seq genomic mapping file" is the
271+
genome-wide mapping file produced by `STAR`_ mapping.
263272

264273

265274
Recipes (FAQ):
@@ -316,11 +325,11 @@ Xudong Xing (xudonxing_bioinf[at]sina.com)
316325

317326
.. _HEK293 dataset: https://trace.ncbi.nlm.nih.gov/Traces/sra/?run=SRR1630831
318327

319-
.. _config.txt: https://github.com/xzt41/RiboCode/blob/master/data/config.txt
328+
.. _config.txt: https://github.com/xryanglab/RiboCode/blob/master/data/config.txt
320329

321-
.. _rRNA.fa: https://github.com/xzt41/RiboCode/blob/master/data/rRNA.fa
330+
.. _rRNA.fa: https://github.com/xryanglab/RiboCode/blob/master/data/rRNA.fa
322331

323-
.. _GTF_update.rst: https://github.com/xzt41/RiboCode/blob/master/data/GTF_update.rst
332+
.. _GTF_update.rst: https://github.com/xryanglab/RiboCode/blob/master/data/GTF_update.rst
324333

325334
.. _UNC Bioinformatics Utilities: https://github.com/mozack/ubu
326335

RiboCode/RPF_count_ORF.py

Lines changed: 4 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -10,13 +10,13 @@
1010
import pysam
1111
import sys
1212

13-
def readGTF(gtfFile):
13+
def readGTF(gtfFile,stranded):
1414
gtf = HTSeq.GFF_Reader(gtfFile)
1515
start_codon_sites = {}
1616
stop_codon_sites = {}
1717
counts = {}
1818

19-
ORF_features = HTSeq.GenomicArrayOfSets("auto",stranded="yes")
19+
ORF_features = HTSeq.GenomicArrayOfSets("auto",stranded !="no")
2020
# for i,f in enumerate(gtf):
2121
for f in gtf:
2222
# i += 1
@@ -87,7 +87,7 @@ def count_reads(start_codon_sites,stop_codon_sites,ORF_features,counts,map_file,
8787
if read_len > max_read:
8888
too_long += 1
8989
continue
90-
if stranded == "yes":
90+
if stranded != "reverse":
9191
iv_seq = (co.ref_iv for co in r.cigar if co.type =="M" and co.size >0)
9292
else:
9393
iv_seq = (invert_strand(co.ref_iv) for co in r.cigar if co.type=="M" and co.size>0)
@@ -140,7 +140,7 @@ def main():
140140
args = parsing_ORF_count()
141141
verboseprint("Reading the GTF file ...")
142142

143-
start_codon_sites,stop_codon_sites,ORF_features,counts = readGTF(args.gtf_file)
143+
start_codon_sites,stop_codon_sites,ORF_features,counts = readGTF(args.gtf_file,args.stranded)
144144
counts["__too_low_quality"] = 0
145145
counts["__not_aligned"] = 0
146146
counts["__too_short(<%i)" % args.min_read] = 0

RiboCode/__init__.py

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -1,4 +1,4 @@
11
#!/usr/bin/env python
22
# -*- coding:UTF-8 -*-
33
__author__ = 'Zhengtao Xiao'
4-
__version__ = "1.2.9"
4+
__version__ = "1.2.10"

RiboCode/loadconfig.py

Lines changed: 8 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -44,13 +44,18 @@ def _parsing(self):
4444
Use commas to separate read lengths and P-site offsets,\
4545
e.g. 28,29 11,12\n")
4646
sys.exit()
47-
if stranded not in ["yes", "reverse"]:
47+
if stranded not in ["yes", "reverse","no"]:
4848
sys.stderr.write("Error, pleas check your config file\n, \
49-
the stranded should be yes or reverse.\
49+
the stranded should be yes ,no, or reverse.\
5050
reverse means reversed strand interpretation\n")
5151
sys.exit()
5252
else:
53-
stranded = True if stranded == "yes" else False
53+
if stranded == "yes":
54+
stranded = True
55+
elif stranded == "reverse":
56+
stranded = False
57+
else:
58+
stranded = None
5459
if samplename in samplenames:
5560
sys.stderr.wirte("Error, pls check you config file\n, bam file name is duplicated: %s.\n" % samplename)
5661
sys.exit()

RiboCode/metaplots.py

Lines changed: 47 additions & 41 deletions
Original file line numberDiff line numberDiff line change
@@ -184,47 +184,54 @@ def meta_analysis(gene_dict,transcript_dict,args):
184184
"If no any start codons and stop codons are annotated in GTF file \n" +
185185
",please skip this step and create a config file to specify the P-site based on other evidence." )
186186
sys.exit()
187-
# read bam file
188-
distance_to_start_count,distance_to_stop_count,length_counter = readTranscriptBam(
189-
args.rpf_mapping_file,filter_tids,transcript_dict,args.stranded,args.minLength,args.maxLength)
190-
# predefine the psite
191-
pre_psite_dict = {}
192-
total_reads = sum(length_counter.values())
187+
# read bam files
193188
fout = open(args.outname + "_pre_config.txt", "w")
194-
fout.write("#read_length\tproportion(per total mapped reads)\tpredicted_psite\tf0_sum\tf1_sum\tf2_sum\tf0_percent\tpvalue1\tpvalue2\tpvalue_combined\n")
195-
for l,d in distance_to_start_count.items():
196-
if d.sum() < 10:
197-
continue
198-
mask_max_psite = d[:50].argmax()
199-
predict_psite,others = _predict_psite(l,d,args.frame0_percent,args.pvalue1_cutoff,args.pvalue2_cutoff)
200-
if predict_psite:
201-
if mask_max_psite != predict_psite:
202-
sys.stderr.write("Warning:The predicted P-site location(%i) for length %i is not the highest peak(%i),\
203-
please confirm according metagene plots.\n" % (50-predict_psite,l,50-mask_max_psite))
204-
read_percent = '{:.2%}'.format(length_counter[l] / total_reads)
205-
pre_psite_dict[l] = predict_psite
206-
f0sum,f1sum,f2sum,f0_percent,pv1,pv2,pv = others
207-
fout.write("# " + "\t".join(map(str,[l,read_percent,50-predict_psite,f0sum,f1sum,f2sum,
208-
'{:.2%}'.format(f0_percent),pv1,pv2,pv])) + "\n")
209-
210-
#print the psite lines
211-
fout.write("\n")
212-
if pre_psite_dict:
213-
fout.write("# " + "\t".join(["SampleName","AlignmentFile","Stranded(yes/reverse)","P-siteReadLength","P-siteLocations"]) + "\n")
214-
stranded = "yes" if args.stranded is True else "reverse"
215-
pre_psite_len = list(map(str,sorted(pre_psite_dict.keys())))
216-
pre_psite_loc = list(map(str,[-pre_psite_dict[i]+50 for i in sorted(pre_psite_dict.keys())]))
217-
sampleName = os.path.splitext(os.path.basename(args.rpf_mapping_file))[0]
218-
fout.write("\t".join(map(str,[sampleName,args.rpf_mapping_file,stranded,",".join(pre_psite_len),",".join(pre_psite_loc)])) + "\n")
219-
fout.close()
220-
distancePlot(distance_to_start_count,distance_to_stop_count,pre_psite_dict,length_counter,args.outname)
221-
#lengthDistribution(length_counter,args.outname)
222-
else:
223-
distancePlot(distance_to_start_count,distance_to_stop_count,pre_psite_dict,length_counter,args.outname)
224-
sys.stderr.write("No obviously periodicity were detected from alignment reads in annotated start codons,\n" +
225-
"it could be due to poor quality sequencing.\n" +
226-
"Please check the metagene plots and try again by lowering the value of frame0_percent")
227-
189+
for r in args.rpf_mapping_file:
190+
fout.write("\n#%s\n" % r)
191+
fout.write("#read_length\tproportion(per total mapped reads)\tpredicted_psite\tf0_sum\tf1_sum\tf2_sum\tf0_percent\tpvalue1\tpvalue2\tpvalue_combined\n")
192+
distance_to_start_count,distance_to_stop_count,length_counter = readTranscriptBam(
193+
r,filter_tids,transcript_dict,args.stranded,args.minLength,args.maxLength)
194+
# predefine the psite
195+
pre_psite_dict = {}
196+
total_reads = sum(length_counter.values())
197+
198+
for l,d in distance_to_start_count.items():
199+
if d.sum() < 10:
200+
continue
201+
mask_max_psite = d[:50].argmax()
202+
predict_psite,others = _predict_psite(l,d,args.frame0_percent,args.pvalue1_cutoff,args.pvalue2_cutoff)
203+
if predict_psite:
204+
if mask_max_psite != predict_psite:
205+
sys.stderr.write("Warning:The predicted P-site location(%i) for length %i is not the highest peak(%i),\
206+
please confirm according metagene plots.\n" % (50-predict_psite,l,50-mask_max_psite))
207+
read_percent = '{:.2%}'.format(length_counter[l] / total_reads)
208+
pre_psite_dict[l] = predict_psite
209+
f0sum,f1sum,f2sum,f0_percent,pv1,pv2,pv = others
210+
fout.write("# " + "\t".join(map(str,[l,read_percent,50-predict_psite,f0sum,f1sum,f2sum,
211+
'{:.2%}'.format(f0_percent),pv1,pv2,pv])) + "\n")
212+
sampleName = os.path.splitext(os.path.basename(r))[0]
213+
#print the psite lines
214+
fout.write("\n")
215+
if pre_psite_dict:
216+
fout.write("# " + "\t".join(["SampleName","AlignmentFile","Stranded(yes/reverse)","P-siteReadLength","P-siteLocations"]) + "\n")
217+
if args.stranded is True:
218+
stranded = "yes"
219+
elif args.stranded is False:
220+
stranded = "reverse"
221+
else:
222+
stranded = "no"
223+
pre_psite_len = list(map(str,sorted(pre_psite_dict.keys())))
224+
pre_psite_loc = list(map(str,[-pre_psite_dict[i]+50 for i in sorted(pre_psite_dict.keys())]))
225+
226+
fout.write("\t".join(map(str,[sampleName,r,stranded,",".join(pre_psite_len),",".join(pre_psite_loc)])) + "\n")
227+
distancePlot(distance_to_start_count,distance_to_stop_count,pre_psite_dict,length_counter,args.outname + sampleName)
228+
#lengthDistribution(length_counter,args.outname)
229+
else:
230+
distancePlot(distance_to_start_count,distance_to_stop_count,pre_psite_dict,length_counter,args.outname + sampleName)
231+
sys.stderr.write("No obviously periodicity are detected from bam file %s,\n" % r +
232+
"it could be due to poor quality sequencing.\n" +
233+
"Please check the metagene plots and try again by lowering the value of frame0_percent\n")
234+
fout.close()
228235

229236
def main():
230237
verboseprint("Create metaplot file and predict the P-site locations ...")
@@ -234,6 +241,5 @@ def main():
234241
meta_analysis(gene_dict,transcript_dict,args)
235242
verboseprint("Complete prediction of the P-site locations")
236243

237-
238244
if __name__ == "__main__":
239245
main()

RiboCode/parsing_opts.py

Lines changed: 45 additions & 16 deletions
Original file line numberDiff line numberDiff line change
@@ -9,6 +9,17 @@
99
import os
1010
from .__init__ import __version__
1111

12+
def read_file(f):
13+
if not os.path.exists(f):
14+
raise ValueError("Error, file %s not found!\n" % f)
15+
fList = []
16+
with open(f) as fin:
17+
for line in fin:
18+
if line.startswith("#") or line.strip() == "":
19+
continue
20+
fList.append(line.strip())
21+
return fList
22+
1223
def parsing_gtf_update():
1324
parser = argparse.ArgumentParser(
1425
description="This script is designed for preparing the appropriate GTF file from \
@@ -59,9 +70,12 @@ def parsing_metaplots():
5970
)
6071
parser.add_argument("-a","--annot_dir",dest="annot_dir",required=True,type=str,
6172
help="transcripts annotation directory, generated by prepare_transcripts.")
62-
parser.add_argument("-r","--rpf_mapping_file",dest="rpf_mapping_file",required=True,type=str,
73+
group = parser.add_mutually_exclusive_group(required=True)
74+
group.add_argument("-r","--rpf_mapping_file",dest="rpf_mapping_file",required=False,type=str,
6375
help="ribo-seq BAM/SAM file aligned to the transcriptome.")
64-
parser.add_argument("-s","--stranded",dest="stranded",required=False,type=str,choices=["yes","reverse"],
76+
group.add_argument("-i","--input_file",dest="rpf_mapping_file",required=False,type=read_file,
77+
help="the file list the ribo-seq BAM/SAM files aligned to the transcriptome.")
78+
parser.add_argument("-s","--stranded",dest="stranded",required=False,type=str,choices=["yes","reverse","no"],
6579
default="yes",help="whether the data is strand-specific, \
6680
reverse means reversed strand interpretation.(default: yes)")
6781
parser.add_argument("-m","--minimum-length",dest="minLength",required=False,type=int,default=24,
@@ -86,9 +100,17 @@ def parsing_metaplots():
86100
raise ValueError("minimum length must be <= maximum length (currently %d and %d, respectively)" % (args.minLength, args.maxLength))
87101
if args.minLength <= 0 or args.maxLength <=0:
88102
raise ValueError("minimum length or maximum length must be larger than 0.")
89-
if not os.path.exists(args.rpf_mapping_file):
90-
raise ValueError("Error, the rpf mapping file not found: %s\n" % args.rpf_mapping_file)
91-
args.stranded = True if args.stranded == "yes" else False
103+
if type(args.rpf_mapping_file) is str:
104+
if not os.path.exists(args.rpf_mapping_file):
105+
raise ValueError("Error, the rpf mapping file not found: %s\n" % args.rpf_mapping_file)
106+
args.rpf_mapping_file = [args.rpf_mapping_file]
107+
108+
if args.stranded == "yes":
109+
args.stranded = True
110+
elif args.stranded == "reverse":
111+
args.stranded = False
112+
else:
113+
args.stranded = None
92114
args.pvalue1_cutoff = float(args.pvalue1_cutoff)
93115
args.pvalue2_cutoff = float(args.pvalue2_cutoff)
94116
args.frame0_percent = float(args.frame0_percent)
@@ -111,9 +133,6 @@ def parsing_ribo():
111133
If set to no , the position of start codon will be automatically determined by program.", type=str)
112134
parser.add_argument("-p","--pval-cutoff",dest="pval_cutoff",default=0.05,required=False,
113135
help="P-value cutoff for ORF filtering, default 0.05", type=float)
114-
parser.add_argument("--stranded","--stranded",dest="stranded",required=False,type=str,choices=["yes","reverse"],
115-
default="yes",help="whether the data is strand-specific, \
116-
reverse means reversed strand interpretation.(default: yes)")
117136
parser.add_argument("-s","--start_codon",default="ATG",type=str,dest="start_codon",
118137
help="The canonical start codon. default: ATG")
119138
parser.add_argument("-A","--alt_start_codons",default="",type=str,dest="alternative_start_codons",
@@ -128,12 +147,12 @@ def parsing_ribo():
128147
# parser.add_argument("-P","--parallel_num",dest="parallel_num",default=1,required=False,
129148
# help="the number of threads to read the alignment file(s), \
130149
# the optimal value is the number of alignment files, default=1",type=int)
131-
parser.add_argument("-o","--output-name",dest="output_name",default="final_result",required=False,
132-
help="output file name, default: final_result", type=str)
133150
parser.add_argument("-g","--output-gtf",dest="output_gtf",action='store_true',default=False,required=False,
134151
help="output the gtf file of predicted ORFs")
135152
parser.add_argument("-b","--output-bed",dest="output_bed",action='store_true',default=False,required=False,
136153
help="output the bed file of predicted ORFs")
154+
parser.add_argument("-o","--output-name",dest="output_name",default="final_result",required=False,
155+
help="output file name, default: final_result", type=str)
137156
parser.add_argument('-V',"--version",action="version",version=__version__)
138157
args = parser.parse_args()
139158

@@ -174,7 +193,7 @@ def parsing_ORF_count():
174193
description="This script is designed for calculating the number of reads mapping to ORF with the alignment files \
175194
in SAM/BAM format (aligned to genome) and a feature file in GTF format"
176195
)
177-
parser.add_argument("-s","--stranded",dest="stranded",required=False,type=str,choices=["yes","reverse"],
196+
parser.add_argument("-s","--stranded",dest="stranded",required=False,type=str,choices=["yes","reverse","no"],
178197
default="yes",help="whether the data is strand-specific, \
179198
reverse means reversed strand interpretation. (default: yes)")
180199
parser.add_argument("-a","--minaqual",dest="min_quality",required=False,type=int,
@@ -228,9 +247,12 @@ def parsing_ribo_onestep():
228247
please refer: https://en.wikipedia.org/wiki/GENCODE')
229248
parser.add_argument("-f","--fasta",dest="genomeFasta",required=True,type=str,
230249
help="The genome sequences file in fasta format.")
231-
parser.add_argument("-r","--rpf_mapping_file",dest="rpf_mapping_file",required=True,type=str,
250+
group = parser.add_mutually_exclusive_group(required=True)
251+
group.add_argument("-r","--rpf_mapping_file",dest="rpf_mapping_file",required=True,type=str,
232252
help="ribo-seq BAM/SAM file aligned to the transcriptome.")
233-
parser.add_argument("-stranded","--stranded",dest="stranded",required=False,type=str,choices=["yes","reverse"],
253+
group.add_argument("-i","--input_file",dest="rpf_mapping_file",required=False,type=read_file,
254+
help="the file list the ribo-seq BAM/SAM files aligned to the transcriptome.")
255+
parser.add_argument("-stranded","--stranded",dest="stranded",required=False,type=str,choices=["yes","reverse","no"],
234256
default="yes",help="whether the data is strand-specific, \
235257
reverse means reversed strand interpretation.(default: yes)")
236258
parser.add_argument("-m","--minimum-length",dest="minLength",required=False,type=int,default=24,
@@ -266,9 +288,16 @@ def parsing_ribo_onestep():
266288
help="output the bed file of predicted ORFs")
267289
parser.add_argument('-V',"--version",action="version",version=__version__)
268290
args = parser.parse_args()
269-
if not os.path.exists(args.rpf_mapping_file):
270-
raise ValueError("Error, the rpf mapping file not found: %s\n" % args.rpf_mapping_file)
271-
args.stranded = True if args.stranded == "yes" else False
291+
if type(args.rpf_mapping_file) is str:
292+
if not os.path.exists(args.rpf_mapping_file):
293+
raise ValueError("Error, the rpf mapping file not found: %s\n" % args.rpf_mapping_file)
294+
args.rpf_mapping_file = [args.rpf_mapping_file]
295+
if args.stranded == "yes":
296+
args.stranded = True
297+
elif args.stranded == "reverse":
298+
args.stranded = False
299+
else:
300+
args.stranded = None
272301
if args.minLength > args.maxLength:
273302
raise ValueError("minimum length must be <= maximum length (currently %d and %d, respectively)" % (args.minLength, args.maxLength))
274303
if args.minLength <= 0 or args.maxLength <=0:

0 commit comments

Comments
 (0)