-
Notifications
You must be signed in to change notification settings - Fork 23
Description
Hi, thank you for developing such an excellent tool!
I encountered an error while running the dataprep
function in the xpore software as follows:
KeyError Traceback (most recent call last)
/tmp/ipykernel_9791/dataprep.py in <module>
2 for id in dict:
3 tx_pos,tx_start=[],0
----> 4 for pair in dict[id]["exon"]: # line 242
5 tx_end=pair[1]-pair[0]+tx_start
6 tx_pos.append((tx_start,tx_end))
KeyError: 'exon'
I carefully checked line 242 of the dataprep.py
and found that some transcripts in dict
do not have the corresponding exon start and end points annotated, as shown in the screenshot for ENSMUST00000193812.
Subsequently, I tested the readAnnotation function, adding the following check code to inspect the attrDict
, type
, start
, and end
variables for ENSMUST00000193812:
The output is as follows:
attrDict: {'gene_id': 'ENSMUSG00000102693', 'gene_version': '1', 'transcript_id': 'ENSMUST00000193812', 'transcript_version': '1', 'exon_number': '1', 'gene_name': '4933401J01Rik', 'gene_source': 'havana', 'gene_biotype': 'TEC', 'havana_gene': 'OTTMUSG00000049935', 'havana_gene_version': '1', 'transcript_name': '4933401J01Rik-201', 'transcript_source': 'havana', 'transcript_biotype': 'TEC', 'havana_transcript': 'OTTMUST00000127109', 'havana_transcript_version': '1', 'exon_id': 'ENSMUSE00001343744', 'exon_version': '1', 'tag': 'basic', 'transcript_support_level': 'NA'}
type: exon
start: 3073253
end: 3074322
attrDict: {'gene_id': 'ENSMUSG00000102693', 'gene_version': '1', 'transcript_id': 'ENSMUST00000193812', 'transcript_version': '1', 'gene_name': '4933401J01Rik', 'gene_source': 'havana', 'gene_biotype': 'TEC', 'havana_gene': 'OTTMUSG00000049935', 'havana_gene_version': '1', 'transcript_name': '4933401J01Rik-201', 'transcript_source': 'havana', 'transcript_biotype': 'TEC', 'havana_transcript': 'OTTMUST00000127109', 'havana_transcript_version': '1', 'tag': 'basic', 'transcript_support_level': 'NA'}
type: transcript
start: 3073253
end: 3074322
This indicates that the exon line
for ENSMUST00000193812 is above the transcript line
, leading to this single-exon transcript, ENSMUST00000193812, not generating the expected information during the following code condition.
if tx_id not in dict:
dict[tx_id]={'chr':chr,'g_id':g_id,'strand':ln[6]}
if type not in dict[tx_id]:
if type == "transcript":
dict[tx_id][type]=(start,end)
else:
if type == "exon":
if type not in dict[tx_id]:
dict[tx_id][type]=[(start,end)]
else:
dict[tx_id][type].append((start,end))
Then, I added the following code to prevent this sequencing error:
if tx_id not in dict:
dict[tx_id]={'chr':chr,'g_id':g_id,'strand':ln[6]}
if type not in dict[tx_id]:
if type == "transcript":
dict[tx_id][type]=(start,end)
if type == "exon": # add
dict[tx_id][type]=[(start,end)] # add
else:
if type == "exon":
if type not in dict[tx_id]:
dict[tx_id][type]=[(start,end)]
else:
dict[tx_id][type].append((start,end))
Although this resolved the issue, I encountered an error in the next loop due to not all genes having multiple exons.
if is_gff < 0:
for id in dict:
tx_pos,tx_start=[],0
for pair in dict[id]["exon"]:
tx_end=pair[1]-pair[0]+tx_start
tx_pos.append((tx_start,tx_end))
tx_start=tx_end+1
dict[id]['tx_exon']=tx_pos
else:
for id in dict:
tx_pos,tx_start=[],0
if dict[id]["strand"] == "-":
dict[id]["exon"].sort(key=lambda tup: tup[0], reverse=True)
for pair in dict[id]["exon"]:
tx_end=pair[1]-pair[0]+tx_start
tx_pos.append((tx_start,tx_end))
tx_start=tx_end+1
dict[id]['tx_exon']=tx_pos
#TypeError Traceback (most recent call last)
#/tmp/ipykernel_13771/2116236259.py in <module>
# 3 tx_pos,tx_start=[],0
# 4 for pair in dict[id]["exon"]:
#----> 5 tx_end=pair[1]-pair[0]+tx_start
# 6 tx_pos.append((tx_start,tx_end))
# 7 tx_start=tx_end+1
TypeError: 'int' object is not subscriptable
For example, it does not produce an error for this key:
'ENSMUST00000187528': {'chr': '1',
'g_id': 'ENSMUSG00000101714',
'strand': '+',
'exon': [(35806974, 35807035), (35810073, 35810462)]}
But it does produce an error for this keys with only one pair of exon loci:
{'ENSMUST00000193812': {'chr': '1',
'g_id': 'ENSMUSG00000102693',
'strand': '+',
'exon': (3073253, 3074322)}
I am unsure if this is a bug or an error in the order of the GTF file.
Thanks!