KeyError: exon in xpore dataprep

Hi, thank you for developing such an excellent tool! 

I encountered an error while running the `dataprep` function in the xpore software as follows:

```py
KeyError                                  Traceback (most recent call last)
/tmp/ipykernel_9791/dataprep.py in <module>
      2     for id in dict:
      3         tx_pos,tx_start=[],0
----> 4         for pair in dict[id]["exon"]:                    # line 242
      5             tx_end=pair[1]-pair[0]+tx_start
      6             tx_pos.append((tx_start,tx_end))

KeyError: 'exon'
```
I carefully checked line 242 of the `dataprep.py` and found that **some transcripts in `dict` do not have the corresponding exon start and end points annotated**, as shown in the screenshot for ENSMUST00000193812.

![image](https://github.com/GoekeLab/xpore/assets/83895347/ef48f893-ce17-47ca-bdfe-1b457756aca0)

Subsequently, I tested the readAnnotation function, adding the following check code to inspect the `attrDict`, `type`, `start`, and `end` variables for ENSMUST00000193812:
![image](https://github.com/GoekeLab/xpore/assets/83895347/48f1a1dd-579b-4f02-80cb-34d2c7d57dbd)

The output is as follows:

```py
attrDict: {'gene_id': 'ENSMUSG00000102693', 'gene_version': '1', 'transcript_id': 'ENSMUST00000193812', 'transcript_version': '1', 'exon_number': '1', 'gene_name': '4933401J01Rik', 'gene_source': 'havana', 'gene_biotype': 'TEC', 'havana_gene': 'OTTMUSG00000049935', 'havana_gene_version': '1', 'transcript_name': '4933401J01Rik-201', 'transcript_source': 'havana', 'transcript_biotype': 'TEC', 'havana_transcript': 'OTTMUST00000127109', 'havana_transcript_version': '1', 'exon_id': 'ENSMUSE00001343744', 'exon_version': '1', 'tag': 'basic', 'transcript_support_level': 'NA'} 
 type: exon 
 start: 3073253 
end: 3074322

attrDict: {'gene_id': 'ENSMUSG00000102693', 'gene_version': '1', 'transcript_id': 'ENSMUST00000193812', 'transcript_version': '1', 'gene_name': '4933401J01Rik', 'gene_source': 'havana', 'gene_biotype': 'TEC', 'havana_gene': 'OTTMUSG00000049935', 'havana_gene_version': '1', 'transcript_name': '4933401J01Rik-201', 'transcript_source': 'havana', 'transcript_biotype': 'TEC', 'havana_transcript': 'OTTMUST00000127109', 'havana_transcript_version': '1', 'tag': 'basic', 'transcript_support_level': 'NA'} 
 type: transcript 
 start: 3073253 
end: 3074322
```

**This indicates that the `exon line` for ENSMUST00000193812 is above the `transcript line`, leading to this single-exon transcript, ENSMUST00000193812, not generating the expected information during the following code condition**.

```py
if tx_id not in dict:
    dict[tx_id]={'chr':chr,'g_id':g_id,'strand':ln[6]}
    if type not in dict[tx_id]:
        if type == "transcript":
            dict[tx_id][type]=(start,end)
else:
    if type == "exon":
        if type not in dict[tx_id]:
            dict[tx_id][type]=[(start,end)]
        else:
            dict[tx_id][type].append((start,end))
```

Then, I added the following code to prevent this sequencing error:

```
if tx_id not in dict:
    dict[tx_id]={'chr':chr,'g_id':g_id,'strand':ln[6]}
    if type not in dict[tx_id]:
        if type == "transcript":
            dict[tx_id][type]=(start,end)
        if type == "exon":                            # add
            dict[tx_id][type]=[(start,end)]      # add
else:
    if type == "exon":
        if type not in dict[tx_id]:
            dict[tx_id][type]=[(start,end)]
        else:
            dict[tx_id][type].append((start,end))
```

Although this resolved the issue, I encountered an error in the next loop due to not all genes having multiple exons.

```py
if is_gff < 0:
    for id in dict:
        tx_pos,tx_start=[],0
        for pair in dict[id]["exon"]:
            tx_end=pair[1]-pair[0]+tx_start
            tx_pos.append((tx_start,tx_end))
            tx_start=tx_end+1
        dict[id]['tx_exon']=tx_pos
else:
    for id in dict:
        tx_pos,tx_start=[],0
        if dict[id]["strand"] == "-":
            dict[id]["exon"].sort(key=lambda tup: tup[0], reverse=True)
        for pair in dict[id]["exon"]:
            tx_end=pair[1]-pair[0]+tx_start
            tx_pos.append((tx_start,tx_end))
            tx_start=tx_end+1
        dict[id]['tx_exon']=tx_pos

#TypeError                                 Traceback (most recent call last)
#/tmp/ipykernel_13771/2116236259.py in <module>
#      3         tx_pos,tx_start=[],0
#      4         for pair in dict[id]["exon"]:
#----> 5             tx_end=pair[1]-pair[0]+tx_start
#      6             tx_pos.append((tx_start,tx_end))
#      7             tx_start=tx_end+1

TypeError: 'int' object is not subscriptable
```

For example, it does not produce an error for this key：
```py
'ENSMUST00000187528': {'chr': '1',
  'g_id': 'ENSMUSG00000101714',
  'strand': '+',
  'exon': [(35806974, 35807035), (35810073, 35810462)]}
```
But it does produce an error for this keys with only one pair of exon loci:

```py
{'ENSMUST00000193812': {'chr': '1',
  'g_id': 'ENSMUSG00000102693',
  'strand': '+',
  'exon': (3073253, 3074322)}
```

 I am unsure if this is a bug or an error in the order of the GTF file.

Thanks!

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

KeyError: exon in xpore dataprep #217

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

KeyError: exon in xpore dataprep #217

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions