Accept U as a valid base type and convert to T for BAM.

## Before you submit
Make sure your issue is not already in the [htsjdk issue tracker](https://github.com/samtools/htsjdk/issues?q=)

It was discussed in https://github.com/samtools/htsjdk/issues/1478, now closed, but see below

### Description of the issue:
The background to this is https://github.com/samtools/samtools/issues/2131 where a user has a *SAM* file containing U.  Certainly we know ONT write fastqs with U (if RNA) and when they write BAM they translate them to T, but I'm still not entirely sure where this SAM came from.  It broke htslib, and would also break htsjdk.

### Your environment:
I checked the [source](https://github.com/samtools/htsjdk/blob/dbebac36f980517f5bd9b7e64f017b383e38ba83/src/main/java/htsjdk/samtools/SAMUtils.java#L173-L228), so largely irrelevant, but yeah our system supported Java is too old for the latest picard. :/

### Steps to reproduce
Create a SAM file with U in the sequence.
SamFormatConverter from SAM to BAM fails with:

```
Exception in thread "main" java.lang.IllegalStateException: Bad base passed to charToCompressedBaseLow: #(35) in read: r1
```

### Expected behaviour
I'd like it to encode U as T in BAM so the data at least can be round-tripped.  The user has the option of doing a T to U substitution on decode if they wish later on.  Ideally it'd be tracked in the meta-data somewhere too.

See https://github.com/samtools/hts-specs/issues/800 and https://github.com/samtools/hts-specs/issues/801 for context.

Given U is IUPAC, I feel it was an early accident to disallow it.  I contast to #1478 I disagree that this is a base modification.  There are still 4 base types, but DNA and RNA differ in the chemical structure for T vs U.  We don't need to track which bases are T and which are U as it's simply a property of the material.  Furthermore IUPAC doesn't permit this.  It has no ambiguity codes for e.g. A/U.  The only mention of U in the original IUPAC paper was for V listed as not-T or not-U.   All other not-? codes are the original base type +1 char.  It's clear they chose V over U due to the T/U issue, but it's also clear the authors basically treat T and U as interchangeable and we should do too.

FWIW, I've already made this change to htslib in a merged PR, but not yet in a release.


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Accept U as a valid base type and convert to T for BAM. #1728

Before you submit

Description of the issue:

Your environment:

Steps to reproduce

Expected behaviour

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Accept U as a valid base type and convert to T for BAM. #1728

Description

Before you submit

Description of the issue:

Your environment:

Steps to reproduce

Expected behaviour

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions