-
Notifications
You must be signed in to change notification settings - Fork 243
Description
Before you submit
Make sure your issue is not already in the htsjdk issue tracker
It was discussed in #1478, now closed, but see below
Description of the issue:
The background to this is samtools/samtools#2131 where a user has a SAM file containing U. Certainly we know ONT write fastqs with U (if RNA) and when they write BAM they translate them to T, but I'm still not entirely sure where this SAM came from. It broke htslib, and would also break htsjdk.
Your environment:
I checked the source, so largely irrelevant, but yeah our system supported Java is too old for the latest picard. :/
Steps to reproduce
Create a SAM file with U in the sequence.
SamFormatConverter from SAM to BAM fails with:
Exception in thread "main" java.lang.IllegalStateException: Bad base passed to charToCompressedBaseLow: #(35) in read: r1
Expected behaviour
I'd like it to encode U as T in BAM so the data at least can be round-tripped. The user has the option of doing a T to U substitution on decode if they wish later on. Ideally it'd be tracked in the meta-data somewhere too.
See samtools/hts-specs#800 and samtools/hts-specs#801 for context.
Given U is IUPAC, I feel it was an early accident to disallow it. I contast to #1478 I disagree that this is a base modification. There are still 4 base types, but DNA and RNA differ in the chemical structure for T vs U. We don't need to track which bases are T and which are U as it's simply a property of the material. Furthermore IUPAC doesn't permit this. It has no ambiguity codes for e.g. A/U. The only mention of U in the original IUPAC paper was for V listed as not-T or not-U. All other not-? codes are the original base type +1 char. It's clear they chose V over U due to the T/U issue, but it's also clear the authors basically treat T and U as interchangeable and we should do too.
FWIW, I've already made this change to htslib in a merged PR, but not yet in a release.