Skip to content

Accept U as a valid base type and convert to T for BAM. #1728

@jkbonfield

Description

@jkbonfield

Before you submit

Make sure your issue is not already in the htsjdk issue tracker

It was discussed in #1478, now closed, but see below

Description of the issue:

The background to this is samtools/samtools#2131 where a user has a SAM file containing U. Certainly we know ONT write fastqs with U (if RNA) and when they write BAM they translate them to T, but I'm still not entirely sure where this SAM came from. It broke htslib, and would also break htsjdk.

Your environment:

I checked the source, so largely irrelevant, but yeah our system supported Java is too old for the latest picard. :/

Steps to reproduce

Create a SAM file with U in the sequence.
SamFormatConverter from SAM to BAM fails with:

Exception in thread "main" java.lang.IllegalStateException: Bad base passed to charToCompressedBaseLow: #(35) in read: r1

Expected behaviour

I'd like it to encode U as T in BAM so the data at least can be round-tripped. The user has the option of doing a T to U substitution on decode if they wish later on. Ideally it'd be tracked in the meta-data somewhere too.

See samtools/hts-specs#800 and samtools/hts-specs#801 for context.

Given U is IUPAC, I feel it was an early accident to disallow it. I contast to #1478 I disagree that this is a base modification. There are still 4 base types, but DNA and RNA differ in the chemical structure for T vs U. We don't need to track which bases are T and which are U as it's simply a property of the material. Furthermore IUPAC doesn't permit this. It has no ambiguity codes for e.g. A/U. The only mention of U in the original IUPAC paper was for V listed as not-T or not-U. All other not-? codes are the original base type +1 char. It's clear they chose V over U due to the T/U issue, but it's also clear the authors basically treat T and U as interchangeable and we should do too.

FWIW, I've already made this change to htslib in a merged PR, but not yet in a release.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions