Skip to content

Mean QScore calculation can be much quicker. #336

@rhpvorderman

Description

@rhpvorderman

Hi,

I had a look at the QScore calculation as defined in the code

def mean_qscore_from_qstring(qstring):

I see you use numpy to leverage the power of C to do multiple calculations. I wrote a fastq filter and I also went down this route. However after some time I realized that there are only 94 distinct phred values in the FASTQ format. So basically you are recalculating the same 94 values over and over again (and in practice, a lot less than 94).

I found that it was much quicker to create a lookup table:

import math
QSCORE_LOOKUP = [10 ** -((i - 33)/10) for i in range(128)]

def mean_qscore_from_qstring(qstring):
    """
    Convert qstring into a mean qscore
    """
    length = len(qstring)
    if len == 0: return 0.0
    qscore_values = qstring.encode('ASCII')  # This returns a bytes object which when iterated over returns integers
    sum_error = sum(QSCORE_LOOKUP[q] for q in qscore_values)
    return -10 * math.log10(sum_error / length)

In C this is even quicker of course. You can simply create a const double QSCORE_LOOKUP[128] = {...} in a generated header file or use a constexpr in C++ (I believe, I do not use C++ myself). I have a generated header file here: https://github.com/LUMC/fastq-filter/blob/develop/src/fastq_filter/score_to_error_rate.h. Feel free to use. Keep in mind that I did not do the substracting of the 33 offset in the C lookup as q-33 in C compiles to an instruction that only takes one clocktick anyway and there is no notable speed difference. So in C I do QSCORE_LOOKUP[q-33] .

If you want to use the very fast C implementation in python. I already made it in C in my fastq-filter. So a from fastq_filter import qualmean suffices.

No need to recalculate 94 distinct phred values all over again is what I am saying. Hope this helps.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions