-
Notifications
You must be signed in to change notification settings - Fork 129
Description
Hi,
I had a look at the QScore calculation as defined in the code
Line 124 in 21df7d5
def mean_qscore_from_qstring(qstring): |
I see you use numpy to leverage the power of C to do multiple calculations. I wrote a fastq filter and I also went down this route. However after some time I realized that there are only 94 distinct phred values in the FASTQ format. So basically you are recalculating the same 94 values over and over again (and in practice, a lot less than 94).
I found that it was much quicker to create a lookup table:
import math
QSCORE_LOOKUP = [10 ** -((i - 33)/10) for i in range(128)]
def mean_qscore_from_qstring(qstring):
"""
Convert qstring into a mean qscore
"""
length = len(qstring)
if len == 0: return 0.0
qscore_values = qstring.encode('ASCII') # This returns a bytes object which when iterated over returns integers
sum_error = sum(QSCORE_LOOKUP[q] for q in qscore_values)
return -10 * math.log10(sum_error / length)
In C this is even quicker of course. You can simply create a const double QSCORE_LOOKUP[128] = {...}
in a generated header file or use a constexpr in C++ (I believe, I do not use C++ myself). I have a generated header file here: https://github.com/LUMC/fastq-filter/blob/develop/src/fastq_filter/score_to_error_rate.h. Feel free to use. Keep in mind that I did not do the substracting of the 33 offset in the C lookup as q-33
in C compiles to an instruction that only takes one clocktick anyway and there is no notable speed difference. So in C I do QSCORE_LOOKUP[q-33]
.
If you want to use the very fast C implementation in python. I already made it in C in my fastq-filter. So a from fastq_filter import qualmean
suffices.
No need to recalculate 94 distinct phred values all over again is what I am saying. Hope this helps.