-
Notifications
You must be signed in to change notification settings - Fork 9
Description
It appears there may be a concurrency bug in Bazam. On our local cluster, the following reproduces the problem, though the exact point of failure is nondeterministic.
module load java/11.0.11
BAZAM=/group/bioi1/shared/tools/bazam/1.0.2-beta/bazam.jar
java -Xmx12G -jar ${BAZAM} -bam /misc/bioinf-ops/archie_main/cpipe-wgs/batches/R0029_test_project_grch38/52cceac70ab5876cdec0ce404a7499eacbcc80f1/align/NA24385-WGS-TWIST-FATHER.bam -r1 NA24385-WGS-TWIST-FATHER_L001_R1.fastq.gz -r2 NA24385-WGS-TWIST-FATHER_L001_R2.fastq.gz
/hpc/bpipeLibrary/shared/archie/archie-cli/scripts/fastq-pair-check.sh NA24385-WGS-TWIST-FATHER_L001_R*
For reference, the fastq-pair-check.sh
follows. The salient feature is phase 3, which uses awk to extract the read IDs and makes sure they are corresponding, as required. To isolate the problem, I used a hacked version which just outputs the first million read IDs and invokes diff. What is clear is that a block of read IDs is positioned differently in the R1 and R2 files. All the reads appear to be there in both files.
From reading the code, it looks like it is possible for a second locator thread to trigger a flush (via the shuffler) while a first locator thread is still flushing, leading to a race. I'm not sure exactly what the idiomatic fix is, but possibly a synchronized block for PairFormatter::flushBuffer()
.
#!/bin/bash
set -e
while (( "$#" ))
do
# For each pair of arguments....
lhs=$1
rhs=$2
shift 2
echo "testing file pair:"
echo " $lhs"
echo " $rhs"
# Phase 1: test the integrity of the gzip files
#
# This fails at the first file that has a problem.
# A more sophisticated version would check each separately
# and report multiple errors if required.
#
echo "phase 1: testing gzip integrity"
gzip -t $lhs
gzip -t $rhs
echo " success!"
# Phase 2: test that both files have the same number of lines
#
echo "phase 2: checking for equal numbers of lines"
lhsLen=$(zcat "$lhs" | wc -l)
rhsLen=$(zcat "$rhs" | wc -l)
if [ "${lhsLen}" != "${rhsLen}" ]
then
echo "files have different numbers of lines (${lhsLen} vs ${rhsLen})"
echo " $lhs"
echo " $rhs"
exit 1
fi
echo " success!"
# Phase 3: test that the read ID lines match
echo "phase 3: checking read IDs are corresponding."
cmp <(zcat "$lhs" | awk 'NR % 4 == 1 { print $1 }') <(zcat "$rhs" | awk 'NR % 4 == 1 { print $1 }')
echo " success!"
done