Skip to content

Blocks of reads out of place #22

@drtconway

Description

@drtconway

It appears there may be a concurrency bug in Bazam. On our local cluster, the following reproduces the problem, though the exact point of failure is nondeterministic.

module load java/11.0.11
BAZAM=/group/bioi1/shared/tools/bazam/1.0.2-beta/bazam.jar
java -Xmx12G -jar ${BAZAM} -bam /misc/bioinf-ops/archie_main/cpipe-wgs/batches/R0029_test_project_grch38/52cceac70ab5876cdec0ce404a7499eacbcc80f1/align/NA24385-WGS-TWIST-FATHER.bam -r1 NA24385-WGS-TWIST-FATHER_L001_R1.fastq.gz -r2 NA24385-WGS-TWIST-FATHER_L001_R2.fastq.gz
/hpc/bpipeLibrary/shared/archie/archie-cli/scripts/fastq-pair-check.sh NA24385-WGS-TWIST-FATHER_L001_R*

For reference, the fastq-pair-check.sh follows. The salient feature is phase 3, which uses awk to extract the read IDs and makes sure they are corresponding, as required. To isolate the problem, I used a hacked version which just outputs the first million read IDs and invokes diff. What is clear is that a block of read IDs is positioned differently in the R1 and R2 files. All the reads appear to be there in both files.

From reading the code, it looks like it is possible for a second locator thread to trigger a flush (via the shuffler) while a first locator thread is still flushing, leading to a race. I'm not sure exactly what the idiomatic fix is, but possibly a synchronized block for PairFormatter::flushBuffer().

#!/bin/bash
set -e

while (( "$#" ))
do
    # For each pair of arguments....
    lhs=$1
    rhs=$2
    shift 2

    echo "testing file pair:"
    echo "    $lhs"
    echo "    $rhs"

    # Phase 1: test the integrity of the gzip files
    #
    # This fails at the first file that has a problem.
    # A more sophisticated version would check each separately
    # and report multiple errors if required.
    #
    echo "phase 1: testing gzip integrity"
    gzip -t $lhs
    gzip -t $rhs
    echo "    success!"


    # Phase 2: test that both files have the same number of lines
    #
    echo "phase 2: checking for equal numbers of lines"
    lhsLen=$(zcat "$lhs" | wc -l)
    rhsLen=$(zcat "$rhs" | wc -l)
    if [ "${lhsLen}" != "${rhsLen}" ]
    then
        echo "files have different numbers of lines (${lhsLen} vs ${rhsLen})"
        echo "    $lhs"
        echo "    $rhs"
        exit 1
    fi
    echo "    success!"

    # Phase 3: test that the read ID lines match
    echo "phase 3: checking read IDs are corresponding."
    cmp <(zcat "$lhs" | awk 'NR % 4 == 1 { print $1 }') <(zcat "$rhs" | awk 'NR % 4 == 1 { print $1 }')
    echo "    success!"
done

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions