Blocks of reads out of place

It appears there may be a concurrency bug in Bazam. On our local cluster, the following reproduces the problem, though the exact point of failure is nondeterministic.

```
module load java/11.0.11
BAZAM=/group/bioi1/shared/tools/bazam/1.0.2-beta/bazam.jar
java -Xmx12G -jar ${BAZAM} -bam /misc/bioinf-ops/archie_main/cpipe-wgs/batches/R0029_test_project_grch38/52cceac70ab5876cdec0ce404a7499eacbcc80f1/align/NA24385-WGS-TWIST-FATHER.bam -r1 NA24385-WGS-TWIST-FATHER_L001_R1.fastq.gz -r2 NA24385-WGS-TWIST-FATHER_L001_R2.fastq.gz
/hpc/bpipeLibrary/shared/archie/archie-cli/scripts/fastq-pair-check.sh NA24385-WGS-TWIST-FATHER_L001_R*
```

For reference, the `fastq-pair-check.sh` follows. The salient feature is phase 3, which uses awk to extract the read IDs and makes sure they are corresponding, as required. To isolate the problem, I used a hacked version which just outputs the first million read IDs and invokes diff. What is clear is that a block of read IDs is positioned differently in the R1 and R2 files. All the reads appear to be there in both files.

From reading the code, it looks like it is possible for a second locator thread to trigger a flush (via the shuffler) while a first locator thread is still flushing, leading to a race. I'm not sure exactly what the idiomatic fix is, but possibly a synchronized block for `PairFormatter::flushBuffer()`.

```bash 
#!/bin/bash
set -e

while (( "$#" ))
do
    # For each pair of arguments....
    lhs=$1
    rhs=$2
    shift 2

    echo "testing file pair:"
    echo "    $lhs"
    echo "    $rhs"

    # Phase 1: test the integrity of the gzip files
    #
    # This fails at the first file that has a problem.
    # A more sophisticated version would check each separately
    # and report multiple errors if required.
    #
    echo "phase 1: testing gzip integrity"
    gzip -t $lhs
    gzip -t $rhs
    echo "    success!"


    # Phase 2: test that both files have the same number of lines
    #
    echo "phase 2: checking for equal numbers of lines"
    lhsLen=$(zcat "$lhs" | wc -l)
    rhsLen=$(zcat "$rhs" | wc -l)
    if [ "${lhsLen}" != "${rhsLen}" ]
    then
        echo "files have different numbers of lines (${lhsLen} vs ${rhsLen})"
        echo "    $lhs"
        echo "    $rhs"
        exit 1
    fi
    echo "    success!"

    # Phase 3: test that the read ID lines match
    echo "phase 3: checking read IDs are corresponding."
    cmp <(zcat "$lhs" | awk 'NR % 4 == 1 { print $1 }') <(zcat "$rhs" | awk 'NR % 4 == 1 { print $1 }')
    echo "    success!"
done
```

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Blocks of reads out of place #22

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Blocks of reads out of place #22

Description

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions