-
Notifications
You must be signed in to change notification settings - Fork 3
fix: use buffered and script based get-reads implementation #126
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
""" WalkthroughThe environment configuration was updated to add the Changes
Sequence Diagram(s)sequenceDiagram
participant Snakemake
participant get-reads.py
participant pysam
participant dnaio
Snakemake->>get-reads.py: Run script with BAM input and FASTQ outputs
get-reads.py->>pysam: Open BAM file for reading
loop For each alignment
get-reads.py->>get-reads.py: Buffer reads by query name
get-reads.py->>get-reads.py: Pair mates when both found
get-reads.py->>dnaio: Write paired reads to FASTQ
end
get-reads.py->>Snakemake: Log progress and completion
Poem
📜 Recent review detailsConfiguration used: CodeRabbit UI 📒 Files selected for processing (1)
⏰ Context from checks skipped due to timeout of 90000ms (1)
🔇 Additional comments (1)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. 🪧 TipsChatThere are 3 ways to chat with CodeRabbit:
SupportNeed help? Create a ticket on our support page for assistance with any issues or questions. Note: Be mindful of the bot's finite context window. It's strongly recommended to break down tasks such as reading entire modules into smaller chunks. For a focused discussion, use review comments to chat about specific files and their changes, instead of using the PR comments. CodeRabbit Commands (Invoked using PR comments)
Other keywords and placeholders
CodeRabbit Configuration File (
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 2
🧹 Nitpick comments (3)
workflow/envs/pysam.yaml (1)
9-9
: Add a newline at the end of the file.The file is missing a newline character at the end, which is considered a best practice and required by some tools.
- dnaio =1.2 + - dnaio =1.2 +🧰 Tools
🪛 YAMLlint (1.37.1)
[error] 9-9: no new line character at the end of file
(new-line-at-end-of-file)
workflow/scripts/get-reads.py (2)
3-3
: Use a context manager for opening the log file.Using a context manager ensures the file is properly closed even if exceptions occur.
-sys.stderr = open(snakemake.log[0], "w") +with open(snakemake.log[0], "w") as log_file: + sys.stderr = log_file🧰 Tools
🪛 Ruff (0.11.9)
3-3: Use a context manager for opening files
(SIM115)
3-3: Undefined name
snakemake
(F821)
35-37
: Make the read1/read2 ordering more robust.The current code only checks if the alignment is read2, but doesn't verify the other alignment is read1. This could lead to incorrect pairing in some edge cases.
- if aln.is_read2: + if aln.is_read2 and mate_aln.is_read1: aln, mate_aln = mate_aln, aln + elif aln.is_read1 and mate_aln.is_read2: + # Already in correct order + pass + else: + # Handle unexpected read flags + continue
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
Cache: Disabled due to data retention organization setting
Knowledge Base: Disabled due to data retention organization setting
📒 Files selected for processing (4)
workflow/envs/pysam.yaml
(1 hunks)workflow/rules/common.smk
(0 hunks)workflow/rules/download.smk
(1 hunks)workflow/scripts/get-reads.py
(1 hunks)
💤 Files with no reviewable changes (1)
- workflow/rules/common.smk
🧰 Additional context used
🪛 YAMLlint (1.37.1)
workflow/envs/pysam.yaml
[error] 9-9: no new line character at the end of file
(new-line-at-end-of-file)
🪛 Ruff (0.11.9)
workflow/scripts/get-reads.py
3-3: Use a context manager for opening files
(SIM115)
3-3: Undefined name snakemake
(F821)
17-17: Undefined name snakemake
(F821)
18-18: Undefined name snakemake
(F821)
22-22: Undefined name snakemake
(F821)
22-22: Undefined name snakemake
(F821)
🔇 Additional comments (3)
workflow/envs/pysam.yaml (1)
9-9
: Addition of dnaio looks good!The dnaio package is correctly added to support the new Python script-based approach for handling FASTQ files.
🧰 Tools
🪛 YAMLlint (1.37.1)
[error] 9-9: no new line character at the end of file
(new-line-at-end-of-file)
workflow/rules/download.smk (2)
6-6
: Parameter handling looks good.The branch function effectively replaces the previous get_read_limit_param approach, providing a cleaner way to conditionally set the read limit.
11-14
: Good refactoring to use a Python script instead of shell commands.Switching from a shell-based approach to a dedicated Python script improves maintainability and reliability for handling paired reads. The conda environment has been appropriately updated to include the required dependencies.
Co-authored-by: coderabbitai[bot] <136622811+coderabbitai[bot]@users.noreply.github.com>
Co-authored-by: coderabbitai[bot] <136622811+coderabbitai[bot]@users.noreply.github.com>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 3
🧹 Nitpick comments (4)
workflow/scripts/get-reads.py (4)
1-7
: Move imports to the top of the file.Following Python best practices (PEP 8), all imports should be placed at the top of the file before any other code. The current placement interrupts the import section with stderr redirection.
import sys +import pysam +import dnaio sys.stderr = open(snakemake.log[0], "w") - -import pysam -import dnaio🧰 Tools
🪛 Ruff (0.11.9)
3-3: Use a context manager for opening files
(SIM115)
3-3: Undefined name
snakemake
(F821)
3-3
: Use a context manager for the log file.The current approach of directly reassigning sys.stderr doesn't ensure the file is properly closed when the script ends. Using a context manager or at least closing the file at the end of the script is recommended.
-sys.stderr = open(snakemake.log[0], "w") +# Option 1: Redirect within a try-finally to ensure closing +log_file = open(snakemake.log[0], "w") +sys.stderr = log_file +try: + # Rest of your code would be indented here +finally: + log_file.close() + +# Option 2: Use a custom context manager for stderr redirection +class RedirectStderr: + def __init__(self, file_path): + self.file_path = file_path + self.original_stderr = None + self.file_obj = None + + def __enter__(self): + self.file_obj = open(self.file_path, "w") + self.original_stderr = sys.stderr + sys.stderr = self.file_obj + return self.file_obj + + def __exit__(self, exc_type, exc_val, exc_tb): + sys.stderr = self.original_stderr + if self.file_obj: + self.file_obj.close() + +with RedirectStderr(snakemake.log[0]): + # Rest of your code would be here🧰 Tools
🪛 Ruff (0.11.9)
3-3: Use a context manager for opening files
(SIM115)
3-3: Undefined name
snakemake
(F821)
38-40
: Add orphaned read handling at end of file processing.Currently, any reads in the buffer without mates at the end of processing are silently ignored. Consider handling these orphaned reads by either logging them or optionally writing them as single-end reads.
fqwriter.write(aln_to_fq(aln), aln_to_fq(mate_aln)) n_written += 1 + +# Add this at the end, after the loop completes +if buffer: + print(f"Warning: {len(buffer)} unpaired reads found in BAM file", file=sys.stderr) + # Optionally, write unpaired reads as single-end if needed + # for orphaned_aln in buffer.values(): + # fqwriter.write(aln_to_fq(orphaned_aln), None)
40-40
: Explicitly close the BAM file.The BAM file should be explicitly closed after processing to ensure proper resource cleanup, especially for large files.
fqwriter.write(aln_to_fq(aln), aln_to_fq(mate_aln)) n_written += 1 + +# Close the BAM file after processing +bam.close()
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
Cache: Disabled due to data retention organization setting
Knowledge Base: Disabled due to data retention organization setting
📒 Files selected for processing (2)
workflow/rules/download.smk
(1 hunks)workflow/scripts/get-reads.py
(1 hunks)
🚧 Files skipped from review as they are similar to previous changes (1)
- workflow/rules/download.smk
🧰 Additional context used
🪛 Ruff (0.11.9)
workflow/scripts/get-reads.py
3-3: Use a context manager for opening files
(SIM115)
3-3: Undefined name snakemake
(F821)
17-17: Undefined name snakemake
(F821)
18-18: Undefined name snakemake
(F821)
22-22: Undefined name snakemake
(F821)
22-22: Undefined name snakemake
(F821)
🔇 Additional comments (2)
workflow/scripts/get-reads.py (2)
24-25
: LGTM!The condition to check if limit is None before comparing has been properly implemented, addressing a previous issue.
26-27
: LGTM!The condition to skip secondary and supplementary alignments is now correct, addressing a previous issue.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 0
♻️ Duplicate comments (3)
workflow/scripts/get-reads.py (3)
9-14
: Handle potential None quality scores in aln_to_fq function.The function doesn't handle the case where
aln.query_qualities
could be None, which might happen with some BAM files. This could lead to unexpected errors when creating the SequenceRecord.def aln_to_fq(aln): + # Handle case when query_qualities is None + qualities = aln.query_qualities + if qualities is None: + qualities = [0] * len(aln.get_forward_sequence()) + return dnaio.SequenceRecord( name=aln.query_name, sequence=aln.get_forward_sequence(), - quality=aln.query_qualities, + quality=qualities, )
17-18
: Add error handling for BAM file opening.The current code doesn't handle potential errors when opening the BAM file URL. This could lead to uncaught exceptions if the URL is invalid or the file is inaccessible.
limit = snakemake.params.limit -bam = pysam.AlignmentFile(snakemake.params.bam_url) +try: + bam = pysam.AlignmentFile(snakemake.params.bam_url) +except (IOError, ValueError) as e: + print(f"Error opening BAM file: {e}", file=sys.stderr) + sys.exit(1)🧰 Tools
🪛 Ruff (0.11.9)
17-17: Undefined name
snakemake
(F821)
18-18: Undefined name
snakemake
(F821)
29-33
: Fix inconsistent query name usage in buffer handling.The code computes
qname
by removing suffixes but then doesn't use it for the buffer lookup or storage, which could lead to missing mate pairs if the suffixes are inconsistently present.- qname = aln.query_name.removesuffix("/1").removesuffix("/2") + # Remove potential read suffixes for consistent mate pairing + qname = aln.query_name.removesuffix("/1").removesuffix("/2") - mate_aln = buffer.get(aln.query_name) + mate_aln = buffer.get(qname) if mate_aln is None: - buffer[aln.query_name] = aln + buffer[qname] = aln
🧹 Nitpick comments (3)
workflow/scripts/get-reads.py (3)
3-3
: Use a context manager for file operations.This line opens a file for stderr redirection but doesn't handle closing it properly. Using a context manager ensures the file is properly closed even if an exception occurs.
-sys.stderr = open(snakemake.log[0], "w") +# Use context manager or explicit close to prevent resource leaks +sys.stderr = open(snakemake.log[0], "w")Alternatively, consider using a context manager:
import contextlib with open(snakemake.log[0], "w") as log_file: with contextlib.redirect_stderr(log_file): # Rest of your code🧰 Tools
🪛 Ruff (0.11.9)
3-3: Use a context manager for opening files
(SIM115)
3-3: Undefined name
snakemake
(F821)
18-18
: Close BAM file explicitly after processing.The script doesn't explicitly close the BAM file after processing, which could lead to resource leaks. Consider using a context manager or explicitly closing the file.
-bam = pysam.AlignmentFile(snakemake.params.bam_url) +# Use context manager to ensure proper resource cleanup +with pysam.AlignmentFile(snakemake.params.bam_url) as bam: + # Move the processing code inside this context🧰 Tools
🪛 Ruff (0.11.9)
18-18: Undefined name
snakemake
(F821)
1-40
: Add docstring and include error handling for the script.The script lacks a docstring explaining its purpose and usage. Additionally, it could benefit from more comprehensive error handling, especially for the main processing loop.
import sys +""" +Extract paired-end reads from a BAM file and write them to FASTQ files. + +This script uses pysam to read alignments from a BAM file and dnaio to write +paired FASTQ output files. It buffers alignments by query name to pair mates, +skipping secondary and supplementary alignments. + +Input: + - BAM file URL (from snakemake.params.bam_url) + - Optional read limit (from snakemake.params.limit) + +Output: + - Paired FASTQ files (from snakemake.output[0] and snakemake.output[1]) + - Log file (from snakemake.log[0]) +"""Additionally, consider adding more robust error handling throughout:
try: with dnaio.open(snakemake.output[0], snakemake.output[1], mode="w") as fqwriter: # Main processing code except Exception as e: print(f"Error processing BAM file or writing FASTQ: {e}", file=sys.stderr) sys.exit(1)🧰 Tools
🪛 Ruff (0.11.9)
3-3: Use a context manager for opening files
(SIM115)
3-3: Undefined name
snakemake
(F821)
17-17: Undefined name
snakemake
(F821)
18-18: Undefined name
snakemake
(F821)
22-22: Undefined name
snakemake
(F821)
22-22: Undefined name
snakemake
(F821)
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
Cache: Disabled due to data retention organization setting
Knowledge Base: Disabled due to data retention organization setting
📒 Files selected for processing (1)
workflow/scripts/get-reads.py
(1 hunks)
🧰 Additional context used
🪛 Ruff (0.11.9)
workflow/scripts/get-reads.py
3-3: Use a context manager for opening files
(SIM115)
3-3: Undefined name snakemake
(F821)
17-17: Undefined name snakemake
(F821)
18-18: Undefined name snakemake
(F821)
22-22: Undefined name snakemake
(F821)
22-22: Undefined name snakemake
(F821)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 0
♻️ Duplicate comments (2)
workflow/scripts/get-reads.py (2)
9-14
: Handle potential None quality scores in aln_to_fq function.The function doesn't handle the case where
aln.query_qualities
could be None, which might happen with some BAM files. This could lead to unexpected errors when creating the SequenceRecord.def aln_to_fq(qname, aln): + # Handle case when query_qualities is None + qualities = aln.query_qualities + if qualities is None: + qualities = [0] * len(aln.get_forward_sequence()) + return dnaio.SequenceRecord( name=qname, sequence=aln.get_forward_sequence(), - qualities=aln.query_qualities, + qualities=qualities, )
18-18
: Add error handling for BAM file opening.The current code doesn't handle potential errors when opening the BAM file URL. This could lead to uncaught exceptions if the URL is invalid or the file is inaccessible.
-bam = pysam.AlignmentFile(snakemake.params.bam_url) +try: + bam = pysam.AlignmentFile(snakemake.params.bam_url) +except (IOError, ValueError) as e: + print(f"Error opening BAM file: {e}", file=sys.stderr) + sys.exit(1)🧰 Tools
🪛 Ruff (0.11.9)
18-18: Undefined name
snakemake
(F821)
🧹 Nitpick comments (1)
workflow/scripts/get-reads.py (1)
3-3
: Use a context manager for file handling.The file is opened without using a context manager, which could lead to resource leaks if the script exits unexpectedly.
-sys.stderr = open(snakemake.log[0], "w") +with open(snakemake.log[0], "w") as log_file: + sys.stderr = log_file + + # Main script logic should be indented here + # Or use a separate function that's called within this context🧰 Tools
🪛 Ruff (0.11.9)
3-3: Use a context manager for opening files
(SIM115)
3-3: Undefined name
snakemake
(F821)
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
Cache: Disabled due to data retention organization setting
Knowledge Base: Disabled due to data retention organization setting
📒 Files selected for processing (1)
workflow/scripts/get-reads.py
(1 hunks)
🧰 Additional context used
🪛 Ruff (0.11.9)
workflow/scripts/get-reads.py
3-3: Use a context manager for opening files
(SIM115)
3-3: Undefined name snakemake
(F821)
17-17: Undefined name snakemake
(F821)
18-18: Undefined name snakemake
(F821)
22-22: Undefined name snakemake
(F821)
22-22: Undefined name snakemake
(F821)
🔇 Additional comments (2)
workflow/scripts/get-reads.py (2)
24-27
: Fixed issues from previous review.The code now correctly:
- Checks if limit is None before comparing with n_written
- Uses the correct condition to skip secondary and supplementary alignments
Great job addressing these issues!
29-39
: Correct implementation of read pair matching.The code now properly:
- Removes suffixes from query names for consistent matching
- Uses the normalized query name for buffer lookup and storage
- Ensures correct ordering of read pairs before writing
This implementation effectively handles paired-end reads from BAM files.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 1
🔭 Outside diff range comments (1)
workflow/scripts/get-reads.py (1)
22-42
: 🛠️ Refactor suggestionImplement buffer cleanup mechanism.
The current implementation stores unpaired reads in the buffer but never clears them, which could lead to memory issues with large BAM files containing many unpaired reads.
Consider implementing one of these approaches:
- Periodically check and remove old entries from the buffer
- Set a maximum buffer size and handle overflow
- After processing all alignments, report any unpaired reads left in the buffer
buffer = {} n_written = 0 +buffer_limit = 1000000 # Set a reasonable buffer size limit + with dnaio.open(snakemake.output[0], snakemake.output[1], mode="w") as fqwriter: for aln in bam: if limit is not None and n_written >= limit: break if aln.is_secondary or aln.is_supplementary: continue qname = aln.query_name.removesuffix("/1").removesuffix("/2") mate_aln = buffer.get(qname) if mate_aln is None: buffer[qname] = aln + # Check if buffer size exceeds limit + if len(buffer) > buffer_limit: + print(f"Warning: Buffer size exceeds {buffer_limit}. Some reads may be unpaired.", + file=sys.stderr) else: if aln.is_read2: aln, mate_aln = mate_aln, aln fqwriter.write(aln_to_fq(qname, aln), aln_to_fq(qname, mate_aln)) n_written += 1 + # Remove entry from buffer after successful write + del buffer[qname] + +# Report statistics on unpaired reads +if buffer: + print(f"Warning: {len(buffer)} reads remained unpaired after processing.", file=sys.stderr)🧰 Tools
🪛 Ruff (0.11.9)
24-24: Undefined name
snakemake
(F821)
24-24: Undefined name
snakemake
(F821)
♻️ Duplicate comments (2)
workflow/scripts/get-reads.py (2)
9-16
: Handle potential None quality scores in aln_to_fq function.The function doesn't handle the case where
aln.get_forward_qualities()
could be None, which might happen with some BAM files. This could lead to unexpected errors when creating the SequenceRecord.def aln_to_fq(qname, aln): + # Handle case when get_forward_qualities returns None + qualities = aln.get_forward_qualities() + if qualities is None: + qualities = [0] * len(aln.get_forward_sequence()) + quality_str = "!" * len(aln.get_forward_sequence()) # '!' is the lowest quality character + else: + quality_str = "".join(map(lambda qual: chr(qual + 33), qualities)) + return dnaio.SequenceRecord( name=qname, sequence=aln.get_forward_sequence(), - qualities="".join( - map(lambda qual: chr(qual + 33), aln.get_forward_qualities()) - ), + qualities=quality_str, )
19-21
: Add error handling for BAM file opening.The current code doesn't handle potential errors when opening the BAM file URL. This could lead to uncaught exceptions if the URL is invalid or the file is inaccessible.
+import os + limit = snakemake.params.limit -bam = pysam.AlignmentFile(snakemake.params.bam_url) +try: + bam = pysam.AlignmentFile(snakemake.params.bam_url) +except (IOError, ValueError) as e: + print(f"Error opening BAM file: {e}", file=sys.stderr) + sys.exit(1)🧰 Tools
🪛 Ruff (0.11.9)
19-19: Undefined name
snakemake
(F821)
20-20: Undefined name
snakemake
(F821)
🧹 Nitpick comments (2)
workflow/scripts/get-reads.py (2)
23-27
: Consider adding progress reporting.For large BAM files, processing can take a significant amount of time. Adding progress reporting would help monitor script execution.
buffer = {} n_written = 0 +report_interval = 100000 # Report every 100k read pairs +processed = 0 + with dnaio.open(snakemake.output[0], snakemake.output[1], mode="w") as fqwriter: for aln in bam: + processed += 1 + if processed % report_interval == 0: + print(f"Processed {processed} alignments, written {n_written} pairs", file=sys.stderr) + if limit is not None and n_written >= limit: + print(f"Reached limit of {limit} read pairs. Stopping.", file=sys.stderr) break🧰 Tools
🪛 Ruff (0.11.9)
24-24: Undefined name
snakemake
(F821)
24-24: Undefined name
snakemake
(F821)
37-41
: Validate read pairing before writing.The code assumes that if one read is marked as read2, then the other must be read1, but it doesn't explicitly verify this. Adding validation would prevent potential errors.
else: if aln.is_read2: aln, mate_aln = mate_aln, aln + # Validate proper read pairing + if (aln.is_read1 and mate_aln.is_read2) or (aln.is_read2 and mate_aln.is_read1): + # Reads are properly paired + pass + else: + print(f"Warning: Improper read pairing for {qname}. Flags: {aln.flag}, {mate_aln.flag}", + file=sys.stderr) + fqwriter.write(aln_to_fq(qname, aln), aln_to_fq(qname, mate_aln)) n_written += 1
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
Cache: Disabled due to data retention organization setting
Knowledge Base: Disabled due to data retention organization setting
📒 Files selected for processing (1)
workflow/scripts/get-reads.py
(1 hunks)
🧰 Additional context used
🪛 Ruff (0.11.9)
workflow/scripts/get-reads.py
3-3: Use a context manager for opening files
(SIM115)
3-3: Undefined name snakemake
(F821)
19-19: Undefined name snakemake
(F821)
20-20: Undefined name snakemake
(F821)
24-24: Undefined name snakemake
(F821)
24-24: Undefined name snakemake
(F821)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 2
♻️ Duplicate comments (2)
workflow/scripts/get-reads.py (2)
1-7
: Improve file handling with context managers and error handlingThe current implementation directly assigns to
sys.stderr
without proper error handling or cleanup mechanism.import sys +import os -sys.stderr = open(snakemake.log[0], "w") +# Redirect stderr to log file using context manager +try: + sys.stderr = open(snakemake.log[0], "w", buffering=1) +except (IOError, OSError) as e: + print(f"Error opening log file: {e}", file=sys.stderr) + sys.exit(1) import pysam import dnaio🧰 Tools
🪛 Ruff (0.11.9)
3-3: Use a context manager for opening files
(SIM115)
3-3: Undefined name
snakemake
(F821)
19-21
: Add error handling for BAM file operationsThe current code lacks error handling when opening and accessing the BAM file, which could lead to uncaught exceptions.
limit = snakemake.params.limit -bam = pysam.AlignmentFile(snakemake.params.bam_url) +try: + bam = pysam.AlignmentFile(snakemake.params.bam_url) +except (IOError, ValueError) as e: + print(f"Error opening BAM file: {e}", file=sys.stderr) + sys.exit(1)🧰 Tools
🪛 Ruff (0.11.9)
19-19: Undefined name
snakemake
(F821)
20-20: Undefined name
snakemake
(F821)
🧹 Nitpick comments (3)
workflow/scripts/get-reads.py (3)
24-29
: Add progress reporting for long-running operationsFor large BAM files, there's no indication of progress, which can make it difficult to monitor the script's execution.
n_written = 0 +processed = 0 +report_interval = 100000 # Report progress every 100k reads with dnaio.open(snakemake.output[0], snakemake.output[1], mode="w") as fqwriter: for aln in bam: + processed += 1 + if processed % report_interval == 0: + print(f"Processed {processed} alignments, written {n_written} pairs", file=sys.stderr) if limit is not None and n_written >= limit: + print(f"Reached limit of {limit} pairs, stopping", file=sys.stderr) break if aln.is_secondary or aln.is_supplementary: continue🧰 Tools
🪛 Ruff (0.11.9)
24-24: Undefined name
snakemake
(F821)
24-24: Undefined name
snakemake
(F821)
38-44
: Validate read pairs before writingThe current implementation assumes that reads with the same name belong to the same fragment, but additional validation could ensure correct pairing.
else: + # Ensure reads are properly paired + if not (aln.is_read1 and mate_aln.is_read2) and not (aln.is_read2 and mate_aln.is_read1): + print(f"Warning: Reads with name {qname} don't appear to be proper pairs based on flags", file=sys.stderr) + if aln.is_read2: aln, mate_aln = mate_aln, aln del buffer[qname] fqwriter.write(aln_to_fq(qname, aln), aln_to_fq(qname, mate_aln)) n_written += 1
1-3
: Import snakemake explicitly for clarityThe script uses the
snakemake
variable without importing it, which works in Snakemake's execution environment but reduces code clarity.import sys +# The snakemake variable is implicitly available when executed through Snakemake +# Adding this comment improves clarity for developers unfamiliar with Snakemake sys.stderr = open(snakemake.log[0], "w")🧰 Tools
🪛 Ruff (0.11.9)
3-3: Use a context manager for opening files
(SIM115)
3-3: Undefined name
snakemake
(F821)
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
Cache: Disabled due to data retention organization setting
Knowledge Base: Disabled due to data retention organization setting
📒 Files selected for processing (1)
workflow/scripts/get-reads.py
(1 hunks)
🧰 Additional context used
🪛 Ruff (0.11.9)
workflow/scripts/get-reads.py
3-3: Use a context manager for opening files
(SIM115)
3-3: Undefined name snakemake
(F821)
19-19: Undefined name snakemake
(F821)
20-20: Undefined name snakemake
(F821)
24-24: Undefined name snakemake
(F821)
24-24: Undefined name snakemake
(F821)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 0
♻️ Duplicate comments (2)
workflow/scripts/get-reads.py (2)
9-16
: 🛠️ Refactor suggestionHandle potential None qualities in the aln_to_fq function.
The function doesn't handle the case where
aln.get_forward_qualities()
could be None or empty, which can happen with some BAM files.def aln_to_fq(qname, aln): + # Handle case when forward qualities might be None + qualities = aln.get_forward_qualities() + if qualities is None: + # Default to quality score of 30 (ASCII '?') for all bases + quality_str = "?" * len(aln.get_forward_sequence()) + else: + quality_str = "".join(map(lambda qual: chr(qual + 33), qualities)) + return dnaio.SequenceRecord( name=qname, sequence=aln.get_forward_sequence(), - qualities="".join( - map(lambda qual: chr(qual + 33), aln.get_forward_qualities()) - ), + qualities=quality_str, )
1-7
: 🛠️ Refactor suggestionUse a context manager for the log file.
The current code directly assigns to
sys.stderr
without ensuring proper file closure. Also, the script usessnakemake
which is implicitly provided in the Snakemake environment but not imported explicitly.import sys -sys.stderr = open(snakemake.log[0], "w") +# Use context manager for log file with line buffering +log_file = open(snakemake.log[0], "w", buffering=1) +sys.stderr = log_file import pysam import dnaioThe code should also close the log file at the end of script execution or restore the original stderr.
🧰 Tools
🪛 Ruff (0.11.9)
3-3: Use a context manager for opening files
(SIM115)
3-3: Undefined name
snakemake
(F821)
🧹 Nitpick comments (2)
workflow/scripts/get-reads.py (2)
19-20
: Add error handling for BAM file opening.The current code doesn't handle potential errors when opening the BAM file. This could lead to uncaught exceptions if the URL is invalid or the file is inaccessible.
limit = snakemake.params.limit -bam = pysam.AlignmentFile(snakemake.params.bam_url) +try: + bam = pysam.AlignmentFile(snakemake.params.bam_url) +except (IOError, ValueError) as e: + print(f"Error opening BAM file: {e}", file=sys.stderr) + sys.exit(1)🧰 Tools
🪛 Ruff (0.11.9)
19-19: Undefined name
snakemake
(F821)
20-20: Undefined name
snakemake
(F821)
24-44
: The read pairing logic looks good but consider adding more validation.The read pairing logic is sound and handles the use case correctly. However, consider adding validation to ensure that paired reads actually belong together (e.g., checking read flags or positions).
else: + # Validate that reads are properly paired + if not (aln.is_paired and mate_aln.is_paired): + print(f"Warning: Read {qname} has inconsistent paired flags", file=sys.stderr) + if aln.is_read2: aln, mate_aln = mate_aln, aln del buffer[qname]🧰 Tools
🪛 Ruff (0.11.9)
24-24: Undefined name
snakemake
(F821)
24-24: Undefined name
snakemake
(F821)
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
Cache: Disabled due to data retention organization setting
Knowledge Base: Disabled due to data retention organization setting
📒 Files selected for processing (1)
workflow/scripts/get-reads.py
(1 hunks)
🧰 Additional context used
🪛 Ruff (0.11.9)
workflow/scripts/get-reads.py
3-3: Use a context manager for opening files
(SIM115)
3-3: Undefined name snakemake
(F821)
19-19: Undefined name snakemake
(F821)
20-20: Undefined name snakemake
(F821)
24-24: Undefined name snakemake
(F821)
24-24: Undefined name snakemake
(F821)
⏰ Context from checks skipped due to timeout of 90000ms (1)
- GitHub Check: Formatting
🔇 Additional comments (1)
workflow/scripts/get-reads.py (1)
46-47
: Good handling of orphaned reads.The code correctly handles orphaned reads by providing a warning message. This is a good improvement from previous iterations.
…ws/dna-seq-benchmark into fix/get-reads-buffered
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM! Should we check for updates for the wrappers in a seperate PR aswell? I noticed that you updated a few of them :)
I would say that is not urgent. In particular, it would trigger reruns, which are probably a good thing before re-uploading but not while you still finalize the analysis. |
Summary by CodeRabbit
New Features
Improvements
Refactor