Skip to content

Commit 2a8f760

Browse files
authored
Merge pull request #92 from sanger-tol/dev
Release 0.3
2 parents 735b30d + aabcfc6 commit 2a8f760

Some content is hidden

Large Commits have some content hidden by default. Use the searchbox below for content that may be hidden.

50 files changed

+1005
-168
lines changed

.github/workflows/linting.yml

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -32,7 +32,7 @@ jobs:
3232
- uses: actions/setup-node@v4
3333

3434
- name: Install Prettier
35-
run: npm install -g prettier
35+
run: npm install -g prettier@3.1.0
3636

3737
- name: Run Prettier --check
3838
run: prettier --check ${GITHUB_WORKSPACE}
@@ -84,7 +84,7 @@ jobs:
8484
- name: Install dependencies
8585
run: |
8686
python -m pip install --upgrade pip
87-
pip install nf-core
87+
pip install nf-core==2.11
8888
8989
- name: Run nf-core lint
9090
env:

.nf-core.yml

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -17,6 +17,7 @@ lint:
1717
- docs/images/nf-core-blobtoolkit_logo_dark.png
1818
- .github/ISSUE_TEMPLATE/bug_report.yml
1919
- .github/PULL_REQUEST_TEMPLATE.md
20+
- .github/workflows/linting.yml
2021
multiqc_config:
2122
- report_comment
2223
nextflow_config:

CHANGELOG.md

Lines changed: 30 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -3,6 +3,36 @@
33
The format is based on [Keep a Changelog](https://keepachangelog.com/en/1.0.0/)
44
and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0.html).
55

6+
## [[0.3.0](https://github.com/sanger-tol/blobtoolkit/releases/tag/0.3.0)] – Poliwag – [2024-02-09]
7+
8+
The pipeline has now been validated on five genomes, all under 100 Mbp: a
9+
sponge, a platyhelminth, and three fungi.
10+
11+
### Enhancements & fixes
12+
13+
- Fixed the conditional runs of blastn
14+
- Fixed the generation of the no-hit list
15+
- Fixed the conversion of the unaligned input files to Fasta
16+
- Fixed the documentation about preparing the NT database
17+
- Fixed the detection of the NT database in the nf-core module
18+
- The pipeline now supports samplesheets generated by the
19+
[nf-core/fetchngs](https://nf-co.re/fetchngs) pipeline by passing the
20+
`--fetchngs_samplesheet true` option.
21+
- FastQ files can bypass the conversion to Fasta
22+
- Fixed missing BUSCO results from the blobdir (only 1 BUSCO was loaded)
23+
- Fixed the default category used to colour the blob plots
24+
- Fixed the output directory of the images
25+
- Added an option to select the format of the images (PNG or SVG)
26+
27+
### Parameters
28+
29+
| Old parameter | New parameter |
30+
| ------------- | ---------------------- |
31+
| | --fetchngs_samplesheet |
32+
| | --image_format |
33+
34+
> **NB:** Parameter has been **updated** if both old and new parameter information is present. </br> **NB:** Parameter has been **added** if just the new parameter information is present. </br> **NB:** Parameter has been **removed** if new parameter information isn't present.
35+
636
## [[0.2.0](https://github.com/sanger-tol/blobtoolkit/releases/tag/0.2.0)] – Pikachu – [2023-12-22]
737

838
### Enhancements & fixes

README.md

Lines changed: 2 additions & 20 deletions
Original file line numberDiff line numberDiff line change
@@ -13,19 +13,6 @@
1313

1414
**sanger-tol/blobtoolkit** is a bioinformatics pipeline that can be used to identify and analyse non-target DNA for eukaryotic genomes. It takes a samplesheet and aligned CRAM files as input, calculates genome statistics, coverage and completeness information, combines them in a TSV file by window size to create a BlobDir dataset and static plots.
1515

16-
<!--
17-
Complete this sentence with a 2-3 sentence summary of what types of data the pipeline ingests, a brief overview of the
18-
major pipeline sections and the types of output it produces. You're giving an overview to someone new
19-
to nf-core here, in 15-20 seconds. For an example, see https://github.com/nf-core/rnaseq/blob/master/README.md#introduction
20-
-->
21-
22-
<!-- Include a figure that guides the user through the major workflow steps. Many nf-core
23-
workflows use the "tube map" design for that. See https://nf-co.re/docs/contributing/design_guidelines#examples for examples. -->
24-
25-
<!-- # ![sanger-tol/blobtoolkit](https://raw.githubusercontent.com/sanger-tol/blobtoolkit/main/docs/images/sanger-tol-blobtoolkit_workflow.png) -->
26-
27-
<!-- Fill in short bullet-pointed list of the default steps in the pipeline -->
28-
2916
1. Calculate genome statistics in windows ([`fastawindows`](https://github.com/tolkit/fasta_windows))
3017
2. Calculate Coverage ([`blobtk/depth`](https://github.com/blobtoolkit/blobtk))
3118
3. Fetch associated BUSCO lineages ([`goat/taxonsearch`](https://github.com/genomehubs/goat-cli))
@@ -44,9 +31,6 @@
4431
> [!NOTE]
4532
> If you are new to Nextflow and nf-core, please refer to [this page](https://nf-co.re/docs/usage/installation) on how to set-up Nextflow. Make sure to [test your setup](https://nf-co.re/docs/usage/introduction#how-to-run-a-pipeline) with `-profile test` before running the workflow on actual data.
4633
47-
<!-- Describe the minimum required steps to execute the pipeline, e.g. how to prepare samplesheets.
48-
Explain what rows and columns represent. For instance (please edit as appropriate): -->
49-
5034
First, prepare a samplesheet with your input data that looks as follows:
5135

5236
`samplesheet.csv`:
@@ -58,12 +42,10 @@ mMelMel1,illumina,GCA_922984935.2.illumina.mMelMel1.cram
5842
mMelMel3,ont,GCA_922984935.2.ont.mMelMel3.cram
5943
```
6044

61-
Each row represents an aligned file. Rows with the same sample identifier are considered technical replicates. The datatype refers to the sequencing technology used to generate the underlying raw data and follows a controlled vocabulary (ont, hic, pacbio, pacbio_clr, illumina). The aligned read files can be generated using the [sanger-tol/readmapping](https://github.com/sanger-tol/readmapping) pipeline.
45+
Each row represents an aligned file. Rows with the same sample identifier are considered technical replicates. The datatype refers to the sequencing technology used to generate the underlying raw data and follows a controlled vocabulary (`ont`, `hic`, `pacbio`, `pacbio_clr`, `illumina`). The aligned read files can be generated using the [sanger-tol/readmapping](https://github.com/sanger-tol/readmapping) pipeline.
6246

6347
Now, you can run the pipeline using:
6448

65-
<!-- update the following command to include all required parameters for a minimal example -->
66-
6749
```bash
6850
nextflow run sanger-tol/blobtoolkit \
6951
-profile <docker/singularity/.../institute> \
@@ -86,7 +68,7 @@ For more details, please refer to the [usage documentation](https://pipelines.to
8668

8769
## Pipeline output
8870

89-
<!-- To see the the results of a test run with a full size dataset refer to the [results](https://pipelines.tol.sanger.ac.uk/blobtoolkit/results) tab on the sanger-tol website pipeline page. --> For more details about the output files and reports, please refer to the [output documentation](https://pipelines.tol.sanger.ac.uk/blobtoolkit/output).
71+
For more details about the output files and reports, please refer to the [output documentation](https://pipelines.tol.sanger.ac.uk/blobtoolkit/output).
9072

9173
## Credits
9274

bin/check_fetchngs_samplesheet.py

Lines changed: 252 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,252 @@
1+
#!/usr/bin/env python
2+
3+
4+
"""Provide a command line tool to validate and transform tabular samplesheets."""
5+
6+
7+
import argparse
8+
import csv
9+
import logging
10+
import sys
11+
from collections import Counter
12+
from pathlib import Path
13+
14+
logger = logging.getLogger()
15+
16+
17+
class RowChecker:
18+
"""
19+
Define a service that can validate and transform each given row.
20+
21+
Attributes:
22+
modified (list): A list of dicts, where each dict corresponds to a previously
23+
validated and transformed row. The order of rows is maintained.
24+
25+
"""
26+
27+
VALID_FORMATS = (".fastq.gz",)
28+
29+
def __init__(
30+
self,
31+
accession_col="run_accession",
32+
model_col="instrument_model",
33+
platform_col="instrument_platform",
34+
library_col="library_strategy",
35+
file1_col="fastq_1",
36+
file2_col="fastq_2",
37+
**kwargs,
38+
):
39+
"""
40+
Initialize the row checker with the expected column names.
41+
42+
Args:
43+
accession_col (str): The name of the column that contains the accession name
44+
(default "run_accession").
45+
model_col (str): The name of the column that contains the model name
46+
of the instrument (default "instrument_model").
47+
platform_col (str): The name of the column that contains the platform name
48+
of the instrument (default "instrument_platform").
49+
library_col (str): The name of the column that contains the strategy of the
50+
preparation of the library (default "library_strategy").
51+
file2_col (str): The name of the column that contains the second file path
52+
for the paired-end read data (default "fastq_2").
53+
"""
54+
super().__init__(**kwargs)
55+
self._accession_col = accession_col
56+
self._model_col = model_col
57+
self._platform_col = platform_col
58+
self._library_col = library_col
59+
self._file1_col = file1_col
60+
self._file2_col = file2_col
61+
self._seen = set()
62+
self.modified = []
63+
64+
def validate_and_transform(self, row):
65+
"""
66+
Perform all validations on the given row.
67+
68+
Args:
69+
row (dict): A mapping from column headers (keys) to elements of that row
70+
(values).
71+
72+
"""
73+
self._validate_accession(row)
74+
self._validate_file(row)
75+
self._seen.add((row[self._accession_col], row[self._file1_col]))
76+
self.modified.append(row)
77+
78+
def _validate_accession(self, row):
79+
"""Assert that the run accession name exists."""
80+
if len(row[self._accession_col]) <= 0:
81+
raise AssertionError("Run accession is required.")
82+
83+
def _validate_file(self, row):
84+
"""Assert that the datafile is non-empty and has the right format."""
85+
if len(row[self._file1_col]) <= 0:
86+
raise AssertionError("Data file is required.")
87+
self._validate_data_format(row[self._file1_col])
88+
if row[self._file2_col]:
89+
self._validate_data_format(row[self._file2_col])
90+
91+
def _validate_data_format(self, filename):
92+
"""Assert that a given filename has one of the expected FASTQ extensions."""
93+
if not any(filename.endswith(extension) for extension in self.VALID_FORMATS):
94+
raise AssertionError(
95+
f"The data file has an unrecognized extension: {filename}\n"
96+
f"It should be one of: {', '.join(self.VALID_FORMATS)}"
97+
)
98+
99+
def validate_unique_accessions(self):
100+
"""
101+
Assert that the combination of accession name and aligned filename is unique.
102+
103+
In addition to the validation, also rename all accessions to have a suffix of _T{n}, where n is the
104+
number of times the same accession exist, but with different FASTQ files, e.g., multiple runs per experiment.
105+
106+
"""
107+
if len(self._seen) != len(self.modified):
108+
raise AssertionError("The pair of accession and file name must be unique.")
109+
seen = Counter()
110+
for row in self.modified:
111+
accession = row[self._accession_col]
112+
seen[accession] += 1
113+
row[self._accession_col] = f"{accession}_T{seen[accession]}"
114+
115+
116+
def read_head(handle, num_lines=10):
117+
"""Read the specified number of lines from the current position in the file."""
118+
lines = []
119+
for idx, line in enumerate(handle):
120+
if idx == num_lines:
121+
break
122+
lines.append(line)
123+
return "".join(lines)
124+
125+
126+
def sniff_format(handle):
127+
"""
128+
Detect the tabular format.
129+
130+
Args:
131+
handle (text file): A handle to a `text file`_ object. The read position is
132+
expected to be at the beginning (index 0).
133+
134+
Returns:
135+
csv.Dialect: The detected tabular format.
136+
137+
.. _text file:
138+
https://docs.python.org/3/glossary.html#term-text-file
139+
140+
"""
141+
peek = read_head(handle)
142+
handle.seek(0)
143+
sniffer = csv.Sniffer()
144+
dialect = sniffer.sniff(peek)
145+
return dialect
146+
147+
148+
def check_samplesheet(file_in, file_out):
149+
"""
150+
Check that the tabular samplesheet has the structure expected by sanger-tol pipelines.
151+
152+
Validate the general shape of the table, expected columns, and each row. Also add
153+
Args:
154+
file_in (pathlib.Path): The given tabular samplesheet. The format can be either
155+
CSV, TSV, or any other format automatically recognized by ``csv.Sniffer``.
156+
file_out (pathlib.Path): Where the validated and transformed samplesheet should
157+
be created; always in CSV format.
158+
159+
Example:
160+
This function checks that the samplesheet follows the following structure,
161+
see also the `blobtoolkit samplesheet`_::
162+
163+
sample,datatype,datafile
164+
sample1,hic,/path/to/file1.cram
165+
sample1,pacbio,/path/to/file2.cram
166+
sample1,ont,/path/to/file3.cram
167+
168+
.. _blobtoolkit samplesheet:
169+
https://raw.githubusercontent.com/sanger-tol/blobtoolkit/main/assets/test/samplesheet.csv
170+
171+
"""
172+
required_columns = {
173+
"run_accession",
174+
"instrument_model",
175+
"instrument_platform",
176+
"library_strategy",
177+
"fastq_1",
178+
"fastq_2",
179+
}
180+
# See https://docs.python.org/3.9/library/csv.html#id3 to read up on `newline=""`.
181+
with file_in.open(newline="") as in_handle:
182+
reader = csv.DictReader(in_handle, dialect=sniff_format(in_handle))
183+
# Validate the existence of the expected header columns.
184+
if not required_columns.issubset(reader.fieldnames):
185+
req_cols = ", ".join(required_columns)
186+
logger.critical(f"The sample sheet **must** contain these column headers: {req_cols}.")
187+
sys.exit(1)
188+
# Validate each row.
189+
checker = RowChecker()
190+
for i, row in enumerate(reader):
191+
try:
192+
checker.validate_and_transform(row)
193+
except AssertionError as error:
194+
logger.critical(f"{str(error)} On line {i + 2}.")
195+
sys.exit(1)
196+
checker.validate_unique_accessions()
197+
header = list(reader.fieldnames)
198+
# See https://docs.python.org/3.9/library/csv.html#id3 to read up on `newline=""`.
199+
with file_out.open(mode="w", newline="") as out_handle:
200+
writer = csv.DictWriter(out_handle, header, delimiter=",")
201+
writer.writeheader()
202+
for row in checker.modified:
203+
writer.writerow(row)
204+
205+
206+
def parse_args(argv=None):
207+
"""Define and immediately parse command line arguments."""
208+
parser = argparse.ArgumentParser(
209+
description="Validate and transform a tabular samplesheet.",
210+
epilog="Example: python check_samplesheet.py samplesheet.csv samplesheet.valid.csv",
211+
)
212+
parser.add_argument(
213+
"file_in",
214+
metavar="FILE_IN",
215+
type=Path,
216+
help="Tabular input samplesheet in CSV or TSV format.",
217+
)
218+
parser.add_argument(
219+
"file_out",
220+
metavar="FILE_OUT",
221+
type=Path,
222+
help="Transformed output samplesheet in CSV format.",
223+
)
224+
parser.add_argument(
225+
"-l",
226+
"--log-level",
227+
help="The desired log level (default WARNING).",
228+
choices=("CRITICAL", "ERROR", "WARNING", "INFO", "DEBUG"),
229+
default="WARNING",
230+
)
231+
parser.add_argument(
232+
"-v",
233+
"--version",
234+
action="version",
235+
version="%(prog)s 1.0.0",
236+
)
237+
return parser.parse_args(argv)
238+
239+
240+
def main(argv=None):
241+
"""Coordinate argument parsing and program execution."""
242+
args = parse_args(argv)
243+
logging.basicConfig(level=args.log_level, format="[%(levelname)s] %(message)s")
244+
if not args.file_in.is_file():
245+
logger.error(f"The given input file {args.file_in} was not found!")
246+
sys.exit(2)
247+
args.file_out.parent.mkdir(parents=True, exist_ok=True)
248+
check_samplesheet(args.file_in, args.file_out)
249+
250+
251+
if __name__ == "__main__":
252+
sys.exit(main())

bin/check_samplesheet.py

Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -27,6 +27,8 @@ class RowChecker:
2727
VALID_FORMATS = (
2828
".cram",
2929
".bam",
30+
".fastq",
31+
".fastq.gz",
3032
)
3133

3234
VALID_DATATYPES = (

bin/nohitlist.sh

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -8,8 +8,8 @@ E=$4
88

99
# find ids of sequences with no hits in the blastx search
1010
grep '>' $fasta | \
11-
grep -v -w -f <(awk -v evalue="$E" '{{if($14<{evalue}){{print $1}}}}' $blast | sort | uniq) | \
12-
cut -f1 | sed 's/>//' > $prefix.nohit.txt
11+
grep -v -w -f <(awk -v evalue="$E" '{if($14<evalue){print $1}}' $blast | sort | uniq) | \
12+
awk '{print $1}' | sed 's/>//' > $prefix.nohit.txt
1313

1414

1515

0 commit comments

Comments
 (0)