Skip to content

Commit 1130fde

Browse files
authored
merge dev into main (#1)
* populate the bin folder * updated conda environment * populate modules folder * remove need for DBVERSION * added README * fixed package versioning and bumped to latest * improved filter_best_bitscore.py * updated filter_best_bitscore process * updated extra info parsing
1 parent 34f1b17 commit 1130fde

18 files changed

+1091
-115
lines changed

.gitignore

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1 @@
1+
bin/__pycache__

README.md

Lines changed: 154 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,154 @@
1+
# its-nf
2+
3+
Prepare a report for taxonomic assignment based on [ITS](https://en.wikipedia.org/wiki/Internal_transcribed_spacer) sequences, using [BLAST](https://blast.ncbi.nlm.nih.gov/Blast.cgi).
4+
5+
## Usage
6+
7+
The pipeline requires a list of BLAST databases to run against. It should follow the following format:
8+
9+
```csv
10+
ID,DBNAME,PATH
11+
ncbi,its_ncbi,/path/to/ncbi/2024-05-16_its_ncbi
12+
unite,its_unite,/path/to/unite/2024-05-13_its_unite
13+
```
14+
15+
...where we expect to find the actual database files at:
16+
17+
```
18+
/path/to/ncbi/2024-05-16_its_ncbi/its_ncbi.ndb
19+
/path/to/ncbi/2024-05-16_its_ncbi/its_ncbi.nhr
20+
/path/to/ncbi/2024-05-16_its_ncbi/its_ncbi.nin
21+
...etc
22+
/path/to/unite/2024-05-13_its_unite/its_unite.ndb
23+
/path/to/unite/2024-05-13_its_unite/its_unite.nhr
24+
/path/to/unite/2024-05-13_its_unite/its_unite.nin
25+
...etc
26+
```
27+
28+
The pipeline also assumes that there is a `metadata.json` file alongside the database files
29+
30+
```
31+
/path/to/ncbi/2024-05-16_its_ncbi/metadata.json
32+
/path/to/unite/2024-05-13_its_unite/metadata.json
33+
```
34+
35+
The contents of the metadata file may vary by database, but we assume that:
36+
37+
- The file contains a single top-level object (not an array or atomic value).
38+
- The top-level object includes these fields:
39+
40+
```
41+
version
42+
date
43+
```
44+
45+
The values associated with those fields will be incorporated into the blast results. All other fields in
46+
the `metadata.json` file are ignored.
47+
48+
```
49+
nextflow run BCCDC-PHL/its-nf \
50+
--databases </path/to/blast/databases.csv> \
51+
--taxonkit_db </path/to/taxonkit/database/> \
52+
--fasta_input </path/to/fasta_dir> \
53+
--outdir </path/to/output_dir>
54+
```
55+
56+
By default, minimum identity and coverage thresholds of 95% will be applied to the blast results.
57+
Alternate thresholds can be applied using the `--minid` and `--mincov` flags.
58+
59+
```
60+
nextflow run BCCDC-PHL/its-nf \
61+
--databases </path/to/blast/databases.csv> \
62+
--taxonkit_db </path/to/taxonkit/database/> \
63+
--fasta_input </path/to/fasta_dir> \
64+
--minid 99.0 \
65+
--mincov 97.5 \
66+
--outdir </path/to/output_dir>
67+
```
68+
69+
Collecting database metadata from the `metadata.json` file can be skipped using the `--no_db_metadata` flag.
70+
71+
```
72+
nextflow run BCCDC-PHL/its-nf \
73+
--databases </path/to/blast/databases.csv> \
74+
--taxonkit_db </path/to/taxonkit/database/> \
75+
--no_db_metadata \
76+
--fasta_input </path/to/fasta_dir> \
77+
--outdir </path/to/output_dir>
78+
```
79+
80+
81+
## Outputs
82+
83+
Each sequence will have a separate output directory, named using the seq ID parsed from
84+
the fasta header. That directory will contain:
85+
86+
```
87+
<seq_id>_<db_id>_blast.csv
88+
<seq_id>_<db_id>_blast_best_bitscore.csv
89+
<seq_id>_<db_id>_blast_filtered.csv
90+
<seq_id>_<db_id>_lineages.tsv
91+
<seq_id>_<db_id>_seq_qc.csv
92+
```
93+
94+
The `_blast.csv`, `_blast_filtered.csv` and `blast_best_bitscore.csv` files have the following headers:
95+
96+
```
97+
query_seq_id
98+
subject_accession
99+
subject_strand
100+
query_length
101+
query_start
102+
query_end
103+
subject_length
104+
subject_start
105+
subject_end
106+
alignment_length
107+
percent_identity
108+
percent_coverage
109+
num_mismatch
110+
num_gaps
111+
e_value
112+
bitscore
113+
subject_taxids
114+
subject_names
115+
genus
116+
species
117+
database_name
118+
database_version
119+
database_date
120+
```
121+
122+
...though if the `--no_db_metadata` flag is used when running the pipeline, the last three fields will be omitted.
123+
124+
The `_blast_best_bitscore.csv` file will only include one entry per species per database if there are multiple matches from the same
125+
species with equally-good bitscores.
126+
127+
The `_lineages.tsv` file is generated by `taxonkit`, and has the following headers:
128+
129+
```
130+
query_taxid
131+
lineage
132+
lineage_taxids
133+
query_taxon_name
134+
lineage_ranks
135+
```
136+
137+
...where the `lineage`, `lineage_taxids`, and `lineage_ranks` are themselves semicolon-separated lists.
138+
139+
The `seq_qc.csv` file has the following headers:
140+
141+
```
142+
seq_length
143+
num_ambiguous_bases
144+
num_n_bases
145+
```
146+
147+
There will also be collected ouputs in the top-level of the `--outdir` directory, named:
148+
149+
```
150+
collected_blast.csv
151+
collected_blast_best_bitscore.csv
152+
```
153+
154+
...which will include results from all sequences.

bin/add_db_metadata.py

Lines changed: 61 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,61 @@
1+
#!/usr/bin/env python3
2+
3+
import argparse
4+
import csv
5+
import json
6+
import sys
7+
8+
9+
def parse_metadata(metadata_path):
10+
with open(metadata_path, 'r') as f:
11+
metadata = json.load(f)
12+
13+
return metadata
14+
15+
16+
def parse_blast_results(blast_results_path):
17+
header_fieldnames = []
18+
blast_results = []
19+
20+
with open(blast_results_path, 'r') as f:
21+
header_line = f.readline().strip()
22+
header_fieldnames = header_line.split(',')
23+
24+
with open(blast_results_path, 'r') as f:
25+
reader = csv.DictReader(f)
26+
for row in reader:
27+
blast_results.append(row)
28+
29+
return header_fieldnames, blast_results
30+
31+
32+
def main(args):
33+
metadata = parse_metadata(args.metadata)
34+
output_fieldnames, blast_results = parse_blast_results(args.blastresult)
35+
for record in blast_results:
36+
record['database_name'] = args.database_name
37+
record['database_version'] = metadata.get('version', None)
38+
record['database_date'] = metadata.get('date', None)
39+
40+
41+
output_fieldnames += [
42+
'database_name',
43+
'database_version',
44+
'database_date',
45+
]
46+
47+
writer = csv.DictWriter(sys.stdout, fieldnames=output_fieldnames, delimiter=',', lineterminator='\n', quoting=csv.QUOTE_MINIMAL, extrasaction='ignore')
48+
49+
writer.writeheader()
50+
for record in blast_results:
51+
writer.writerow(record)
52+
53+
54+
55+
if __name__ == "__main__":
56+
parser = argparse.ArgumentParser()
57+
parser.add_argument('-m','--metadata')
58+
parser.add_argument('-d','--database_name')
59+
parser.add_argument('-b','--blastresult')
60+
args = parser.parse_args()
61+
main(args)

bin/bind_taxonkit.py

Lines changed: 79 additions & 23 deletions
Original file line numberDiff line numberDiff line change
@@ -1,44 +1,100 @@
11
#!/usr/bin/env python3
22

33
import argparse
4-
import pandas as pd
5-
import numpy as np
4+
import csv
5+
import json
6+
import sys
67

7-
def main(args):
8-
cols = 'subject_taxids,lineage,name,rank'.split(',')
9-
#taxon_results = pd.read_csv('F1910235_taxon_results.txt',sep = '\t', names=cols)
10-
taxon_results = pd.read_csv(args.taxonresult,sep = '\t', names=cols)
118

12-
blast_results = pd.read_csv(args.blastresult,sep = ',')
13-
taxon_results = taxon_results.dropna()
9+
def parse_taxonkit_lineage(taxonkit_path):
10+
"""
11+
Parse taxonkit lineage outputs
1412
13+
:param taxonkit_path: path to taxonkit lineage output
14+
:type taxonkit_path: str
15+
:return: taxonkit lineages, by taxid. dict of dicts. keys of inner dicts: ['taxid', 'lineage', 'name', 'rank']
16+
:rtype: dict[str, dict[str, str]]
17+
"""
18+
taxonkit_lineage_by_taxid = {}
19+
with open(taxonkit_path, 'r') as f:
20+
reader = csv.DictReader(f, delimiter='\t')
21+
for row in reader:
22+
taxonkit_lineage_record = {}
23+
query_taxid = row['query_taxid']
24+
lineage = row['lineage']
25+
lineage_split = lineage.split(';')
26+
taxids = row['lineage_taxids']
27+
taxids_split = taxids.split(';')
28+
name = row['query_taxon_name']
29+
ranks = row['lineage_ranks']
30+
ranks_split = ranks.split(';')
1531

32+
taxonkit_lineage_record['query_taxid'] = query_taxid
33+
for idx, rank in enumerate(ranks_split):
34+
if rank == 'species':
35+
taxonkit_lineage_record['species_taxid'] = taxids_split[idx]
36+
taxonkit_lineage_record['species_name'] = lineage_split[idx]
37+
elif rank == 'genus':
38+
taxonkit_lineage_record['genus_taxid'] = taxids_split[idx]
39+
taxonkit_lineage_record['genus_name'] = lineage_split[idx]
1640

17-
conditions = [
18-
(taxon_results['rank'] == "genus"),
19-
(taxon_results['rank'] == "species"),
20-
(taxon_results['rank'] == "strain")
41+
taxonkit_lineage_by_taxid[query_taxid] = taxonkit_lineage_record
42+
43+
return taxonkit_lineage_by_taxid
2144

22-
]
2345

24-
choices_species = [None, taxon_results['lineage'].apply(lambda x: x.split(';')[-1]), taxon_results['lineage'].apply(lambda x: x.split(';')[-2])]
25-
choices_genus = [taxon_results['lineage'].apply(lambda x: x.split(';')[-1]), taxon_results['lineage'].apply(lambda x: x.split(';')[-2]),taxon_results['lineage'].apply(lambda x: x.split(';')[-3])]
46+
def parse_blast_results(blast_results_path):
47+
"""
48+
Parse blast results
49+
50+
:param blast_results_path: path to blast results
51+
:type blast_results_path: str
52+
53+
"""
54+
header_fieldnames = []
55+
blast_results = []
56+
with open(blast_results_path, 'r') as f:
57+
header_line = f.readline().strip()
58+
header_fieldnames = header_line.split(',')
59+
60+
with open(blast_results_path, 'r') as f:
61+
reader = csv.DictReader(f)
62+
for row in reader:
63+
blast_results.append(row)
64+
65+
return header_fieldnames, blast_results
2666

27-
taxon_results['species'] = np.select(conditions,choices_species, default = taxon_results['lineage'].apply(lambda x: x.split(';')[-1]))
28-
taxon_results['genus'] = np.select(conditions,choices_genus, default = taxon_results['lineage'].apply(lambda x: x.split(';')[-2]))
67+
68+
def main(args):
2969

70+
taxonkit_lineage_by_taxid = parse_taxonkit_lineage(args.taxonresult)
71+
72+
output_fieldnames, blast_results = parse_blast_results(args.blastresult)
3073

31-
merged = pd.merge(blast_results,taxon_results[['subject_taxids','species','genus']],on='subject_taxids', how='left')
32-
fil = merged['species'].str.contains('uncultured')
33-
filtered_merged = merged[~fil]
74+
for blast_result in blast_results:
75+
subject_taxid = blast_result['subject_taxids']
76+
if subject_taxid in taxonkit_lineage_by_taxid:
77+
blast_result['species'] = taxonkit_lineage_by_taxid[subject_taxid].get('species_name', None)
78+
blast_result['genus'] = taxonkit_lineage_by_taxid[subject_taxid].get('genus_name', None)
79+
else:
80+
blast_result['species'] = None
81+
blast_result['genus'] = None
82+
83+
output_fieldnames += [
84+
'genus',
85+
'species',
86+
]
87+
writer = csv.DictWriter(sys.stdout, fieldnames=output_fieldnames, delimiter=',', lineterminator='\n', quoting=csv.QUOTE_MINIMAL, extrasaction='ignore')
88+
writer.writeheader()
89+
for blast_result in blast_results:
90+
writer.writerow(blast_result)
91+
3492

35-
filtered_merged.to_csv(args.outfile)
3693

3794
if __name__ == "__main__":
3895
parser = argparse.ArgumentParser()
3996

4097
parser.add_argument('-f','--taxonresult')
4198
parser.add_argument('-b','--blastresult')
42-
parser.add_argument('-o','--outfile')
4399
args = parser.parse_args()
44-
main(args)
100+
main(args)

0 commit comments

Comments
 (0)