Skip to content

GenBank genomes not found  #235

@tgurbich

Description

@tgurbich

Hello!

Thanks a lot for making this tool available. I ran the demo and analyzed the sample dataset with no issues but when testing charcoal on my own dataset I am running into errors which seem to be caused by some genomes not downloading from GenBank correctly.

I am running this command:
python -m charcoal run zebrafish-test.conf -j 16

It fails with this error message:

Error in snakemake invocation: Command '['snakemake', '-s', 
'/users/tg/Misc/Tool_testing/charcoal/charcoal/Snakefile',  '--use-conda', 
'-j', '1', '-j', '16', '--configfile', '/users/tg/Misc/Tool_testing/charcoal/charcoal/conf/defaults.conf',
 '/users/tg/Misc/Tool_testing/charcoal/charcoal/conf/system.conf', 
'zebrafish-test.conf']' returned non-zero exit status 1.

Which appears to be caused by a file not downloading from GenBank:

ERROR, skch::validateInputFile, Could not open genbank_genomes/GCF_002943105.1_genomic.fna.gz
[Thu Feb 16 11:58:47 2023]
Error in rule mashmap_compare:
    jobid: 1373
    output: output.zebrafish-test/stage2/MGYG000299400.fna.x.GCF_002943105.1.mashmap.align, 
output.zebrafish-test/stage2/MGYG000299400.fna.x.GCF_002943105.1.mashmap.out
    conda-env: /users/tg/Misc/Tool_testing/charcoal/.snakemake/conda/d01f2d1356a2c223e7b61208c452d8a0
    shell:
mashmap -q zebrafish-genomes/MGYG000299400.fna -r 
genbank_genomes/GCF_002943105.1_genomic.fna.gz -o 
output.zebrafish-test/stage2/MGYG000299400.fna.x.GCF_002943105.1.mashmap.align   
--pi 95 > output.zebrafish-test/stage2/MGYG000299400.fna.x.GCF_002943105.1.mashmap.out

(one of the commands exited with non-zero exit code; note that snakemake uses bash strict mode!)

I checked the genbank_genomes folder, it did contain some genome files but this accession (GCF_002943105) was not there.
I manually downloaded this file from GenBank and reran the snakemake command. It failed twice again (on GCA_000798955.1_genomic.fna.gz and GCF_000820225.1_genomic.fna.gz) which I also then manually downloaded and reran the snakemake command. The workflow then failed on genome GCA_011046675.1, which has been suppressed in GenBank and isn't available (https://ftp.ncbi.nlm.nih.gov/genomes/all/GCA/011/046/675/GCA_011046675.1_ASM1104667v1/assembly_status.txt).

This is what the error looked like for the suppressed genome:

Error in rule download_matching_genomes_one_by_one:
    jobid: 0
    output: genbank_genomes/GCA_011046675.1_genomic.fna.gz

RuleException:
HTTPError in line 465 of /users/tg/Misc/Tool_testing/charcoal/charcoal/Snakefile:
HTTP Error 404: Not Found
  File "/software/miniconda_py39/envs/charcoal/lib/python3.9/site-packages/snakemake/executors/__init__.py", line 2357, in run_wrapper
  File "/users/tg/Misc/Tool_testing/charcoal/charcoal/Snakefile", line 465, in __rule_download_matching_genomes_one_by_one
  File "/software/miniconda_py39/envs/charcoal/lib/python3.9/urllib/request.py", line 214, in urlopen
  File "/software/miniconda_py39/envs/charcoal/lib/python3.9/urllib/request.py", line 523, in open
  File "/software/miniconda_py39/envs/charcoal/lib/python3.9/urllib/request.py", line 632, in http_response
  File "/software/miniconda_py39/envs/charcoal/lib/python3.9/urllib/request.py", line 561, in error
  File "/software/miniconda_py39/envs/charcoal/lib/python3.9/urllib/request.py", line 494, in _call_chain
  File "/software/miniconda_py39/envs/charcoal/lib/python3.9/urllib/request.py", line 641, in http_error_default
  File "/software/miniconda_py39/envs/charcoal/lib/python3.9/site-packages/snakemake/executors/__init__.py", line 574, in _callback
  File "/software/miniconda_py39/envs/charcoal/lib/python3.9/concurrent/futures/thread.py", line 58, in run
  File "/software/miniconda_py39/envs/charcoal/lib/python3.9/site-packages/snakemake/executors/__init__.py", line 560, in cached_or_run
  File "/software/miniconda_py39/envs/charcoal/lib/python3.9/site-packages/snakemake/executors/__init__.py", line 2390, in run_wrapper
Exiting because a job execution failed. Look above for error message

I tried to run charcoal on a small subset of my genomes (the ones that went through with no errors during this initial test) and that completed without errors and a report was generated successfully.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions