Topic Python Lesson 5 Live code

Set-up & Binder components for Python Lesson 5

Set-up

Accessing data from NCBI can sometimes be a pain in the a**. One common problem I run into is downloading genomes from GenBank in a large-scale automated fashion.

(Almost) Daily NCBI produces an update of the current genomes in GenBank. You can download that content through their FTP website.

The file assembly_summary_genbank.txt contains identifiers and the FTP address for all genomes in GenBank. For the version I am currently looking at consists of file that is 212MB and contains 722,260 genome entries.

If we had a list of unique identifiers linked to genomes we were interested in, we could create a file that has the FTP locations of all the files we are interested in, and download them using the Unix command wget.

Script

Provided a list of unique identifiers, create wget commands to download the genome files of interest.

The file assembly_summary_genbank_subset.txt contains 15 entries to practice parsing with - sag-uniq-ids.txt & taxon-ids.txt provide two different types of unique identifiers that can be parsed from this list.

Name		Name	Last commit message	Last commit date
Latest commit History 7 Commits
LICENSE		LICENSE
README.md		README.md
assembly_summary_genbank_subset.txt		assembly_summary_genbank_subset.txt
environment.yml		environment.yml
parse-genbank-ftps.py		parse-genbank-ftps.py
postBuild		postBuild
sag-uniq-ids.txt		sag-uniq-ids.txt
taxon-ids.txt		taxon-ids.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Topic Python Lesson 5 Live code

Set-up

Script

About

Uh oh!

Releases

Packages

Languages

License

biovcnet/topic-python-lesson-5-live-code

Folders and files

Latest commit

History

Repository files navigation

Topic Python Lesson 5 Live code

Set-up

Script

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages