Skip to content

Plasmid Database

pedroscampoy edited this page Jul 3, 2018 · 15 revisions

Please, follow those steps to download a reliable and complete plasmid database. This is going to take several hours but needs to be done only once.

1. Download plasmid database info file:

ftp://ftp.ncbi.nlm.nih.gov/genomes/GENOME_REPORTS/plasmids.txt

2. Extract sequences from all accession numbers into a FASTA file using eutils:

This command outputs a raw FASTA with about 12000 sequences

for i in $(cat plasmids.txt | awk 'BEGIN{FS="\t"} (NR>2) {if ($6 ~ "N") {print $6;} else {print $7}}'); do curl -s "https://eutils.ncbi.nlm.nih.gov/entrez/eutils/efetch.fcgi?db=nuccore&id=$i&retmode=text&rettype=fasta"; done > plasmids.fna

3. Remove concepts

From PlasmidID folder execute:

filter_fasta.sh -i PATH/TO/FILE/plasmids.fna -N -l gene -l partial -l putative -l protein -l hypothetical -o PATH/TO/FILE -n plasmids

A file named plasmids_term.fasta will be created with -o argument for the output directory and -n for file name.

4. Remove redundancy

From PlasmidID folder execute:

cdhit_cluster.sh -i PATH/TO/FILE/plasmids_term.fasta -p -c 100 -M 20000 -T 8

NOTE:

  • -i argument is the route to and plasmids.fna file
  • The output will be the same as the input
  • Memmory (-M) and number of threads (-T) can vary depending on the computer than execute this command

NOTE2:

This step is optional, PlasmidID works with any DNA database. Redundancy removal is useful in order to reduce execution time. Also, any other clustering software is welcome.

Clone this wiki locally