The first step is create the folder for the project and the subfolders that we need.
mkdir project_A
cd project_A
mkdir rawdata
mkdir scripts
mkdit logs
mkdir output
The second step is cleaning the adapters from the sequences (cutadapt)
sbatch scripts/cutadapt.sh
SMART, Chaetoceros:
PRIMER_F
="GCAGTTAAAAAGCTCGTAGT" # EK-565F
PRIMER_R
="TTTAAGTTTCAGCCTTGCG" # 18S-EUK-1134R-UnonMet
PRIMER_F
="CCAGCASCYGCGGTAATTCC" # TAReuk454FWD1
PRIMER_R
="ACTTTCGTTCTTGATYRATGA" # TAReukREV3_modi
Folder: rawdata/
Files:
- XX_R[1|2]*.fastq --> The raw data from the secuencer:
Folder: output/cutadapt
Files:
- XX__trimmed_R[1|2].fastq --> same # of the input, seq. without adapter.
- seqkit_stats_trimmed.tsv --> stats of the input files.
- seqkit_stats_untrimmed.tsv --> stats of the output files.
Pre-processing step, study the Qscore distribution.
sbatch scripts/00_qscore.sh
Change lines 43 & 43: revise that the paths are correct for the cutadapt (input) and output.
Folder: output/cutadapt
Files: output step 2
Folder: output/00_qprofiles
Files:
- forward.pdf --> q_score graphs for forward files.
- reverse.pdf --> q_score graphs for reverse files.
The next step is for filtering, denoising and merge.
sbatch scripts/01_all.sh
Change lines: L8-11: the project name (project_A) and the version if needed
path_principal
<- "/home/ntimoneda/ntimoneda/projects/project_A"path
<- "/home/ntimoneda/ntimoneda/projects/project_A/cutadapt"output
<- "/home/ntimoneda/ntimoneda/projects/project_A/output"name.run
<- "v1"
Change lines: L28-30: parameters for filtering
truncLen
=c(280,270) --> Lenght to truncate low quality, forward & reverse, see in the graphs step 2.5.maxEE
=c(4,6) --> Maximum number of expectation errors, in the denoising (forward and reverse).
Change lines: L69: parameters for dereplication
minOverlap
=10 --> number of overlap. Overlap = (truncLen_fwd + truncLen_rev) - Amplicon_expected
Folder: output/cutadapt
Files: output step 2
Folder: output/filtered
Files:
- XX_[F|R]_filt.fastq --> same # of the input, seq. filtered.
- errors_name.run_[fwd|rev].pdf --> plots about errors learned.
- v1_seqtab.rds --> the sequences.
- v1_track_analysis.v3.tsv --> stats of the output.
The last step at DADA2 is to delete chimeras and trim the final sequences.
sbatch scripts/02_chimeras.sh
Change lines in 02_chimeras.sh. L38-41.
- Make sure the paths are correct and the names of the files.
- L41: The minimum and maximum length of the final sequences, depends on length amplicon and the previous trimming options
Folder: previous step File: v1_seqtab.rds (from the previous step)
- XX_track_analysis_final.tsv --> loss of reads in each step
- XX_seqtab_final.rds --> Final ASV table
- XX_seqtab_final.fasta --> Fasta file with final ASVs
The aim of this step is to assign taxonomically the sequences, by I use to generate the table of ASV for phyloseq.
sbatch scripts/03_tax.sh
Change lines in 03_tax.sh. L42-46.
- Make sure the paths are correct and the names of the files.
- L46: the cut-off for the identity
Folder: previous step File: XX_seqtab_final.rds (from the previous step)
- XX_tax_assignation.rds --> the sequences with no chimeras
- XX_merged_table.txt --> Tax and counts for the ASVs
This step is for performing taxonomic assignment with the VSEARCH
. It can be done with global alignment, which involves a de Bruijn graph where you choose the minimum identity (similar to BLAST), or with SINTAX, where you specify the bootstrap.
sbatch tax_vsearch.sh
-Usearch_global (BLAST) vsearch --usearch_global INPUT.fa --db DDBB.fa --id CUTOFF --alnout ALIGNOUTPUT.aln --blast6out OUTPUT_taxt.out
- bootstrap vsearch --sintax INPUT.fa --db DDBB.fa --sintax_cutoff CUTOFF --tabbedout OUTPUT.tbl
- XX_seqtab_final.fasta --> Output fasta from step 4
At this point, all the analysis in the server is done. However, we need to change the format of some files before going to RCRAN and continuing with the analysis.
#Delete the taxonomy and the sequence of each ASV
#calcualte the number of columns
awk -F'\t' '{print NF; exit}' merged_table.txt
#change the naumber 32 for: the previous result - 7
cut -f 1-32 merge_table.txt | awk '!($2="")' > clean_ASV_table.txt
#change the number 32 for the number of samples in the project
#Select only the ranks taxonomy with a bootstrap = or > of 0.6. Prepare the file for phyloseq.
sed 's/d.*+//' output.tbl | sed 's/;size=[0-9]*//' > pr2_custom.tbl
perl ~/Escriptori/ICM/perl/parse_tax_vsearch.v1.pl pr2_custom.tbl 8 > tax_table.tbl