obitools2-preprocessing-pipeline

This repository contains the code for pre-processing metabarcoding data with OBITools v.1.2.12 on Brown's high-performance cluster (HPC), OSCAR via an RStudio Server. The code could be adapted to run on other HPCs or RStudio Servers.

Note: Pre-processing steps are run on individual sequencing runs to ensure that we can check "samples" and "controls" for contamination on a run-by-run basis.

The steps included in this repository include:

setting up folder structure for conducting pre-processing steps (Step 1a)
remove primers from forward and reverse sequence reads (Step 1b)
merge reads, dereplicate reads, and check control samples for contamination (Step 1c)

The schematic below shows the entire bioinformatic pipeline for DNA metabarcoding data, but the steps included in this repository are shown in the light grey box at the top.

Connecting to Oscar

The following steps provide guidance on connecting to an RStudio server on Oscar. There are three ways to interact with Oscar:

Through an RStudio Server hosted on Open OnDemand. All interactions are through the various RStudio panes.
Through a virtual linux desktop called Open OnDemand (full desktop with access to files, a command-line shell, and RStudio)
Through a SSH tunnel in a terminal (command-line only)

Option #1 is recommended for this use case, and allows us to choose a newer version of R.

Getting sequencing data onto OSCAR

Raw sequencing data is stored in dated folders at /oscar/data/tkartzin/projects/<project_code>/. The permissions at this location are limited to trklab people.

Project codes include:

test: test data
YNP: Yellowstone
FJ: Fray Jorge
MRC: Africa projects (giraffes/UHURU)
Banff
SEV: Sevilleta
Sloths

If you need to copy over data to OSCAR, the easiest way to do that is through the SMB client in your local Mac Finder app. Connect as described here and use the path displayed in this example:

In the Finder window, drag and drop raw sequence files (fastq.gz) from local into a dated folder in the correct project code.

Note: The pipeline expects .gz file pairs for each sample (i.e. forward and reverse reads) so make sure both are copied over when transferring data.

Prepare your sample sheet

Before running step 1, you need to complete a sample sheet. In the parent directory, take a look at the sample_sheet.xlsx as an example and the fill out sample_sheet_blank.xlsx with your own metadata; save that file with your project name appended (eg., sample_sheet_Sloths.xlsx) and upload to the root directory of this repo. The sample sheet will also get copied over to the shared lab drive at /oscar/data/tkartzin/projects/<project_code>.

Important notes on formats:

Dates should be in YYYYMMDD format (General or Text format in Excel).
Make sure any controls are labeled as simply "control" and sample are just "sample" in the SampleType column.
Cross reference your SampleNames with the actual files in the project_code/raw_data folder, sometimes the lab will omit leading zeros when entering sample names into sequencing software. For example, on the sample sheet, we may have a sample named AJ0804, but the actual file is named AJ804_S196_L001_R1_001.fastq.gz. Amend the sample sheet to match the actual files names in raw_data.

Note: While we run pre-processing steps on individual sequencing runs, multiple sequencing runs can be processed at the same time, so it's important to include the folder name of where the sequencing data can be found in your sample sheet.

Running the Notebooks for Step 1:

Note: The first notebook (Step_1a) is in the parent directory. Notebooks can be opened by double-clicking from the RStudio Files window.

Step_1a should run very quickly. Step_1b should run within minutes as you walk through each code chunk. Step_1c takes the longest, particularly the chunk that pairs forward and reverse reads. The entire preprocessing pipeline should take a few hours (tested with 127 samples).

Step 1a. `Step1a_env_setup.Rmd`

The first step is to update all of the params in the YAML header of the first notebook. You can click "run all" from the drop-down menu at the top of the notebook to generate parameters and create environment variables.

This first notebook generates a new folder with today's date and time (e.g., 20240502T10:43:32Z). Within this folder, you will find individual folders for each sequencing data that you specified in your sample sheet. Inside these, this notebook copies over data, source notebooks, and an empty results folder.

For each sequencing date, you now need to navigate to the following notebooks.

Step 1b. `Step1b_data_prep.Rmd`

The second notebook is where you set all of your parameters for trimming, filtering, primers, etc. This notebook also runs cutadapt to trim primers from your forward and reverse reads.

Step 1c. `Step1c_data_processing.Rmd`

The third notebook merges forward and reverse reads for each sample, filters merged reads, and de-replicates sequences across all samples into a single FASTA file. There are interactive steps at the end to investigate controls and move any suspicious samples out of the analysis.

Output

At the end of this step, the output will be moved to a results folder in the appropriate sequencing run folder on /oscar/data/tkartzin/projects/<project_code>/.

Check for the latest code

There are a few git commands that can help make sure you always have the latest code versions when running your analyses.

From the Terminal in RStudio:

git status - check which branch you are on and view staging area; you should see the main branch
git pull - this is always good to run after you verify you are on main. It will pull down any changes since the last time you ran an analysis.

Tips for Development

Useful git commands:

git add <file> - add a file to the staging area
git commit -m "<descriptive message>" - commit the staged changes with a message (required)
git switch <branch> - change to a different branch
git checkout -b <branch> - make a new branch; just be aware of which branch you are currently on
git pull - pull the latest changes from the remote repo; a good habit every time you switch to main
git stash - stash the changes so your branch is clean before you switch to another branch
git stash pop - pop the changes back out after you have switched to the desired branch

Troubleshooting

If your R session hangs, the environment variables will be lost, so it is best to start back at the top with Step 1a.
When creating the conda environments in steps 1b and 1c, they only need to be created once. They will take some time to resolve dependencies when first created, but then can simply be activated each time they are needed thereafter.

Name		Name	Last commit message	Last commit date
Latest commit History 128 Commits
images		images
template		template
.gitignore		.gitignore
.sample_sheet.xlsx		.sample_sheet.xlsx
README.md		README.md
Step1a_env_setup.Rmd		Step1a_env_setup.Rmd
obitools2-pipeline.Rproj		obitools2-pipeline.Rproj
sample_sheet.xlsx		sample_sheet.xlsx
sample_sheet_blank.xlsx		sample_sheet_blank.xlsx

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

obitools2-preprocessing-pipeline

Connecting to Oscar

Getting sequencing data onto OSCAR

Prepare your sample sheet

Running the Notebooks for Step 1:

Step 1a. `Step1a_env_setup.Rmd`

Step 1b. `Step1b_data_prep.Rmd`

Step 1c. `Step1c_data_processing.Rmd`

Output

Check for the latest code

Tips for Development

Troubleshooting

About

Uh oh!

Releases

Packages

Contributors 2

Uh oh!

trklab-metabarcoding/obitools2-preprocessing-pipeline

Folders and files

Latest commit

History

Repository files navigation

obitools2-preprocessing-pipeline

Connecting to Oscar

Getting sequencing data onto OSCAR

Prepare your sample sheet

Running the Notebooks for Step 1:

Step 1a. Step1a_env_setup.Rmd

Step 1b. Step1b_data_prep.Rmd

Step 1c. Step1c_data_processing.Rmd

Output

Check for the latest code

Tips for Development

Troubleshooting

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Uh oh!

Step 1a. `Step1a_env_setup.Rmd`

Step 1b. `Step1b_data_prep.Rmd`

Step 1c. `Step1c_data_processing.Rmd`

Packages