diff --git a/Submodule_00_Glossary.md b/AWS/Submodule_00_Glossary.md similarity index 100% rename from Submodule_00_Glossary.md rename to AWS/Submodule_00_Glossary.md diff --git a/AWS/Submodule_00_background.ipynb b/AWS/Submodule_00_background.ipynb new file mode 100644 index 0000000..3b56c0c --- /dev/null +++ b/AWS/Submodule_00_background.ipynb @@ -0,0 +1,450 @@ +{ + "cells": [ + { + "cell_type": "markdown", + "id": "b2476ae8-9aad-4594-a003-f92ad8b0e126", + "metadata": {}, + "source": [ + "# MDIBL Transcriptome Assembly Learning Module\n", + "# Notebook 0: Background Material" + ] + }, + { + "cell_type": "markdown", + "id": "5e6d2086-4dbf-4a61-a5bb-8f08a269f3fa", + "metadata": {}, + "source": [ + "## Overview\n", + "\n", + "This is a series of notebooks that allows you to explore the biological and computational process of the transcriptome assembly. Through these notebooks, you will also learn to leverage the powerful capabilities of tools such as Nextflow and Google Life Science API to bring your computational capabilities to the next level!\n", + "\n", + "Before you get started, please take this prerequisite that checks existing knowledge that will be assumed to be known through the rest of these workbooks.\n", + "\n", + "Throughout the notebooks, there will be periodic quizzes and knowledge checks that you are encouraged to do.\n", + "\n", + "Good luck, and have fun!" + ] + }, + { + "cell_type": "markdown", + "id": "3518c1a9", + "metadata": {}, + "source": [ + "## Learning Objectives:\n", + "\n", + "1. **Assess prior knowledge:** A pre-check quiz verifies foundational understanding of DNA, RNA, transcription, and gene expression.\n", + "\n", + "2. **Introduce transcriptome assembly:** Learners gain an understanding of what transcriptome assembly is, why RNA sequencing is performed, and the overall workflow involved.\n", + "\n", + "3. **Explain the process of transcriptome assembly:** This includes understanding preprocessing, sequence assembly using de Bruijn graphs, assembly assessment (internal and external consistency, BUSCO), and refinement techniques.\n", + "\n", + "4. **Introduce workflow management:** Learners are introduced to the concept of workflows/pipelines in bioinformatics and the role of workflow management systems like Nextflow.\n", + "\n", + "5. **Explain the use of Docker containers:** The notebook explains the purpose and benefits of using Docker containers for managing software dependencies in bioinformatics.\n", + "\n", + "6. **Introduce the Google Cloud Life Sciences API:** Learners are introduced to the Google Cloud Life Sciences API and its advantages for managing and executing workflows on cloud computing resources.\n", + "\n", + "7. **Familiarize learners with Jupyter Notebooks:** The notebook provides instructions on how to navigate and use Jupyter Notebooks, including cell types and execution order." + ] + }, + { + "cell_type": "markdown", + "id": "6a23eec6", + "metadata": {}, + "source": [ + "## Prerequisites\n", + "\n", + "* **Basic Biology Knowledge:** A foundational understanding of DNA, RNA, transcription, and gene expression is assumed. The notebook includes quizzes to assess this knowledge.\n", + "* **Python Programming:** While the notebook itself doesn't contain complex Python code, familiarity with Python syntax and the Jupyter Notebook environment is helpful.\n", + "* **Command Line Interface (CLI) Familiarity:** The notebook mentions using `pip` (a command-line package installer), indicating some CLI knowledge is beneficial, although not strictly required for completing the quizzes and reviewing the material." + ] + }, + { + "cell_type": "markdown", + "id": "f6eefc1e", + "metadata": {}, + "source": [ + "## Get Started" + ] + }, + { + "cell_type": "markdown", + "id": "22b95a28-fad7-4b6c-99ae-093c323f769c", + "metadata": {}, + "source": [ + "
\n", + " \n", + " Precheck:\n", + "
\n", + "\n", + ">Before you get started, please take this quick quiz that will verify some baseline knowledge on the ideas of DNA, RNA, Transcription, and Gene Expression." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "8c479507-0e54-44e1-a727-d13dceaa1c7b", + "metadata": {}, + "outputs": [], + "source": [ + "# This is an install that you need to run once to allow the quizes to be functional.\n", + "!pip install jupyterquiz==2.0.7\n", + "!pip install jupytercards" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "c0a4331b-494c-4054-93c8-ef8433ca7b40", + "metadata": {}, + "outputs": [], + "source": [ + "from jupyterquiz import display_quiz" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "a89c8abf-fad1-4b23-b3fc-1a5933193fa0", + "metadata": {}, + "outputs": [], + "source": [ + "display_quiz(\"Transcriptome-Assembly-Refinement-and-Applications/quiz-material/00-pc1.json\")" + ] + }, + { + "cell_type": "markdown", + "id": "e1d5ab0d-1671-47c6-888d-a8d2774df30c", + "metadata": {}, + "source": [ + "
\n", + " \n", + " Note: Some Resources\n", + "
\n", + "\n", + ">If you feel unsure about your knowledge in any of these topics, please reference [Submodule_00_Glossary.md](./Submodule_00_Glossary.md) along with the National Human Genome Research Institute's [Glossary of Genomic and Genetic Terms](https://www.genome.gov/genetics-glossary)." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "39b204c4-847a-49b3-afa0-c4da4216ace4", + "metadata": {}, + "outputs": [], + "source": [ + "#Run the command below to watch the video\n", + "from IPython.display import YouTubeVideo\n", + "\n", + "YouTubeVideo('abw2XAg1e_g', width=800, height=400)" + ] + }, + { + "cell_type": "markdown", + "id": "45b85d9f-4115-446d-8120-02c988a7769f", + "metadata": {}, + "source": [ + "## Why do we sequence RNA?\n", + "RNA-sequencing (RNA-seq) is the most common means by which biological samples are characterized at the molecular level. In brief, it is a means of measuring which genes have RNA copies (transcripts) present in a sample and in what relative abundance. The sample is prepared in such a way that DNA and proteins are degraded away, and then the remaining RNA is prepared such that it can be read (as a series of DNA bases A, C, G, and T) on a modern sequencer. Sequencing machines are generally classified as short read, which produces sequence read lengths of 50 to 150 nucleotides, or long-read, which can generate up to tens of thousands of bases. Short-read sequencers have been available for a longer time, and remain more capable of high throughput quantitative output, and these reads are the focus of our work here.\n", + "\n", + "The standard workflow analysis of RNA-seq data consists of these broad steps:\n", + "- Quality assessment and pre-processing\n", + "- Assignment of reads to transcripts/genes\n", + "- Normalization of reads between samples\n", + "- Assessment of differential expression of transcripts/genes between experimental conditions\n", + "- Interpretation of the resulting differential expression profiles\n", + "\n", + "Implicit in the workflow above is the existence of a target transcriptome to which the RNA-seq reads can be compared, aligned, and assigned for quantification. For well-studied organisms, such as human, mouse, zebrafish, or other model organisms, there are abundant reference materials available from such sites as [Ensembl](https://www.ensembl.org/), [NCBI](https://ncbi.nlm.nih.gov/), and the [UCSC Genome Browser](https://genome.ucsc.edu/).\n", + "\n", + "For less well-studied organisms, no such references are generally available, however, the RNA-seq data contains the information necessary to infer not only abundance but also the transcript sequences from which the data was generated. The process of inferring the starting transcripts from the data, termed ***Transcriptome Assembly***, is the focus of this module." + ] + }, + { + "cell_type": "markdown", + "id": "e392c553-1831-4978-af06-83359f8de746", + "metadata": {}, + "source": [ + "## Transcriptome Sequence Assembly\n", + "As a first approximation, sequence assembly of a single molecule (*e.g.*, a chromosome) can be thought of as analogous to the process of reconstructing a picture from smaller, overlapping segments of the picture. Overlapping pieces are identified and matched, extending the construct until an estimation of the complete picture is generated. To make this metaphor a bit more realistic, the subsegments of the original picture are *imperfect*, such that successful construction of the complete picture will require error identification (or at least estimation) and correction.\n", + "\n", + "In order to extend this analog to transcriptome assembly, imagine that instead of one picture, our smaller segments instead are drawn from many pictures. Now the process of reconstruction will necessarily include a step that attempts to separate the smaller segments into related groups, after which the assembly procedure proceeds.\n", + "\n", + "#### Preprocessing and Data Cleaning\n", + "For reasons described below, stringent quality assessment and filtering of the data is generally carried out before the assembly process is begun. The primary steps include:\n", + "- Removal of low-quality score data\n", + "- Removal of contaminant sequence data\n", + "- Removal of known functional RNA\n", + "\n", + "#### Sequence Assembly\n", + "\"Conceptual\n", + "\n", + "**Figure 1:** Conceptual diagram of a sequence-defined [de Bruijn graph](https://en.wikipedia.org/wiki/De_Bruijn_graph). (A) Each sequence in an RNA-seq is broken into overlapping *k*-mers. (B) Each *k*-mer becomes a node in the graph, shown in the example with *k*=6. Edges are drawn between nodes that match *k*-1 contiguous nucleotides. (C) Putative transcripts (shown in distinct colors) are represented as traversals of one of the many connected components of the graph generated by the starting sequence set.

\n", + "\n", + "#### Assembly Assessment\n", + "- Internal consistency\n", + " - Use of a de Bruijn graph is computationally efficient (especially compared to exhaustive pairwise alignment of all sequence reads), but all \"long-range\" information is weakened.\n", + " - The weakening of the long-range information necessitates further QC. The problem is that building complete transcripts from just *k*-mers and their probabilities means that we can generate sequences that are computationally possible but don't exist in the input data. Internal consistency refers to the process of aligning the original input reads to the output transcriptome. Transcripts that do not get sufficient coverage are flagged as probable artifacts.\n", + "- External consistency\n", + " - Studies of transcriptomes across many organisms have demonstrated common features. By \"external consistency\" we mean matching our new transcriptome to these expectations.\n", + " - [BUSCO](https://busco.ezlab.org/) is an innovative analysis and set of tools developed by the [EZlab at the Swiss Insitute of Bioinformatics](https://www.ezlab.org/). The fundamental idea behind BUSCO (**B**enchmarking **U**niversal **S**ingle-**C**opy **O**rthologs) derives from the Zdobnov group's analysis, which showed that for a defined phylogenetic range of organisms, there is a core set of protein-coding genes that are nearly universally present in only a single copy. The BUSCO tools test this assumption.\n", + " - The second standard process for external consistency is to align all predicted proteins for the new transcriptome to a complete set of proteins from a well-studied (e.g., fly or mouse) under the assumption that most of the proteins should match.\n", + "\n", + "#### Assembly Refinement\n", + "Assemblies are refined in several different manners:\n", + "- Removal of redundant (or likely so) transcripts, based on sequence similarity between assembled forms.\n", + "- Limitation to transcripts with predicted/conceptual translated protein sequences that match known proteins in other organisms.\n", + "\n", + "For Assembly refinement, the TransPi workflow relies primarily on the \"EvidentialGene\" tool." + ] + }, + { + "cell_type": "markdown", + "id": "725afa76-ab95-4ab5-8b85-6b6795436c0e", + "metadata": {}, + "source": [ + "## Workflow Execution with Nextflow" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "394e8677-1f82-4e42-9e37-535777128a1a", + "metadata": {}, + "outputs": [], + "source": [ + "#Run the command below to watch the video\n", + "from IPython.display import YouTubeVideo\n", + "\n", + "YouTubeVideo('FMcZD10Qrbs', width=800, height=400)" + ] + }, + { + "cell_type": "markdown", + "id": "624b90a4-732a-47a2-987e-e37d16f124bb", + "metadata": {}, + "source": [ + "\n", + "It is standard practice in modern biotechnology, bioinformatics, and computational biology that most complex analyses are carried out not by a single comprehensive program, but are instead carried out in a defined sequence of multiple programs. The process of running through these steps in the proper order is collectively called a ***workflow*** or ***pipeline***.\n", + "\n", + "\n", + "\n", + "Workflow management systems, *e.g.*, Nextflow, provide a syntax for defining the order of steps and the associated flow of information between steps, while also providing management/control software that can read and carry out these workflows. The workflow control systems (which are generally platform-specific) are responsible for allocating resources, activating analysis steps, and also making sure that all steps occur in the proper order (e.g., only activating a sequence alignment program after the sequence quality control has been performed).\n", + "\n", + "\n", + "\n", + "Workflows can be conceptually broken up into steps or modules (see the figure at left), which formalize the flow of information as inputs and outputs. A workflow conceptually ties the steps/modules together and enforces the dependencies (see the figure above), specifically in that if the output from one step is the input for a later step, the later step is blocked until the earlier step completes." + ] + }, + { + "cell_type": "markdown", + "id": "77e96ae4-5e6e-4499-9f6e-bda00cbbabc5", + "metadata": {}, + "source": [ + "## Running Individual Analysis Steps with Docker\n", + "One of the most frustrating aspects of carrying out computational biology/bioinformatics programs is installing and maintaining the software programs that we use. These programs are built by a wide variety of research and industrial organizations, and they are built on a wide variety of platforms and utilize an even wider set of supporting libraries and auxiliary components. The reason this causes problems is that the underlying dependencies can conflict with those of other programs or the operating system.\n", + "\n", + "One of our primary tools for efficient maintenance is a container system such as [Docker](https://www.docker.com/).\n", + "#### What are container systems and what are containers?\n", + "A container system is a program that creates protected environments within your computer in which programs and their dependencies can be loaded only as long as they are needed to run the program of interest. The container system can load and unload containers as needed. One of the primary benefits of such systems is that once a container has been defined for a specific program, it can be reused repeatedly on the same computer or shared with others through online repositories.\n", + "#### Why do we use containers?\n", + "We use containers because they allow us to run a broad range of computer programs without having to manage all of their underlying programmatic dependencies. Having a program encapsulated in a container also preserves our ability to continue to use that version of the program, even if either the program or its dependencies are updated." + ] + }, + { + "cell_type": "markdown", + "id": "9497e93b-b398-41fb-87a5-f1082e13a0aa", + "metadata": {}, + "source": [ + "## Running workflows using the Google Cloud Life Sciences API\n", + "The [Google Cloud Life Sciences API (GLS)](https://cloud.google.com/life-sciences) is a service provided by Google that both understands workflows and also controls, including activation, program execution, and deactivation of Google Cloud computing servers.\n", + "\n", + "#### What do we gain by using GLS\n", + "- The key to cost-efficient cloud computing is to only use the resources you need for as long as you need them. \n", + "- GLS allows us to control our process from a modest, inexpensive machine that can interface with GLS to provision and use the more expensive machines needed for computing.\n", + "- GLS explicitly supports the Nextflow workflow system that we are using, mapping computational tasks onto GCP computing resources." + ] + }, + { + "cell_type": "markdown", + "id": "97bd4c54-198a-4c94-a959-20d440a02156", + "metadata": {}, + "source": [ + "
\n", + " \n", + " Checkpoint 1:\n", + "
" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "b64e6f8a-84ba-4279-b002-25ae4f0755ae", + "metadata": {}, + "outputs": [], + "source": [ + "display_quiz(\"Transcriptome-Assembly-Refinement-and-Applications/quiz-material/00-cp1.json\", shuffle_questions = True)" + ] + }, + { + "cell_type": "markdown", + "id": "1b46f12f-f439-425e-ba98-28bfe1b5ec77", + "metadata": {}, + "source": [ + "## Jupyter Notebook Introduction\n", + "\n", + "All of the content contained within this module is presented in Jupyter notebooks which have the `.ipynb` file type. *You are in a Jupyter notebook right now.* Within each notebook is a series of cells that can be individually executed by pressing the `shift + enter` keys at the same time.\n", + "\n", + "Each cell has options as to how it is executed. For example, the text that you are reading right now in this cell is in the `Markdown` cell type, but there are also `code`, and `raw` cell types. In these modules, you will primarily be seeing `Markdown` and `code` cells. *You can choose what each cell type is by using the drop-down menu at the top of the notebook.*\n", + "\n", + "> \n", + "\n", + "For the code cells, information carries over between cells, but in execution order. This is important because when looking at a series of cells you may be expecting a specific output, but receive a different output due to the order of execution." + ] + }, + { + "cell_type": "markdown", + "id": "02c8af23-fc70-4265-a07f-50ef77c12564", + "metadata": {}, + "source": [ + "
\n", + " \n", + " Example: Follow the steps in the cells below\n", + "
" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "da4d8800-63cb-4dd8-bc2e-b28a52513c76", + "metadata": {}, + "outputs": [], + "source": [ + "# Execute 1st:\n", + "var1 = 100" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "70147ffa-fe71-40eb-bf83-12e7cb0cfdce", + "metadata": {}, + "outputs": [], + "source": [ + "# Execute 2nd and 4th:\n", + "print(var1)" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "4bb0a293-4feb-4ca3-b3fd-caf4df3ce6af", + "metadata": {}, + "outputs": [], + "source": [ + "# Execute 3rd:\n", + "var1 = 'not the same anymore'\n", + "# And now run the cell above" + ] + }, + { + "cell_type": "markdown", + "id": "927a8df2-87f9-45aa-a096-4e1133a97f64", + "metadata": {}, + "source": [ + ">As you can see, `var1` got overwritten, and when you retroactively re-run the `print(var1)` cell, the output has changed, even though it is above the variable assignment.\n", + "\n", + "In the following notebooks, there will be some code cells that will take a long time to run. *Sometimes with no output.* So there are two ways to check if the cell is still executing:\n", + "\n", + "1. The first way to check is to look to the left of the code cell. There will be an indication that looks like this: `[ ]:` If it is empty, then the cell has never been executed yet. If it looks like this: `[*]:`, that means that it is actively executing. And if it looks like this `[53]:`, that means that it has completed executing.\n", + "2. The second way, which will check to see if anything in the entire notebook is executing is in the top right of the notebook (Image Below). If the circle is empty, then nothing is actively executing. If the circle is grayed out, then there is something executing.\n", + "\n", + "> " + ] + }, + { + "cell_type": "markdown", + "id": "20730a46-7d8c-4dec-a6de-3dcf10fdb888", + "metadata": {}, + "source": [ + "
\n", + " \n", + " Knowledge Check: \n", + "
\n", + "\n", + ">Change the cell below from a code cell to a markdown cell. *Don't forget to execute the cell.*" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "7ebda978-4968-46f6-a330-927a978b43ef", + "metadata": {}, + "outputs": [], + "source": [ + "change our cell type\n", + "# I WANT TO BE BIGGER\n", + "*I want to be tilted*\n", + "\n", + "**I want to be bold**\n", + "\n", + "`And I want to have a grey background`" + ] + }, + { + "cell_type": "markdown", + "id": "d473f3b2-3440-47a5-98b9-950ceb66704e", + "metadata": {}, + "source": [ + "
\n", + " \n", + " Checkpoint 2:\n", + "
" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "a080a9dd-9ea6-4d6a-b5d5-ccd3382f09ed", + "metadata": {}, + "outputs": [], + "source": [ + "display_quiz(\"Transcriptome-Assembly-Refinement-and-Applications/quiz-material/00-cp2.json\", shuffle_questions = True)" + ] + }, + { + "cell_type": "markdown", + "id": "032f3aa2-8e73-4e64-bda7-a4a35e0f2a99", + "metadata": {}, + "source": [ + "
\n", + " \n", + " Glossary: \n", + "
\n", + "\n", + "> Within the the file [`Submodule_00_glossary.md`](./Submodule_00_Glossary.md) you will find a compilation of useful terms that will be beneficial to refer to throughout the rest of the learning module." + ] + }, + { + "cell_type": "markdown", + "id": "8d3cf5c9", + "metadata": {}, + "source": [ + "## Conclusion\n", + "\n", + "This introductory Jupyter Notebook provided essential background information and a pre-requisite knowledge check on fundamental molecular biology concepts (DNA, RNA, transcription, gene expression) crucial for understanding transcriptome assembly. The notebook established the context for the subsequent modules, outlining the workflow involving RNA-seq data, transcriptome assembly techniques (including de Bruijn graphs, BUSCO analysis), and the use of Nextflow and Google Cloud Life Sciences API for efficient workflow execution and management. The inclusion of interactive quizzes and video resources enhanced learning and engagement, preparing learners for the practical applications and computational challenges presented in the following notebooks. Successful completion of the checkpoint quizzes demonstrates readiness to proceed to the next stage of the MDIBL Transcriptome Assembly Learning Module." + ] + }, + { + "cell_type": "markdown", + "id": "421cebc3", + "metadata": {}, + "source": [ + "## Clean Up\n", + "\n", + "Remember to proceed to the next notebook [`Submodule_01_prog_setup.ipynb`](./Submodule_01_prog_setup.ipynb) or shut down your instance if you are finished." + ] + } + ], + "metadata": {}, + "nbformat": 4, + "nbformat_minor": 5 +} diff --git a/Submodule_01_prog_setup.ipynb b/AWS/Submodule_01_prog_setup.ipynb similarity index 68% rename from Submodule_01_prog_setup.ipynb rename to AWS/Submodule_01_prog_setup.ipynb index 50ce29a..9f66445 100644 --- a/Submodule_01_prog_setup.ipynb +++ b/AWS/Submodule_01_prog_setup.ipynb @@ -6,11 +6,64 @@ "metadata": {}, "source": [ "# MDIBL Transcriptome Assembly Learning Module\n", - "# Notebook 1: Setup\n", + "# Notebook 1: Setup" + ] + }, + { + "cell_type": "markdown", + "id": "f62d616c", + "metadata": {}, + "source": [ + "## Overview\n", "\n", "This notebook is designed to configure your virtual machine (VM) to have the proper tools and data in place to run the transcriptome assembly training module." ] }, + { + "cell_type": "markdown", + "id": "60145056", + "metadata": {}, + "source": [ + "## Learning Objectives\n", + "\n", + "1. **Understand and utilize shell commands within Jupyter Notebooks:** The notebook explicitly teaches the difference between `!` and `%` prefixes for executing shell commands, and how to navigate directories using `cd` and `pwd`.\n", + "\n", + "2. **Set up the necessary software:** Students will install and configure essential tools including:\n", + " * Java (a prerequisite for Nextflow).\n", + " * Mambaforge (a package manager for bioinformatics tools).\n", + " * `sra-tools`, `perl-dbd-sqlite`, and `perl-dbi` (specific bioinformatics packages).\n", + " * Nextflow (a workflow management system).\n", + " * `aws s3` (for interacting with AWS S3 Storage).\n", + "\n", + "3. **Download and organize necessary data:** Students will download the TransPi transcriptome assembly software and its associated resources (databases, scripts, configuration files) from an S3 bucket. This includes understanding the directory structure and file organization.\n", + "\n", + "4. **Manage file permissions:** Students will use the `chmod` command to set executable permissions for the necessary files and directories within the TransPi software.\n", + "\n", + "5. **Navigate file paths:** The notebook provides examples and explanations for using relative file paths (e.g., `./`, `../`) within shell commands." + ] + }, + { + "cell_type": "markdown", + "id": "549be731", + "metadata": {}, + "source": [ + "## Prerequisites\n", + "\n", + "* **Operating System:** A Linux-based system is assumed (commands like `apt`, `uname` are used). The specific distribution isn't specified but a Debian-based system is likely.\n", + "* **Shell Access:** The ability to execute shell commands from within the Jupyter Notebook environment (using `!` and `%`).\n", + "* **Java Development Kit (JDK):** Required for Nextflow.\n", + "* **Miniforge** A package manager for installing bioinformatics tools.\n", + "* **`aws s3`:** The AWS command-line tool. This is crucial for downloading data from an S3 storage bucket." + ] + }, + { + "cell_type": "markdown", + "id": "a92f62a0", + "metadata": {}, + "source": [ + "## Get Started" + ] + }, { "cell_type": "markdown", "id": "958495ce-339d-4d4d-a621-9ede79a7363c", @@ -51,7 +104,7 @@ "## Time to begin!\n", "\n", "**Step 1:** To start, make sure that you are in the right starting place with a `cd`.\n", - "> `pwd` prints our current local working directory. Make sure the output from the command is: `/home/jupyter`" + "> `pwd` prints our current local working directory. Make sure the output from the command is: `/home/ec2-user/SageMaker`" ] }, { @@ -61,7 +114,7 @@ "metadata": {}, "outputs": [], "source": [ - "%cd /home/jupyter" + "%cd /home/ec2-user/SageMaker" ] }, { @@ -71,7 +124,7 @@ "metadata": {}, "outputs": [], "source": [ - "!pwd" + "! pwd" ] }, { @@ -89,31 +142,27 @@ "metadata": {}, "outputs": [], "source": [ - "!sudo apt update\n", - "!sudo apt-get install default-jdk -y\n", - "!java -version" + "! sudo apt update\n", + "! sudo apt-get install default-jdk -y\n", + "! java -version" ] }, { "cell_type": "markdown", - "id": "7b3ffb16-3395-4c01-9774-ee568e815490", + "id": "7b930ad7", "metadata": {}, "source": [ - "**Step 3:** Install Mambaforge, which is needed to support the information held within the TransPi databases.\n", - "\n", - ">Mambaforge is a package manager." + "**Step 3:** Using Mamba and bioconda, install the tools that will be used in this tutorial." ] }, { "cell_type": "code", "execution_count": null, - "id": "ac5b204a-f0db-4ceb-bf37-57eca6d77974", + "id": "4d4dd51e", "metadata": {}, "outputs": [], "source": [ - "!curl -L -O https://github.com/conda-forge/miniforge/releases/latest/download/Mambaforge-$(uname)-$(uname -m).sh\n", - "!bash Mambaforge-$(uname)-$(uname -m).sh -b -p $HOME/mambaforge\n", - "!~/mambaforge/bin/mamba install -c bioconda sra-tools perl-dbd-sqlite perl-dbi -y" + "! mamba install -c bioconda sra-tools perl-dbd-sqlite perl-dbi -y" ] }, { @@ -131,9 +180,9 @@ "metadata": {}, "outputs": [], "source": [ - "!curl https://get.nextflow.io | bash\n", - "!chmod +x nextflow\n", - "!./nextflow self-update" + "! curl https://get.nextflow.io | bash\n", + "! chmod +x nextflow\n", + "! ./nextflow self-update" ] }, { @@ -152,7 +201,7 @@ "metadata": {}, "outputs": [], "source": [ - "!gsutil -m cp -r gs://nigms-sandbox/nosi-inbremaine-storage/TransPi ./" + "! aws s3 cp --recursive s3://nigms-sandbox/nosi-inbremaine-storage/TransPi ./TransPi" ] }, { @@ -162,10 +211,10 @@ "source": [ "
\n", " \n", - " Note: gsutil\n", + " Note: aws\n", "
\n", "\n", - ">`gsutil` is a tool allows you to interact with Google Cloud Storage through the command line." + ">`aws s3` is a tool allows you to interact with S3 Storage through the command line." ] }, { @@ -190,7 +239,7 @@ "metadata": {}, "outputs": [], "source": [ - "!gsutil -m cp -r gs://nigms-sandbox/nosi-inbremaine-storage/resources ./" + "! aws s3 cp --recursive s3://nigms-sandbox/nosi-inbremaine-storage/resources ./resources" ] }, { @@ -215,8 +264,7 @@ "> - They can also be stacked so `../../` will take you two layers up.\n", ">\n", ">- If you were to type `!ls ./nextWeek/` it would return the contents of the `nextWeek` directory which is one layer down from the current directory, so it would return `manyThings.txt`.\n", - ">\n", - ">**This means that in the second line of the code cell above, the file `TransPi.nf` will be copied from the Google Cloud Storage bucket to the current directory.**" + ">" ] }, { @@ -234,7 +282,7 @@ "metadata": {}, "outputs": [], "source": [ - "!chmod -R +x ./TransPi/bin" + "! chmod -R +x ./TransPi/bin" ] }, { @@ -295,19 +343,23 @@ }, { "cell_type": "markdown", - "id": "f80a7bab-98ae-45a6-845f-ad3c4138575a", + "id": "ffec658a", "metadata": {}, "source": [ - "## When you are ready, proceed to the next notebook: [`Submodule_02_basic_assembly.ipynb`](./Submodule_02_basic_assembly.ipynb)." + "## Conclusion\n", + "\n", + "This notebook successfully configured the virtual machine for the MDIBL Transcriptome Assembly Learning Module. We updated the system, installed necessary software including Java, Mambaforge, and Nextflow, and downloaded the TransPi program and its associated resources from Google Cloud Storage. The `chmod` command ensured executability of the TransPi scripts. The VM is now prepared for the next notebook, `Submodule_02_basic_assembly.ipynb`, which will delve into the transcriptome assembly process itself. Successful completion of this notebook's steps is crucial for the successful execution of subsequent modules." ] }, { - "cell_type": "code", - "execution_count": null, - "id": "934165c2-8fbd-4801-979f-6db5d1e592ea", + "cell_type": "markdown", + "id": "666c1e4d", "metadata": {}, - "outputs": [], - "source": [] + "source": [ + "## Clean Up\n", + "\n", + "Remember to proceed to the next notebook [`Submodule_02_basic_assembly.ipynb`](./Submodule_02_basic_assembly.ipynb) or shut down your instance if you are finished." + ] } ], "metadata": {}, diff --git a/Submodule_02_basic_assembly.ipynb b/AWS/Submodule_02_basic_assembly.ipynb similarity index 76% rename from Submodule_02_basic_assembly.ipynb rename to AWS/Submodule_02_basic_assembly.ipynb index 1ce40fe..b243a2e 100644 --- a/Submodule_02_basic_assembly.ipynb +++ b/AWS/Submodule_02_basic_assembly.ipynb @@ -8,6 +8,8 @@ "# MDIBL Transcriptome Assembly Learning Module\n", "# Notebook 2: Performing a \"Standard\" basic transcriptome assembly\n", "\n", + "## Overview\n", + "\n", "In this notebook, we will set up and run a basic transcriptome assembly, using the analysis pipeline as defined by the TransPi Nextflow workflow. The steps to be carried out are the following, and each is described in more detail in the Background material notebook.\n", "\n", "- Sequence Quality Control (QC): removing adapters and low-quality sequences.\n", @@ -23,12 +25,58 @@ "> **Figure 1:** TransPi workflow for a basic transcriptome assembly run." ] }, + { + "cell_type": "markdown", + "id": "062784ec", + "metadata": {}, + "source": [ + "## Learning Objectives\n", + "\n", + "1. **Understanding the TransPi Workflow:** Learners will gain a conceptual understanding of the TransPi workflow, including its individual steps and their order. This involves understanding the purpose of each stage (QC, normalization, assembly, integration, assessment, annotation, and reporting).\n", + "\n", + "2. **Executing a Transcriptome Assembly:** Learners will learn how to run a transcriptome assembly using Nextflow and the TransPi pipeline, including setting necessary parameters (e.g., k-mer size, read length). They will learn how to interpret the command-line interface for executing Nextflow workflows.\n", + "\n", + "3. **Interpreting Nextflow Output:** Learners will learn to navigate and understand the directory structure generated by the TransPi workflow. This includes interpreting the output from various tools such as FastQC, FastP, Trinity, TransAbyss, SOAP, rnaSpades, Velvet/Oases, EvidentialGene, rnaQuast, BUSCO, DIAMOND/BLAST, HMMER/Pfam, and TransDecoder. This involves understanding the different types of output files generated and how to extract relevant information from them (e.g., assembly statistics, annotation results).\n", + "\n", + "4. **Assessing Transcriptome Quality:** Learners will understand how to assess the quality of a transcriptome assembly using metrics generated by rnaQuast and BUSCO.\n", + "\n", + "5. **Interpreting Annotation Results:** Learners will learn to interpret the results of transcriptome annotation using tools like DIAMOND/BLAST and HMMER/Pfam, understanding what information they provide regarding protein function and domains.\n", + "\n", + "6. **Utilizing Workflow Management Systems:** Learners will gain practical experience using Nextflow, a workflow management system, to execute a complex bioinformatics pipeline. This includes understanding the benefits of using a defined workflow for reproducibility and efficiency.\n", + "\n", + "7. **Working with Jupyter Notebooks:** The notebook itself provides a practical example of how to integrate command-line tools within a Jupyter Notebook environment." + ] + }, + { + "cell_type": "markdown", + "id": "abf9345c", + "metadata": {}, + "source": [ + "## Prerequisites\n", + "\n", + "* **Nextflow:** A workflow management system used to execute the TransPi pipeline. \n", + "* **Docker:** Used for containerization of the various bioinformatics tools within the workflow. This avoids the need for local installation of numerous packages.\n", + "* **TransPi:** The specific Nextflow pipeline for transcriptome assembly. The notebook assumes it's present in the `/home/jupyter` directory.\n", + "* **Bioinformatics Tools (within TransPi):** The workflow utilizes several bioinformatics tools. These are packaged within Docker containers, but the notebook expects that TransPi is configured correctly to access and use them:\n", + " * FastQC: Sequence quality control.\n", + " * FastP: Read preprocessing (trimming, adapter removal).\n", + " * Trinity, TransAbyss, SOAPdenovo-Trans, rnaSpades, Velvet/Oases: Transcriptome assemblers.\n", + " * EvidentialGene: Transcriptome integration and reduction.\n", + " * rnaQuast: Transcriptome assessment.\n", + " * BUSCO: Assessment of completeness of the assembled transcriptome.\n", + " * DIAMOND/BLAST: Protein alignment for annotation.\n", + " * HMMER/Pfam: Protein domain assignment for annotation.\n", + " * Bowtie2: Read mapping for assembly validation.\n", + " * TransDecoder: ORF prediction and coding region identification.\n", + " * Trinotate: Functional annotation of transcripts." + ] + }, { "cell_type": "markdown", "id": "6cd0f4f2-5559-4675-9e97-24b0548b31af", "metadata": {}, "source": [ - "## Time to get started! \n", + "## Get Started \n", "\n", "**Step 1:** Make sure you are in the correct local working directory as in `01_prog_setup.ipynb`.\n", "> It should be `/home/jupyter`." @@ -272,16 +320,28 @@ "outputs": [], "source": [ "from jupytercards import display_flashcards\n", - "display_flashcards('Transcriptome-Assembly-Refinement-and-Applications/quiz-material/02-cp1-1.json')\n", - "display_flashcards('Transcriptome-Assembly-Refinement-and-Applications/quiz-material/02-cp1-2.json')" + "display_flashcards('../quiz-material/02-cp1-1.json')\n", + "display_flashcards('../quiz-material/02-cp1-2.json')" + ] + }, + { + "cell_type": "markdown", + "id": "b82f0b3a", + "metadata": {}, + "source": [ + "## Conclusion\n", + "\n", + "This Jupyter Notebook demonstrated a complete transcriptome assembly workflow using the TransPi Nextflow pipeline. We successfully executed the pipeline, encompassing quality control, normalization, multiple assembly generation with Trinity, TransAbyss, SOAP, rnaSpades, and Velvet/Oases, integration via EvidentialGene, and subsequent assessment using rnaQuast and BUSCO. The final assembly underwent annotation with DIAMOND/BLAST and HMMER/Pfam, culminating in comprehensive reports detailing the entire process and the resulting transcriptome characteristics. The generated output, accessible in the `basicRun/output` directory, provides a rich dataset for further investigation and analysis, including detailed quality metrics, assembly statistics, and functional annotations. This module provided a practical introduction to automated transcriptome assembly, highlighting the efficiency and reproducibility offered by integrated workflows like TransPi. Further exploration of the detailed output is encouraged, and the subsequent notebook focuses on a more in-depth annotation analysis." ] }, { "cell_type": "markdown", - "id": "b96dd6bb-a8ed-44bf-b1f4-bb284f8f0f3e", + "id": "b68484f3", "metadata": {}, "source": [ - "## When you are ready, proceed to the next notebook: [`Submodule_03_annotation_only.ipynb`](Submodule_03_annotation_only.ipynb)." + "## Clean Up\n", + "\n", + "Remember to proceed to the next notebook [`Submodule_03_annotation_only.ipynb`](Submodule_03_annotation_only.ipynb) or shut down your instance if you are finished." ] } ], diff --git a/Submodule_03_annotation_only.ipynb b/AWS/Submodule_03_annotation_only.ipynb similarity index 80% rename from Submodule_03_annotation_only.ipynb rename to AWS/Submodule_03_annotation_only.ipynb index 423ca38..906a023 100644 --- a/Submodule_03_annotation_only.ipynb +++ b/AWS/Submodule_03_annotation_only.ipynb @@ -8,11 +8,64 @@ "# MDIBL Transcriptome Assembly Learning Module\n", "# Notebook 3: Using TransPi to Performing an \"Annotation Only\" Run\n", "\n", + "## Overview\n", + "\n", "In the previous notebook, we ran the entire default TransPi workflow, generating a small transcriptome from a test data set. While that is a valid exercise in carrying through the workflow, the downstream steps (annotation and assessment) will be unrealistic in their output, since the test set will only generate a few hundred transcripts. In contrast, a more complete estimate of a vertebrate transcriptome will contain tens to hundreds of thousands of transcripts.\n", "\n", "In this notebook, we will start from an assembled transcriptome. We will work with a more realistic example that was generated and submitted to the NCBI Transcriptome Shotgun Assembly archive.\n" ] }, + { + "cell_type": "markdown", + "id": "8f4cd172", + "metadata": {}, + "source": [ + "## Learning Objectives:\n", + "\n", + "1. **Understanding the TransPi workflow and its components:** The notebook builds upon previous knowledge of TransPi, focusing on the annotation stage, separating it from the assembly process. It reinforces the understanding of the overall workflow and its different stages.\n", + "\n", + "2. **Performing an \"annotation-only\" run with TransPi:** The primary objective is to learn how to execute TransPi, specifically utilizing the `--onlyAnn` option to process a pre-assembled transcriptome. This teaches efficient use of the tool and avoids unnecessary recomputation.\n", + "\n", + "3. **Working with realistic transcriptome data:** The notebook shifts from a small test dataset to a larger, more realistic transcriptome from the NCBI Transcriptome Shotgun Assembly archive. This exposes learners to the scale and characteristics of real-world transcriptome data.\n", + "\n", + "4. **Using command-line tools for data manipulation:** The notebook uses `grep`, `perl` one-liners, and `docker` commands to count sequences, modify configuration files, and manage containerized applications. This improves proficiency in using these essential bioinformatics tools.\n", + "\n", + "5. **Interpreting TransPi output:** Learners analyze the `RUN_INFO.txt` file and other output files to understand the analysis parameters and results. This develops skills in interpreting computational biology results.\n", + "\n", + "6. **Understanding and using containerization (Docker):** The notebook introduces the concept of Docker containers and demonstrates how to utilize a BUSCO container to run the BUSCO analysis, highlighting the benefits of containerization for reproducibility and dependency management. This teaches practical application of containers in bioinformatics.\n", + "\n", + "7. **Running BUSCO analysis:** Learners execute BUSCO, a crucial tool for assessing the completeness of transcriptome assemblies. This extends their skillset to include running and interpreting BUSCO results.\n", + "\n", + "8. **Interpreting BUSCO and other annotation results:** The notebook includes checkpoints that challenge learners to interpret the BUSCO results, GO stats, and TransDecoder stats, fostering critical thinking and data interpretation skills.\n", + "\n", + "9. **Critical evaluation of data sources:** The notebook encourages learners to consider the source and context of the transcriptome data used, prompting reflection on data quality and limitations. This emphasizes responsible use of biological data.\n", + "\n", + "10. **Independent BUSCO analysis:** The final checkpoint task requires learners to independently run a BUSCO analysis on a new transcriptome, selecting a data source and lineage, and interpreting the results. This assesses the understanding and practical application of the concepts covered in the notebook." + ] + }, + { + "cell_type": "markdown", + "id": "04994736", + "metadata": {}, + "source": [ + "## Prerequisites\n", + "\n", + "* **Nextflow:** The core workflow engine used to manage the TransPi pipeline.\n", + "* **Perl:** Used for a one-liner to modify the Nextflow configuration file.\n", + "* **Docker:** Used to run BUSCO in a containerized environment.\n", + "* **BUSCO:** The Benchmarking Universal Single-Copy Orthologs program for assessing genome completeness.\n", + "* **TransPi:** The specific transcriptome assembly pipeline. The notebook assumes this is pre-installed or available through Nextflow.\n", + "* **Command-line tools:** Basic Unix command-line utilities like `grep`, `ls`, `cat`, `pwd`, etc., are used throughout the notebook." + ] + }, + { + "cell_type": "markdown", + "id": "16adea33", + "metadata": {}, + "source": [ + "## Get Started" + ] + }, { "cell_type": "code", "execution_count": null, @@ -392,7 +445,7 @@ "metadata": {}, "outputs": [], "source": [ - "display_flashcards('Transcriptome-Assembly-Refinement-and-Applications/quiz-material/03-cp1-2.json')" + "display_flashcards('../quiz-material/03-cp1-2.json')" ] }, { @@ -470,10 +523,22 @@ }, { "cell_type": "markdown", - "id": "ed64e9fa-7ae6-468c-8605-7600dbf9bbc0", + "id": "68cfd48a", "metadata": {}, "source": [ - "## When you are ready, proceed to the next notebook: [`Submodule_04_gls_assembly.ipynb`](Submodule_04_gls_assembly.ipynb). " + "## Conclusion\n", + "\n", + "This Jupyter Notebook demonstrated the \"annotation only\" run of TransPi, utilizing a pre-assembled transcriptome of *Oncorhynchus mykiss* (Rainbow Trout) containing 31,176 transcripts. By modifying the `nextflow.config` file and leveraging the `--onlyAnn` option, we efficiently performed annotation steps, including Pfam and BLAST analyses, without repeating the assembly process. Furthermore, the notebook introduced the concept of Docker containers, showcasing their use in executing BUSCO analysis for assessing transcriptome completeness. The practical application of BUSCO, along with interpretation of the resulting output files (including GO stats and TransDecoder statistics), emphasized the importance of data context and critical evaluation of transcriptome assembly quality. Finally, the notebook concluded with a hands-on exercise, prompting users to perform their own BUSCO analysis on a different transcriptome, fostering a deeper understanding of the workflow and its applications." + ] + }, + { + "cell_type": "markdown", + "id": "5bc80021", + "metadata": {}, + "source": [ + "## Clean Up\n", + "\n", + "Remember to proceed to the next notebook [`Submodule_04_gls_assembly.ipynb`](Submodule_04_gls_assembly.ipynb) or shut down your instance if you are finished." ] } ], diff --git a/Submodule_04_google_batch_assembly.ipynb b/AWS/Submodule_04_google_batch_assembly.ipynb similarity index 74% rename from Submodule_04_google_batch_assembly.ipynb rename to AWS/Submodule_04_google_batch_assembly.ipynb index 45c1743..26b039f 100644 --- a/Submodule_04_google_batch_assembly.ipynb +++ b/AWS/Submodule_04_google_batch_assembly.ipynb @@ -14,6 +14,8 @@ "id": "0512337d-7ade-44c7-832a-ae6970a7d980", "metadata": {}, "source": [ + "## Overview\n", + "\n", "So far, all of the computational work executed has been run locally, using the compute resources available within this Jupyter notebook. Although this is functional, it is not the ideal setup for fast, cost-efficient data analysis.\n", "\n", "Google Batch is known as a scheduler, which provisions specific compute resources to be allocated for individual processes within our workflow. This provides two primary benefits:\n", @@ -23,13 +25,59 @@ "Fortunately, Batch and Nextflow are compatible with each other allowing for any Nextflow workflow, including the TransPi workflow that we have been using, to be executable on Batch.\n", "\n", "\n", - "> \n", + "> \n", ">\n", "> **Figure 1:** Diagram illustrating the interactions between the components used for the Google Batch run. \n", "\n", "For this to work, there are a few quick adjustment steps to make sure everything is set up for a Google Batch run!" ] }, + { + "cell_type": "markdown", + "id": "8b495639", + "metadata": {}, + "source": [ + "## Learning Objectives:\n", + "\n", + "1. **Utilize Google Batch for efficient and cost-effective data analysis:** The notebook contrasts local computation with Google Batch, highlighting the benefits of the latter in terms of cost savings (auto-shutdown of unused resources) and speed (parallelization of tasks).\n", + "\n", + "2. **Integrate Nextflow workflows with Google Batch:** The notebook demonstrates how to configure a Nextflow pipeline (TransPi) to run on Google Batch, emphasizing the compatibility between these tools.\n", + "\n", + "3. **Manage files using Google Cloud Storage (GCS):** The lesson requires users to create or utilize a GCS bucket to store the necessary files for the TransPi workflow, addressing the challenge of accessing local files from external compute resources.\n", + "\n", + "4. **Configure a Nextflow pipeline for Google Batch execution:** This involves modifying the `nextflow.config` file to point to the GCS bucket, adjust compute allocations (CPU and memory), and specify the correct Google Batch profile. It shows how to use Perl one-liners for efficient configuration changes.\n", + "\n", + "5. **Interpret and compare the timelines of local and Google Batch runs:** By comparing the `transpi_timeline.html` files from both local and Google Batch executions, users learn to analyze the performance differences and understand the impact of resource allocation.\n", + "\n", + "6. **Execute and manage a Nextflow pipeline on Google Batch:** The notebook provides step-by-step instructions for running TransPi on Google Batch using specific command-line arguments and managing the output.\n", + "\n", + "7. **Understand and utilize Google Cloud commands:** The notebook uses `gcloud` and `gsutil` commands extensively, teaching users basic Google Cloud command-line interactions." + ] + }, + { + "cell_type": "markdown", + "id": "1dbd972f", + "metadata": {}, + "source": [ + "## Prerequisites\n", + "\n", + "* **A Google Cloud Storage (GCS) Bucket:** A bucket is needed to store the TransPi workflow's input files and output results. The notebook provides options to create a new bucket or use an existing one.\n", + "* **Sufficient Compute Resources:** The user needs to have sufficient quota available in their GCP project to handle the compute resources required by the TransPi workflow (CPUs, memory, disk space). The notebook uses a `nextflow.config` file to configure the Google Batch execution.\n", + "* **`gcloud` CLI:** The Google Cloud SDK (`gcloud`) command-line tool must be installed and configured to authenticate with the GCP project. The notebook uses `gcloud` commands to interact with GCP services.\n", + "* **`gsutil` CLI:** The `gsutil` command-line tool (part of the Google Cloud SDK) is used to interact with GCS.\n", + "* **Nextflow:** The Nextflow workflow engine must be installed locally on the Jupyter Notebook environment.\n", + "* **TransPi Workflow:** The TransPi Nextflow pipeline code must be available in the Jupyter Notebook environment's file system. The notebook assumes it's in a `TransPi` directory.\n", + "* **Perl:** The notebook uses Perl one-liners for file manipulation. Perl must be installed in the Jupyter Notebook environment." + ] + }, + { + "cell_type": "markdown", + "id": "9449ee77", + "metadata": {}, + "source": [ + "## Get Started" + ] + }, { "cell_type": "code", "execution_count": null, @@ -458,9 +506,9 @@ "outputs": [], "source": [ "from jupytercards import display_flashcards\n", - "display_flashcards('Transcriptome-Assembly-Refinement-and-Applications/quiz-material/04-cp1-2.json')\n", - "display_flashcards('Transcriptome-Assembly-Refinement-and-Applications/quiz-material/04-cp1-3.json')\n", - "display_flashcards('Transcriptome-Assembly-Refinement-and-Applications/quiz-material/04-cp1-4.json')" + "display_flashcards('../quiz-material/04-cp1-2.json')\n", + "display_flashcards('../quiz-material/04-cp1-3.json')\n", + "display_flashcards('../quiz-material/04-cp1-4.json')" ] }, { @@ -526,16 +574,28 @@ "id": "96722e89-2d6a-4381-ba42-673e9be79a2e", "metadata": {}, "source": [ - "### At this point, you have the toolkit necessary to run TransPi in various configurations and the baseline knowledge to interpret the output that TransPi produces. You also have the foundational knowledge of Google Cloud resources with the ability to utilize buckets and cloud computing to execute your computational task. Specifically, Batch which not only works with TransPi but also with any other Nextflow pipeline. We urge you to continue exploring TransPi, using different data sets, and also to explore other Nextflow pipelines as well." + "##### At this point, you have the toolkit necessary to run TransPi in various configurations and the baseline knowledge to interpret the output that TransPi produces. You also have the foundational knowledge of Google Cloud resources with the ability to utilize buckets and cloud computing to execute your computational task. Specifically, Batch which not only works with TransPi but also with any other Nextflow pipeline. We urge you to continue exploring TransPi, using different data sets, and also to explore other Nextflow pipelines as well." ] }, { - "cell_type": "code", - "execution_count": null, - "id": "0ac8a4e6-ad87-438a-9b74-86dd82fb6823", + "cell_type": "markdown", + "id": "5213f6a1", "metadata": {}, - "outputs": [], - "source": [] + "source": [ + "## Conclusion\n", + "\n", + "This module demonstrated the execution of the TransPi transcriptome assembly workflow on Google Batch, a significant advancement from local Jupyter Notebook execution. By leveraging Google Batch's scheduling capabilities, we achieved both cost efficiency through automated resource allocation and increased speed through parallelization of computational tasks. The integration of Nextflow with Google Batch streamlined the process, requiring only minor adjustments to the `nextflow.config` file to redirect file paths to Google Cloud Storage (GCS) buckets and optimize compute allocations. Comparison of local and Google Batch run timelines highlighted the benefits of cloud computing for large-scale bioinformatics analyses. This learning module equipped users with the skills to effectively utilize Google Batch for efficient and scalable execution of Nextflow pipelines, paving the way for more complex and data-intensive bioinformatics projects." + ] + }, + { + "cell_type": "markdown", + "id": "2661513f", + "metadata": {}, + "source": [ + "## Clean Up\n", + "\n", + "You would proceed to the next notebook [`Submodule_05_Bonus_Notebook.ipynb`](./Submodule_05_Bonus_Notebook.ipynb) or shut down your instance if you are finished." + ] } ], "metadata": {}, diff --git a/Submodule_05_Bonus_Notebook.ipynb b/AWS/Submodule_05_Bonus_Notebook.ipynb similarity index 70% rename from Submodule_05_Bonus_Notebook.ipynb rename to AWS/Submodule_05_Bonus_Notebook.ipynb index 596a1e9..d884c61 100644 --- a/Submodule_05_Bonus_Notebook.ipynb +++ b/AWS/Submodule_05_Bonus_Notebook.ipynb @@ -14,6 +14,7 @@ "id": "c38bba56-40d9-4ca4-b58b-b9733b424b1f", "metadata": {}, "source": [ + "## Overview\n", "In this notebook, we are going to explore how to run this module with a new dataset. These submodules provide a great framework for running a rigorous and scalable transcriptome assembly, but there are some considerations that must be made in order to run this with your own data. We will walk through that process here so that hopefully, you are able to take these notebooks to your research group and use them for your own analysis." ] }, @@ -25,6 +26,45 @@ "The data we are using here comes from SRA. In this example, we are using data from an experiment that compared RNA sequences in honeybees with and without viral infections. The BioProject ID is [PRJNA274674](https://www.ncbi.nlm.nih.gov/bioproject/PRJNA274674). This experiment includes 6 RNA-seq samples and 2 methylation-seq samples. We are only considering the RNA-seq data here. Additionally, we have subsampled them to about 2 millions reads collectively accross all of the samples. In a real analysis this would not be a good idea, but to keep costs and runtimes low we will use the down-sampled files in this demonstration. If you want to explore the full dataset, we recommend pulling the fastq files using the [STRIDES tutorial on SRA downloads](https://github.com/STRIDES/NIHCloudLabGCP/blob/main/notebooks/SRADownload/SRA-Download.ipynb). As with the original example in this module, we have concatenated all 6 files into one set of combined fastq files called joined_R{1,2}.fastq.gz We have stored the subsampled fastq files in this module's cloud storage bucket." ] }, + { + "cell_type": "markdown", + "id": "ae57ad92", + "metadata": {}, + "source": [ + "## Learning Objectives:\n", + "\n", + "1. **Adapting a Nextflow workflow:** The notebook demonstrates how to modify a Nextflow pipeline's configuration to point to a new dataset, highlighting the workflow's reusability and flexibility. This involves understanding how to change input parameters within a configuration file.\n", + "\n", + "2. **Data preparation and management:** Users learn how to download and manage data from the SRA (Sequence Read Archive) using `gsutil` (although a pre-downloaded, subsampled dataset is provided for convenience). This includes understanding file organization and paths.\n", + "\n", + "3. **Software installation and environment setup:** The notebook guides users through installing necessary software (Java, Mamba, sra-tools, perl modules, Nextflow) and setting up the computational environment. This emphasizes reproducibility and dependency management.\n", + "\n", + "4. **Running a transcriptome assembly:** The notebook shows how to execute the TransPi Nextflow pipeline with the new dataset, demonstrating the complete process from data input to (presumably) assembly output." + ] + }, + { + "cell_type": "markdown", + "id": "e6a8c2f6", + "metadata": {}, + "source": [ + "## Prerequisites\n", + "\n", + "* **Java:** The notebook installs the default JDK.\n", + "* **Miniforge** Used for package management.\n", + "* **sra-tools, perl-dbd-sqlite, perl-dbi:** Bioinformatics tools for working with SRA data.\n", + "* **Nextflow:** A workflow management system.\n", + "* **Docker** Either Docker pre-installed on the VM, or permissions to install and run Docker containers.\n", + "* **`gsutil`:** The Google Cloud Storage command-line tool." + ] + }, + { + "cell_type": "markdown", + "id": "27475529", + "metadata": {}, + "source": [ + "## Get Started" + ] + }, { "cell_type": "markdown", "id": "dcf2a2d0-bc91-4a2a-9db0-62f1eee91f92", @@ -73,9 +113,9 @@ "metadata": {}, "outputs": [], "source": [ - "# install mamba and dependencies\n", - "! curl -L -O https://github.com/conda-forge/miniforge/releases/latest/download/Mambaforge-$(uname)-$(uname -m).sh\n", - "! bash Mambaforge-$(uname)-$(uname -m).sh -b -p $HOME/mambaforge" + "# install Miniforge\n", + "! curl -L -O https://github.com/conda-forge/miniforge/releases/latest/download/Miniforge3-$(uname)-$(uname -m).sh\n", + "! bash Miniforge3-$(uname)-$(uname -m).sh -b -p $HOME/miniforge" ] }, { @@ -85,9 +125,9 @@ "metadata": {}, "outputs": [], "source": [ - "# add mamba to your path\n", + "# add Miniforge to your path\n", "import os\n", - "os.environ[\"PATH\"] += os.pathsep + os.environ[\"HOME\"]+\"/mambaforge/bin\"" + "os.environ[\"PATH\"] += os.pathsep + os.environ[\"HOME\"]+\"/miniforge/bin\"" ] }, { @@ -95,7 +135,7 @@ "id": "39bb00de-3481-4cb0-a2fe-098cfdae51a6", "metadata": {}, "source": [ - "Use mamba to install: sra-tools perl-dbd-sqlite perl-dbi from channel bioconda\n", + "Use Miniforge to install: sra-tools perl-dbd-sqlite perl-dbi from channel bioconda\n", "\n", "
\n", " Click for help\n", @@ -285,6 +325,26 @@ "source": [ "With the subsampled reads, the assembly should complete in about 2 hours using a n1-highmem-16 machine." ] + }, + { + "cell_type": "markdown", + "id": "38abe476", + "metadata": {}, + "source": [ + "## Conclusion\n", + "\n", + "This notebook demonstrated the adaptability of the MDIBL Transcriptome Assembly Learning Module's TransPi pipeline by applying it to a new RNA-Seq dataset from a honeybee viral infection study (PRJNA274674). While utilizing a subsampled dataset for demonstration purposes, the process highlighted the ease of integrating new data into the existing Nextflow workflow. By simply modifying the `nextflow.config` file to specify the new reads' location, the pipeline executed seamlessly, showcasing its robustness and reproducibility. This adaptability makes the module a valuable resource for researchers seeking to perform scalable and rigorous transcriptome assemblies on their own datasets, facilitating efficient and reproducible analyses within their research groups. The successful execution underscores the power of workflow management systems like Nextflow for streamlining bioinformatics analyses." + ] + }, + { + "cell_type": "markdown", + "id": "7f7d2cab", + "metadata": {}, + "source": [ + "## Clean Up\n", + "\n", + "Shut down your instance if you are finished." + ] } ], "metadata": {}, diff --git a/images/AnnotationProcess.png b/AWS/images/AnnotationProcess.png similarity index 100% rename from images/AnnotationProcess.png rename to AWS/images/AnnotationProcess.png diff --git a/AWS/images/MDI-course-card-2.png b/AWS/images/MDI-course-card-2.png new file mode 100644 index 0000000..06fea92 Binary files /dev/null and b/AWS/images/MDI-course-card-2.png differ diff --git a/images/RNA-Seq_Notebook_Homepage.png b/AWS/images/RNA-Seq_Notebook_Homepage.png similarity index 100% rename from images/RNA-Seq_Notebook_Homepage.png rename to AWS/images/RNA-Seq_Notebook_Homepage.png diff --git a/images/Setup10.png b/AWS/images/Setup10.png similarity index 100% rename from images/Setup10.png rename to AWS/images/Setup10.png diff --git a/images/Setup11.png b/AWS/images/Setup11.png similarity index 100% rename from images/Setup11.png rename to AWS/images/Setup11.png diff --git a/images/Setup12.png b/AWS/images/Setup12.png similarity index 100% rename from images/Setup12.png rename to AWS/images/Setup12.png diff --git a/images/Setup13.png b/AWS/images/Setup13.png similarity index 100% rename from images/Setup13.png rename to AWS/images/Setup13.png diff --git a/images/Setup14.png b/AWS/images/Setup14.png similarity index 100% rename from images/Setup14.png rename to AWS/images/Setup14.png diff --git a/images/Setup15.png b/AWS/images/Setup15.png similarity index 100% rename from images/Setup15.png rename to AWS/images/Setup15.png diff --git a/images/Setup16.png b/AWS/images/Setup16.png similarity index 100% rename from images/Setup16.png rename to AWS/images/Setup16.png diff --git a/images/Setup17.png b/AWS/images/Setup17.png similarity index 100% rename from images/Setup17.png rename to AWS/images/Setup17.png diff --git a/images/Setup18.png b/AWS/images/Setup18.png similarity index 100% rename from images/Setup18.png rename to AWS/images/Setup18.png diff --git a/images/Setup19.png b/AWS/images/Setup19.png similarity index 100% rename from images/Setup19.png rename to AWS/images/Setup19.png diff --git a/images/Setup2.png b/AWS/images/Setup2.png similarity index 100% rename from images/Setup2.png rename to AWS/images/Setup2.png diff --git a/images/Setup20.png b/AWS/images/Setup20.png similarity index 100% rename from images/Setup20.png rename to AWS/images/Setup20.png diff --git a/images/Setup21.png b/AWS/images/Setup21.png similarity index 100% rename from images/Setup21.png rename to AWS/images/Setup21.png diff --git a/AWS/images/Setup22.png b/AWS/images/Setup22.png new file mode 100644 index 0000000..e39d867 Binary files /dev/null and b/AWS/images/Setup22.png differ diff --git a/AWS/images/Setup23.png b/AWS/images/Setup23.png new file mode 100644 index 0000000..19fcd08 Binary files /dev/null and b/AWS/images/Setup23.png differ diff --git a/AWS/images/Setup24.png b/AWS/images/Setup24.png new file mode 100644 index 0000000..dc0b879 Binary files /dev/null and b/AWS/images/Setup24.png differ diff --git a/AWS/images/Setup25.png b/AWS/images/Setup25.png new file mode 100644 index 0000000..2e32a69 Binary files /dev/null and b/AWS/images/Setup25.png differ diff --git a/images/Setup3.png b/AWS/images/Setup3.png similarity index 100% rename from images/Setup3.png rename to AWS/images/Setup3.png diff --git a/images/Setup4.png b/AWS/images/Setup4.png similarity index 100% rename from images/Setup4.png rename to AWS/images/Setup4.png diff --git a/images/Setup5.png b/AWS/images/Setup5.png similarity index 100% rename from images/Setup5.png rename to AWS/images/Setup5.png diff --git a/images/Setup6.png b/AWS/images/Setup6.png similarity index 100% rename from images/Setup6.png rename to AWS/images/Setup6.png diff --git a/images/Setup7.png b/AWS/images/Setup7.png similarity index 100% rename from images/Setup7.png rename to AWS/images/Setup7.png diff --git a/images/Setup8.png b/AWS/images/Setup8.png similarity index 100% rename from images/Setup8.png rename to AWS/images/Setup8.png diff --git a/images/Setup9.png b/AWS/images/Setup9.png similarity index 100% rename from images/Setup9.png rename to AWS/images/Setup9.png diff --git a/images/TransPiWorkflow.png b/AWS/images/TransPiWorkflow.png similarity index 100% rename from images/TransPiWorkflow.png rename to AWS/images/TransPiWorkflow.png diff --git a/images/VMdownsize.jpg b/AWS/images/VMdownsize.jpg similarity index 100% rename from images/VMdownsize.jpg rename to AWS/images/VMdownsize.jpg diff --git a/images/architecture_diagram.png b/AWS/images/architecture_diagram.png similarity index 100% rename from images/architecture_diagram.png rename to AWS/images/architecture_diagram.png diff --git a/AWS/images/basic_assembly.png b/AWS/images/basic_assembly.png new file mode 100644 index 0000000..99607ee Binary files /dev/null and b/AWS/images/basic_assembly.png differ diff --git a/images/cellMenu.png b/AWS/images/cellMenu.png similarity index 100% rename from images/cellMenu.png rename to AWS/images/cellMenu.png diff --git a/images/deBruijnGraph.png b/AWS/images/deBruijnGraph.png similarity index 100% rename from images/deBruijnGraph.png rename to AWS/images/deBruijnGraph.png diff --git a/images/fileDemo.png b/AWS/images/fileDemo.png similarity index 100% rename from images/fileDemo.png rename to AWS/images/fileDemo.png diff --git a/images/gcbDiagram.jpg b/AWS/images/gcbDiagram.jpg similarity index 100% rename from images/gcbDiagram.jpg rename to AWS/images/gcbDiagram.jpg diff --git a/images/glsDiagram.png b/AWS/images/glsDiagram.png similarity index 100% rename from images/glsDiagram.png rename to AWS/images/glsDiagram.png diff --git a/images/jupyterRuntime.png b/AWS/images/jupyterRuntime.png similarity index 100% rename from images/jupyterRuntime.png rename to AWS/images/jupyterRuntime.png diff --git a/images/jupyterRuntimeCircle.png b/AWS/images/jupyterRuntimeCircle.png similarity index 100% rename from images/jupyterRuntimeCircle.png rename to AWS/images/jupyterRuntimeCircle.png diff --git a/images/mdibl-compbio-core-logo-eurostyle.jpg b/AWS/images/mdibl-compbio-core-logo-eurostyle.jpg similarity index 100% rename from images/mdibl-compbio-core-logo-eurostyle.jpg rename to AWS/images/mdibl-compbio-core-logo-eurostyle.jpg diff --git a/images/mdibl-compbio-core-logo-square.jpg b/AWS/images/mdibl-compbio-core-logo-square.jpg similarity index 100% rename from images/mdibl-compbio-core-logo-square.jpg rename to AWS/images/mdibl-compbio-core-logo-square.jpg diff --git a/images/module_concept.png b/AWS/images/module_concept.png similarity index 100% rename from images/module_concept.png rename to AWS/images/module_concept.png diff --git a/images/perl-logo.png b/AWS/images/perl-logo.png similarity index 100% rename from images/perl-logo.png rename to AWS/images/perl-logo.png diff --git a/images/rainbowTrout.jpeg b/AWS/images/rainbowTrout.jpeg similarity index 100% rename from images/rainbowTrout.jpeg rename to AWS/images/rainbowTrout.jpeg diff --git a/AWS/images/transpi_workflow.png b/AWS/images/transpi_workflow.png new file mode 100644 index 0000000..a9da75b Binary files /dev/null and b/AWS/images/transpi_workflow.png differ diff --git a/images/workflow_concept.png b/AWS/images/workflow_concept.png similarity index 100% rename from images/workflow_concept.png rename to AWS/images/workflow_concept.png diff --git a/GoogleCloud/README.md b/GoogleCloud/README.md new file mode 100644 index 0000000..1b2fd02 --- /dev/null +++ b/GoogleCloud/README.md @@ -0,0 +1,143 @@ +![course card](images/MDI-course-card-2.png) + +# MDI Biological Laboratory RNA-seq Transcriptome Assembly Module +--------------------------------- + + +## Three primary and interlinked learning goals: +1. From a *biological perspective*, demonstration of the **process of transcriptome assembly** from raw RNA-seq data. +2. From a *computational perspective*, demonstration of **computing using workflow management and container systems**. +3. Also from an *infrastructure perspective*, demonstration of **carrying out these analyses efficiently in a cloud environment.** + + + +# Quick Overview +This module teaches you how to perform a short-read RNA-seq Transcriptome Assembly with Google Cloud Platform using a Nextflow pipeline, and eventually using the Google Batch API. In addition to the overview given in this README, you will find three Jupyter notebooks that teach you different components of RNA-seq in the cloud. + +This module will cost you about $7.00 to run end to end, assuming you shutdown and delete all resources upon completion. + + +## Contents + ++ [Getting Started](#getting-started) ++ [Biological Problem](#biological-problem) ++ [Set Up](#set-up) ++ [Software Requirements](#software-requirements) ++ [Workflow Diagrams](#workflow-diagrams) ++ [Data](#data) ++ [Troubleshooting](#troubleshooting) ++ [Funding](#funding) ++ [License for Data](#license-for-data) + +## **Getting Started** +This learning module includes tutorials and execution scripts in the form of Jupyter notebooks. The purpose of these tutorials is to help users familiarize themselves with cloud computing in the specific context of running bioinformatics workflows to prep for and to carry out a transcriptome assembly, refinement, and annotation. These tutorials do this by utilizing a recently published Nextflow workflow (TransPi [manuscript](https://onlinelibrary.wiley.com/doi/10.1111/1755-0998.13593), [repository](https://github.com/palmuc/TransPi), and [user guide](https://palmuc.github.io/TransPi/)), which manages and passes data between several state-of-the-art programs, carrying out the processes from initial quality control and normalization, through assembly with several tools, refinement and assessment, and finally annotation of the final putative transcriptome. + +Since the work is managed by this pipeline, the notebooks will focus on setting up and running the pipeline, followed by an examination of some of the wide range of outputs produced. We will also demonstrate how to retrieve the complete results directory so that users can examine more extensively on their own computing systems going step-by-step through specific workflows. These workflows cover the start to finish of basic bioinformatics analysis; starting from raw sequence data and carrying out the steps needed to generate a final assembled and annotated transcriptome. + +We also put an emphasis on understanding how workflows execute, using the specific example of the Nextflow (https://www.nextflow.io) workflow engine, and on using workflow engines as supported by cloud infrastructure, using the specific example of the Google Batch API (https://cloud.google.com/batch). + +![technical infrastructure](/images/architecture_diagram.png) + +**Figure 1:** The technical infrastructure diagram for this project. + +## **Biological Problem** +The combination of increased availability and reduced expense in obtaining high-throughput sequencing has made transcriptome profiling analysis (primarily with RNA-seq) a standard tool for the molecular characterization of widely disparate biological systems. Researchers working in common model organisms, such as mouse or zebrafish, have relatively easy access to the necessary resources (e.g., well-assembled genomes and large collections of predicted/verified transcripts), for the analysis and interpretation of their data. In contrast, researchers working on less commonly studied organisms and systems often must develop these resources for themselves. + +Transcriptome assembly is the broad term used to describe the process of estimating many (or ideally all) of an organism’s transcriptome based on the large-scale but fragmentary data provided by high-throughput sequencing. A "typical" RNA-seq dataset will consist of tens of millions of reads or read-pairs, with each contiguous read representing up to 150 nucleotides in the sequence. Complete transcripts, in contrast, typically range from hundreds to tens of thousands of nucleotides in length. In short, and leaving out the technical details, the process of assembling a transcriptome from raw reads (Figure 2) is to first make a "best guess" segregation of the reads into subsets that are most likely derived from one (or a small set of related/similar genes), and then for each subset, build a most-likely set of transcripts and genes. + +![basic transcriptome assembly](/images/basic_assembly.png) + +**Figure 2:** The process from raw reads to first transcriptome assembly. + +Once a new transcriptome is generated, assessed, and refined, it must be annotated with putative functional assignments to be of use in subsequent functional studies. Functional annotation is accomplished through a combination of assignment of homology-based and ab initio methods. The most well-established homology-based processes are the combination of protein-coding sequence prediction followed by protein sequence alignment to databases of known proteins, especially those from human or common model organisms. Ab initio methods use computational models of various features (e.g., known protein domains, signal peptides, or peptide modification sites) to characterize either the transcript or its predicted protein product. This training module will cover multiple approaches to the annotation of assembled transcriptomes. + +## **Set Up** + +#### Part 1: Setting up Environment + +**Enable APIs and create a Nextflow Sercice Account** + +If you are using Nextflow outside of NIH CloudLab you must enable the required APIs, set up a service account, and add your service account to your notebook permissions before creating the notebook. Follow sections 1 and 2 of the accompanying [how to document](https://github.com/NIGMS/NIGMS-Sandbox/blob/main/docs/HowToCreateNextflowServiceAccount.md) for instructions. If you are executing this tutorial with an NIH CloudLab account your default Compute Engine service account will have all required IAM roles to run the nextflow portion. + +**Create the Vertex AI Instance** + +Follow the steps highlighted [here](https://github.com/STRIDES/NIHCloudLabGCP/blob/main/docs/vertexai.md) to create a new user-managed notebook in Vertex AI. Follow steps 1-8 and be especially careful to enable idle shutdown as highlighted in step 7. For this module you should select **Debian 11** and **Python3** in the Environment tab in step 5. In step 6 in the Machine type tab, select **n1-highmem-16** from the dropdown box. This will provide you with 16 vCPUs and 104 GB of RAM which may feel like a lot but is necessary for TransPi to run. + + +#### Part 2: Adding the Modules to the Notebook + +1. From the Launcher in your new VM, Click the Terminal option. +![setup 22](images/Setup22.png) +2. Next, paste the following git command to get a copy of everything within this repository, including all of the submodules. + +> ```git clone https://github.com/NIGMS/Transcriptome-Assembly-Refinement-and-Applications.git``` +3. You are now all set! + +**WARNING:** When you are not using the notebook, stop it. This will prevent you from incurring costs while you are not using the notebook. You can do this in the same window as where you opened the notebook. Make sure that you have the notebook selected ![setup 23](images/Setup23.png). Then click the ![setup 24](images/Setup24.png). When you want to start up the notebook again, do the same process except click the ![setup 25](images/Setup25.png) instead. + +## **Software Requirements** + +All of the software requirements are taken care of and installed within [Submodule_01_prog_setup.ipynb](./Submodule_01_prog_setup.ipynb). The key pieces of software needed are: +1. [Nextflow workflow system](https://www.nextflow.io/): Nextflow is a workflow management software that TransPi is built for. +2. [Google Batch API](https://cloud.google.com/batch/docs): Google Batch was enabled as part of the setup process and will be readily available when it is needed. +3. [Nextflow TransPi Package](https://github.com/palmuc/TransPi): The rest of the software is all downloaded as part of the TransPi package. TransPi is a Nextflow pipeline that carries out many of the standard steps required for transcriptome assembly and annotation. The original TransPi is available from this GitHub [link](https://github.com/palmuc/TransPi). We have made various alterations to the TransPi package and so the TransPi files you will be using throughout this module will be our own altered version. + +## **Workflow Diagrams** + +![transpi workflow](images/transpi_workflow.png) + +**Figure 3:** Nextflow workflow diagram. (Rivera 2021). +Image Source: https://github.com/PalMuc/TransPi/blob/master/README.md + +Explanation of which notebooks execute which processes: + ++ Notebooks labeled 0 ([Submodule_00_Background.ipynb](./Submodule_00_Background.ipynb) and [00_Glossary.md](./00_Glossary.md)) respectively cover background materials and provide a centralized glossary for both the biological problem of transcriptome assembly, as well as an introduction to workflows and container-based computing. ++ Notebook 1 ([Submodule_01_prog_setup.ipynb](./Submodule_01_prog_setup.ipynb)) is used for setting up the environment. It should only need to be run once per machine. (Note that our version of TransPi does not run the `precheck script`. To avoid the headache and wasted time, we have developed a workaround to skip that step.) ++ Notebook 2 ([Submodule_02_basic_assembly.ipynb](./Submodule_02_basic_assembly.ipynb)) carries out a complete run of the Nextflow TransPi assembly workflow on a modest sequence set, producing a small transcriptome. ++ Notebook 3 ([Submodule_03_annotation_only.ipynb](./Submodule_03_annotation_only.ipynb)) carries out an annotation-only run using a prebuilt, but more complete transcriptome. ++ Notebook 4 ([Submodule_04_google_batch_assembly.ipynb](./Submodule_04_google_batch_assembly.ipynb)) carries out the workflow using the Google Batch API. ++ Notebook 5 ([Submodule_05_Bonus_Notebook.ipynb](./Submodule_05_Bonus_Notebook.ipynb)) is a more hands-off notebook to test basic skills taught in this module. + +## **Data** +The test dataset used in the majority of this module is a downsampled version of a dataset that can be obtained in its complete form from the SRA database (Bioproject [**PRJNA318296**](https://www.ncbi.nlm.nih.gov/bioproject/PRJNA318296), GEO Accession [**GSE80221**](https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE80221)). The data was originally generated by **Hartig et al., 2016**. We downsampled the data files in order to streamline the performance of the tutorials and stored them in a Google Cloud Storage bucket. The sub-sampled data, in individual sample files as well as a concatenated version of these files are available in our Google Cloud Storage bucket at `gs://nigms-sandbox/nosi-inbremaine-storage/resources/seq2`. + +Additional datasets for demonstration of the annotation features of TransPi were obtained from the NCBI Transcriptome Shotgun Assembly archive. These files can be found in our Google Cloud Storage bucket at `gs://nigms-sandbox/nosi-inbremaine-storage/resources/trans`. +- Microcaecilia dermatophaga + - Bioproject: [**PRJNA387587**](https://www.ncbi.nlm.nih.gov/bioproject/PRJNA387587) + - Originally generated by **Torres-Sánchez M et al., 2019**. +- Oncorhynchus mykiss + - Bioproject: [**PRJNA389609**](https://www.ncbi.nlm.nih.gov/bioproject/PRJNA389609) + - Originally generated by **Wang J et al., 2016**, **Al-Tobasei R et al., 2016**, and **Salem M et al., 2015**. +- Pseudacris regilla + - Bioproject: [**PRJNA163143**](https://www.ncbi.nlm.nih.gov/bioproject/PRJNA163143) + - Originally generated by **Laura Robertson, USGS**. + +The final submodule ([Submodule_05_Bonus_Notebook.ipynb](./Submodule_05_Bonus_Notebook.ipynb)) uses an additional dataset pulled from the SRA database. We are using the RNA-seq reads only and have subsampled and merged them to a collective 2 million reads. This is not a good idea for real analysis, but was done to reduce the costs and runtime. These files are avalible in our Google Cloud Storage bucket at `gs://nigms-sandbox/nosi-inbremaine-storage/resources/seq2`. +- Apis mellifera + - Bioproject: [**PRJNA274674**](https://www.ncbi.nlm.nih.gov/bioproject/PRJNA274674) + - Originally generated by **Galbraith DA et al., 2015**. + +## **Troubleshooting** +- If a quiz is not rendering: + - Make sure the `pip install` cell was executed in Submodule 00. + - Try re-executing `from jupytercards import display_flashcards` or `from jupyterquiz import display_quiz` depending on the quiz type. +- If a file/directory is not able to be found, make sure that you are in the right directory. If the notebook is idle for a long time, gets reloaded, or restarted, you will need to re-run Step 1 of the notebook. (`%cd /home/jupyter`) +- Sometimes, Nextflow will print `WARN:` followed by the warning. These are okay and should not produce any errors. +- Sometimes Nextflow will print `Waiting for file transfers to complete`. This may take a few minutes, but is nothing to worry about. +- If you are unable to create a bucket using the `gsutil mb` command, check your `nextflow-service-account` roles. Make sure that you have `Storage Admin` added. +- If you are trying to execute a terminal command in a Jupyter code cell and it is not working, make sure that you have an `!` before the command. + - e.g., `mkdir example-1` -> `!mkdir example-1` + +## **Funding** + +MDIBL Computational Biology Core efforts are supported by two Institutional Development Awards (IDeA) from the National Institute of General Medical Sciences of the National Institutes of Health under grant numbers P20GM103423 and P20GM104318. + +## **License for Data** + +Text and materials are licensed under a Creative Commons CC-BY-NC-SA license. The license allows you to copy, remix and redistribute any of our publicly available materials, under the condition that you attribute the work (details in the license) and do not make profits from it. More information is available [here](https://tilburgsciencehub.com/about). + +![Creative commons license](https://i.creativecommons.org/l/by-nc-sa/4.0/88x31.png) + +This work is licensed under a [Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International License](http://creativecommons.org/licenses/by-nc-sa/4.0/) + +The TransPi Nextflow workflow was developed and released by Ramon Rivera and can be obtained from its [GitHub repository](https://github.com/PalMuc/TransPi) diff --git a/GoogleCloud/Submodule_00_Glossary.md b/GoogleCloud/Submodule_00_Glossary.md new file mode 100644 index 0000000..48c62ba --- /dev/null +++ b/GoogleCloud/Submodule_00_Glossary.md @@ -0,0 +1,58 @@ +# MDIBL Transcriptome Assembly Learning Module +# Glossary of Terms and Acronyms + +Modern Biology and Biotechnology can be very intimidating to learn because, beyond the body of knowledge, there is also a large body of terminology that must be understood and internalized to effectively work within it. Going a step further into computational aspects of biology (i.e., bioinformatics and computational biology) increase this difficulty by adding the terminology of computational tools. + +Below is what we hope will be a helpful list of terms that provide a reference to help clarify the other pages in this module. + +## Terms and Concepts +### *Biology/Biotechnology* +**Genome** +: The complete [DNA](https://www.genome.gov/genetics-glossary/Deoxyribonucleic-Acid) content of an organism, generally broken up into chromosomes or contigs + +**Transcript** +: A functional [RNA molecule](https://www.genome.gov/genetics-glossary/RNA-Ribonucleic-Acid), generated by transcription, which copies one strand of the DNA into RNA. + +**Transcriptome** +: The collection of all RNA transcripts encoded in an organism's genome. Generally speaking, most genes can produce one or more RNA transcripts upon being activated (expressed), resulting in transcriptome sizes that are significantly larger than the number of genes. The term transcriptome is sometimes used to refer to more specific sets of RNA transcripts, but in our materials, "transcriptome" with no qualifiers will always mean the complete set of all possible transcripts. + +**Expressed Transcriptome** +: The set of transcripts that is present (activated or expressed) and can be measured in a given sample. + +**Tissue-specific or Cell-specific Transcriptome** +: The subset of the transcriptome that is expressed in either a specific tissue or cell type. + +**Transcriptome Profile** +: An experimental characterization of a sample (either bulk tissue or a single cell) that quantifies the identity and relative abundance of all transcripts measured in the sample. + +**Sequence Assembly** +: A computational process in which short fragments of sequence are integrated through alignment and joining to produce a longer sequence. + +**Transcriptome Assembly** +: Sequence assembly in which the sequenced molecules are RNA transcripts. A transcriptome assembly will generally produce thousands to hundreds of thousands of putative transcript sequences. + +**FASTA/FASTQ Sequence file formats** +: A text file representing one or more biological sequences. In a FASTA file, each sequence includes both a description/header line (which always begins with **'>'**) and one (or more lines of sequence data). In a FASTQ file, sequence quality information is encoded for every nucleotide in the read sequence. + +### *Computational* +**Workflow or Pipeline (computational)** +: A series of computational analysis steps carried out with a defined order and dependencies. Workflows can be conceptually defined and carried out within workflow control systems such as [Nextflow](https://www.nextflow.io/). + +**Container System** +: A program control system that sets up a virtual and protected environment within a larger computer that facilitates the safe installation and execution of programs. + +**Container Image** +: The working unit of the container system. A container image includes a specific executable program (or possibly a suite of programs), along with all of the necessary supporting libraries and auxiliary information. The contents of the container are accessible only while the container is active. + +## Acronyms +**API** +: [Application Programming Interface](https://en.wikipedia.org/wiki/API) → A way that different programs can interact with each other. + +**GFF** +: [General Feature Format](https://en.wikipedia.org/wiki/General_feature_format) → One of several plain-text transfer files that are used to map features onto a genome data set. The file is tab-delimited. + +**GTF** +: [General Transfer Format](https://en.wikipedia.org/wiki/Gene_transfer_format) → One of several plain-text transfer files that are used to map features onto a genome data set. The file is tab-delimited. + +**HTML** +: [HyperText Markup Language](https://en.wikipedia.org/wiki/HTML) → A markup language often used for web development. Many programs within TransPi produce an HTML output file that often gives a visual representation of the output. diff --git a/Submodule_00_background.ipynb b/GoogleCloud/Submodule_00_background.ipynb similarity index 84% rename from Submodule_00_background.ipynb rename to GoogleCloud/Submodule_00_background.ipynb index 151f101..2c2f996 100644 --- a/Submodule_00_background.ipynb +++ b/GoogleCloud/Submodule_00_background.ipynb @@ -14,7 +14,7 @@ "id": "5e6d2086-4dbf-4a61-a5bb-8f08a269f3fa", "metadata": {}, "source": [ - "## Welcome!\n", + "## Overview\n", "\n", "This is a series of notebooks that allows you to explore the biological and computational process of the transcriptome assembly. Through these notebooks, you will also learn to leverage the powerful capabilities of tools such as Nextflow and Google Life Science API to bring your computational capabilities to the next level!\n", "\n", @@ -25,6 +25,48 @@ "Good luck, and have fun!" ] }, + { + "cell_type": "markdown", + "id": "3518c1a9", + "metadata": {}, + "source": [ + "## Learning Objectives:\n", + "\n", + "1. **Assess prior knowledge:** A pre-check quiz verifies foundational understanding of DNA, RNA, transcription, and gene expression.\n", + "\n", + "2. **Introduce transcriptome assembly:** Learners gain an understanding of what transcriptome assembly is, why RNA sequencing is performed, and the overall workflow involved.\n", + "\n", + "3. **Explain the process of transcriptome assembly:** This includes understanding preprocessing, sequence assembly using de Bruijn graphs, assembly assessment (internal and external consistency, BUSCO), and refinement techniques.\n", + "\n", + "4. **Introduce workflow management:** Learners are introduced to the concept of workflows/pipelines in bioinformatics and the role of workflow management systems like Nextflow.\n", + "\n", + "5. **Explain the use of Docker containers:** The notebook explains the purpose and benefits of using Docker containers for managing software dependencies in bioinformatics.\n", + "\n", + "6. **Introduce the Google Cloud Life Sciences API:** Learners are introduced to the Google Cloud Life Sciences API and its advantages for managing and executing workflows on cloud computing resources.\n", + "\n", + "7. **Familiarize learners with Jupyter Notebooks:** The notebook provides instructions on how to navigate and use Jupyter Notebooks, including cell types and execution order." + ] + }, + { + "cell_type": "markdown", + "id": "6a23eec6", + "metadata": {}, + "source": [ + "## Prerequisites\n", + "\n", + "* **Basic Biology Knowledge:** A foundational understanding of DNA, RNA, transcription, and gene expression is assumed. The notebook includes quizzes to assess this knowledge.\n", + "* **Python Programming:** While the notebook itself doesn't contain complex Python code, familiarity with Python syntax and the Jupyter Notebook environment is helpful.\n", + "* **Command Line Interface (CLI) Familiarity:** The notebook mentions using `pip` (a command-line package installer), indicating some CLI knowledge is beneficial, although not strictly required for completing the quizzes and reviewing the material." + ] + }, + { + "cell_type": "markdown", + "id": "f6eefc1e", + "metadata": {}, + "source": [ + "## Get Started" + ] + }, { "cell_type": "markdown", "id": "22b95a28-fad7-4b6c-99ae-093c323f769c", @@ -67,7 +109,7 @@ "metadata": {}, "outputs": [], "source": [ - "display_quiz(\"quiz-material/00-pc1.json\")" + "display_quiz(\"../quiz-material/00-pc1.json\")" ] }, { @@ -239,7 +281,7 @@ "metadata": {}, "outputs": [], "source": [ - "display_quiz(\"quiz-material/00-cp1.json\", shuffle_questions = True)" + "display_quiz(\"../quiz-material/00-cp1.json\", shuffle_questions = True)" ] }, { @@ -365,7 +407,7 @@ "metadata": {}, "outputs": [], "source": [ - "display_quiz(\"quiz-material/00-cp2.json\", shuffle_questions = True)" + "display_quiz(\"../quiz-material/00-cp2.json\", shuffle_questions = True)" ] }, { @@ -383,10 +425,22 @@ }, { "cell_type": "markdown", - "id": "489beca6-4a9e-4a2e-a646-6b276270d810", + "id": "8d3cf5c9", "metadata": {}, "source": [ - "## When you are ready, proceed to the next notebook: [`Submodule_01_prog_setup.ipynb`](./Submodule_01_prog_setup.ipynb)." + "## Conclusion\n", + "\n", + "This introductory Jupyter Notebook provided essential background information and a pre-requisite knowledge check on fundamental molecular biology concepts (DNA, RNA, transcription, gene expression) crucial for understanding transcriptome assembly. The notebook established the context for the subsequent modules, outlining the workflow involving RNA-seq data, transcriptome assembly techniques (including de Bruijn graphs, BUSCO analysis), and the use of Nextflow and Google Cloud Life Sciences API for efficient workflow execution and management. The inclusion of interactive quizzes and video resources enhanced learning and engagement, preparing learners for the practical applications and computational challenges presented in the following notebooks. Successful completion of the checkpoint quizzes demonstrates readiness to proceed to the next stage of the MDIBL Transcriptome Assembly Learning Module." + ] + }, + { + "cell_type": "markdown", + "id": "421cebc3", + "metadata": {}, + "source": [ + "## Clean Up\n", + "\n", + "Remember to proceed to the next notebook [`Submodule_01_prog_setup.ipynb`](./Submodule_01_prog_setup.ipynb) or shut down your instance if you are finished." ] } ], diff --git a/GoogleCloud/Submodule_01_prog_setup.ipynb b/GoogleCloud/Submodule_01_prog_setup.ipynb new file mode 100644 index 0000000..7a4f8cb --- /dev/null +++ b/GoogleCloud/Submodule_01_prog_setup.ipynb @@ -0,0 +1,407 @@ +{ + "cells": [ + { + "cell_type": "markdown", + "id": "ea2cf3b6-8128-4170-bcd1-6889043c1bc6", + "metadata": {}, + "source": [ + "# MDIBL Transcriptome Assembly Learning Module\n", + "# Notebook 1: Setup" + ] + }, + { + "cell_type": "markdown", + "id": "f62d616c", + "metadata": {}, + "source": [ + "## Overview\n", + "\n", + "This notebook is designed to configure your virtual machine (VM) to have the proper tools and data in place to run the transcriptome assembly training module." + ] + }, + { + "cell_type": "markdown", + "id": "60145056", + "metadata": {}, + "source": [ + "## Learning Objectives\n", + "\n", + "1. **Understand and utilize shell commands within Jupyter Notebooks:** The notebook explicitly teaches the difference between `!` and `%` prefixes for executing shell commands, and how to navigate directories using `cd` and `pwd`.\n", + "\n", + "2. **Set up the necessary software:** Students will install and configure essential tools including:\n", + " * Java (a prerequisite for Nextflow).\n", + " * Miniforge (a package manager for bioinformatics tools).\n", + " * `sra-tools`, `perl-dbd-sqlite`, and `perl-dbi` (specific bioinformatics packages).\n", + " * Nextflow (a workflow management system).\n", + " * `gsutil` (for interacting with Google Cloud Storage).\n", + "\n", + "3. **Download and organize necessary data:** Students will download the TransPi transcriptome assembly software and its associated resources (databases, scripts, configuration files) from a Google Cloud Storage bucket. This includes understanding the directory structure and file organization.\n", + "\n", + "4. **Manage file permissions:** Students will use the `chmod` command to set executable permissions for the necessary files and directories within the TransPi software.\n", + "\n", + "5. **Navigate file paths:** The notebook provides examples and explanations for using relative file paths (e.g., `./`, `../`) within shell commands." + ] + }, + { + "cell_type": "markdown", + "id": "549be731", + "metadata": {}, + "source": [ + "## Prerequisites\n", + "\n", + "* **Operating System:** A Linux-based system is assumed (commands like `apt`, `uname` are used). The specific distribution isn't specified but a Debian-based system is likely.\n", + "* **Shell Access:** The ability to execute shell commands from within the Jupyter Notebook environment (using `!` and `%`).\n", + "* **Java Development Kit (JDK):** Required for Nextflow.\n", + "* **Miniforge** A package manager for installing bioinformatics tools.\n", + "* **`gsutil`:** The Google Cloud Storage command-line tool. This is crucial for downloading data from Google Cloud Storage." + ] + }, + { + "cell_type": "markdown", + "id": "a92f62a0", + "metadata": {}, + "source": [ + "## Get Started" + ] + }, + { + "cell_type": "markdown", + "id": "958495ce-339d-4d4d-a621-9ede79a7363c", + "metadata": {}, + "source": [ + "
\n", + " \n", + " Note: ! and % in code cells\n", + "
\n", + "\n", + ">You may notice that many of the lines in the code cells begin with one of these symbols: `!` or `%`. They both allow you (the user) to run shell commands in the code cells of a Juypter notebook. They do, however, operate slightly differently: \n", + ">- The `!` executes the command and then immediately terminates.\n", + ">- The `%` executes the command and has a lasting effect.\n", + "\n", + "
\n", + " \n", + " Example: \n", + "
\n", + "\n", + ">Take this example code snippet: *Imagine that you are currently in the directory named* `original-directory`.\n", + ">```python\n", + "!cd different-directory/\n", + ">```\n", + ">After this line executes, you will still be in the directory named `original-directory`.\n", + ">\n", + ">**Vs.**\n", + ">```python\n", + "%cd different-directory/\n", + ">```\n", + ">After this line executes, you will now be in the directory `different-directory`." + ] + }, + { + "cell_type": "markdown", + "id": "423e706b-5085-4575-91a3-7ba7969ef1e2", + "metadata": {}, + "source": [ + "## Time to begin!\n", + "\n", + "**Step 1:** To start, make sure that you are in the right starting place with a `cd`.\n", + "> `pwd` prints our current local working directory. Make sure the output from the command is: `/home/jupyter`" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "0344b971-a2d1-46cf-8661-495cf801d337", + "metadata": {}, + "outputs": [], + "source": [ + "%cd /home/jupyter" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "5844a09f-3834-4d13-bc9f-1e55286f503d", + "metadata": {}, + "outputs": [], + "source": [ + "! pwd" + ] + }, + { + "cell_type": "markdown", + "id": "e674e157-8b1a-48e0-b1ce-72e748a3cb17", + "metadata": {}, + "source": [ + "**Step 2:** Now, update the system and install Java (which is needed for Nextflow to run)." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "be0fb298-b8ff-4f5f-b21b-90d9796f11e1", + "metadata": {}, + "outputs": [], + "source": [ + "! sudo apt update\n", + "! sudo apt-get install default-jdk -y\n", + "! java -version" + ] + }, + { + "cell_type": "markdown", + "id": "7b3ffb16-3395-4c01-9774-ee568e815490", + "metadata": {}, + "source": [ + "**Step 3:** Install Miniforge (a package manager), which is needed to support the information held within the TransPi databases." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "ac5b204a-f0db-4ceb-bf37-57eca6d77974", + "metadata": {}, + "outputs": [], + "source": [ + "! curl -L -O https://github.com/conda-forge/miniforge/releases/latest/download/Miniforge3-$(uname)-$(uname -m).sh\n", + "! bash Miniforge3-$(uname)-$(uname -m).sh -b -p $HOME/miniforge" + ] + }, + { + "cell_type": "markdown", + "id": "c5584e2e", + "metadata": {}, + "source": [ + "Next, add it to the path." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "ad030cd1", + "metadata": {}, + "outputs": [], + "source": [ + "import os\n", + "os.environ[\"PATH\"] += os.pathsep + os.environ[\"HOME\"]+\"/miniforge/bin\"" + ] + }, + { + "cell_type": "markdown", + "id": "7b930ad7", + "metadata": {}, + "source": [ + "Next, using Miniforge and bioconda, install the tools that will be used in this tutorial." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "4d4dd51e", + "metadata": {}, + "outputs": [], + "source": [ + "! mamba install -c bioconda sra-tools perl-dbd-sqlite perl-dbi -y" + ] + }, + { + "cell_type": "markdown", + "id": "bffbeab5-3664-4c28-b948-b20d9c15aa05", + "metadata": {}, + "source": [ + "**Step 4:** Now, install Nextflow, make it executable, and update it." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "722415ad-e060-417c-8d41-f0a63f529421", + "metadata": {}, + "outputs": [], + "source": [ + "! curl https://get.nextflow.io | bash\n", + "! chmod +x nextflow\n", + "! ./nextflow self-update" + ] + }, + { + "cell_type": "markdown", + "id": "87bb745c-b465-498d-8f62-95199fc37b4d", + "metadata": {}, + "source": [ + "**Step 5:** Time to get TransPi.\n", + ">The original version of TransPi is available on GitHub, however, we have made a variety of alterations to the program and will be using the updated version in the following modules." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "d7409cc1-540d-42ba-b195-dd2979929fb8", + "metadata": {}, + "outputs": [], + "source": [ + "! gsutil -m cp -r gs://nigms-sandbox/nosi-inbremaine-storage/TransPi ./" + ] + }, + { + "cell_type": "markdown", + "id": "d8346fb0-a1dd-417b-ab19-256bbf8e32ce", + "metadata": {}, + "source": [ + "
\n", + " \n", + " Note: gsutil\n", + "
\n", + "\n", + ">`gsutil` is a tool allows you to interact with Google Cloud Storage through the command line." + ] + }, + { + "cell_type": "markdown", + "id": "624311d5-bb21-429c-b575-80ffc0a4fd9f", + "metadata": {}, + "source": [ + "**Step 6:** Now copy over all of the additional resources needed for TransPi to run. This may take a few minutes.\n", + "> Within the resources directory, 5 sub-directories are needed: `/bin`, `/conf`, `/DBs`, `/seq2`, and `trans`.\n", + "> - In the **`/bin`** directory, there are a set of programs that get called by various processes within the TransPi workflow. One example `GO_plots.R` is an R script that creates plots showing gene ontology of the built transcriptome.\n", + "> - In the **`/conf`** directory, there are 3 files, but we will only be using `uni_tax.txt` which contains the UniProt taxonomy codes.\n", + "> - In the **`/DBs`** directory, there are 3 sub-directories containing 3 databases that TransPi needs:\n", + "> - **`/hmmerdb`** contains the `Pfam_A.hmm` file which is a database of protein families. This database is used to annotate the transcriptome that is built using probabilities built from Hidden Markov Models.\n", + "> - **`/sqlite_db`** contains the necessary files and database to run DIAMOND, a program that swiftly aligns the built transcriptome to a database of known proteins.\n", + "> - **`/uniprot_db`:** contains a different database to run DIAMOND and to run TransDecoder, a program that identifies coding regions." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "1233004c-fcad-444e-a5f2-9c38d33b6e95", + "metadata": {}, + "outputs": [], + "source": [ + "! gsutil -m cp -r gs://nigms-sandbox/nosi-inbremaine-storage/resources ./" + ] + }, + { + "cell_type": "markdown", + "id": "e6697337-a102-42ea-9b2a-55cbdbeeefa2", + "metadata": {}, + "source": [ + "
\n", + " \n", + " Note: File Paths\n", + "
\n", + "\n", + ">Consider the following file structure and you are currently in the directory `toDo`: \n", + ">\n", + "> \n", + ">\n", + ">- If you were to type `!ls ./`, it would return the contents of your current directory, so it would return `nextWeek`, `Today.txt`, `Tomorrow.txt`, `Yesterday.txt`.\n", + "> - The `./` path points to your current directory.\n", + ">\n", + ">- If you were to type `!ls ../`, it would return the contents of the directory 1 layer up from your current directory, so it would return `coolPicturesOcean`, `shoppingList`, `toDo`.\n", + "> - The `../` path points to the directory one layer up from the current directory.\n", + "> - They can also be stacked so `../../` will take you two layers up.\n", + ">\n", + ">- If you were to type `!ls ./nextWeek/` it would return the contents of the `nextWeek` directory which is one layer down from the current directory, so it would return `manyThings.txt`.\n", + ">\n", + ">**This means that in the second line of the code cell above, the file `TransPi.nf` will be copied from the Google Cloud Storage bucket to the current directory.**" + ] + }, + { + "cell_type": "markdown", + "id": "b2182ab3-9661-4f33-bda7-b47d38dde5eb", + "metadata": {}, + "source": [ + "**Step 7:** Make the contents of `./TransPi/bin` executable." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "78bb9105-4de3-405b-b0a7-8b6b49024493", + "metadata": {}, + "outputs": [], + "source": [ + "! chmod -R +x ./TransPi/bin" + ] + }, + { + "cell_type": "markdown", + "id": "5840d55c-3696-4c5b-a1b7-a5eb827b9c4c", + "metadata": {}, + "source": [ + "
\n", + " \n", + " Note: chmod\n", + "
\n", + "\n", + ">The `chmod` command is responsible for granting access to files and directories.\n", + ">\n", + ">Following the `chmod` can be a series of letters and symbols, in the case above `a+rx`.\n", + ">- The first letter can be `u`, `g`, `o`, or `a`.\n", + "> - `u` stands for owner\n", + "> - `g` stands for group\n", + "> - `o` stands for other users\n", + "> - `a` stands for all\n", + "> \n", + "> \n", + ">- Next can be either a `+` or a `-`.\n", + "> - `+` grants access\n", + "> - `-` revokes access\n", + ">\n", + ">\n", + ">- Next the type of permission is indicated (more than one can be there). The options are `r`, `w`, and `x`.\n", + "> - `r` is read permission\n", + "> - `w` is write permission\n", + "> - `x` is execute permission\n", + ">\n", + ">\n", + ">- Finally, the file or directory is designated." + ] + }, + { + "cell_type": "markdown", + "id": "c3cfbfc2-37c0-4a54-bdca-f3ea997c25c2", + "metadata": {}, + "source": [ + "
\n", + " \n", + " Checkpoint 1:\n", + "
" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "6b1dd0d1-8b57-4fea-a029-649052016159", + "metadata": {}, + "outputs": [], + "source": [ + "from jupyterquiz import display_quiz\n", + "display_quiz(\"../quiz-material/01-cp1.json\", shuffle_questions = True)" + ] + }, + { + "cell_type": "markdown", + "id": "ffec658a", + "metadata": {}, + "source": [ + "## Conclusion\n", + "\n", + "This notebook successfully configured the virtual machine for the MDIBL Transcriptome Assembly Learning Module. We updated the system, installed necessary software including Java, Mambaforge, and Nextflow, and downloaded the TransPi program and its associated resources from Google Cloud Storage. The `chmod` command ensured executability of the TransPi scripts. The VM is now prepared for the next notebook, `Submodule_02_basic_assembly.ipynb`, which will delve into the transcriptome assembly process itself. Successful completion of this notebook's steps is crucial for the successful execution of subsequent modules." + ] + }, + { + "cell_type": "markdown", + "id": "666c1e4d", + "metadata": {}, + "source": [ + "## Clean Up\n", + "\n", + "Remember to proceed to the next notebook [`Submodule_02_basic_assembly.ipynb`](./Submodule_02_basic_assembly.ipynb) or shut down your instance if you are finished." + ] + } + ], + "metadata": {}, + "nbformat": 4, + "nbformat_minor": 5 +} diff --git a/GoogleCloud/Submodule_02_basic_assembly.ipynb b/GoogleCloud/Submodule_02_basic_assembly.ipynb new file mode 100644 index 0000000..b243a2e --- /dev/null +++ b/GoogleCloud/Submodule_02_basic_assembly.ipynb @@ -0,0 +1,351 @@ +{ + "cells": [ + { + "cell_type": "markdown", + "id": "f061b442-f917-42b8-a635-bd85bb0b502f", + "metadata": {}, + "source": [ + "# MDIBL Transcriptome Assembly Learning Module\n", + "# Notebook 2: Performing a \"Standard\" basic transcriptome assembly\n", + "\n", + "## Overview\n", + "\n", + "In this notebook, we will set up and run a basic transcriptome assembly, using the analysis pipeline as defined by the TransPi Nextflow workflow. The steps to be carried out are the following, and each is described in more detail in the Background material notebook.\n", + "\n", + "- Sequence Quality Control (QC): removing adapters and low-quality sequences.\n", + "- Sequence normalization: reducing the reads that appear to be \"overrepresented\" (based on their *k*-mer content).\n", + "- Generation of multiple 1st-pass assemblies using the following tools: Trinity, TransAbyss, SOAP, rnaSpades, and Velvet/Oases.\n", + "- Integration and reduction of the individual transcriptomes using EvidentialGene.\n", + "- Assessment of the final transcriptome with rnaQuast and BUSCO.\n", + "- Annotation of the final transcriptome using alignment to known proteins (using DIAMOND/BLAST) and assignment to probable protein domains (using HMMER/Pfam).\n", + "- Generation of output reports.\n", + "\n", + "> \n", + ">\n", + "> **Figure 1:** TransPi workflow for a basic transcriptome assembly run." + ] + }, + { + "cell_type": "markdown", + "id": "062784ec", + "metadata": {}, + "source": [ + "## Learning Objectives\n", + "\n", + "1. **Understanding the TransPi Workflow:** Learners will gain a conceptual understanding of the TransPi workflow, including its individual steps and their order. This involves understanding the purpose of each stage (QC, normalization, assembly, integration, assessment, annotation, and reporting).\n", + "\n", + "2. **Executing a Transcriptome Assembly:** Learners will learn how to run a transcriptome assembly using Nextflow and the TransPi pipeline, including setting necessary parameters (e.g., k-mer size, read length). They will learn how to interpret the command-line interface for executing Nextflow workflows.\n", + "\n", + "3. **Interpreting Nextflow Output:** Learners will learn to navigate and understand the directory structure generated by the TransPi workflow. This includes interpreting the output from various tools such as FastQC, FastP, Trinity, TransAbyss, SOAP, rnaSpades, Velvet/Oases, EvidentialGene, rnaQuast, BUSCO, DIAMOND/BLAST, HMMER/Pfam, and TransDecoder. This involves understanding the different types of output files generated and how to extract relevant information from them (e.g., assembly statistics, annotation results).\n", + "\n", + "4. **Assessing Transcriptome Quality:** Learners will understand how to assess the quality of a transcriptome assembly using metrics generated by rnaQuast and BUSCO.\n", + "\n", + "5. **Interpreting Annotation Results:** Learners will learn to interpret the results of transcriptome annotation using tools like DIAMOND/BLAST and HMMER/Pfam, understanding what information they provide regarding protein function and domains.\n", + "\n", + "6. **Utilizing Workflow Management Systems:** Learners will gain practical experience using Nextflow, a workflow management system, to execute a complex bioinformatics pipeline. This includes understanding the benefits of using a defined workflow for reproducibility and efficiency.\n", + "\n", + "7. **Working with Jupyter Notebooks:** The notebook itself provides a practical example of how to integrate command-line tools within a Jupyter Notebook environment." + ] + }, + { + "cell_type": "markdown", + "id": "abf9345c", + "metadata": {}, + "source": [ + "## Prerequisites\n", + "\n", + "* **Nextflow:** A workflow management system used to execute the TransPi pipeline. \n", + "* **Docker:** Used for containerization of the various bioinformatics tools within the workflow. This avoids the need for local installation of numerous packages.\n", + "* **TransPi:** The specific Nextflow pipeline for transcriptome assembly. The notebook assumes it's present in the `/home/jupyter` directory.\n", + "* **Bioinformatics Tools (within TransPi):** The workflow utilizes several bioinformatics tools. These are packaged within Docker containers, but the notebook expects that TransPi is configured correctly to access and use them:\n", + " * FastQC: Sequence quality control.\n", + " * FastP: Read preprocessing (trimming, adapter removal).\n", + " * Trinity, TransAbyss, SOAPdenovo-Trans, rnaSpades, Velvet/Oases: Transcriptome assemblers.\n", + " * EvidentialGene: Transcriptome integration and reduction.\n", + " * rnaQuast: Transcriptome assessment.\n", + " * BUSCO: Assessment of completeness of the assembled transcriptome.\n", + " * DIAMOND/BLAST: Protein alignment for annotation.\n", + " * HMMER/Pfam: Protein domain assignment for annotation.\n", + " * Bowtie2: Read mapping for assembly validation.\n", + " * TransDecoder: ORF prediction and coding region identification.\n", + " * Trinotate: Functional annotation of transcripts." + ] + }, + { + "cell_type": "markdown", + "id": "6cd0f4f2-5559-4675-9e97-24b0548b31af", + "metadata": {}, + "source": [ + "## Get Started \n", + "\n", + "**Step 1:** Make sure you are in the correct local working directory as in `01_prog_setup.ipynb`.\n", + "> It should be `/home/jupyter`." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "ec368bd6-699b-4dfb-8a4c-e20dc42a3437", + "metadata": {}, + "outputs": [], + "source": [ + "%cd /home/jupyter" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "caa66918-d431-46da-8922-3d6659b8e5a1", + "metadata": {}, + "outputs": [], + "source": [ + "!pwd" + ] + }, + { + "cell_type": "markdown", + "id": "e94c3682-deae-4eb3-a10b-c9c2be24852d", + "metadata": {}, + "source": [ + "**Step 3A:** First, check the listings within the `resources directory`. Make sure you see the items listed below:\n", + "```\n", + "DBs bin conf seq2 trans\n", + "```" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "3cc95c9c-d19a-479c-8aa3-101191a00c69", + "metadata": {}, + "outputs": [], + "source": [ + "!ls ./resources" + ] + }, + { + "cell_type": "markdown", + "id": "b276e979-b4eb-40f0-b336-c41d3391549a", + "metadata": {}, + "source": [ + "**Step 3B** Now, check the listing of the sequence directory: `seq2`. You should see seven pairs of gzipped fastq files (signified by the paired `.fastq.gz` naming). Six of these are for individual samples, and the seventh set, labeled **joined** is a concatenation of all files. Because of the way that TransPi works (as well as some of the programs that it uses), it's best to use a joined set of all sequences to make a unified transcriptome assembly." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "20ba183a-0e68-4e23-9f98-1d18dec69aa0", + "metadata": {}, + "outputs": [], + "source": [ + "!ls ./resources/seq2" + ] + }, + { + "cell_type": "markdown", + "id": "6494cca1-3e15-4a27-bef7-bdeb9222a8da", + "metadata": {}, + "source": [ + "**Step 4:** Now we are set to perform the assembly using the sequences within the directory `seq2/`. \n", + "> The specific sequences here are from zebrafish, and they represent a selected subset of the sequences from the experiment of [Hartig et al](https://journals.biologists.com/bio/article-pdf/5/8/1134/1114440/bio020065.pdf).\n", + "\n", + "The data was selected in order to create a reasonably large assembly (targeting a few hundred transcripts), while also being able to be checked against the \"known\" transcripts and genes).\n", + "\n", + "We will set only a small number of the options used in TransPi, focusing on the following:\n", + "- `-profile docker`: This is a key setting, as it allows all software to be run from Docker container images, negating the need to install all programs locally (in other scenarios, there is the option to add more than one profile).\n", + " - The profile names are pointing to pre-defined groupings of setting within the `nextflow.config` file. \n", + "- `--k 17,25,43`: The size(s) of *k*-mers to be used in the generation of the de Bruijn graphs (see the background file for a discussion of the role of *k*, and why it needs to be variable).\n", + "- `--maxReadLen 50`: The maximum length of the reads (since these files all come from one experiment, this represents the length of all sequences).\n", + "- `--all`: This setting tells Nextflow to run all steps from pre-assembly QC, through assembly and refinement, and then finally the analysis and tabulation of annotations to the putative transcripts.\n", + "\n", + "Under the assumption of an n1-high-memory node with 16 processors and 104GB of RAM, this run should take approximately **58 minutes**.\n", + "\n", + "As the workflow executes, the Nextflow engine will generate a directory called `work` where it places all of the intermediate information and output that is needed to carry out the work.\n", + "\n", + "
\n", + " \n", + " Tip: Run-Time Reminder\n", + "
\n", + "\n", + "\n", + "> \n", + ">\n", + "> Remember that you can tell if the cell is still running by referring to the contents inside the `[ ]:` that sits to the left of the code cell. Or you can check the top right of the screen for the circle." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "ee5985e3-93df-4779-afe1-4464e13bf619", + "metadata": {}, + "outputs": [], + "source": [ + "!NXF_VER=22.10.1 ./nextflow run ./TransPi/TransPi.nf \\\n", + "-profile docker --k 17,25,43 --maxReadLen 50 --all " + ] + }, + { + "cell_type": "markdown", + "id": "0117d994-0502-4a58-b07a-861d254f11e2", + "metadata": {}, + "source": [ + "The beauty and power of using a defined workflow in a management system (such as Nextflow) are that we not only get a defined set of steps that are carried out in the proper order, but we also get a well-structured and concise directory structure that holds all pertinent output (with naming specified in the command-line call of Nextflow and TransPi).\n", + "\n", + "**Step 5:** With the execution complete, let's look at what we have generated, first in the results directory. We will add the `-l` argument for a \"long listing\"." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "78d5a924-fbfc-49a9-ad8e-db707bca1999", + "metadata": {}, + "outputs": [], + "source": [ + "!ls -l ./basicRun/output" + ] + }, + { + "cell_type": "markdown", + "id": "44d927ba-6916-47b5-8edf-1cc58d585f78", + "metadata": {}, + "source": [ + "## Investigation and Exploration: Assembly and Annotation Results\n", + "The use of an established and complex multi-step workflow (such as the TransPi workflow that you just ran) has the benefit of saving you a lot of manual effort in setting up and carrying out the individual steps yourself. It also is highly reproducible, given the same input data and parameters.\n", + "\n", + "It does, however, generate a lot of output, and it is beyond the scope of this training exercise to go through all of it in detail. We recommend that you download the complete results directory onto another machine or storage so that you can view it at your convenience, and on a less expensive machine than you are using to run this tutorial. *If you would like the proceed with the data in its current location, this also works, just bear in mind that it will cost roughly $0.72 per hour.*\n", + "\n", + "
\n", + " \n", + " Note: To Download...\n", + "
\n", + "\n", + ">Here are two possible options to access the results files outside of this expensive JupyterLab instance. \n", + ">- If you instead have an external machine that accepts ssh connections, then you can use the secure copy scp command: `!scp -r ./basicRun/output/YOUR_USERID@YOUR.MACHINE`\n", + ">- If you have a Google Cloud Storage bucket available, you can use the gsutil command: `!gsutil -m cp -r ./basicRun/output gs://YOUR-BUCKET-NAME-HERE/transpi-basicrun-output` to place all of your results into that bucket. \n", + "> - From there you have two options: \n", + "> 1. (Recommended) You could create a new (cheaper) Vertex AI instance (or use an old one) and copy the files down into that new directory using the following gsutil command:`!gsutil -m cp -r gs://YOUR-BUCKET-NAME-HERE/transpi-basicrun-output ./`\n", + "> 2. You could navigate to the bucket through the Google Cloud console and open the files through the links labeled `Authenticated URL`\n", + ">\n", + ">**In all of the commands above, you will need to edit the All-Caps part to match your own bucket or machine.**\n", + "\n", + "
\n", + " \n", + " Tip: \n", + "
\n", + "\n", + "> - **After you have the output directory in its desired location, consider the information in the cell below as you explore the output.**\n", + "> - **If you are viewing the output in a different location, consider copying or taking a screenshot of the cell below.**\n", + "> - **Make sure that if you are viewing your output in a different location, you save your notebooks here, and then stop the VM instance, or it will keep costing money.**\n", + "> - **Upon completion of your exploration, return to this submodule to complete the checkpoint quiz.**" + ] + }, + { + "cell_type": "markdown", + "id": "895c027e-edaf-4f96-9502-c4e4cfd67471", + "metadata": {}, + "source": [ + "## Output Overview\n", + "*These sub-directories will be mentioned in the order of their execution within TransPi.*\n", + "\n", + "
\n", + " \n", + " Note: HTML files\n", + "
\n", + "\n", + ">**If you are viewing your output within a JupyterLab VM instance, for the `.html` files to work correctly, you will need to select `Trust HTML` at the top of the screen.** This is due to the dynamic elements within the files.\n", + "\n", + "### FastQC\n", + "> FastQC takes the raw read files and runs a swift analysis of the read data. The two key output files are `joined_R1_fastqc.html` and `joined_R2_fastqc.html` which provide a visual illustration of the read quality metrics. It is important to note that FastQC does not manipulate the data for further steps, it just outputs an analysis of the data quality.\n", + "\n", + "### Filter\n", + "> FastP is a bioinformatics tool that preprocesses the raw read data. It trims poor-quality reads, removes adapter sequences, and corrects errors noticed within the reads. The `joined.fastp.html` provides an overview of the processing done on both read files.\n", + "\n", + "### Assemblies\n", + "> TransPi uses five different assembly tools. All of the assembly `.fa` files are placed within the assemblies directory. For all of the assemblies except for Trinity, there are four `.fa` files: one for each of the *k*-mer length plus a compilation of all three. Trinity does not have the option to customize the *k*-mer size. Instead, it runs at a default `k=25`, therefore only having one assembly.\n", + "\n", + "### EviGene\n", + "> At this point, we have a major overassembly of the transcriptome. We use a small piece of the EvidentialGene (EviGene) program known as tr2aacds which takes all of the assemblies and crunches them into a single, unified transcriptome. Within the evigene directory, there are two files: `joined.combined.fa` is all of the assemblies placed into the same file and`joined.combined.okay.fa` is the combined transcriptome after EviGene has reduced it down. In each header line, there is key information about the sequence.\n", + ">> For example: `>SOAP.k17.C9429 58.0 evgclass=main,okay,match:SPADES.k43.NODE_313_length_1670_cov_12.047941_g161_i0,pct:100/100/.; aalen=392,75%,complete;`\n", + ">>\n", + ">> - This header indicates that this sequence was found in both the SOAP and SPADES assemblies.\n", + ">> - The `eviclass=main` means that this sequence is the primary transcript, and there are alternates identified.\n", + ">> - The `aalen=392` is the amino acid length of the sequence.\n", + ">> - The `complete` means that it is a complete reading frame.\n", + ">> - For more information on interpreting the headers from EviGene, reference the following [link](http://arthropods.eugenes.org/EvidentialGene/evigene/) in section 3.\n", + "\n", + "### BUSCO\n", + "> BUSCO uses a database of known universal single-copy orthologs under a specific lineage (vertebrata in this case) and checks our assembled transcriptome for those sequences which it expects to find. BUSCO was run on both the TransPi assembly along with the assembly just done by Trinity. To visualize BUSCO's results, refer to the `short_summary.specific.vertebrata_odb10.joined.TransPi.bus4.txt` and `short_summary.specific.vertebrata_odb10.joined.Trinity.bus4.txt` files.\n", + "\n", + "### Mapping \n", + "> One way to verify the quality of the assembly is to map the original input reads to the assembly (using an alignment program called bowtie2). There are two output files, one for the TransPi assembly and one for the Trinity exclusive assembly. These files are named `log_joined.combined.okay.fa.txt` and `log_joined.Trinity.fa.txt`.\n", + "\n", + "### rnaQUAST\n", + "> rnaQUAST is another assembly assessment program. It provides statistics about the transcripts that have been produced. For a brief overview of the transcript statistics, refer to `joined_rnaQUAST.csv`.\n", + "\n", + "### TransDecoder \n", + "> TransDecoder is a program that \"decodes\" the transcripts. First, it identifies open reading frames (ORFs). From there, it then will make predictions on what is likely to be coding regions. For statistics regarding TransDecoder, refer to the `joined_transdecoder.stats` file.\n", + "\n", + "### Trinotate\n", + "> Trinotate uses the information regarding likely coding regions produced by TransDecoder to make predictions about potential protein function. It does this by cross-referencing the assembled transcripts to various databases such as pfam and hmmer. These annotations can be viewed in the `joined.trinotate_annotation_report.xls` file.\n", + "\n", + "### Report\n", + "> Within `report` is one file: `TransPi_Report_joined.html`. This is an HTML file that combines the results throughout TransPi into a series of visual tables and figures.\n", + ">> The sub-directories `stats` and `figures` are intermediary sub-directories that hold information to generate the report.\n", + "\n", + "### pipeline_info\n", + "> One of the benefits of using Nexflow and a well-defined pipeline/workflow is that when the run is completed, we get a high-level summary of the execution timeline and resources. Two key files within this sub-directory are `transpi_timeline.html` and `transpi_report.html`. In the `transpi_timeline.html` file, you can see a graphic representation of the total execution time of the entire workflow, along with where in the process each of the programs was active. From this diagram, you can also infer the ***dependency*** ordering that is encoded into the TransPi workflow. For example, none of the assembly runs started until the process labeled **`normalize reads`** was complete because each of these is run on the normalized data, rather than the raw input. Similarly, **`evigene`**, the program that integrates and refines the output of all of the assembly runs doesn't start until all of the assembly processes are complete. Within the `transpi_report.html` file, you can get a view of the resources used and activities carried out by each process, including CPUs, RAM, input/output, container used, and more.\n", + "\n", + "### RUN-INFO.txt\n", + "> `RUN-INFO.txt` provides the specific details of the run such as where the directories are and the versions of the various programs used." + ] + }, + { + "cell_type": "markdown", + "id": "c372f902-6138-4217-86a8-4b0002f5f387", + "metadata": {}, + "source": [ + "
\n", + " \n", + " Checkpoint 1:\n", + "
\n", + "\n", + "*The green cards below are interactive. Spend some time to consider the question and click on the card to check your answer.*" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "74e23c5b-a6e7-4ad7-b69c-7dc42fa9bf2a", + "metadata": {}, + "outputs": [], + "source": [ + "from jupytercards import display_flashcards\n", + "display_flashcards('../quiz-material/02-cp1-1.json')\n", + "display_flashcards('../quiz-material/02-cp1-2.json')" + ] + }, + { + "cell_type": "markdown", + "id": "b82f0b3a", + "metadata": {}, + "source": [ + "## Conclusion\n", + "\n", + "This Jupyter Notebook demonstrated a complete transcriptome assembly workflow using the TransPi Nextflow pipeline. We successfully executed the pipeline, encompassing quality control, normalization, multiple assembly generation with Trinity, TransAbyss, SOAP, rnaSpades, and Velvet/Oases, integration via EvidentialGene, and subsequent assessment using rnaQuast and BUSCO. The final assembly underwent annotation with DIAMOND/BLAST and HMMER/Pfam, culminating in comprehensive reports detailing the entire process and the resulting transcriptome characteristics. The generated output, accessible in the `basicRun/output` directory, provides a rich dataset for further investigation and analysis, including detailed quality metrics, assembly statistics, and functional annotations. This module provided a practical introduction to automated transcriptome assembly, highlighting the efficiency and reproducibility offered by integrated workflows like TransPi. Further exploration of the detailed output is encouraged, and the subsequent notebook focuses on a more in-depth annotation analysis." + ] + }, + { + "cell_type": "markdown", + "id": "b68484f3", + "metadata": {}, + "source": [ + "## Clean Up\n", + "\n", + "Remember to proceed to the next notebook [`Submodule_03_annotation_only.ipynb`](Submodule_03_annotation_only.ipynb) or shut down your instance if you are finished." + ] + } + ], + "metadata": {}, + "nbformat": 4, + "nbformat_minor": 5 +} diff --git a/GoogleCloud/Submodule_03_annotation_only.ipynb b/GoogleCloud/Submodule_03_annotation_only.ipynb new file mode 100644 index 0000000..5c4c5fd --- /dev/null +++ b/GoogleCloud/Submodule_03_annotation_only.ipynb @@ -0,0 +1,548 @@ +{ + "cells": [ + { + "cell_type": "markdown", + "id": "b3ea1e55-d687-4189-b90f-5ce35b3c89de", + "metadata": {}, + "source": [ + "# MDIBL Transcriptome Assembly Learning Module\n", + "# Notebook 3: Using TransPi to Performing an \"Annotation Only\" Run\n", + "\n", + "## Overview\n", + "\n", + "In the previous notebook, we ran the entire default TransPi workflow, generating a small transcriptome from a test data set. While that is a valid exercise in carrying through the workflow, the downstream steps (annotation and assessment) will be unrealistic in their output, since the test set will only generate a few hundred transcripts. In contrast, a more complete estimate of a vertebrate transcriptome will contain tens to hundreds of thousands of transcripts.\n", + "\n", + "In this notebook, we will start from an assembled transcriptome. We will work with a more realistic example that was generated and submitted to the NCBI Transcriptome Shotgun Assembly archive.\n" + ] + }, + { + "cell_type": "markdown", + "id": "8f4cd172", + "metadata": {}, + "source": [ + "## Learning Objectives:\n", + "\n", + "1. **Understanding the TransPi workflow and its components:** The notebook builds upon previous knowledge of TransPi, focusing on the annotation stage, separating it from the assembly process. It reinforces the understanding of the overall workflow and its different stages.\n", + "\n", + "2. **Performing an \"annotation-only\" run with TransPi:** The primary objective is to learn how to execute TransPi, specifically utilizing the `--onlyAnn` option to process a pre-assembled transcriptome. This teaches efficient use of the tool and avoids unnecessary recomputation.\n", + "\n", + "3. **Working with realistic transcriptome data:** The notebook shifts from a small test dataset to a larger, more realistic transcriptome from the NCBI Transcriptome Shotgun Assembly archive. This exposes learners to the scale and characteristics of real-world transcriptome data.\n", + "\n", + "4. **Using command-line tools for data manipulation:** The notebook uses `grep`, `perl` one-liners, and `docker` commands to count sequences, modify configuration files, and manage containerized applications. This improves proficiency in using these essential bioinformatics tools.\n", + "\n", + "5. **Interpreting TransPi output:** Learners analyze the `RUN_INFO.txt` file and other output files to understand the analysis parameters and results. This develops skills in interpreting computational biology results.\n", + "\n", + "6. **Understanding and using containerization (Docker):** The notebook introduces the concept of Docker containers and demonstrates how to utilize a BUSCO container to run the BUSCO analysis, highlighting the benefits of containerization for reproducibility and dependency management. This teaches practical application of containers in bioinformatics.\n", + "\n", + "7. **Running BUSCO analysis:** Learners execute BUSCO, a crucial tool for assessing the completeness of transcriptome assemblies. This extends their skillset to include running and interpreting BUSCO results.\n", + "\n", + "8. **Interpreting BUSCO and other annotation results:** The notebook includes checkpoints that challenge learners to interpret the BUSCO results, GO stats, and TransDecoder stats, fostering critical thinking and data interpretation skills.\n", + "\n", + "9. **Critical evaluation of data sources:** The notebook encourages learners to consider the source and context of the transcriptome data used, prompting reflection on data quality and limitations. This emphasizes responsible use of biological data.\n", + "\n", + "10. **Independent BUSCO analysis:** The final checkpoint task requires learners to independently run a BUSCO analysis on a new transcriptome, selecting a data source and lineage, and interpreting the results. This assesses the understanding and practical application of the concepts covered in the notebook." + ] + }, + { + "cell_type": "markdown", + "id": "04994736", + "metadata": {}, + "source": [ + "## Prerequisites\n", + "\n", + "* **Nextflow:** The core workflow engine used to manage the TransPi pipeline.\n", + "* **Perl:** Used for a one-liner to modify the Nextflow configuration file.\n", + "* **Docker:** Used to run BUSCO in a containerized environment.\n", + "* **BUSCO:** The Benchmarking Universal Single-Copy Orthologs program for assessing genome completeness.\n", + "* **TransPi:** The specific transcriptome assembly pipeline. The notebook assumes this is pre-installed or available through Nextflow.\n", + "* **Command-line tools:** Basic Unix command-line utilities like `grep`, `ls`, `cat`, `pwd`, etc., are used throughout the notebook." + ] + }, + { + "cell_type": "markdown", + "id": "16adea33", + "metadata": {}, + "source": [ + "## Get Started" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "6b81dff4-64b8-42ca-91b5-6ad4cef392ff", + "metadata": {}, + "outputs": [], + "source": [ + "#Run the command below to watch the video\n", + "from IPython.display import YouTubeVideo\n", + "\n", + "YouTubeVideo('AGuUHmSobEA', width=800, height=400)" + ] + }, + { + "cell_type": "markdown", + "id": "dc4cd97f-fc62-40ab-badc-ba0bc6f092e4", + "metadata": {}, + "source": [ + "> \n", + ">\n", + "> **Figure 1:** Annotation workflow for a new, unannotated transcriptome. \n", + "\n", + "**Step 1:** Make sure that we start back at the following local working directory: `/home/jupyter`." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "62a48fca-ab66-4b62-8a33-d8104dffe1e3", + "metadata": {}, + "outputs": [], + "source": [ + "%cd /home/jupyter" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "8157e9a9-3457-40cc-9405-91903564d0dd", + "metadata": {}, + "outputs": [], + "source": [ + "! pwd" + ] + }, + { + "cell_type": "markdown", + "id": "a4f6b740-f02c-4993-b082-8cb58fc46e7b", + "metadata": {}, + "source": [ + "
\n", + " \n", + " Note: The Transcriptome\n", + "
\n", + "\n", + "> The transcriptome that we are using has already been downloaded onto your local directory. It lives within the `resources` directory in the sub-directory named `trans`. It is in the file format `.fa`." + ] + }, + { + "cell_type": "markdown", + "id": "3755b541-007b-4773-a53e-f01eee1ed570", + "metadata": {}, + "source": [ + "**Step 2:** Count the sequences in this file.\n", + "\n", + "> You should get a count of 31,176." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "0db05d70-513c-4ef4-92d6-8253562b43b7", + "metadata": {}, + "outputs": [], + "source": [ + "! grep -c \">\" ./resources/trans/Oncorhynchus_mykiss_GGBN01.1.fa" + ] + }, + { + "cell_type": "markdown", + "id": "7065fc1a-0a64-447c-8628-b20d5277dd3b", + "metadata": {}, + "source": [ + "
\n", + " \n", + " Note: What is the Oncorhynchus mykiss?\n", + "
\n", + "\n", + "> The Oncorhynchus mykiss is commonly known as the **Rainbow Trout**. Here is what they look like:\n", + ">\n", + "> \n", + "> \n", + ">> Image Source: https://www.ndow.org/species/rainbow-trout/" + ] + }, + { + "cell_type": "markdown", + "id": "8e2d2315-3cdd-4a5c-b58b-6f5621d53abf", + "metadata": {}, + "source": [ + "**Step 3:** Using a Perl one-liner, we will change the output directory so that the results from the `--all` run in Submodule 02 do not get overwritten." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "f2061d9d-c85a-49a0-b06d-7ecd95fb5534", + "metadata": {}, + "outputs": [], + "source": [ + "! perl -i.allloc -pe 's/basicRun/onlyAnnRun/g' ./TransPi/nextflow.config" + ] + }, + { + "cell_type": "markdown", + "id": "7bb26ca2-329a-4416-97c4-190a0759790c", + "metadata": {}, + "source": [ + "
\n", + " \n", + " Note: Perl\n", + "
\n", + "\n", + "> \n", + ">\n", + ">> Image Source: https://medium.com/@aman_adastra/techspace-what-is-perl-1e29a430676c\n", + ">\n", + ">**Mini History Lesson on Perl**\n", + ">\n", + ">So what is `Perl`? Surprisingly, there is no original acronym for the name, however, some have retroactively given this one: *Practical Extraction and Reporting Language*. It has aspects of many different programming languages and has a broad variety of applications. It was created as a text manipulation language by an expert in linguistics: Larry Wall. Interestingly, Larry Wall became a linguistic expert not to create a programming language but rather as a project to develop a written language for an exclusively oral language in Africa.\n", + ">\n", + ">**Comprehending the Perl one-liner**\n", + ">```python\n", + "!perl -i.allloc -pe 's/basicRun/onlyAnnRun/g' ./TransPi/nextflow.config\n", + ">```\n", + ">- `!`: This indicates that the line is to be executed as a command line argument. \n", + ">- `perl`: This indicates that the argument is in the Perl programming language.\n", + ">- `-i.allloc`: This indicates that Perl should edit the input file and create a backup of this file with the added `.allloc` extension.\n", + ">- `-pe`: These are actually two separate options...\n", + " - `-p`: This indicates that each line in the file is to be interpreted independently and printed after it has been processed. \n", + " - `-e`: This indicates that the Perl command is to be executed. \n", + ">- `'s/basicRun/onlyAnnRun/g'`: This is the meat of the Perl one-liner. It is essentially a search and replace.\n", + " - It searches for all occurrences of the string `basicRun`, and replaces it with the string `onlyAnnRun`.\n", + ">- `./TransPi/nextflow.config`: This points to the location of the input file." + ] + }, + { + "cell_type": "markdown", + "id": "8e1541b9-abb6-47c0-aa49-5c1720680376", + "metadata": {}, + "source": [ + "**Step 4:** Now we can run TransPi using the option `--onlyAnn` which assumes that the transcriptome has been generated, and will only run the various steps for annotation of the transcripts.\n", + "\n", + ">This run should take about **29 minutes**, assuming an N1 high-memory, 16 processor 104GB instance." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "3bb702b3-0a36-4a32-aca5-50dc75f9d401", + "metadata": {}, + "outputs": [], + "source": [ + "! NXF_VER=22.10.1 ./nextflow run ./TransPi/TransPi.nf \\\n", + "-profile docker --onlyAnn " + ] + }, + { + "cell_type": "markdown", + "id": "8a0f8dfb-366d-4e0f-af4e-d96f6ee97d34", + "metadata": {}, + "source": [ + "**Step 5:** As with the basic assembly example of the last workbook, the output will be arranged in a directory structure that is automatically created by Nextflow. Let's get a listing." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "595a199f-f19b-42a8-b63d-ca2a4b8e35e0", + "metadata": {}, + "outputs": [], + "source": [ + "! ls -l ./onlyAnnRun/output" + ] + }, + { + "cell_type": "markdown", + "id": "f3255502-270c-4ebb-9f72-141c4fab5c0f", + "metadata": {}, + "source": [ + "**Step 6:** Let's take a look at the `RUN_INFO.txt` file to see what the parameters and programs associated with our analysis were." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "a14265d6-bd68-4f08-a55e-c472c3f23faa", + "metadata": {}, + "outputs": [], + "source": [ + "! cat ./onlyAnnRun/output/RUN_INFO.txt" + ] + }, + { + "cell_type": "markdown", + "id": "4187a790-276c-4bf2-8ce8-2f7985e8c662", + "metadata": {}, + "source": [ + "
\n", + " \n", + " Note: Containers\n", + "
\n", + "\n", + ">Note that while the \"onlyAnn\" run carries out the searches against Pfam and the BLAS analysis against known proteins, it does not carry out the BUSCO analysis. We can make that happen ourselves however, to do so, we need to learn a little bit about running programs from containers.\n", + ">\n", + ">Container systems (and associated images) are one approach that simplifies the use of a broad set of programs, such as is commonly found in the wide field of computational biology. To put it concisely, most programs are not \"stand-alone\" but instead rely upon at least a few supporting libraries or auxiliary programs. Since many analyses require multiple programs, installation of the necessary programs will also require installation of the supporting components, and critically *sometimes the supporting components of one program conflict with those of other programs.* \n", + ">\n", + ">Container systems ([Docker](https://www.docker.com/) and [Singularity](https://sylabs.io/singularity/) are the two most well-known examples) address this by installing and encapsulating the program and all of its necessary supporting components in an image. Each program is then executed in the context of its container image, which is activated just long enough to run its program.\n", + ">\n", + ">Because of the way that we have run the TransPi workflow in the previous, our system will already have several container images installed. We can now work directly with these images.\n", + "\n", + "**Step 7:** Start by getting a listing of the images that are currently loaded." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "7888742e-67f9-4f42-8a22-ee3314e9bdf7", + "metadata": {}, + "outputs": [], + "source": [ + "! docker images" + ] + }, + { + "cell_type": "markdown", + "id": "ec2fed5f-0d41-4fc3-8a7c-2600e2993490", + "metadata": {}, + "source": [ + "**Step 8:** Activate the BUSCO container.\n", + ">We want the Docker image that contains the program (and all necessary infrastructure) for running the BUSCO analysis. The name is in the first column, but we also need the version number, which is in the second column. So let's put that together and first activate the container and ask it to run BUSCO and just give us back the help message.\n", + ">\n", + ">We will use the `docker run` command, and we will use the following options with it:\n", + ">- `-it`, which means run interactively\n", + ">- `--rm`, which means clean up after shutting down\n", + ">- `--volume /home:/home` This is critical because, by default, a Docker image can only see the file system inside of the container image. We need to have it see our working directory, so we create a volume mapping. For simplicity, we will just map the /home directory outside the container to the same address inside. This will let us access and use all of the files that are below `/home`." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "61d9539e-5f1f-4590-9b1b-1a20231acd13", + "metadata": {}, + "outputs": [], + "source": [ + "! docker run -it --rm --volume /home:/home quay.io/biocontainers/busco:5.4.3--pyhdfd78af_0 busco --help" + ] + }, + { + "cell_type": "markdown", + "id": "c282b861-9616-4c07-98cf-bbe0b96747c1", + "metadata": {}, + "source": [ + "**Step 9:** Run BUSCO (in the container)\n", + ">Now we will fill out a complete command and ask BUSCO to analyze the same trout data that we just used above. Here is the full command needed to make this run go. A lot is going on here:\n", + ">\n", + ">- `-i /home/jupyter/resources/trans/Oncorhynchus_mykiss_GGBN01.1.fa`: this points to the location and name of the file to be examined.\n", + ">- `-l vertebrata_odb10`: this tells BUSCO to use the vertebrata gene set (genes common to vertebrates) as the target.\n", + ">- `-o GGBN01_busco_vertebrata`: this tells BUSCO to use this as the label for the output.\n", + ">- `--out_path /home/jupyter/buscoOutput`: this tells BUSCO where to put the output directory. \n", + ">- `-m tran`: this tells BUSCO that the inputs are transcripts (rather than protein or genomic data). \n", + ">- `-c 14`: this tells BUSCO to use 14 CPUs\n", + ">\n", + "> This should take about **22 minutes**\n", + "\n" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "400d3a6b-6d07-4483-8a57-6d795c4eb602", + "metadata": {}, + "outputs": [], + "source": [ + "#Run the command below to watch the video\n", + "from IPython.display import YouTubeVideo\n", + "\n", + "YouTubeVideo('D95mFnIjRo4', width=800, height=400)" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "82296cfb-cddb-4325-a8a6-ab3eea8fdfbd", + "metadata": {}, + "outputs": [], + "source": [ + "numthreads=!lscpu | grep '^CPU(s)'| awk '{print $2-1}'\n", + "THREADS = int(numthreads[0])\n", + "! echo $THREADS" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "83d031c6-6a25-46fc-b041-e69e92fdb20b", + "metadata": {}, + "outputs": [], + "source": [ + "! docker run -it --rm --volume /home:/home quay.io/biocontainers/busco:5.4.3--pyhdfd78af_0 busco \\\n", + "-i /home/jupyter/resources/trans/Oncorhynchus_mykiss_GGBN01.1.fa \\\n", + "-l vertebrata_odb10 -o GGBN01_busco_vertebrata \\\n", + "--out_path /home/jupyter/buscoOutput -m tran -c $THREADS" + ] + }, + { + "cell_type": "markdown", + "id": "fef3d8cb-2cb2-4522-89b6-4f102ad4658f", + "metadata": {}, + "source": [ + "**Step 10:** Look at the output." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "eccbb15e-b08b-48a6-8acc-f450fc98152f", + "metadata": {}, + "outputs": [], + "source": [ + "!ls ./buscoOutput/GGBN01_busco_vertebrata" + ] + }, + { + "cell_type": "markdown", + "id": "3d9da3fe-ca2c-41f7-9af9-7384c5688a47", + "metadata": {}, + "source": [ + "
\n", + " \n", + " Checkpoint 1: Interpret The Results \n", + "
\n", + "\n", + "> Consider the following result files:\n", + "> - The BUSCO result `./buscoOutput/GGBN01_busco_vertebrata/short_summary.specific.vertebrata_odb10.GGBN01_busco_vertebrata.txt`\n", + "> - The GO stats result `./onlyAnnRun/output/stats/Oncorhynchus_mykiss_GGBN01.sum_GO.txt`\n", + "> - The TransDecoder stats result: `./onlyAnnRun/output/stats/Oncorhynchus_mykiss_GGBN01.sum_transdecoder.txt`\n", + "\n", + "*The green cards below are interactive. Spend some time to consider the question and click on the card to check your answer.*" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "3721adc8-08ff-4b35-b994-1d18e61e7a91", + "metadata": {}, + "outputs": [], + "source": [ + "from jupytercards import display_flashcards\n", + "display_flashcards('../quiz-material/03-cp1-1.json')" + ] + }, + { + "cell_type": "markdown", + "id": "0028ed80-7227-49ee-b4e1-c123ae5bdfda", + "metadata": {}, + "source": [ + "> Now let's take a look at where the data came from... Consider the abstract of the [Al-Tobasel et al.](https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4764514/) paper published from this data.\n", + ">\n", + ">>*The ENCODE project revealed that ~70% of the human genome is transcribed. While only 1–2% of the RNAs encode for proteins, the rest are non-coding RNAs. Long non-coding RNAs (lncRNAs) form a diverse class of non-coding RNAs that are longer than 200nt. Emerging evidence indicates that lncRNAs play critical roles in various cellular processes including regulation of gene expression. LncRNAs show low levels of gene expression and sequence conservation, which make their computational identification in genomes difficult. In this study, more than two billion Illumina sequence reads were mapped to the genome reference using the TopHat and Cufflinks software. Transcripts shorter than 200nt, with more than 83–100 amino acids ORF, or with significant homologies to the NCBI nr-protein database were removed. In addition, a computational pipeline was used to filter the remaining transcripts based on a protein-coding-score test. Depending on the filtering stringency conditions, between 31,195 and 54,503 lncRNAs were identified, with only 421 matching known lncRNAs in other species. A digital gene expression atlas revealed 2,935 tissue-specific and 3,269 ubiquitously-expressed lncRNAs. This study annotates the lncRNA rainbow trout genome and provides a valuable resource for functional genomics research in salmonids.*" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "3654ccf7-de98-443a-9dd2-6563767a4be5", + "metadata": {}, + "outputs": [], + "source": [ + "display_flashcards('../quiz-material/03-cp1-2.json')" + ] + }, + { + "cell_type": "markdown", + "id": "7954beff-d647-4a1c-9b21-ad05e4fa16c3", + "metadata": {}, + "source": [ + ">**The key takeaway is to always be mindful of the data you are using before performing analysis on it.**" + ] + }, + { + "cell_type": "markdown", + "id": "17554548-3d09-4984-8ebe-4a6b26a744a1", + "metadata": {}, + "source": [ + "**Step 11:** Now let's try with one of the other transcriptomes that we downloaded from the NCBI Transcriptome Shotgun Assembly archive.\n", + "> This should take about **30 minutes**" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "684820c8-ee9f-495d-8476-9f25eaa02d43", + "metadata": {}, + "outputs": [], + "source": [ + "!docker run -it --rm --volume /home:/home quay.io/biocontainers/busco:5.4.3--pyhdfd78af_0 busco \\\n", + "-i /home/jupyter/resources/trans/Pseudacris_regilla_GAEI01.1.fa \\\n", + "-l vertebrata_odb10 -o GAEI01_busco_vertebrata \\\n", + "--out_path /home/jupyter/buscoOutput -m tran -c $THREADS" + ] + }, + { + "cell_type": "markdown", + "id": "71b26cfb-9f5f-45a5-8466-7e94110c480d", + "metadata": {}, + "source": [ + "
\n", + " \n", + " Checkpoint 2: Your turn to run a BUSCO analysis \n", + "
\n", + "\n", + ">For this checkpoint, you will run another BUSCO analysis, however, this time you will write your own execution command. For the transcriptome used, you have two options:\n", + ">1. Within the directory that we have been using for the previous two BUSCO runs, `./resources/trans`, there is one more assembled transcriptome named `Microcaecilia_dermatophaga_GFOE01.1.fa`.\n", + ">2. Go onto the NCBI Transcriptome Shotgun Assembly archive, find your own complete, assembled transcriptome, and use that.\n", + "> - If you download the file onto your local computer, there is an upload button (up arrow) in the top left of the Jupyter interface where you can upload the file.\n", + "> - If the file you have uploaded is zipped, you will need to unzip it using the following commands: (make sure that the file name after the `>` has the `.fa` extension.)\n", + ">```python\n", + " !gzip -d -c ./PATH/TO/FILE.fsa_nt.gz > ./PATH/TO/FILE.1.fa\n", + " !rm ./PATH/TO/FILE.fsa_nt.gz\n", + ">```\n", + "> Additionally, consider trying a different lineage (`-l` selection). EZlab, the creators of BUSCO, have produced a large selection of lineages to choose from. Each one has a different set of genes that BUSCO looks for. If you decide to try a different lineage, it is recommended to choose a lineage that falls somewhere within the same family. (e.g., Don't choose the `primates_odb10` lineage if you are choosing to use a bullfrog transcriptome.)\n", + ">```python\n", + " # This will be a complete list of the available datasets\n", + " !docker run -it --rm --volume /home:/home quay.io/biocontainers/busco:5.4.3--pyhdfd78af_0 busco --list-datasets\n", + ">```\n", + "> Feel free to reference the commands for the previous BUSCO runs and the help command we ran earlier if you are stuck. Additionally, feel free to check out the [BUSCO user guide](https://busco.ezlab.org/busco_userguide.html).\n", + ">\n", + ">After the run has been complete, consider the following:\n", + ">1. How did BUSCO perform on this transcriptome? Does the transcriptome appear to be well assembled based on the provided lineage? If the results are not good, consider the possible reasons why? Is it more likely that the transcriptome chosen was not good? Or potentially a poorly chosen lineage? Or maybe something else entirely?\n", + ">2. What could be a logical biological reason the output says that there are duplicate copies of the same gene?\n", + ">3. What could be a possible reason for fragmented copies?\n", + ">4. Why is it that broader lineages such as metazoa have far fewer genes (954) that BUSCO looks for compared to more specific lineages such as mammalia which has far more genes (9226) that BUSCO looks for?" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "57a46ef9-5459-4302-9a34-bf92528c9d9c", + "metadata": {}, + "outputs": [], + "source": [ + "# Put your BUSCO command here\n" + ] + }, + { + "cell_type": "markdown", + "id": "68cfd48a", + "metadata": {}, + "source": [ + "## Conclusion\n", + "\n", + "This Jupyter Notebook demonstrated the \"annotation only\" run of TransPi, utilizing a pre-assembled transcriptome of *Oncorhynchus mykiss* (Rainbow Trout) containing 31,176 transcripts. By modifying the `nextflow.config` file and leveraging the `--onlyAnn` option, we efficiently performed annotation steps, including Pfam and BLAST analyses, without repeating the assembly process. Furthermore, the notebook introduced the concept of Docker containers, showcasing their use in executing BUSCO analysis for assessing transcriptome completeness. The practical application of BUSCO, along with interpretation of the resulting output files (including GO stats and TransDecoder statistics), emphasized the importance of data context and critical evaluation of transcriptome assembly quality. Finally, the notebook concluded with a hands-on exercise, prompting users to perform their own BUSCO analysis on a different transcriptome, fostering a deeper understanding of the workflow and its applications." + ] + }, + { + "cell_type": "markdown", + "id": "5bc80021", + "metadata": {}, + "source": [ + "## Clean Up\n", + "\n", + "Remember to proceed to the next notebook [`Submodule_04_gls_assembly.ipynb`](Submodule_04_gls_assembly.ipynb) or shut down your instance if you are finished." + ] + } + ], + "metadata": {}, + "nbformat": 4, + "nbformat_minor": 5 +} diff --git a/GoogleCloud/Submodule_04_google_batch_assembly.ipynb b/GoogleCloud/Submodule_04_google_batch_assembly.ipynb new file mode 100644 index 0000000..26b039f --- /dev/null +++ b/GoogleCloud/Submodule_04_google_batch_assembly.ipynb @@ -0,0 +1,604 @@ +{ + "cells": [ + { + "cell_type": "markdown", + "id": "481f3583-186d-4d0e-b2c8-0381b6cf814f", + "metadata": {}, + "source": [ + "# MDIBL Transcriptome Assembly Learning Module\n", + "# Notebook 4: Using TransPi on Google Batch" + ] + }, + { + "cell_type": "markdown", + "id": "0512337d-7ade-44c7-832a-ae6970a7d980", + "metadata": {}, + "source": [ + "## Overview\n", + "\n", + "So far, all of the computational work executed has been run locally, using the compute resources available within this Jupyter notebook. Although this is functional, it is not the ideal setup for fast, cost-efficient data analysis.\n", + "\n", + "Google Batch is known as a scheduler, which provisions specific compute resources to be allocated for individual processes within our workflow. This provides two primary benefits:\n", + "> - Once each specific process is complete, the computer will automatically turn off, meaning that you aren't wasting any money on unused resources.\n", + "> - Multiple processes can be executed at the same time, allowing for the parallelization of computational tasks. This means that the computational process is quicker from start to finish.\n", + "\n", + "Fortunately, Batch and Nextflow are compatible with each other allowing for any Nextflow workflow, including the TransPi workflow that we have been using, to be executable on Batch.\n", + "\n", + "\n", + "> \n", + ">\n", + "> **Figure 1:** Diagram illustrating the interactions between the components used for the Google Batch run. \n", + "\n", + "For this to work, there are a few quick adjustment steps to make sure everything is set up for a Google Batch run!" + ] + }, + { + "cell_type": "markdown", + "id": "8b495639", + "metadata": {}, + "source": [ + "## Learning Objectives:\n", + "\n", + "1. **Utilize Google Batch for efficient and cost-effective data analysis:** The notebook contrasts local computation with Google Batch, highlighting the benefits of the latter in terms of cost savings (auto-shutdown of unused resources) and speed (parallelization of tasks).\n", + "\n", + "2. **Integrate Nextflow workflows with Google Batch:** The notebook demonstrates how to configure a Nextflow pipeline (TransPi) to run on Google Batch, emphasizing the compatibility between these tools.\n", + "\n", + "3. **Manage files using Google Cloud Storage (GCS):** The lesson requires users to create or utilize a GCS bucket to store the necessary files for the TransPi workflow, addressing the challenge of accessing local files from external compute resources.\n", + "\n", + "4. **Configure a Nextflow pipeline for Google Batch execution:** This involves modifying the `nextflow.config` file to point to the GCS bucket, adjust compute allocations (CPU and memory), and specify the correct Google Batch profile. It shows how to use Perl one-liners for efficient configuration changes.\n", + "\n", + "5. **Interpret and compare the timelines of local and Google Batch runs:** By comparing the `transpi_timeline.html` files from both local and Google Batch executions, users learn to analyze the performance differences and understand the impact of resource allocation.\n", + "\n", + "6. **Execute and manage a Nextflow pipeline on Google Batch:** The notebook provides step-by-step instructions for running TransPi on Google Batch using specific command-line arguments and managing the output.\n", + "\n", + "7. **Understand and utilize Google Cloud commands:** The notebook uses `gcloud` and `gsutil` commands extensively, teaching users basic Google Cloud command-line interactions." + ] + }, + { + "cell_type": "markdown", + "id": "1dbd972f", + "metadata": {}, + "source": [ + "## Prerequisites\n", + "\n", + "* **A Google Cloud Storage (GCS) Bucket:** A bucket is needed to store the TransPi workflow's input files and output results. The notebook provides options to create a new bucket or use an existing one.\n", + "* **Sufficient Compute Resources:** The user needs to have sufficient quota available in their GCP project to handle the compute resources required by the TransPi workflow (CPUs, memory, disk space). The notebook uses a `nextflow.config` file to configure the Google Batch execution.\n", + "* **`gcloud` CLI:** The Google Cloud SDK (`gcloud`) command-line tool must be installed and configured to authenticate with the GCP project. The notebook uses `gcloud` commands to interact with GCP services.\n", + "* **`gsutil` CLI:** The `gsutil` command-line tool (part of the Google Cloud SDK) is used to interact with GCS.\n", + "* **Nextflow:** The Nextflow workflow engine must be installed locally on the Jupyter Notebook environment.\n", + "* **TransPi Workflow:** The TransPi Nextflow pipeline code must be available in the Jupyter Notebook environment's file system. The notebook assumes it's in a `TransPi` directory.\n", + "* **Perl:** The notebook uses Perl one-liners for file manipulation. Perl must be installed in the Jupyter Notebook environment." + ] + }, + { + "cell_type": "markdown", + "id": "9449ee77", + "metadata": {}, + "source": [ + "## Get Started" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "98a81048-5b92-4ee4-9ede-28abe3ccf5cf", + "metadata": {}, + "outputs": [], + "source": [ + "#Run the command below to watch the video\n", + "from IPython.display import YouTubeVideo\n", + "\n", + "YouTubeVideo('abw2XAg1e_g', width=800, height=400)" + ] + }, + { + "cell_type": "markdown", + "id": "14495602-64ba-44ac-9ee7-478709cee34c", + "metadata": {}, + "source": [ + "**Step 1:** Downsize the VM instance.\n", + "> Consider downloading or taking a screenshot of the following image as the downsizing process will involve briefly stopping this VM instance.\n", + ">\n", + "> " + ] + }, + { + "cell_type": "markdown", + "id": "a0ab5976-4523-4816-85b7-bcd748feb6ec", + "metadata": {}, + "source": [ + "**Step 2:** Once again we are going to set the local working directory back to `/home/jupyter`." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "dca889ac-dcdc-46f3-abd3-01c9d9ff3c61", + "metadata": {}, + "outputs": [], + "source": [ + "%cd /home/jupyter" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "7537685b-794e-496f-97b9-a54694918bec", + "metadata": {}, + "outputs": [], + "source": [ + "!pwd" + ] + }, + { + "cell_type": "markdown", + "id": "8be921e6-dd86-48d9-a8cc-d7baa5e99d08", + "metadata": {}, + "source": [ + "
\n", + " \n", + " Note: Bucket for Batch\n", + "
\n", + "\n", + "> Batch is using external machines to do our computing work for us, which means that it is unable to find files that we have locally within this Jupyter notebook. As a result, we need to put the files that TransPi needs to run in a location that is findable from these machines: Google Cloud Storage (GCS) buckets!" + ] + }, + { + "cell_type": "markdown", + "id": "5da11b26-3914-49dd-8a8b-f796ff66626c", + "metadata": {}, + "source": [ + "**Step 3:** Create a variable for your Google project name\n", + "> - The first line is a Google Cloud command that gets the name of your project and puts it in a list named projName.\n", + "> - The second line gets the name, which is at the 0 index of the list and sets it to the variable `myProject`.\n", + "> - The third line just prints out the name." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "16f5b0d0-3634-4baa-b7c5-b37a111edb9f", + "metadata": {}, + "outputs": [], + "source": [ + "projName=!gcloud config get-value project\n", + "myProject=projName[0]\n", + "myProject" + ] + }, + { + "cell_type": "markdown", + "id": "8628b174-a2e4-4585-b332-0bd04c4df6f0", + "metadata": {}, + "source": [ + "**Step 4a:** Bucket Setup:\n", + "\n", + "Set the variable `myBucketName` to one of the following:\n", + "1. If you plan on using an existing bucket, then set it to the name of that bucket.\n", + "2. If you would like to use a new bucket, then set the variable to whatever you would like to name your new bucket. Here are some quick naming guidelines:\n", + " - You can use lowercase letters, numbers, dashes, underscores, and dots. \n", + " - The name cannot start or end in a dash, underscore, or dot.\n", + " - Keep the name within the quotes.\n", + " - More information can be found [here](https://cloud.google.com/storage/docs/buckets?_ga=2.188214954.-360038957.1673379287#naming)." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "bd9ba2f4-c75d-4cc0-af5e-9432919b3f7a", + "metadata": {}, + "outputs": [], + "source": [ + "myBucketName=\"your-bucket-name\"\n", + "myBucketName" + ] + }, + { + "cell_type": "markdown", + "id": "5676e5d2-f556-463e-93c2-5add30f1fff8", + "metadata": {}, + "source": [ + "**Step 4b:** Create a new GCS bucket. *If you are using an existing bucket, you can skip this step.*\n", + "> To do this, we can use a new `gsutil` command: `mb` which stands for make bucket." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "3d6dbf3b-14d7-43ec-9253-2698219c2ff6", + "metadata": {}, + "outputs": [], + "source": [ + "!gsutil mb -p $myProject -c STANDARD -b on gs://$myBucketName" + ] + }, + { + "cell_type": "markdown", + "id": "a5a3b453-4692-4db3-86ab-c6234244888f", + "metadata": {}, + "source": [ + "**Step 5:** Create a Google-recognizable path variable named `gbPath`.\n", + "> You don't need to change anything, just execute." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "f26b3e78-bb3c-4072-89da-41c622537dda", + "metadata": {}, + "outputs": [], + "source": [ + "gbPath=\"gs://\" + myBucketName + \"/TransPi\"\n", + "gbPath" + ] + }, + { + "cell_type": "markdown", + "id": "fe37655a-3d46-4f53-829c-16996e2d1751", + "metadata": {}, + "source": [ + "**Step 6:** Copy the `resources` directory into your bucket.\n", + "> These are the same resources that we copied to the local directory in Submodule 01." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "5d39277a-1128-4b7b-a5a2-db8491bc89ca", + "metadata": {}, + "outputs": [], + "source": [ + "!gsutil -m cp -r gs://nigms-sandbox/nosi-inbremaine-storage/resources $gbPath/resources" + ] + }, + { + "cell_type": "markdown", + "id": "02f33dee-6928-41ad-84b9-da1eaa3823a3", + "metadata": {}, + "source": [ + "**Step 7A:** Adjust our `nextflow.config` file paths.\n", + "\n", + "This changes all of the pointers to our resources in the GCS bucket.\n", + "> This is a Perl one-liner that is very similar to the one used in Submodule 03." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "77db418c-ebe3-409c-9ef9-70205e856089", + "metadata": {}, + "outputs": [], + "source": [ + "!perl -i.annloc -pe s#/home/jupyter#$gbPath#g ./TransPi/nextflow.config" + ] + }, + { + "cell_type": "markdown", + "id": "abcaab62-344f-443b-99e0-345e04ead1e2", + "metadata": {}, + "source": [ + "**Step 7B:** Adjust the names of directories and add your project name to the gcb profile." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "8e41b8cf-5d55-4bd2-86cc-afb24c6e492d", + "metadata": {}, + "outputs": [], + "source": [ + "!perl -i -pe 's/onlyAnnRun/basicRun/g' ./TransPi/nextflow.config" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "f1e16dc8-79ff-48f2-99f4-e6a7ed8bb4c4", + "metadata": {}, + "outputs": [], + "source": [ + "!perl -i -pe s#your-project-name#$myProject#g ./TransPi/nextflow.config" + ] + }, + { + "cell_type": "markdown", + "id": "5b2417a7-6f8b-4ea6-aefb-3c36cc0d604a", + "metadata": {}, + "source": [ + "**Step 7C:** Adjust our `nextflow.config` compute allocations.\n", + "\n", + "Now that we are using separately provisioned compute resources, we can allocate more CPU power and memory to specific processes.\n", + "\n", + "> These are also Perl one-liners, but this time they are delimited with `/` instead of `#`." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "d5dbd33e-cb1c-4eda-b82f-92bcc00aec7c", + "metadata": {}, + "outputs": [], + "source": [ + "!perl -i -pe \"s/cpus='15'/cpus='20'/g\" ./TransPi/nextflow.config" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "42741c52-0e12-49c9-a802-40f5e41441f4", + "metadata": {}, + "outputs": [], + "source": [ + "!perl -i -pe \"s/memory='100 GB'/memory={ 100.Gb + (task.attempt * 50.Gb)}/g\" ./TransPi/nextflow.config" + ] + }, + { + "cell_type": "markdown", + "id": "eaa334e6-f3bf-4c4a-85ae-c6e1cc2e6060", + "metadata": {}, + "source": [ + "
\n", + " \n", + " Checkpoint 1:\n", + "
" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "4b80b91f-f798-4cea-956b-17c406aaf59e", + "metadata": {}, + "outputs": [], + "source": [ + "from jupytercards import display_flashcards\n", + "display_flashcards('Transcriptome-Assembly-Refinement-and-Applications/quiz-material/04-cp1-1.json')" + ] + }, + { + "cell_type": "markdown", + "id": "330f0fa9-3efa-468d-9c2b-7e6c23bec0d0", + "metadata": {}, + "source": [ + "**Step 8:** Time to run TransPi using Batch.\n", + "> This should take about **40 minutes.**" + ] + }, + { + "cell_type": "markdown", + "id": "9cd17b0d-683d-4a02-a347-26b65fddacbb", + "metadata": {}, + "source": [ + "
\n", + " \n", + " Note: gcb profile\n", + "
\n", + "\n", + "> Note that in the command, we use the profile gcb. This tells Nextflow that we want to use the gcb profile designated within the `nextflow.config` file. Here is what that profile looks like: \n", + ">```\n", + " gcb {\n", + "\tprocess.executor = 'google-batch'\n", + " process.container = 'ubuntu'\n", + " google.location = 'us-central1'\n", + " google.region = 'us-central1'\n", + " google.project = 'your-project-name'\n", + " workDir = 'gs://your-bucket-name/TransPi/basicRun/work'\n", + " params.outdir='gs://your-bucket-name/TransPi/basicRun/output'\n", + " google.batch.bootDiskSize=50.GB\n", + " google.storage.parallelThreadCount = 100\n", + " google.storage.maxParallelTransfers = 100\n", + " }\n", + ">```" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "fc8b9f7a-d069-444b-894a-5b974c2bb6d0", + "metadata": {}, + "outputs": [], + "source": [ + "!NXF_VER=22.10.1 ./nextflow run ./TransPi/TransPi.nf \\\n", + "-profile gcb --k 17,25,43 --maxReadLen 50 --all -resume" + ] + }, + { + "cell_type": "markdown", + "id": "f2c7b532-5ff8-451f-8cc3-01eb827133fd", + "metadata": {}, + "source": [ + "**Step 9:** Take a look at `transpi_timeline.html` and compare it to the timeline of the local run.\n", + "\n", + ">First we have to make a local directory to place the output." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "e2188706-1cbe-4555-8294-96e897999fb3", + "metadata": {}, + "outputs": [], + "source": [ + "!mkdir GCBbasicRun\n", + "!mkdir ./GCBbasicRun/output" + ] + }, + { + "cell_type": "markdown", + "id": "ae11c423-042b-42ca-b46f-757945d27257", + "metadata": {}, + "source": [ + ">Now we can copy over the `pipeline_info` from the bucket to our new local bucket." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "d7e8f4eb-0e3c-45ce-a5b8-042d66592fe7", + "metadata": {}, + "outputs": [], + "source": [ + "!gsutil -m cp -r $gbPath/basicRun/output/pipeline_info ./GCBbasicRun/output" + ] + }, + { + "cell_type": "markdown", + "id": "eccd2852-d47d-498a-b480-c70074c94f42", + "metadata": {}, + "source": [ + ">Now we can visualize both the local and GCB run." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "95b61178-8282-442e-bd05-e72d11b05ac3", + "metadata": {}, + "outputs": [], + "source": [ + "from IPython.display import IFrame\n", + "IFrame(\"../GCBbasicRun/output/pipeline_info/transpi_timeline.html\",width=1200, height=900)" + ] + }, + { + "cell_type": "markdown", + "id": "15338b8b-fdc6-43ce-8140-a5011a328451", + "metadata": {}, + "source": [ + "> **Figure 1:** GCB Run Timeline" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "b68d2ac1-46cb-421e-b6bd-2394d831ea81", + "metadata": {}, + "outputs": [], + "source": [ + "IFrame('../basicRun/output/pipeline_info/transpi_timeline.html',width=1200, height=900)" + ] + }, + { + "cell_type": "markdown", + "id": "5b514636-6ab2-47cc-89fd-61e4a71a8ed4", + "metadata": {}, + "source": [ + "> **Figure 2:** Local Run Timeline Above" + ] + }, + { + "cell_type": "markdown", + "id": "5fdb6344-3e3f-4a4c-ae4c-9a1aff01a1a7", + "metadata": {}, + "source": [ + "
\n", + " \n", + " Checkpoint 2:\n", + "
\n", + "\n", + "Consider the two figures that you just generated. In the markdown cell below, take some notes on the similarities and differences between the timelines of the two runs." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "32687e6e-9db8-4a02-955a-508401ae7984", + "metadata": {}, + "outputs": [], + "source": [ + "from jupytercards import display_flashcards\n", + "display_flashcards('../quiz-material/04-cp1-2.json')\n", + "display_flashcards('../quiz-material/04-cp1-3.json')\n", + "display_flashcards('../quiz-material/04-cp1-4.json')" + ] + }, + { + "cell_type": "markdown", + "id": "f4335a50-39fd-4579-aafc-26b6de21ede7", + "metadata": {}, + "source": [ + "**Step 10:** Now let's try a GCB run with `--onlyAnn`. Before we do, we need to change our `workDir` and `outDir` paths in the `nextflow.config` so that it does not overwrite the output that we just created for the `--all` run." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "0b98521f-c608-49a6-9761-2c97d8203da2", + "metadata": {}, + "outputs": [], + "source": [ + "!perl -i.allgcb -pe 's#basicRun#onlyAnnRun#g' ./TransPi/nextflow.config" + ] + }, + { + "cell_type": "markdown", + "id": "9c39fb97-2a2c-4eeb-8cca-9ff6ed54afa6", + "metadata": {}, + "source": [ + "**Step 11:** Time to run. The only change that we will make to the run command is to change `--all` to `--onlyAnn`\n", + "> This run should take about **30 minutes**." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "3d032206-02bc-473e-8b2e-c313a1f53e1e", + "metadata": {}, + "outputs": [], + "source": [ + "!NXF_VER=22.10.1 ./nextflow run ./TransPi/TransPi.nf \\\n", + "-profile gcb --onlyAnn " + ] + }, + { + "cell_type": "markdown", + "id": "37e74889-f236-484a-975d-1c565bb08a49", + "metadata": {}, + "source": [ + "Feel free to explore the results found in the GCB `onlyAnn` run. The following cell will place the `pipeline_info` directory from the run into the directory: `./GCBonlyAnnRun/output`. The rest of the results should be essentially the same as the `onlyAnn` run locally in Submodule 03." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "2489aca9-812c-4d6f-aa18-000bd72a5bdc", + "metadata": {}, + "outputs": [], + "source": [ + "!mkdir GCBonlyAnnRun\n", + "!mkdir ./GCBonlyAnnRun/output\n", + "!gsutil -m cp -r $gbPath/onlyAnnRun/output/pipeline_info ./GCBonlyAnnRun/output" + ] + }, + { + "cell_type": "markdown", + "id": "96722e89-2d6a-4381-ba42-673e9be79a2e", + "metadata": {}, + "source": [ + "##### At this point, you have the toolkit necessary to run TransPi in various configurations and the baseline knowledge to interpret the output that TransPi produces. You also have the foundational knowledge of Google Cloud resources with the ability to utilize buckets and cloud computing to execute your computational task. Specifically, Batch which not only works with TransPi but also with any other Nextflow pipeline. We urge you to continue exploring TransPi, using different data sets, and also to explore other Nextflow pipelines as well." + ] + }, + { + "cell_type": "markdown", + "id": "5213f6a1", + "metadata": {}, + "source": [ + "## Conclusion\n", + "\n", + "This module demonstrated the execution of the TransPi transcriptome assembly workflow on Google Batch, a significant advancement from local Jupyter Notebook execution. By leveraging Google Batch's scheduling capabilities, we achieved both cost efficiency through automated resource allocation and increased speed through parallelization of computational tasks. The integration of Nextflow with Google Batch streamlined the process, requiring only minor adjustments to the `nextflow.config` file to redirect file paths to Google Cloud Storage (GCS) buckets and optimize compute allocations. Comparison of local and Google Batch run timelines highlighted the benefits of cloud computing for large-scale bioinformatics analyses. This learning module equipped users with the skills to effectively utilize Google Batch for efficient and scalable execution of Nextflow pipelines, paving the way for more complex and data-intensive bioinformatics projects." + ] + }, + { + "cell_type": "markdown", + "id": "2661513f", + "metadata": {}, + "source": [ + "## Clean Up\n", + "\n", + "You would proceed to the next notebook [`Submodule_05_Bonus_Notebook.ipynb`](./Submodule_05_Bonus_Notebook.ipynb) or shut down your instance if you are finished." + ] + } + ], + "metadata": {}, + "nbformat": 4, + "nbformat_minor": 5 +} diff --git a/GoogleCloud/Submodule_05_Bonus_Notebook.ipynb b/GoogleCloud/Submodule_05_Bonus_Notebook.ipynb new file mode 100644 index 0000000..d884c61 --- /dev/null +++ b/GoogleCloud/Submodule_05_Bonus_Notebook.ipynb @@ -0,0 +1,353 @@ +{ + "cells": [ + { + "cell_type": "markdown", + "id": "6eb2c6fb-13d9-4461-ad74-44262079211c", + "metadata": {}, + "source": [ + "# MDIBL Transcriptome Assembly Learning Module\n", + "# Bonus notebook: Using TransPi on a new dataset" + ] + }, + { + "cell_type": "markdown", + "id": "c38bba56-40d9-4ca4-b58b-b9733b424b1f", + "metadata": {}, + "source": [ + "## Overview\n", + "In this notebook, we are going to explore how to run this module with a new dataset. These submodules provide a great framework for running a rigorous and scalable transcriptome assembly, but there are some considerations that must be made in order to run this with your own data. We will walk through that process here so that hopefully, you are able to take these notebooks to your research group and use them for your own analysis." + ] + }, + { + "cell_type": "markdown", + "id": "80044322-a021-4fdc-ad83-504961bd1919", + "metadata": {}, + "source": [ + "The data we are using here comes from SRA. In this example, we are using data from an experiment that compared RNA sequences in honeybees with and without viral infections. The BioProject ID is [PRJNA274674](https://www.ncbi.nlm.nih.gov/bioproject/PRJNA274674). This experiment includes 6 RNA-seq samples and 2 methylation-seq samples. We are only considering the RNA-seq data here. Additionally, we have subsampled them to about 2 millions reads collectively accross all of the samples. In a real analysis this would not be a good idea, but to keep costs and runtimes low we will use the down-sampled files in this demonstration. If you want to explore the full dataset, we recommend pulling the fastq files using the [STRIDES tutorial on SRA downloads](https://github.com/STRIDES/NIHCloudLabGCP/blob/main/notebooks/SRADownload/SRA-Download.ipynb). As with the original example in this module, we have concatenated all 6 files into one set of combined fastq files called joined_R{1,2}.fastq.gz We have stored the subsampled fastq files in this module's cloud storage bucket." + ] + }, + { + "cell_type": "markdown", + "id": "ae57ad92", + "metadata": {}, + "source": [ + "## Learning Objectives:\n", + "\n", + "1. **Adapting a Nextflow workflow:** The notebook demonstrates how to modify a Nextflow pipeline's configuration to point to a new dataset, highlighting the workflow's reusability and flexibility. This involves understanding how to change input parameters within a configuration file.\n", + "\n", + "2. **Data preparation and management:** Users learn how to download and manage data from the SRA (Sequence Read Archive) using `gsutil` (although a pre-downloaded, subsampled dataset is provided for convenience). This includes understanding file organization and paths.\n", + "\n", + "3. **Software installation and environment setup:** The notebook guides users through installing necessary software (Java, Mamba, sra-tools, perl modules, Nextflow) and setting up the computational environment. This emphasizes reproducibility and dependency management.\n", + "\n", + "4. **Running a transcriptome assembly:** The notebook shows how to execute the TransPi Nextflow pipeline with the new dataset, demonstrating the complete process from data input to (presumably) assembly output." + ] + }, + { + "cell_type": "markdown", + "id": "e6a8c2f6", + "metadata": {}, + "source": [ + "## Prerequisites\n", + "\n", + "* **Java:** The notebook installs the default JDK.\n", + "* **Miniforge** Used for package management.\n", + "* **sra-tools, perl-dbd-sqlite, perl-dbi:** Bioinformatics tools for working with SRA data.\n", + "* **Nextflow:** A workflow management system.\n", + "* **Docker** Either Docker pre-installed on the VM, or permissions to install and run Docker containers.\n", + "* **`gsutil`:** The Google Cloud Storage command-line tool." + ] + }, + { + "cell_type": "markdown", + "id": "27475529", + "metadata": {}, + "source": [ + "## Get Started" + ] + }, + { + "cell_type": "markdown", + "id": "dcf2a2d0-bc91-4a2a-9db0-62f1eee91f92", + "metadata": {}, + "source": [ + "Before we start any analysis, let's set up the environment just like we did in Submodule_01 and Submodule_02 where we move to the correct directory and install software." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "795e3329-21d8-4de0-91c4-58c81371a712", + "metadata": {}, + "outputs": [], + "source": [ + "%cd /home/jupyter" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "c9701541-9c05-4a02-abe7-e1126826efe3", + "metadata": {}, + "outputs": [], + "source": [ + "!pwd" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "2d0c3012-27ca-4eb7-a106-6532f62cbc82", + "metadata": {}, + "outputs": [], + "source": [ + "#update java\n", + "! sudo apt update\n", + "! sudo apt-get install default-jdk -y\n", + "! java -version" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "3d6ffb05-19bb-42c4-a628-6d874c5d3517", + "metadata": {}, + "outputs": [], + "source": [ + "# install Miniforge\n", + "! curl -L -O https://github.com/conda-forge/miniforge/releases/latest/download/Miniforge3-$(uname)-$(uname -m).sh\n", + "! bash Miniforge3-$(uname)-$(uname -m).sh -b -p $HOME/miniforge" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "7df1c464-b102-4e34-8f5f-57dbfc7f43b8", + "metadata": {}, + "outputs": [], + "source": [ + "# add Miniforge to your path\n", + "import os\n", + "os.environ[\"PATH\"] += os.pathsep + os.environ[\"HOME\"]+\"/miniforge/bin\"" + ] + }, + { + "cell_type": "markdown", + "id": "39bb00de-3481-4cb0-a2fe-098cfdae51a6", + "metadata": {}, + "source": [ + "Use Miniforge to install: sra-tools perl-dbd-sqlite perl-dbi from channel bioconda\n", + "\n", + "
\n", + " Click for help\n", + "\n", + "```\n", + "mamba install -c bioconda sra-tools perl-dbd-sqlite perl-dbi -y\n", + "```\n", + " \n", + "
" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "d1b80a47-89c9-469c-a1f8-9ee3c1817fa7", + "metadata": {}, + "outputs": [], + "source": [ + "! " + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "21ae7ca3-c9e1-4160-a29f-37a53b8265ab", + "metadata": {}, + "outputs": [], + "source": [ + "#install Nextflow\n", + "! curl https://get.nextflow.io | bash\n", + "! chmod +x nextflow\n", + "! ./nextflow self-update" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "f7d6f314-9af1-428f-85ea-6d02e527a6a1", + "metadata": {}, + "outputs": [], + "source": [ + "# Copy the software from gs://nigms-sandbox/nosi-inbremaine-storage/TransPi\n", + "! " + ] + }, + { + "cell_type": "markdown", + "id": "fbb26d8b-21ac-4907-b25a-3bd39b853d1b", + "metadata": {}, + "source": [ + "
\n", + " Click for help\n", + "\n", + "```\n", + "gsutil -m cp -r gs://nigms-sandbox/nosi-inbremaine-storage/TransPi ./```\n", + " \n", + "
" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "e5bdc3fa-7b78-4329-ae17-49ffce2085bb", + "metadata": {}, + "outputs": [], + "source": [ + "# Copy the data from gs://nigms-sandbox/nosi-inbremaine-storage/resources\n", + "! " + ] + }, + { + "cell_type": "markdown", + "id": "d73e5d6b-9b3f-46e2-9c5c-24713a2ad55c", + "metadata": {}, + "source": [ + "
\n", + " Click for help\n", + "\n", + "```\n", + "gsutil -m cp -r gs://nigms-sandbox/nosi-inbremaine-storage/resources ./\n", + "```\n", + " \n", + "
" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "65350b31-588b-42d9-90b4-849a523be021", + "metadata": {}, + "outputs": [], + "source": [ + "#Make the program executable\n", + "!chmod -R +x ./TransPi/bin" + ] + }, + { + "cell_type": "markdown", + "id": "15d79bb0-9f4c-469a-b2e6-3379e68f8f73", + "metadata": {}, + "source": [ + "Let's have a look at what we've downloaded to make sure it's there." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "9c5606e2-5c0c-423c-9e32-40bd550c19cb", + "metadata": {}, + "outputs": [], + "source": [ + "! ls ./resources/seq2" + ] + }, + { + "cell_type": "markdown", + "id": "08ae572a-fe6d-4852-a11c-5a2449c1d6b2", + "metadata": {}, + "source": [ + "You should see the joined fastq files alongside the others that we use in the previous submodules. Now let's adjust the workflow to run on them." + ] + }, + { + "cell_type": "markdown", + "id": "331a3857-7734-41e4-819a-de3603b9c95b", + "metadata": {}, + "source": [ + "One of the great benefits of using a workflow manager like Nextflow is that it allows easy swapping of input samples without drastic changes to the code. In the true spirit of reproducible workflows, the only change necessary in order to run the joined samples is to adjust the `reads` line in the `nextflow.config` file `params` section to point to the new reads location. In the line below, write the updated reads path that you would add to the config file. " + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "9a7bb7e0-f7eb-4166-96a9-b25b0885479c", + "metadata": {}, + "outputs": [], + "source": [ + "# " + ] + }, + { + "cell_type": "markdown", + "id": "c22622e3-56a7-42bb-9884-abd39c72d6e3", + "metadata": {}, + "source": [ + "
\n", + " Click for help\n", + " \n", + "\n", + "\n", + "```\n", + "// Directory for reads\n", + "reads=\"/home/jupyter/resources/seq2/joined*R[1,2].fastq.gz\"\n", + "```\n", + " \n", + " \n", + "
\n" + ] + }, + { + "cell_type": "markdown", + "id": "fd94aacc-2fb8-4a99-8378-2883b723253a", + "metadata": {}, + "source": [ + "After this change, you should be able to run the same Nextflow command as you did in Submodule_02 and everything will progress automatically." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "24663040-553b-4434-83a8-93fb9d2fd58f", + "metadata": {}, + "outputs": [], + "source": [ + "! NXF_VER=22.10.1 ./nextflow run \\\n", + " ./TransPi/TransPi.nf \\\n", + " -profile docker \\\n", + " --k 17,25,43 \\\n", + " --maxReadLen 50 \\\n", + " --all " + ] + }, + { + "cell_type": "markdown", + "id": "57dc8622-fb61-4ce6-ad40-48fc710a4713", + "metadata": {}, + "source": [ + "With the subsampled reads, the assembly should complete in about 2 hours using a n1-highmem-16 machine." + ] + }, + { + "cell_type": "markdown", + "id": "38abe476", + "metadata": {}, + "source": [ + "## Conclusion\n", + "\n", + "This notebook demonstrated the adaptability of the MDIBL Transcriptome Assembly Learning Module's TransPi pipeline by applying it to a new RNA-Seq dataset from a honeybee viral infection study (PRJNA274674). While utilizing a subsampled dataset for demonstration purposes, the process highlighted the ease of integrating new data into the existing Nextflow workflow. By simply modifying the `nextflow.config` file to specify the new reads' location, the pipeline executed seamlessly, showcasing its robustness and reproducibility. This adaptability makes the module a valuable resource for researchers seeking to perform scalable and rigorous transcriptome assemblies on their own datasets, facilitating efficient and reproducible analyses within their research groups. The successful execution underscores the power of workflow management systems like Nextflow for streamlining bioinformatics analyses." + ] + }, + { + "cell_type": "markdown", + "id": "7f7d2cab", + "metadata": {}, + "source": [ + "## Clean Up\n", + "\n", + "Shut down your instance if you are finished." + ] + } + ], + "metadata": {}, + "nbformat": 4, + "nbformat_minor": 5 +} diff --git a/GoogleCloud/images/AnnotationProcess.png b/GoogleCloud/images/AnnotationProcess.png new file mode 100644 index 0000000..11781db Binary files /dev/null and b/GoogleCloud/images/AnnotationProcess.png differ diff --git a/GoogleCloud/images/MDI-course-card-2.png b/GoogleCloud/images/MDI-course-card-2.png new file mode 100644 index 0000000..06fea92 Binary files /dev/null and b/GoogleCloud/images/MDI-course-card-2.png differ diff --git a/GoogleCloud/images/RNA-Seq_Notebook_Homepage.png b/GoogleCloud/images/RNA-Seq_Notebook_Homepage.png new file mode 100644 index 0000000..f0812b3 Binary files /dev/null and b/GoogleCloud/images/RNA-Seq_Notebook_Homepage.png differ diff --git a/GoogleCloud/images/Setup10.png b/GoogleCloud/images/Setup10.png new file mode 100644 index 0000000..a5f75d3 Binary files /dev/null and b/GoogleCloud/images/Setup10.png differ diff --git a/GoogleCloud/images/Setup11.png b/GoogleCloud/images/Setup11.png new file mode 100644 index 0000000..134c9fc Binary files /dev/null and b/GoogleCloud/images/Setup11.png differ diff --git a/GoogleCloud/images/Setup12.png b/GoogleCloud/images/Setup12.png new file mode 100644 index 0000000..9484eec Binary files /dev/null and b/GoogleCloud/images/Setup12.png differ diff --git a/GoogleCloud/images/Setup13.png b/GoogleCloud/images/Setup13.png new file mode 100644 index 0000000..fec92ae Binary files /dev/null and b/GoogleCloud/images/Setup13.png differ diff --git a/GoogleCloud/images/Setup14.png b/GoogleCloud/images/Setup14.png new file mode 100644 index 0000000..7284283 Binary files /dev/null and b/GoogleCloud/images/Setup14.png differ diff --git a/GoogleCloud/images/Setup15.png b/GoogleCloud/images/Setup15.png new file mode 100644 index 0000000..8a2b16e Binary files /dev/null and b/GoogleCloud/images/Setup15.png differ diff --git a/GoogleCloud/images/Setup16.png b/GoogleCloud/images/Setup16.png new file mode 100644 index 0000000..00a6bf1 Binary files /dev/null and b/GoogleCloud/images/Setup16.png differ diff --git a/GoogleCloud/images/Setup17.png b/GoogleCloud/images/Setup17.png new file mode 100644 index 0000000..e802eea Binary files /dev/null and b/GoogleCloud/images/Setup17.png differ diff --git a/GoogleCloud/images/Setup18.png b/GoogleCloud/images/Setup18.png new file mode 100644 index 0000000..d385738 Binary files /dev/null and b/GoogleCloud/images/Setup18.png differ diff --git a/GoogleCloud/images/Setup19.png b/GoogleCloud/images/Setup19.png new file mode 100644 index 0000000..080f96e Binary files /dev/null and b/GoogleCloud/images/Setup19.png differ diff --git a/GoogleCloud/images/Setup2.png b/GoogleCloud/images/Setup2.png new file mode 100644 index 0000000..1a2fff5 Binary files /dev/null and b/GoogleCloud/images/Setup2.png differ diff --git a/GoogleCloud/images/Setup20.png b/GoogleCloud/images/Setup20.png new file mode 100644 index 0000000..fdbb802 Binary files /dev/null and b/GoogleCloud/images/Setup20.png differ diff --git a/GoogleCloud/images/Setup21.png b/GoogleCloud/images/Setup21.png new file mode 100644 index 0000000..9826ac7 Binary files /dev/null and b/GoogleCloud/images/Setup21.png differ diff --git a/GoogleCloud/images/Setup22.png b/GoogleCloud/images/Setup22.png new file mode 100644 index 0000000..e39d867 Binary files /dev/null and b/GoogleCloud/images/Setup22.png differ diff --git a/GoogleCloud/images/Setup23.png b/GoogleCloud/images/Setup23.png new file mode 100644 index 0000000..19fcd08 Binary files /dev/null and b/GoogleCloud/images/Setup23.png differ diff --git a/GoogleCloud/images/Setup24.png b/GoogleCloud/images/Setup24.png new file mode 100644 index 0000000..dc0b879 Binary files /dev/null and b/GoogleCloud/images/Setup24.png differ diff --git a/GoogleCloud/images/Setup25.png b/GoogleCloud/images/Setup25.png new file mode 100644 index 0000000..2e32a69 Binary files /dev/null and b/GoogleCloud/images/Setup25.png differ diff --git a/GoogleCloud/images/Setup3.png b/GoogleCloud/images/Setup3.png new file mode 100644 index 0000000..be25fbe Binary files /dev/null and b/GoogleCloud/images/Setup3.png differ diff --git a/GoogleCloud/images/Setup4.png b/GoogleCloud/images/Setup4.png new file mode 100644 index 0000000..5d9346c Binary files /dev/null and b/GoogleCloud/images/Setup4.png differ diff --git a/GoogleCloud/images/Setup5.png b/GoogleCloud/images/Setup5.png new file mode 100644 index 0000000..a1040b1 Binary files /dev/null and b/GoogleCloud/images/Setup5.png differ diff --git a/GoogleCloud/images/Setup6.png b/GoogleCloud/images/Setup6.png new file mode 100644 index 0000000..b37e6f4 Binary files /dev/null and b/GoogleCloud/images/Setup6.png differ diff --git a/GoogleCloud/images/Setup7.png b/GoogleCloud/images/Setup7.png new file mode 100644 index 0000000..1546adf Binary files /dev/null and b/GoogleCloud/images/Setup7.png differ diff --git a/GoogleCloud/images/Setup8.png b/GoogleCloud/images/Setup8.png new file mode 100644 index 0000000..f18d5b7 Binary files /dev/null and b/GoogleCloud/images/Setup8.png differ diff --git a/GoogleCloud/images/Setup9.png b/GoogleCloud/images/Setup9.png new file mode 100644 index 0000000..471d727 Binary files /dev/null and b/GoogleCloud/images/Setup9.png differ diff --git a/GoogleCloud/images/TransPiWorkflow.png b/GoogleCloud/images/TransPiWorkflow.png new file mode 100644 index 0000000..458c794 Binary files /dev/null and b/GoogleCloud/images/TransPiWorkflow.png differ diff --git a/GoogleCloud/images/VMdownsize.jpg b/GoogleCloud/images/VMdownsize.jpg new file mode 100644 index 0000000..b880b09 Binary files /dev/null and b/GoogleCloud/images/VMdownsize.jpg differ diff --git a/GoogleCloud/images/architecture_diagram.png b/GoogleCloud/images/architecture_diagram.png new file mode 100644 index 0000000..0bcae11 Binary files /dev/null and b/GoogleCloud/images/architecture_diagram.png differ diff --git a/GoogleCloud/images/basic_assembly.png b/GoogleCloud/images/basic_assembly.png new file mode 100644 index 0000000..99607ee Binary files /dev/null and b/GoogleCloud/images/basic_assembly.png differ diff --git a/GoogleCloud/images/cellMenu.png b/GoogleCloud/images/cellMenu.png new file mode 100644 index 0000000..01ccce2 Binary files /dev/null and b/GoogleCloud/images/cellMenu.png differ diff --git a/GoogleCloud/images/deBruijnGraph.png b/GoogleCloud/images/deBruijnGraph.png new file mode 100644 index 0000000..03a98c9 Binary files /dev/null and b/GoogleCloud/images/deBruijnGraph.png differ diff --git a/GoogleCloud/images/fileDemo.png b/GoogleCloud/images/fileDemo.png new file mode 100644 index 0000000..c806097 Binary files /dev/null and b/GoogleCloud/images/fileDemo.png differ diff --git a/GoogleCloud/images/gcbDiagram.jpg b/GoogleCloud/images/gcbDiagram.jpg new file mode 100644 index 0000000..f7fe18a Binary files /dev/null and b/GoogleCloud/images/gcbDiagram.jpg differ diff --git a/GoogleCloud/images/glsDiagram.png b/GoogleCloud/images/glsDiagram.png new file mode 100644 index 0000000..f749360 Binary files /dev/null and b/GoogleCloud/images/glsDiagram.png differ diff --git a/GoogleCloud/images/jupyterRuntime.png b/GoogleCloud/images/jupyterRuntime.png new file mode 100644 index 0000000..e9fe8db Binary files /dev/null and b/GoogleCloud/images/jupyterRuntime.png differ diff --git a/GoogleCloud/images/jupyterRuntimeCircle.png b/GoogleCloud/images/jupyterRuntimeCircle.png new file mode 100644 index 0000000..84c7790 Binary files /dev/null and b/GoogleCloud/images/jupyterRuntimeCircle.png differ diff --git a/GoogleCloud/images/mdibl-compbio-core-logo-eurostyle.jpg b/GoogleCloud/images/mdibl-compbio-core-logo-eurostyle.jpg new file mode 100644 index 0000000..e338aef Binary files /dev/null and b/GoogleCloud/images/mdibl-compbio-core-logo-eurostyle.jpg differ diff --git a/GoogleCloud/images/mdibl-compbio-core-logo-square.jpg b/GoogleCloud/images/mdibl-compbio-core-logo-square.jpg new file mode 100644 index 0000000..308994b Binary files /dev/null and b/GoogleCloud/images/mdibl-compbio-core-logo-square.jpg differ diff --git a/GoogleCloud/images/module_concept.png b/GoogleCloud/images/module_concept.png new file mode 100644 index 0000000..a45caf8 Binary files /dev/null and b/GoogleCloud/images/module_concept.png differ diff --git a/GoogleCloud/images/perl-logo.png b/GoogleCloud/images/perl-logo.png new file mode 100644 index 0000000..2894eca Binary files /dev/null and b/GoogleCloud/images/perl-logo.png differ diff --git a/GoogleCloud/images/rainbowTrout.jpeg b/GoogleCloud/images/rainbowTrout.jpeg new file mode 100644 index 0000000..a1ff954 Binary files /dev/null and b/GoogleCloud/images/rainbowTrout.jpeg differ diff --git a/GoogleCloud/images/transpi_workflow.png b/GoogleCloud/images/transpi_workflow.png new file mode 100644 index 0000000..a9da75b Binary files /dev/null and b/GoogleCloud/images/transpi_workflow.png differ diff --git a/GoogleCloud/images/workflow_concept.png b/GoogleCloud/images/workflow_concept.png new file mode 100644 index 0000000..715cc32 Binary files /dev/null and b/GoogleCloud/images/workflow_concept.png differ diff --git a/README.md b/README.md index 1b2fd02..ff34591 100644 --- a/README.md +++ b/README.md @@ -3,85 +3,36 @@ # MDI Biological Laboratory RNA-seq Transcriptome Assembly Module --------------------------------- - -## Three primary and interlinked learning goals: -1. From a *biological perspective*, demonstration of the **process of transcriptome assembly** from raw RNA-seq data. -2. From a *computational perspective*, demonstration of **computing using workflow management and container systems**. -3. Also from an *infrastructure perspective*, demonstration of **carrying out these analyses efficiently in a cloud environment.** - - - -# Quick Overview -This module teaches you how to perform a short-read RNA-seq Transcriptome Assembly with Google Cloud Platform using a Nextflow pipeline, and eventually using the Google Batch API. In addition to the overview given in this README, you will find three Jupyter notebooks that teach you different components of RNA-seq in the cloud. - -This module will cost you about $7.00 to run end to end, assuming you shutdown and delete all resources upon completion. - - ## Contents -+ [Getting Started](#getting-started) ++ [Overview](#overview) ++ [Learning goals](#learning-goals) + [Biological Problem](#biological-problem) -+ [Set Up](#set-up) -+ [Software Requirements](#software-requirements) + [Workflow Diagrams](#workflow-diagrams) + [Data](#data) + [Troubleshooting](#troubleshooting) + [Funding](#funding) + [License for Data](#license-for-data) -## **Getting Started** -This learning module includes tutorials and execution scripts in the form of Jupyter notebooks. The purpose of these tutorials is to help users familiarize themselves with cloud computing in the specific context of running bioinformatics workflows to prep for and to carry out a transcriptome assembly, refinement, and annotation. These tutorials do this by utilizing a recently published Nextflow workflow (TransPi [manuscript](https://onlinelibrary.wiley.com/doi/10.1111/1755-0998.13593), [repository](https://github.com/palmuc/TransPi), and [user guide](https://palmuc.github.io/TransPi/)), which manages and passes data between several state-of-the-art programs, carrying out the processes from initial quality control and normalization, through assembly with several tools, refinement and assessment, and finally annotation of the final putative transcriptome. - -Since the work is managed by this pipeline, the notebooks will focus on setting up and running the pipeline, followed by an examination of some of the wide range of outputs produced. We will also demonstrate how to retrieve the complete results directory so that users can examine more extensively on their own computing systems going step-by-step through specific workflows. These workflows cover the start to finish of basic bioinformatics analysis; starting from raw sequence data and carrying out the steps needed to generate a final assembled and annotated transcriptome. - -We also put an emphasis on understanding how workflows execute, using the specific example of the Nextflow (https://www.nextflow.io) workflow engine, and on using workflow engines as supported by cloud infrastructure, using the specific example of the Google Batch API (https://cloud.google.com/batch). +## Overview +This module teaches you how to perform a short-read RNA-seq Transcriptome Assembly with a Cloud Computing Platform using a Nextflow pipeline. In addition to the overview given in this README, you will find README related to each platform (AWS, Google Cloud) and Jupyter notebooks that teach you different components of RNA-seq in the cloud. -![technical infrastructure](/images/architecture_diagram.png) - -**Figure 1:** The technical infrastructure diagram for this project. +## Learning goals: +1. From a *biological perspective*, demonstration of the **process of transcriptome assembly** from raw RNA-seq data. +2. From a *computational perspective*, demonstration of **computing using workflow management and container systems**. +3. Also from an *infrastructure perspective*, demonstration of **carrying out these analyses efficiently in a cloud environment.** ## **Biological Problem** The combination of increased availability and reduced expense in obtaining high-throughput sequencing has made transcriptome profiling analysis (primarily with RNA-seq) a standard tool for the molecular characterization of widely disparate biological systems. Researchers working in common model organisms, such as mouse or zebrafish, have relatively easy access to the necessary resources (e.g., well-assembled genomes and large collections of predicted/verified transcripts), for the analysis and interpretation of their data. In contrast, researchers working on less commonly studied organisms and systems often must develop these resources for themselves. Transcriptome assembly is the broad term used to describe the process of estimating many (or ideally all) of an organism’s transcriptome based on the large-scale but fragmentary data provided by high-throughput sequencing. A "typical" RNA-seq dataset will consist of tens of millions of reads or read-pairs, with each contiguous read representing up to 150 nucleotides in the sequence. Complete transcripts, in contrast, typically range from hundreds to tens of thousands of nucleotides in length. In short, and leaving out the technical details, the process of assembling a transcriptome from raw reads (Figure 2) is to first make a "best guess" segregation of the reads into subsets that are most likely derived from one (or a small set of related/similar genes), and then for each subset, build a most-likely set of transcripts and genes. -![basic transcriptome assembly](/images/basic_assembly.png) +![basic transcriptome assembly](./images/basic_assembly.png) **Figure 2:** The process from raw reads to first transcriptome assembly. Once a new transcriptome is generated, assessed, and refined, it must be annotated with putative functional assignments to be of use in subsequent functional studies. Functional annotation is accomplished through a combination of assignment of homology-based and ab initio methods. The most well-established homology-based processes are the combination of protein-coding sequence prediction followed by protein sequence alignment to databases of known proteins, especially those from human or common model organisms. Ab initio methods use computational models of various features (e.g., known protein domains, signal peptides, or peptide modification sites) to characterize either the transcript or its predicted protein product. This training module will cover multiple approaches to the annotation of assembled transcriptomes. -## **Set Up** - -#### Part 1: Setting up Environment - -**Enable APIs and create a Nextflow Sercice Account** - -If you are using Nextflow outside of NIH CloudLab you must enable the required APIs, set up a service account, and add your service account to your notebook permissions before creating the notebook. Follow sections 1 and 2 of the accompanying [how to document](https://github.com/NIGMS/NIGMS-Sandbox/blob/main/docs/HowToCreateNextflowServiceAccount.md) for instructions. If you are executing this tutorial with an NIH CloudLab account your default Compute Engine service account will have all required IAM roles to run the nextflow portion. - -**Create the Vertex AI Instance** - -Follow the steps highlighted [here](https://github.com/STRIDES/NIHCloudLabGCP/blob/main/docs/vertexai.md) to create a new user-managed notebook in Vertex AI. Follow steps 1-8 and be especially careful to enable idle shutdown as highlighted in step 7. For this module you should select **Debian 11** and **Python3** in the Environment tab in step 5. In step 6 in the Machine type tab, select **n1-highmem-16** from the dropdown box. This will provide you with 16 vCPUs and 104 GB of RAM which may feel like a lot but is necessary for TransPi to run. - - -#### Part 2: Adding the Modules to the Notebook - -1. From the Launcher in your new VM, Click the Terminal option. -![setup 22](images/Setup22.png) -2. Next, paste the following git command to get a copy of everything within this repository, including all of the submodules. - -> ```git clone https://github.com/NIGMS/Transcriptome-Assembly-Refinement-and-Applications.git``` -3. You are now all set! - -**WARNING:** When you are not using the notebook, stop it. This will prevent you from incurring costs while you are not using the notebook. You can do this in the same window as where you opened the notebook. Make sure that you have the notebook selected ![setup 23](images/Setup23.png). Then click the ![setup 24](images/Setup24.png). When you want to start up the notebook again, do the same process except click the ![setup 25](images/Setup25.png) instead. - -## **Software Requirements** - -All of the software requirements are taken care of and installed within [Submodule_01_prog_setup.ipynb](./Submodule_01_prog_setup.ipynb). The key pieces of software needed are: -1. [Nextflow workflow system](https://www.nextflow.io/): Nextflow is a workflow management software that TransPi is built for. -2. [Google Batch API](https://cloud.google.com/batch/docs): Google Batch was enabled as part of the setup process and will be readily available when it is needed. -3. [Nextflow TransPi Package](https://github.com/palmuc/TransPi): The rest of the software is all downloaded as part of the TransPi package. TransPi is a Nextflow pipeline that carries out many of the standard steps required for transcriptome assembly and annotation. The original TransPi is available from this GitHub [link](https://github.com/palmuc/TransPi). We have made various alterations to the TransPi package and so the TransPi files you will be using throughout this module will be our own altered version. - ## **Workflow Diagrams** ![transpi workflow](images/transpi_workflow.png) diff --git a/images/README.md b/images/README.md new file mode 100644 index 0000000..0aa8044 --- /dev/null +++ b/images/README.md @@ -0,0 +1,143 @@ +![course card](images/MDI-course-card-2.png) + +# MDI Biological Laboratory RNA-seq Transcriptome Assembly Module +--------------------------------- + + +## Three primary and interlinked learning goals: +1. From a *biological perspective*, demonstration of the **process of transcriptome assembly** from raw RNA-seq data. +2. From a *computational perspective*, demonstration of **computing using workflow management and container systems**. +3. Also from an *infrastructure perspective*, demonstration of **carrying out these analyses efficiently in a cloud environment.** + + + +# Quick Overview +This module teaches you how to perform a short-read RNA-seq Transcriptome Assembly with Google Cloud Platform using a Nextflow pipeline, and eventually using the Google Batch API. In addition to the overview given in this README, you will find three Jupyter notebooks that teach you different components of RNA-seq in the cloud. + +This module will cost you about $7.00 to run end to end, assuming you shutdown and delete all resources upon completion. + + +## Contents + ++ [Getting Started](#getting-started) ++ [Biological Problem](#biological-problem) ++ [Set Up](#set-up) ++ [Software Requirements](#software-requirements) ++ [Workflow Diagrams](#workflow-diagrams) ++ [Data](#data) ++ [Troubleshooting](#troubleshooting) ++ [Funding](#funding) ++ [License for Data](#license-for-data) + +## **Getting Started** +This learning module includes tutorials and execution scripts in the form of Jupyter notebooks. The purpose of these tutorials is to help users familiarize themselves with cloud computing in the specific context of running bioinformatics workflows to prep for and to carry out a transcriptome assembly, refinement, and annotation. These tutorials do this by utilizing a recently published Nextflow workflow (TransPi [manuscript](https://onlinelibrary.wiley.com/doi/10.1111/1755-0998.13593), [repository](https://github.com/palmuc/TransPi), and [user guide](https://palmuc.github.io/TransPi/)), which manages and passes data between several state-of-the-art programs, carrying out the processes from initial quality control and normalization, through assembly with several tools, refinement and assessment, and finally annotation of the final putative transcriptome. + +Since the work is managed by this pipeline, the notebooks will focus on setting up and running the pipeline, followed by an examination of some of the wide range of outputs produced. We will also demonstrate how to retrieve the complete results directory so that users can examine more extensively on their own computing systems going step-by-step through specific workflows. These workflows cover the start to finish of basic bioinformatics analysis; starting from raw sequence data and carrying out the steps needed to generate a final assembled and annotated transcriptome. + +We also put an emphasis on understanding how workflows execute, using the specific example of the Nextflow (https://www.nextflow.io) workflow engine, and on using workflow engines as supported by cloud infrastructure, using the specific example of the Google Batch API (https://cloud.google.com/batch). + +![technical infrastructure](/images/architecture_diagram.png) + +**Figure 1:** The technical infrastructure diagram for this project. + +## **Biological Problem** +The combination of increased availability and reduced expense in obtaining high-throughput sequencing has made transcriptome profiling analysis (primarily with RNA-seq) a standard tool for the molecular characterization of widely disparate biological systems. Researchers working in common model organisms, such as mouse or zebrafish, have relatively easy access to the necessary resources (e.g., well-assembled genomes and large collections of predicted/verified transcripts), for the analysis and interpretation of their data. In contrast, researchers working on less commonly studied organisms and systems often must develop these resources for themselves. + +Transcriptome assembly is the broad term used to describe the process of estimating many (or ideally all) of an organism’s transcriptome based on the large-scale but fragmentary data provided by high-throughput sequencing. A "typical" RNA-seq dataset will consist of tens of millions of reads or read-pairs, with each contiguous read representing up to 150 nucleotides in the sequence. Complete transcripts, in contrast, typically range from hundreds to tens of thousands of nucleotides in length. In short, and leaving out the technical details, the process of assembling a transcriptome from raw reads (Figure 2) is to first make a "best guess" segregation of the reads into subsets that are most likely derived from one (or a small set of related/similar genes), and then for each subset, build a most-likely set of transcripts and genes. + +![basic transcriptome assembly](./images/basic_assembly.png) + +**Figure 2:** The process from raw reads to first transcriptome assembly. + +Once a new transcriptome is generated, assessed, and refined, it must be annotated with putative functional assignments to be of use in subsequent functional studies. Functional annotation is accomplished through a combination of assignment of homology-based and ab initio methods. The most well-established homology-based processes are the combination of protein-coding sequence prediction followed by protein sequence alignment to databases of known proteins, especially those from human or common model organisms. Ab initio methods use computational models of various features (e.g., known protein domains, signal peptides, or peptide modification sites) to characterize either the transcript or its predicted protein product. This training module will cover multiple approaches to the annotation of assembled transcriptomes. + +## **Set Up** + +#### Part 1: Setting up Environment + +**Enable APIs and create a Nextflow Sercice Account** + +If you are using Nextflow outside of NIH CloudLab you must enable the required APIs, set up a service account, and add your service account to your notebook permissions before creating the notebook. Follow sections 1 and 2 of the accompanying [how to document](https://github.com/NIGMS/NIGMS-Sandbox/blob/main/docs/HowToCreateNextflowServiceAccount.md) for instructions. If you are executing this tutorial with an NIH CloudLab account your default Compute Engine service account will have all required IAM roles to run the nextflow portion. + +**Create the Vertex AI Instance** + +Follow the steps highlighted [here](https://github.com/STRIDES/NIHCloudLabGCP/blob/main/docs/vertexai.md) to create a new user-managed notebook in Vertex AI. Follow steps 1-8 and be especially careful to enable idle shutdown as highlighted in step 7. For this module you should select **Debian 11** and **Python3** in the Environment tab in step 5. In step 6 in the Machine type tab, select **n1-highmem-16** from the dropdown box. This will provide you with 16 vCPUs and 104 GB of RAM which may feel like a lot but is necessary for TransPi to run. + + +#### Part 2: Adding the Modules to the Notebook + +1. From the Launcher in your new VM, Click the Terminal option. +![setup 22](images/Setup22.png) +2. Next, paste the following git command to get a copy of everything within this repository, including all of the submodules. + +> ```git clone https://github.com/NIGMS/Transcriptome-Assembly-Refinement-and-Applications.git``` +3. You are now all set! + +**WARNING:** When you are not using the notebook, stop it. This will prevent you from incurring costs while you are not using the notebook. You can do this in the same window as where you opened the notebook. Make sure that you have the notebook selected ![setup 23](images/Setup23.png). Then click the ![setup 24](images/Setup24.png). When you want to start up the notebook again, do the same process except click the ![setup 25](images/Setup25.png) instead. + +## **Software Requirements** + +All of the software requirements are taken care of and installed within [Submodule_01_prog_setup.ipynb](./Submodule_01_prog_setup.ipynb). The key pieces of software needed are: +1. [Nextflow workflow system](https://www.nextflow.io/): Nextflow is a workflow management software that TransPi is built for. +2. [Google Batch API](https://cloud.google.com/batch/docs): Google Batch was enabled as part of the setup process and will be readily available when it is needed. +3. [Nextflow TransPi Package](https://github.com/palmuc/TransPi): The rest of the software is all downloaded as part of the TransPi package. TransPi is a Nextflow pipeline that carries out many of the standard steps required for transcriptome assembly and annotation. The original TransPi is available from this GitHub [link](https://github.com/palmuc/TransPi). We have made various alterations to the TransPi package and so the TransPi files you will be using throughout this module will be our own altered version. + +## **Workflow Diagrams** + +![transpi workflow](images/transpi_workflow.png) + +**Figure 3:** Nextflow workflow diagram. (Rivera 2021). +Image Source: https://github.com/PalMuc/TransPi/blob/master/README.md + +Explanation of which notebooks execute which processes: + ++ Notebooks labeled 0 ([Submodule_00_Background.ipynb](./Submodule_00_Background.ipynb) and [00_Glossary.md](./00_Glossary.md)) respectively cover background materials and provide a centralized glossary for both the biological problem of transcriptome assembly, as well as an introduction to workflows and container-based computing. ++ Notebook 1 ([Submodule_01_prog_setup.ipynb](./Submodule_01_prog_setup.ipynb)) is used for setting up the environment. It should only need to be run once per machine. (Note that our version of TransPi does not run the `precheck script`. To avoid the headache and wasted time, we have developed a workaround to skip that step.) ++ Notebook 2 ([Submodule_02_basic_assembly.ipynb](./Submodule_02_basic_assembly.ipynb)) carries out a complete run of the Nextflow TransPi assembly workflow on a modest sequence set, producing a small transcriptome. ++ Notebook 3 ([Submodule_03_annotation_only.ipynb](./Submodule_03_annotation_only.ipynb)) carries out an annotation-only run using a prebuilt, but more complete transcriptome. ++ Notebook 4 ([Submodule_04_google_batch_assembly.ipynb](./Submodule_04_google_batch_assembly.ipynb)) carries out the workflow using the Google Batch API. ++ Notebook 5 ([Submodule_05_Bonus_Notebook.ipynb](./Submodule_05_Bonus_Notebook.ipynb)) is a more hands-off notebook to test basic skills taught in this module. + +## **Data** +The test dataset used in the majority of this module is a downsampled version of a dataset that can be obtained in its complete form from the SRA database (Bioproject [**PRJNA318296**](https://www.ncbi.nlm.nih.gov/bioproject/PRJNA318296), GEO Accession [**GSE80221**](https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE80221)). The data was originally generated by **Hartig et al., 2016**. We downsampled the data files in order to streamline the performance of the tutorials and stored them in a Google Cloud Storage bucket. The sub-sampled data, in individual sample files as well as a concatenated version of these files are available in our Google Cloud Storage bucket at `gs://nigms-sandbox/nosi-inbremaine-storage/resources/seq2`. + +Additional datasets for demonstration of the annotation features of TransPi were obtained from the NCBI Transcriptome Shotgun Assembly archive. These files can be found in our Google Cloud Storage bucket at `gs://nigms-sandbox/nosi-inbremaine-storage/resources/trans`. +- Microcaecilia dermatophaga + - Bioproject: [**PRJNA387587**](https://www.ncbi.nlm.nih.gov/bioproject/PRJNA387587) + - Originally generated by **Torres-Sánchez M et al., 2019**. +- Oncorhynchus mykiss + - Bioproject: [**PRJNA389609**](https://www.ncbi.nlm.nih.gov/bioproject/PRJNA389609) + - Originally generated by **Wang J et al., 2016**, **Al-Tobasei R et al., 2016**, and **Salem M et al., 2015**. +- Pseudacris regilla + - Bioproject: [**PRJNA163143**](https://www.ncbi.nlm.nih.gov/bioproject/PRJNA163143) + - Originally generated by **Laura Robertson, USGS**. + +The final submodule ([Submodule_05_Bonus_Notebook.ipynb](./Submodule_05_Bonus_Notebook.ipynb)) uses an additional dataset pulled from the SRA database. We are using the RNA-seq reads only and have subsampled and merged them to a collective 2 million reads. This is not a good idea for real analysis, but was done to reduce the costs and runtime. These files are avalible in our Google Cloud Storage bucket at `gs://nigms-sandbox/nosi-inbremaine-storage/resources/seq2`. +- Apis mellifera + - Bioproject: [**PRJNA274674**](https://www.ncbi.nlm.nih.gov/bioproject/PRJNA274674) + - Originally generated by **Galbraith DA et al., 2015**. + +## **Troubleshooting** +- If a quiz is not rendering: + - Make sure the `pip install` cell was executed in Submodule 00. + - Try re-executing `from jupytercards import display_flashcards` or `from jupyterquiz import display_quiz` depending on the quiz type. +- If a file/directory is not able to be found, make sure that you are in the right directory. If the notebook is idle for a long time, gets reloaded, or restarted, you will need to re-run Step 1 of the notebook. (`%cd /home/jupyter`) +- Sometimes, Nextflow will print `WARN:` followed by the warning. These are okay and should not produce any errors. +- Sometimes Nextflow will print `Waiting for file transfers to complete`. This may take a few minutes, but is nothing to worry about. +- If you are unable to create a bucket using the `gsutil mb` command, check your `nextflow-service-account` roles. Make sure that you have `Storage Admin` added. +- If you are trying to execute a terminal command in a Jupyter code cell and it is not working, make sure that you have an `!` before the command. + - e.g., `mkdir example-1` -> `!mkdir example-1` + +## **Funding** + +MDIBL Computational Biology Core efforts are supported by two Institutional Development Awards (IDeA) from the National Institute of General Medical Sciences of the National Institutes of Health under grant numbers P20GM103423 and P20GM104318. + +## **License for Data** + +Text and materials are licensed under a Creative Commons CC-BY-NC-SA license. The license allows you to copy, remix and redistribute any of our publicly available materials, under the condition that you attribute the work (details in the license) and do not make profits from it. More information is available [here](https://tilburgsciencehub.com/about). + +![Creative commons license](https://i.creativecommons.org/l/by-nc-sa/4.0/88x31.png) + +This work is licensed under a [Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International License](http://creativecommons.org/licenses/by-nc-sa/4.0/) + +The TransPi Nextflow workflow was developed and released by Ramon Rivera and can be obtained from its [GitHub repository](https://github.com/PalMuc/TransPi)