NIGMS · kyleoconnell-NIH · Jan 14, 2025 · Jan 14, 2025 · Jan 14, 2025 · Jan 14, 2025
diff --git a/Submodule_00_Glossary.md → AWS/Submodule_00_Glossary.md b/Submodule_00_Glossary.md → AWS/Submodule_00_Glossary.md
diff --git a/AWS/Submodule_00_background.ipynb b/AWS/Submodule_00_background.ipynb
diff --git a/Submodule_01_prog_setup.ipynb → AWS/Submodule_01_prog_setup.ipynb b/Submodule_01_prog_setup.ipynb → AWS/Submodule_01_prog_setup.ipynb
@@ -6,11 +6,64 @@
    "metadata": {},
    "source": [
     "# MDIBL Transcriptome Assembly Learning Module\n",
-    "# Notebook 1: Setup\n",
+    "# Notebook 1: Setup"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "f62d616c",
+   "metadata": {},
+   "source": [
+    "## Overview\n",
     "\n",
     "This notebook is designed to configure your virtual machine (VM) to have the proper tools and data in place to run the transcriptome assembly training module."
    ]
   },
+  {
+   "cell_type": "markdown",
+   "id": "60145056",
+   "metadata": {},
+   "source": [
+    "## Learning Objectives\n",
+    "\n",
+    "1. **Understand and utilize shell commands within Jupyter Notebooks:**  The notebook explicitly teaches the difference between `!` and `%` prefixes for executing shell commands, and how to navigate directories using `cd` and `pwd`.\n",
+    "\n",
+    "2. **Set up the necessary software:** Students will install and configure essential tools including:\n",
+    "    * Java (a prerequisite for Nextflow).\n",
+    "    * Mambaforge (a package manager for bioinformatics tools).\n",
+    "    *  `sra-tools`, `perl-dbd-sqlite`, and `perl-dbi` (specific bioinformatics packages).\n",
+    "    * Nextflow (a workflow management system).\n",
+    "    *  `aws s3` (for interacting with AWS S3 Storage).\n",
+    "\n",
+    "3. **Download and organize necessary data:** Students will download the TransPi transcriptome assembly software and its associated resources (databases, scripts, configuration files) from an S3 bucket.  This includes understanding the directory structure and file organization.\n",
+    "\n",
+    "4. **Manage file permissions:** Students will use the `chmod` command to set executable permissions for the necessary files and directories within the TransPi software.\n",
+    "\n",
+    "5. **Navigate file paths:** The notebook provides examples and explanations for using relative file paths (e.g., `./`, `../`) within shell commands."
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "549be731",
+   "metadata": {},
+   "source": [
+    "## Prerequisites\n",
+    "\n",
+    "* **Operating System:** A Linux-based system is assumed (commands like `apt`, `uname` are used).  The specific distribution isn't specified but a Debian-based system is likely.\n",
+    "* **Shell Access:**  The ability to execute shell commands from within the Jupyter Notebook environment (using `!` and `%`).\n",
+    "* **Java Development Kit (JDK):**  Required for Nextflow.\n",
+    "* **Miniforge** A package manager for installing bioinformatics tools.\n",
+    "* **`aws s3`:** The AWS command-line tool. This is crucial for downloading data from an S3 storage bucket."
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "a92f62a0",
+   "metadata": {},
+   "source": [
+    "## Get Started"
+   ]
+  },
   {
    "cell_type": "markdown",
    "id": "958495ce-339d-4d4d-a621-9ede79a7363c",
@@ -51,7 +104,7 @@
     "## Time to begin!\n",
     "\n",
     "**Step 1:** To start, make sure that you are in the right starting place with a `cd`.\n",
-    "> `pwd` prints our current local working directory. Make sure the output from the command is: `/home/jupyter`"
+    "> `pwd` prints our current local working directory. Make sure the output from the command is: `/home/ec2-user/SageMaker`"
    ]
   },
   {
@@ -61,7 +114,7 @@
    "metadata": {},
    "outputs": [],
    "source": [
-    "%cd /home/jupyter"
+    "%cd /home/ec2-user/SageMaker"
    ]
   },
   {
@@ -71,7 +124,7 @@
    "metadata": {},
    "outputs": [],
    "source": [
-    "!pwd"
+    "! pwd"
    ]
   },
   {
@@ -89,31 +142,27 @@
    "metadata": {},
    "outputs": [],
    "source": [
-    "!sudo apt update\n",
-    "!sudo apt-get install default-jdk -y\n",
-    "!java -version"
+    "! sudo apt update\n",
+    "! sudo apt-get install default-jdk -y\n",
+    "! java -version"
    ]
   },
   {
    "cell_type": "markdown",
-   "id": "7b3ffb16-3395-4c01-9774-ee568e815490",
+   "id": "7b930ad7",
    "metadata": {},
    "source": [
-    "**Step 3:** Install Mambaforge, which is needed to support the information held within the TransPi databases.\n",
-    "\n",
-    ">Mambaforge is a package manager."
+    "**Step 3:** Using Mamba and bioconda, install the tools that will be used in this tutorial."
    ]
   },
   {
    "cell_type": "code",
    "execution_count": null,
-   "id": "ac5b204a-f0db-4ceb-bf37-57eca6d77974",
+   "id": "4d4dd51e",
    "metadata": {},
    "outputs": [],
    "source": [
-    "!curl -L -O https://github.com/conda-forge/miniforge/releases/latest/download/Mambaforge-$(uname)-$(uname -m).sh\n",
-    "!bash Mambaforge-$(uname)-$(uname -m).sh -b -p $HOME/mambaforge\n",
-    "!~/mambaforge/bin/mamba install -c bioconda sra-tools perl-dbd-sqlite perl-dbi -y"
+    "! mamba install -c bioconda sra-tools perl-dbd-sqlite perl-dbi -y"
    ]
   },
   {
@@ -131,9 +180,9 @@
    "metadata": {},
    "outputs": [],
    "source": [
-    "!curl https://get.nextflow.io | bash\n",
-    "!chmod +x nextflow\n",
-    "!./nextflow self-update"
+    "! curl https://get.nextflow.io | bash\n",
+    "! chmod +x nextflow\n",
+    "! ./nextflow self-update"
    ]
   },
   {
@@ -152,7 +201,7 @@
    "metadata": {},
    "outputs": [],
    "source": [
-    "!gsutil -m cp -r gs://nigms-sandbox/nosi-inbremaine-storage/TransPi ./"
+    "! aws s3 cp  --recursive s3://nigms-sandbox/nosi-inbremaine-storage/TransPi ./TransPi"
    ]
   },
   {
@@ -162,10 +211,10 @@
    "source": [
     "<div class=\"alert alert-block alert-success\">\n",
     "    <i class=\"fa fa-hand-paper-o\" aria-hidden=\"true\"></i>\n",
-    "    <b>Note: </b>  gsutil\n",
+    "    <b>Note: </b>  aws\n",
     "</div>\n",
     "\n",
-    ">`gsutil` is a tool allows you to interact with Google Cloud Storage through the command line."
+    ">`aws s3` is a tool allows you to interact with S3 Storage through the command line."
    ]
   },
   {
@@ -190,7 +239,7 @@
    "metadata": {},
    "outputs": [],
    "source": [
-    "!gsutil -m cp -r gs://nigms-sandbox/nosi-inbremaine-storage/resources ./"
+    "! aws s3 cp --recursive s3://nigms-sandbox/nosi-inbremaine-storage/resources ./resources"
    ]
   },
   {
@@ -215,8 +264,7 @@
     ">    - They can also be stacked so `../../` will take you two layers up.\n",
     ">\n",
     ">- If you were to type `!ls ./nextWeek/` it would return the contents of the `nextWeek` directory which is one layer down from the current directory, so it would return `manyThings.txt`.\n",
-    ">\n",
-    ">**This means that in the second line of the code cell above, the file `TransPi.nf` will be copied from the Google Cloud Storage bucket to the current directory.**"
+    ">"
    ]
   },
   {
@@ -234,7 +282,7 @@
    "metadata": {},
    "outputs": [],
    "source": [
-    "!chmod -R +x ./TransPi/bin"
+    "! chmod -R +x ./TransPi/bin"
    ]
   },
   {
@@ -295,19 +343,23 @@
   },
   {
    "cell_type": "markdown",
-   "id": "f80a7bab-98ae-45a6-845f-ad3c4138575a",
+   "id": "ffec658a",
    "metadata": {},
    "source": [
-    "## When you are ready, proceed to the next notebook: [`Submodule_02_basic_assembly.ipynb`](./Submodule_02_basic_assembly.ipynb)."
+    "## Conclusion\n",
+    "\n",
+    "This notebook successfully configured the virtual machine for the MDIBL Transcriptome Assembly Learning Module.  We updated the system, installed necessary software including Java, Mambaforge, and Nextflow, and downloaded the TransPi program and its associated resources from Google Cloud Storage.  The `chmod` command ensured executability of the TransPi scripts.  The VM is now prepared for the next notebook, `Submodule_02_basic_assembly.ipynb`, which will delve into the transcriptome assembly process itself.  Successful completion of this notebook's steps is crucial for the successful execution of subsequent modules."
    ]
   },
   {
-   "cell_type": "code",
-   "execution_count": null,
-   "id": "934165c2-8fbd-4801-979f-6db5d1e592ea",
+   "cell_type": "markdown",
+   "id": "666c1e4d",
    "metadata": {},
-   "outputs": [],
-   "source": []
+   "source": [
+    "## Clean Up\n",
+    "\n",
+    "Remember to proceed to the next notebook [`Submodule_02_basic_assembly.ipynb`](./Submodule_02_basic_assembly.ipynb) or shut down your instance if you are finished."
+   ]
   }
  ],
  "metadata": {},

diff --git a/Submodule_02_basic_assembly.ipynb → AWS/Submodule_02_basic_assembly.ipynb b/Submodule_02_basic_assembly.ipynb → AWS/Submodule_02_basic_assembly.ipynb
@@ -8,6 +8,8 @@
     "# MDIBL Transcriptome Assembly Learning Module\n",
     "# Notebook 2: Performing a \"Standard\" basic transcriptome assembly\n",
     "\n",
+    "## Overview\n",
+    "\n",
     "In this notebook, we will set up and run a basic transcriptome assembly, using the analysis pipeline as defined by the TransPi Nextflow workflow. The steps to be carried out are the following, and each is described in more detail in the Background material notebook.\n",
     "\n",
     "- Sequence Quality Control (QC): removing adapters and low-quality sequences.\n",
@@ -23,12 +25,58 @@
     "> **Figure 1:** TransPi workflow for a basic transcriptome assembly run."
    ]
   },
+  {
+   "cell_type": "markdown",
+   "id": "062784ec",
+   "metadata": {},
+   "source": [
+    "## Learning Objectives\n",
+    "\n",
+    "1. **Understanding the TransPi Workflow:** Learners will gain a conceptual understanding of the TransPi workflow, including its individual steps and their order.  This involves understanding the purpose of each stage (QC, normalization, assembly, integration, assessment, annotation, and reporting).\n",
+    "\n",
+    "2. **Executing a Transcriptome Assembly:** Learners will learn how to run a transcriptome assembly using Nextflow and the TransPi pipeline, including setting necessary parameters (e.g., k-mer size, read length). They will learn how to interpret the command-line interface for executing Nextflow workflows.\n",
+    "\n",
+    "3. **Interpreting Nextflow Output:** Learners will learn to navigate and understand the directory structure generated by the TransPi workflow.  This includes interpreting the output from various tools such as FastQC, FastP, Trinity, TransAbyss, SOAP, rnaSpades, Velvet/Oases, EvidentialGene, rnaQuast, BUSCO, DIAMOND/BLAST, HMMER/Pfam, and TransDecoder.  This involves understanding the different types of output files generated and how to extract relevant information from them (e.g., assembly statistics, annotation results).\n",
+    "\n",
+    "4. **Assessing Transcriptome Quality:** Learners will understand how to assess the quality of a transcriptome assembly using metrics generated by rnaQuast and BUSCO.\n",
+    "\n",
+    "5. **Interpreting Annotation Results:** Learners will learn to interpret the results of transcriptome annotation using tools like DIAMOND/BLAST and HMMER/Pfam, understanding what information they provide regarding protein function and domains.\n",
+    "\n",
+    "6. **Utilizing Workflow Management Systems:** Learners will gain practical experience using Nextflow, a workflow management system, to execute a complex bioinformatics pipeline.  This includes understanding the benefits of using a defined workflow for reproducibility and efficiency.\n",
+    "\n",
+    "7. **Working with Jupyter Notebooks:** The notebook itself provides a practical example of how to integrate command-line tools within a Jupyter Notebook environment."
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "abf9345c",
+   "metadata": {},
+   "source": [
+    "## Prerequisites\n",
+    "\n",
+    "* **Nextflow:** A workflow management system used to execute the TransPi pipeline. \n",
+    "* **Docker:** Used for containerization of the various bioinformatics tools within the workflow.  This avoids the need for local installation of numerous packages.\n",
+    "* **TransPi:** The specific Nextflow pipeline for transcriptome assembly. The notebook assumes it's present in the `/home/jupyter` directory.\n",
+    "* **Bioinformatics Tools (within TransPi):** The workflow utilizes several bioinformatics tools. These are packaged within Docker containers, but the notebook expects that TransPi is configured correctly to access and use them:\n",
+    "    * FastQC: Sequence quality control.\n",
+    "    * FastP: Read preprocessing (trimming, adapter removal).\n",
+    "    * Trinity, TransAbyss, SOAPdenovo-Trans, rnaSpades, Velvet/Oases:  Transcriptome assemblers.\n",
+    "    * EvidentialGene: Transcriptome integration and reduction.\n",
+    "    * rnaQuast: Transcriptome assessment.\n",
+    "    * BUSCO: Assessment of completeness of the assembled transcriptome.\n",
+    "    * DIAMOND/BLAST: Protein alignment for annotation.\n",
+    "    * HMMER/Pfam: Protein domain assignment for annotation.\n",
+    "    * Bowtie2: Read mapping for assembly validation.\n",
+    "    * TransDecoder: ORF prediction and coding region identification.\n",
+    "    * Trinotate: Functional annotation of transcripts."
+   ]
+  },
   {
    "cell_type": "markdown",
    "id": "6cd0f4f2-5559-4675-9e97-24b0548b31af",
    "metadata": {},
    "source": [
-    "## Time to get started! \n",
+    "## Get Started \n",
     "\n",
     "**Step 1:** Make sure you are in the correct local working directory as in `01_prog_setup.ipynb`.\n",
     "> It should be `/home/jupyter`."
@@ -272,16 +320,28 @@
    "outputs": [],
    "source": [
     "from jupytercards import display_flashcards\n",
-    "display_flashcards('Transcriptome-Assembly-Refinement-and-Applications/quiz-material/02-cp1-1.json')\n",
-    "display_flashcards('Transcriptome-Assembly-Refinement-and-Applications/quiz-material/02-cp1-2.json')"
+    "display_flashcards('../quiz-material/02-cp1-1.json')\n",
+    "display_flashcards('../quiz-material/02-cp1-2.json')"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "b82f0b3a",
+   "metadata": {},
+   "source": [
+    "## Conclusion\n",
+    "\n",
+    "This Jupyter Notebook demonstrated a complete transcriptome assembly workflow using the TransPi Nextflow pipeline.  We successfully executed the pipeline, encompassing quality control, normalization, multiple assembly generation with Trinity, TransAbyss, SOAP, rnaSpades, and Velvet/Oases, integration via EvidentialGene, and subsequent assessment using rnaQuast and BUSCO.  The final assembly underwent annotation with DIAMOND/BLAST and HMMER/Pfam, culminating in comprehensive reports detailing the entire process and the resulting transcriptome characteristics.  The generated output, accessible in the `basicRun/output` directory, provides a rich dataset for further investigation and analysis, including detailed quality metrics, assembly statistics, and functional annotations.  This module provided a practical introduction to automated transcriptome assembly, highlighting the efficiency and reproducibility offered by integrated workflows like TransPi.  Further exploration of the detailed output is encouraged, and the subsequent notebook focuses on a more in-depth annotation analysis."
    ]
   },
   {
    "cell_type": "markdown",
-   "id": "b96dd6bb-a8ed-44bf-b1f4-bb284f8f0f3e",
+   "id": "b68484f3",
    "metadata": {},
    "source": [
-    "## When you are ready, proceed to the next notebook: [`Submodule_03_annotation_only.ipynb`](Submodule_03_annotation_only.ipynb)."
+    "## Clean Up\n",
+    "\n",
+    "Remember to proceed to the next notebook [`Submodule_03_annotation_only.ipynb`](Submodule_03_annotation_only.ipynb) or shut down your instance if you are finished."
    ]
   }
  ],