- Overview
- Background
- Before Starting
- Getting Started
- Software Requirements
- Architecture Design
- Data
- Funding
- License for Data
- Wrapping Up
- Additional Resources
- Troubleshooting
This module will introduce you to (graphical) pangenomics and walk you through a pangenomics pipeline. Specifically, you will learn how to build a pangenome graph, index the graph for analysis, map reads to the graph, call variants on the mapped reads, and visualize the graph. All analyses will be performed on the Google Cloud Platform (GCP). The estimated cost to complete the whole module is about $2, assuming you tear down all resources upon completion. See the Wrapping Up section for details on how to tear down resources.
A pangenome is a collection of genomes from the same species. Compared to a reference genome, a pangenome is a less biased, more comprehensive representation of sequence preservation and variation within a population. While the pangenome may provide greater insight into questions related to the genetic and genomic nature of a species, these data require the use of bioinformatics tools that are different than those typically used on reference genomes. This module aims to introduce you to the idea of pangenome graphs and the bioinformatics tools used for their analysis.
Click above image to watch this video
This module is designed to run on the Google Cloud Platform (GCP). Follow the instructions below to prepare to run the module on GCP.
Setting up GCP
See the Vertex AI Quickstart instructions for details on steps 1-5.
-
Create a Google Cloud account
-
Create a Google Cloud project
-
Enable billing for your Google Cloud project
-
Go to Vertex AI Workbench then create a new VM instance using "CREATE NEW" -> "ADVANCED OPTIONS". The configurations for each page are described below. Click the "Continue" button at the bottom of each page to go to the next page. Any configuration not explicitly mentioned below should use its default setting.
- Details:
Name: nigms-pangenomics-module (this is optional; you can use whatever name you want or the default)
Region: us-east4
Zone: us-east4-a
Workbench type:
Type: Instance - Environment:
JupyterLab Version: JupyterLab 4.x - Machine type:
Series: N2
Machine type: n2-standard-4
Idle shutdown:
Enable Idle Shutdown: Checked
Time of inactivity before shutdown (Minutes): 60 - Disks: Use default settings
- Networking:
Assign external IP address: Checked
Allow proxy access: Checked - IAM and security
Security options:
Root access to the instance: Checked
File downloading: Checked
Terminal access: Checked - System health: Use default settings
The last configuration page - System health - will not have a "Continue" button. Instead, use the "Create" button below the page to create the Vertex AI Workbench VM instance you just configured. It will have a blue loading circle. Once the green checkmark has appeared, your Vertex AI Workbench VM instance has been created.
- Details:
-
Click "OPEN JUPYTERLAB" on your VM instance to open JupyterLab
Installing Software
To install the software for this module in JupyterLab, open a Terminal ("File" -> "New Launcher" -> "Terminal") and run the following commands:
cd ~
git clone https://github.com/NIGMS/Intro-to-Pangenomics
bash -i ./NIGMS-Sandbox-Pangenomics-Module/scripts/0-setup.sh
After the last command completes, close the terminal and restart the VM instance in the Vertex AI Workbench.
There should now be a new kernal in the JupyterLab launcher tab called "nigms-pangenomics". If you do not see the Launcher tab open a new one ("File" -> "New Launcher" -> "Terminal"). This is the kernel you should use with every notebook in the module (when you open the notebook, the kernal will be listed in the upper right corner). The launcher should also contain two new sections: "Submodule Notebooks" and "Visualization Software". Submodule notebooks contains an ordered list of the notebooks in this module, one for each submodule. Clicking on a submodule will open the corresponding notebook. Visualization Software contains a list of visualization software used in this module. Clicking on a program in this list will open the program in a new window in your Web Browser.
After following the Before Starting instructions, the JupyterLab launcher ("File" -> "New Launcher") will contain a "Submodule Notebooks" section. This section contains an ordered list of the notebooks in this module, one for each submodule. Clicking on a submodule in this section will open the corresponding notebook. To begin, click on the "Environment Setup" notebook.
Alternatively, you can use the JupyterLab file browser. Here is the location and file structure of the module notebooks:
NIGMIS-Sandbox-Pangenomics-Module/
└── module_notebooks/
├── 00-environment-setup.ipynb
├── 01-intro-to-pangenomics.ipynb
├── 02-building-graphs-with-pggb.ipynb
├── 03-searching-graphs-with-blast.ipynb
├── 04-visualization.ipynb
├── 05-indexing-graphs-with-vg.ipynb
├── 06-read-mapping-with-vg.ipynb
└── 07-variant-calling-with-vg.ipynb
module_notebooks/
contains Jupyter notebooks - one for each submodule.
To open a notebook, simply double-click on it it.
To begin this module, open the 00-environment-setup.ipynb
notebook.
The following software is required for this module:
All of these programs can be installed in JupterLab running on the GCP Vertex AI Workbench following the Installing Software instructions in the Before Starting section.
The architecture of this workshop is composed of 3 major parts: 1) input data from storage, 2) an analysis pipeline run on the GCP Vertex AI Workbench, and 3) output data to storage, which itself is used as input in subsequent steps of the pipeline. The analysis pipeline part (2) is composed of the following submodules:
- Setting up the environment for the module
- An introduction to graphical pangenomics
- A tutorial on how to build a pangenome graph using PGGB
- A tutorial on how to search a pangenome graph using BLAST
- A tutorial on how to visualize a pangenome graph using Bandage
- A tutorial on indexing pangenome graphs with vg for downstream analysis
- A tutorial on mapping reads to an indexed pangenome graph
- A tutorial on calling variants on reads mapped to a pangenome graph
All submodules in the pipeline use a custom nigms-pangenomics
Jupyter kernel, which can be installed following the instructions in the Before Starting section.
This module uses the following data:
- 3 genome assemblies acquired from the Yeast Population Reference Panel (YPRP)
- S288C (reference)
- SK1
- Y12
- Illumina paired-end reads acquired from NCBI
- SK1
- Gene sequences acquired from Saccharomyces Genome Database (SGD)
- CUP1-1
- YHR053C
This module was developed by the National Center for Genome Resources (NCGR) as part of the Data Science Core for the New Mexico IDeA Network of Biomedical Research Excellence (NM-INBRE). The work was supported by National Institutes of Health (NIH) grant number P20GM103451 and NIH supplement award number Q02588.
Text and materials are licensed under a Creative Commons CC-BY-NC-SA license. The license allows you to copy, remix, and redistribute any of our publicly available materials, under the condition that you attribute the work (details in the license) and do not make profits from it. More information is available here.
This work is licensed under a Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International License
Once you have completed the module, we recommend deleting the Vertex AI Workbench VM instance you created since it costs money even when it's not running.
You can download a copy of your work from this module before deleting the VM by creating a .zip
file of your copy of this repository.
To do this, in JupyterLab open a Terminal ("File" -> "New Launcher" -> "Terminal") and run the following commands:
cd ~
zip -r NIGMS-Sandbox-Pangenomics-Module.zip NIGMS-Sandbox-Pangenomics-Module
You can then download this file from JupyterLab by opening the File Browser in the left menu, clicking the Home (/
) button, right-clicking on the NIGMS-Sandbox-Pangenomics-Module.zip
file and selecting "Download" in the menu that appears.
When you are ready to delete your Vertex AI Workbench VM instance, go into the Vertex AI Workbench, check the box next to the VM instance that you want to delete, and click the "Delete" button in the menu that appears at the top.
This module is based on a workshop offered by the National Center for Genome Resources (NCGR) as part of the Data Science Core for the New Mexico IDeA Network of Biomedical Research Excellence (NM-INBRE). The workshop covers all of the material in this module and much more. See NCGR's workshop webpage for details.
command not found
If you try to run a code cell and you get the error command not found
, then there's a good chance that you're not using the correct Jupyter kernel.
To change what kernel you're using, in JupyterLab click "Kernel" -> "Change Kernel..."
In the pop-up that appears, select "nigms-pangenomics" in the dropdown and then click the "Select" button.
If "nigms-pangenomics" is not an option, then you need to setup the environment.
Do this by following the "Installing Software" instructions in the Before Starting section of this readme.
Note that you must restart your Vertex AI Workbench VM instance after completing the setup to be able to use the "nigms-pangenomics" kernel.
Resetting Bandage
Sometimes the Bandage software can break, i.e. there will be a message that say "KasmVNC encountered an error." When this occurs, you can reset the Bandage software by opening a Terminal in JupyterLab ("File" -> "New Launcher" -> "Terminal") and running the following commands:
cd ~
docker compose -f NIGMS-Sandbox-Pangenomics-Module/bandage/compose.yml up -d --build --force-recreate