National Institute of General Medical Sciences Cloud Learning Modules

Introduction

This repository aims to teach students, researchers, and clinicians, among others, how to utilize the power of cloud technology for the benefit of life sciences applications and research. Here we present 26 cloud learning modules that represent unique use cases or scientific workflows. Types of data used across the modules include, but are not limited to, genomics, methylomics, transcriptomics, proteomics, and medical imaging data across formats such as FASTA/FASTQ, SAM, BAM, CSV, PNG, and DICOM. Learning modules cover areas from introductory material to single-omics approaches, multi-omics techniques, single-cell analysis, metagenomics, and AI/ML imaging applications.

These modules run in Amazon Web Services (AWS), Google Cloud Platform (GCP), and Microsoft Azure (Azure). You can see which cloud provider is available for each module in the list of Available Modules. You can run these modules in any cloud account, but we encourage users to request access to an NIH Cloud Lab account for an optimal experience.

To get started with any of the cloud platforms, visit the NIH Cloud Lab Jumpstart Pages for AWS, Azure, or Google Cloud, or visit the tutorial pages: AWS, Azure, GCP.

❗ If you require support at any time, please open an issue on GitHub for the module in question, or send us an informative email at CloudLab@nih.gov.

Available Modules

The 26 topics and their authors are listed here. If you would like guidance on what order to complete them in, jump to the recommended learning pathways in the next section. The logo next to each module shows the available cloud provider.

Analysis of Biomedical Data for Biomarker Discovery - University of Rhode Island
ATAC-Seq and Single Cell ATAC-Seq Analysis - University of Nebraska Medical Center
Biomedical Imaging Analysis using AI/ML approaches - University of Arkansas
Chromatin Occupancy with Cut and Run - University of Nebraska Medical Center
Consensus Pathway Analysis in the Cloud - University of Nevada Reno
DNA Methylation Sequencing Analysis with WGBS - University of Hawai'i at Manoa
Explore RNA methylation using MeRIP-seq - University of Hawai'i Manoa
Fundamentals of Bioinformatics - Dartmouth College
Comparative Prokaryotic Genomics - University of New Hampshire
Identifying Protein-Protein Interactions with ML Methods - Georgia Institute of Technology
Integrating Multi-Omics Datasets - University of North Dakota
Introduction to Data Science for Biology - San Francisco State University
Introduction to Python for Biology - Northern Nazarene University
Introduction to R and LLMs for Biology - Duke University
Metagenomics Analysis of Biofilm-Microbiome - University of South Dakota
Introduction to Amplicon-based Metagenomics - University of Nevada, Reno
Introduction to Pangenomic Methods - National Center for Genome Resources
Introduction to Phylogenetics - University of South Dakota
Python and ML for Biomedical Data Science - University of Delaware
Protein Crystal Data Collection for Solving Protein Structure - The University of Southern Mississippi
Proteome Quantification - University of Arkansas for Medical Sciences
Introduction to Population Genomics - University of Wyoming
RNAseq Differential Expression Analysis - University of Maine
scRNASeq, miRNASeq, and Transcription Factors - The University of Maine
Structural Biology and Drug Discovery - Louisiana State University
Transcriptome Assembly Refinement and Applications - MDI Biological Laboratory

Recommended Learning Pathways

✨ We put together these learning pathways to help orient you to using the Sandbox modules. Before starting on any of the individual modules, we recommend you complete all the steps in the Prerequisites section for your respective cloud provider and only continue once you are able to check off these key skills.

Prerequisites: Introduction to AWS

Here are some AWS prerequisites you should make sure are completed before starting the modules. These will give you the necessary cloud computing skills to run the training modules such as launching a VM, cloning modules from GitHub, and more. If anything looks unfamiliar, follow the link to view documentation on that subject. After reading the documentation in the links below, complete the simple tasks described in the steps below as a knowledge check on what you just learned. Complete each step in order to learn the key skills you need to complete the learning modules in the next section!

STEP 1: Navigate to SageMaker AI and Launch a Notebook
STEP 2: Change your Notebook Kernel
STEP 3: Launch AWS Batch or setup AWS Batch
STEP 4: Read our Billing Guide and set budget alerts for 25%, 50%, and 75%.
STEP 5: Review this overview of building Docker containers and pushing to the Elastic Container Registry. Try to setup a JupyterLab instance using Sagemaker Studio from a custom docker image using this guide.
STEP 6: Clone this GitHub repository into a SageMaker AI notebook using the command line or the SageMaker AI user interface.
STEP 7: Review how to open a GitHub issue. If you have a question or a suggested enhancement, feel free to open an issue for this repository or for the module you are having trouble with. You can also email us at CloudLab@nih.gov.

After completing this prerequisite learning path, you should be able to:

Navigate the AWS console
Use SageMaker AI Notebooks
Copy data to and from a AWS S3 Storage Bucket
Enable AWS Batch
Understand Billing
Pull container images and launch an instance from a container
Use GitHub repositories.

🏄 You are now ready to start analyzing data!

Prerequisites: Introduction to Azure

Here are some Azure prerequisites you should make sure are completed before starting the modules. These will give you the necessary cloud computing skills to run the training modules such as launching a VM, cloning modules from GitHub, and more. If anything looks unfamiliar, follow the link to view documentation on that subject. After reading the documentation in the links below, complete the simple tasks described in the steps below as a knowledge check on what you just learned. Complete each step in order to learn the key skills you need to complete the learning modules in the next section!

STEP 1: Create a Resource Group
STEP 2: Navigate to Azure ML and Launch a Notebook
STEP 3: Change/Add your Notebook Kernel
STEP 4: Launch Azure Batch and submit a test job
STEP 5: Set budget alerts for 25%, 50%, and 75%.
STEP 6: Clone this GitHub repository into an Azure ML notebook using the command line. To learn more about GitHub cloning, check out the link here.
STEP 7: Review how to open a GitHub issue. If you have a question or a suggested enhancement, feel free to open an issue for this repository or for the module you are having trouble with. You can also email us at CloudLab@nih.gov.

After completing this prerequisite learning path, you should be able to:

Navigate the Azure console
Use Azure ML Notebooks
Copy data to and from a Storage Account
Use Azure Batch
Understand Billing
Use GitHub repositories.

🏄 You are now ready to start analyzing data!

Prerequisites: Introduction to GCP

Here are some GCP prerequisites you should make sure you can complete before diving into the modules. These will give you the necessary cloud computing skills to run the training modules such as launching a VM, cloning modules from GitHub, and more. If anything looks unfamiliar, follow the link to view documentation on that subject. After reading the documentation in the links below, complete the simple tasks described in the steps below as a knowledge check on what you just learned. Complete each step in order to learn the key skills you need to complete the learning modules in the next section!

STEP 1: Select your GCP Project
STEP 2: Navigate to Vertex AI and Launch an Instance. Enable idle-shutdown after 15 minutes of inactivity.
STEP 3: Change your Notebook Kernel
STEP 4: Enable the Google Batch and Big Query APIs
STEP 5: Read our Billing Guide. Generate a billing report for the last 30 days.
STEP 6: Review this overview of pushing and pulling containers. Try to spin up a Vertex AI notebook from this container path: us-east4-docker.pkg.dev/nih-cl-shared-resources/nigms-sandbox/nvidiaforVertex AI-rapids-22.12-cuda11.5-runtime-ubuntu20.04-py3.9@sha256:bb6703315633f21281e8caceed811f74822564a63ede01953664fe8d58b0c658. Review these instructions for help.
STEP 7: Clone this GitHub repository into a Vertex AI Notebook Instance using the command line or the Vertex AI user interface.
STEP 8: Review how to open a GitHub issue. If you have a question or a suggested enhancement, feel free to open an issue for this repository or for the module you are having trouble with. You can also email us at CloudLab@nih.gov.

After completing this prerequisite learning path, you should be able to:

Navigate the GCP console
Use Vertex AI Notebook instances
Copy data to and from a Google Cloud Storage Bucket
Enable APIs
Understand Billing
Pull container images and launch an instance from a container
Use GitHub repositories.

🏄 You are now ready to start analyzing data!

We have organized the rest of the learning pathways by scientific topic area and ordered them by technical complexity within each pathway. Our ordering is only based on the number and complexity of cloud services used and has no bearing on the complexity of the scientific content. We recommend you begin with Introductory modules within a given scientific interest area, and then progress to the more advanced modules as you gain key skills. Listed prices are only estimates. These prices assume you are only running the module on one cloud platform and that you delete all resources after running a module.

Introductory modules 🌱 use only Jupyter notebooks and sometimes Cloud Storage. Upon completion, users should be comfortable starting a Jupyter notebook instance and with copying data to and from Cloud Storage.
Advanced modules 🌳 include additional functionality such as launching a notebook from a custom Docker container, making API calls from within the notebook to perform Batch Computing, and using GPU-enabled machine types. Upon completion of advanced modules, users should be comfortable interacting with additional cloud services from within the Jupyter interface.

Introduction to Biomedical Data Science

🌱 Fundamentals of Bioinformatics - Dartmouth College
🌱 Introduction to Data Science for Biology - San Francisco State University
🌱 Introduction to Python for Biology - Northern Nazarene University
🌱 Introduction to R and LLMs for Biology - Duke University

After completing this learning path, you should be able to conduct comprehensive data science analysis with a variety of bioinformatics data sources. Your skills now include version control and creating reproducible workflows, downloading, processing, and visualizing data, calculating statistically significant variables, use GenAI chatbots, and building and evaluating machine learning models with real biomedical data.

Specs	Details
Languages and Workflows	Python, R, BASH
Jupyter Notebooks	36
Approximate Cost	$17.00

Introduction to Biomedical Machine Learning and Artificial Intelligence

🌱 Python and ML for Biomedical Data Science - University of Delaware
🌱 Analysis of Biomedical Data for Biomarker Discovery - University of Rhode Island
🌳 Biomedical Imaging Analysis using AI/ML approaches - University of Arkansas

After completing this learning path, you should understand and apply the data science lifecycle and FAIR data principles, develop ethical AI/ML systems and interpret model decisions, prepare biomedical datasets by preprocessing, managing, and augmenting data, and implement deep learning methods including transfer learning and models. You will also learn to perform statistical analyses, utilize dimensionality reduction techniques on high-dimensional biomedical data, evaluate model performance, and improve models by addressing biomedical-specific data challenges.

Specs	Details
Languages and Workflows	Python, R
Jupyter Notebooks	60
Approximate Cost	$30.00

Introduction to Biomedical Genomics

🌱 Consensus Pathway Analysis in the Cloud - University of Nevada Reno
🌳 DNA Methylation Sequencing Analysis with WGBS - University of Hawaii at Manoa
🌳 ATAC-Seq and Single Cell ATAC-Seq Analysis - University of Nebraska Medical Center
🌳 Chromatin Occupancy with Cut and Run - University of Nebraska Medical Center
🌳 Integrating Multi-Omics Datasets - University of North Dakota

After completing this learning path, you should be able to manage and analyze large genomic and omics datasets using cloud computing, perform quality control and become familiar with Nextflow pipelines, conduct and interpret differential analyses, and visualize multi-omics datasets to derive biological insights. You will also apply statistical methods and computational tools for genomic analysis, and retrieve and manage data from biological databases.

Specs	Details
Languages and Workflows	BASH, R, Nextflow
Jupyter Notebooks	22
Approximate Cost	$14.00

Introduction to Metagenomics and Phylogenetics

🌱 Introduction to Amplicon-based Metagenomics - University of Nevada Reno
🌱 Introduction to Phylogenetics - University of South Dakota
🌱 Introduction to Population Genomics - University of Wyoming
🌳 Comparative Prokaryotic Genomics - University of New Hampshire
🌳 Introduction to Pangenomic Methods - National Center for Genome Resources
🌳 Metagenomics Analysis of Biofilm-Microbiome - University of South Dakota

After completing this learning path, you should know how to analyze microbial data to perform taxonomic classification using diversity and differential abundance analyses, utilize comparative genomics and phylogenetics to interpret genome sequences and evolutionary relationships, and implement cloud-based bioinformatics workflows including pipeline configuration resource scheduling and scalable analysis. Further, you will be able to access, retrieve, and manage genomic data from biological databases and sequencing sources, assemble, annotate, and analyze genome sequences utilizing bioinformatics tools and cloud resources, and apply quality control techniques and standardized bioinformatic data processing workflows for reproducibility. Finally, you will be able to visualize genomic, metagenomic, and microbial community data, relate genomic microbial and phylogenetic findings to metadata geographic context and biological interpretation, and leverage machine learning with comparative genomics and population genomics in genomic analyses and pangenome studies.

Specs	Details
Languages and Workflows	BASH, R, Nextflow
Jupyter Notebooks	29
Approximate Cost	$22.00

Introduction to Proteomics

🌱Proteome Quantification - University of Arkansas for Medical Sciences
🌱Proteome Structures and Docking - University of Arkansas for Medical Sciences

After completing this learning path, you should be comfortable analyzing bioinformatics data on the command line in a cloud environment and integrating -omics data to understand biological insight including normalize and perform differential analysis on proteomics data including QC management of missing values and understanding batch effects, prepare visualize and interpret molecular data and simulations using bioinformatics tools like PyMOL, manage and execute computational biology software and docking simulations in containerized cloud environments, apply cheminformatics and machine learning approaches to analyze molecular interactions ligand binding and protein structures, execute automated protein docking computational workflows via cloud platforms APIs and services, understand binding affinity equilibrium ligand residence and non-covalent interactions in protein-ligand complexes, perform macromolecular structure determination applying X-ray crystallography phasing and molecular replacement techniques, analyze protein structure-function relationships and relate protein properties to biochemical conditions, evaluate interpret visualize and communicate statistical and scientific results critically using interactive computational resources, and utilize cloud computing resources efficiently for biomedical bioinformatics and proteomics data science analyses. These skills can be applied to a wide variety of Omics datasets in the subsequent sections.

Specs	Details
Languages and Workflows	Python, R
Jupyter Notebooks	19
Approximate Cost	$11.00

Introduction to RNAseq and Transcriptome Assembly

🌳 Explore RNA methylation using MeRIP-seq - University of Hawai'i Manoa
🌳 RNAseq Differential Expression Analysis - University of Maine
🌳 Transcriptome Assembly Refinement and Applications - MDI Biological Laboratory
🌳 scRNASeq, miRNASeq, and Transcription Factors - The University of Maine

After completing this learning path, you are able to do a full analysis of RNA-seq data, including assembling a transcriptome and identifying differentially expressed genes as well as analyze MeRIP-seq and RNA-seq data including differential methylation and gene expression analyses, perform quality control, preprocessing, adapter trimming, alignment, and quantification of sequencing data, utilize cloud computing environments for managing genomic datasets and running reproducible bioinformatics pipelines (Nextflow Snakemake containerization), visualize genomic data with tools such as IGV PCA volcano MA and heatmap plots in interactive platforms like Jupyter Notebooks, apply statistical methods normalization and dimensionality reduction techniques to genomic datasets, identify annotate and interpret genomic peaks such as m6A sites and relate methylation to gene expression changes, retrieve manage and analyze genomic data from public repositories and cloud storage solutions, perform transcriptome assembly transcriptome analysis and explore gene regulatory networks and transcription factor activity, design and interpret experimental setups for differential expression and methylation analyses, and investigate biological pathways and functional enrichment related to genomic analyses leveraging databases and bioinformatics tools. These are computationally intensive tools which the cloud enables you to conduct in a scalable manner.

That concludes our section on Learning Pathways. The rest of the README will walk you through some of the technical details of each module, in particular focusing on compute environments and machine types, as well as additional resources to help you continue your learning journey!

Specs	Details
Languages and Workflows	Bash, Python, R, Nextflow, Snakemake
Jupyter Notebooks	33
Approximate Cost	$29.00

Name		Name	Last commit message	Last commit date
Latest commit History 354 Commits
.github		.github
docs		docs
images		images
README.md		README.md
idle-shutdown.sh		idle-shutdown.sh

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

National Institute of General Medical Sciences Cloud Learning Modules

Introduction

Table of Contents

Available Modules

Recommended Learning Pathways

Prerequisites: Introduction to AWS

Prerequisites: Introduction to Azure

Prerequisites: Introduction to GCP

About

Uh oh!

Releases

Packages

Uh oh!

Contributors 13

Uh oh!

Languages

NIGMS/NIGMS-Sandbox

Folders and files

Latest commit

History

Repository files navigation

National Institute of General Medical Sciences Cloud Learning Modules

Introduction

Table of Contents

Available Modules

Recommended Learning Pathways

Prerequisites: Introduction to AWS

Prerequisites: Introduction to Azure

Prerequisites: Introduction to GCP

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors 13

Uh oh!

Languages

Packages