This repository aims to teach students, researchers, and clinicians, among others, how to utilize the power of cloud technology for the benefit of life sciences applications and research. Here we present 26 cloud learning modules that represent unique use cases or scientific workflows. Types of data used across the modules include, but are not limited to, genomics, methylomics, transcriptomics, proteomics, and medical imaging data across formats such as FASTA/FASTQ, SAM, BAM, CSV, PNG, and DICOM. Learning modules cover areas from introductory material to single-omics approaches, multi-omics techniques, single-cell analysis, metagenomics, and AI/ML imaging applications.
These modules run in Amazon Web Services (AWS), Google Cloud Platform (GCP), and Microsoft Azure (Azure). All modules will be available across the three cloud platforms in 2027, but you may notice that a given module is only available in one or two of these platforms at this time. You can run these modules in any cloud account, but we encourage users to request access to an NIH Cloud Lab account for an optimal experience.
To get started with any of the cloud platforms, visit the NIH Cloud Lab Jumpstart Pages for AWS, Azure, or Google Cloud, or visit the tutorial pages: AWS, Azure, GCP.
β If you require support at any time, please open an issue on GitHub for the module in question, or send us an informative email at CloudLab@nih.gov.
The 26 topics and their authors are listed here. If you would like guidance on what order to complete them in, jump to the recommended learning pathways in the next section.
- Analysis of Biomedical Data for Biomarker Discovery - University of Rhode Island
- ATAC-Seq and Single Cell ATAC-Seq Analysis - University of Nebraska Medical Center
- Biomedical Imaging Analysis using AI/ML approaches - University of Arkansas
- Chromatin Occupancy with Cut and Run - University of Nebraska Medical Center
- Consensus Pathway Analysis in the Cloud - University of Nevada Reno
- DNA Methylation Sequencing Analysis with WGBS - University of Hawai'i at Manoa
- Explore RNA methylation using MeRIP-seq - University of Hawai'i Manoa
- Fundamentals of Bioinformatics - Dartmouth College
- Comparative Prokaryotic Genomics - University of New Hampshire
- Identifying Protein-Protein Interactions with ML Methods - Georgia Institute of Technology
- Integrating Multi-Omics Datasets - University of North Dakota
- Introduction to Data Science for Biology - San Francisco State University
- Introduction to Python for Biology - Northern Nazarene University
- Introduction to R and LLMs for Biology - Duke University
- Metagenomics Analysis of Biofilm-Microbiome - University of South Dakota
- Introduction to Amplicon-based Metagenomics - University of Nevada, Reno
- Introduction to Pangenomic Methods - National Center for Genome Resources
- Introduction to Phylogenetics - University of South Dakota
- Python and ML for Biomedical Data Science - University of Delaware
- Protein Crystal Data Collection for Solving Protein Structure - The University of Southern Mississippi
- Proteome Quantification - University of Arkansas for Medical Sciences
- Introduction to Population Genomics - University of Wyoming
- RNAseq Differential Expression Analysis - University of Maine
- scRNASeq, miRNASeq, and Transcription Factors - The University of Maine
- Structural Biology and Drug Discovery - Louisiana State University
- Transcriptome Assembly Refinement and Applications - MDI Biological Laboratory
β¨ We put together these learning pathways to help orient you to using the Sandbox modules. Before starting on any of the individual modules, we recommend you complete all the steps in the Prerequisites section for your respective cloud provider and only continue once you are able to check off these key skills.
Here are some AWS prerequisites you should make sure are completed before starting the modules. These will give you the necessary cloud computing skills to run the training modules such as launching a VM, cloning modules from GitHub, and more. If anything looks unfamiliar, follow the link to view documentation on that subject. After reading the documentation in the links below, complete the simple tasks described in the steps below as a knowledge check on what you just learned. Complete each step in order to learn the key skills you need to complete the learning modules in the next section!
- STEP 1: Navigate to SageMaker AI and Launch a Notebook
- STEP 2: Change your Notebook Kernel
- STEP 3: Launch AWS Batch or setup AWS Batch
- STEP 4: Read our Billing Guide and set budget alerts for 25%, 50%, and 75%.
- STEP 5: Review this overview of building Docker containers and pushing to the Elastic Container Registry. Try to setup a JupyterLab instance using Sagemaker Studio from a custom docker image using this guide.
- STEP 6: Clone this GitHub repository into a SageMaker AI notebook using the command line or the SageMaker AI user interface.
- STEP 7: Review how to open a GitHub issue. If you have a question or a suggested enhancement, feel free to open an issue for this repository or for the module you are having trouble with. You can also email us at
CloudLab@nih.gov
.
After completing this prerequisite learning path, you should be able to:
- Navigate the AWS console
- Use SageMaker AI Notebooks
- Copy data to and from a AWS S3 Storage Bucket
- Enable AWS Batch
- Understand Billing
- Pull container images and launch an instance from a container
- Use GitHub repositories.
π You are now ready to start analyzing data!
Here are some Azure prerequisites you should make sure are completed before starting the modules. These will give you the necessary cloud computing skills to run the training modules such as launching a VM, cloning modules from GitHub, and more. If anything looks unfamiliar, follow the link to view documentation on that subject. After reading the documentation in the links below, complete the simple tasks described in the steps below as a knowledge check on what you just learned. Complete each step in order to learn the key skills you need to complete the learning modules in the next section!
- STEP 1: Create a Resource Group
- STEP 2: Navigate to Azure ML and Launch a Notebook
- STEP 3: Change your Notebook Kernel
- STEP 4: Launch Azure Batch and submit a test job
- STEP 5: Set budget alerts for 25%, 50%, and 75%.
- STEP 6: Clone this GitHub repository into an Azure ML notebook using the command line or the user interface.
- STEP 7: Review how to open a GitHub issue. If you have a question or a suggested enhancement, feel free to open an issue for this repository or for the module you are having trouble with. You can also email us at
CloudLab@nih.gov
.
After completing this prerequisite learning path, you should be able to:
- Navigate the Azure console
- Use Azure ML Notebooks
- Copy data to and from a Storage Account
- Use Azure Batch
- Understand Billing
- Use GitHub repositories.
π You are now ready to start analyzing data!
Here are some GCP prerequisites you should make sure you can complete before diving into the modules. These will give you the necessary cloud computing skills to run the training modules such as launching a VM, cloning modules from GitHub, and more. If anything looks unfamiliar, follow the link to view documentation on that subject. After reading the documentation in the links below, complete the simple tasks described in the steps below as a knowledge check on what you just learned. Complete each step in order to learn the key skills you need to complete the learning modules in the next section!
- STEP 1: Select your GCP Project
- STEP 2: Navigate to Vertex AI and Launch an Instance. Enable idle-shutdown after 15 minutes of inactivity.
- STEP 3: Change your Notebook Kernel
- STEP 4: Enable the Google Batch and Big Query APIs
- STEP 5: Read our Billing Guide. Generate a billing report for the last 30 days.
- STEP 6: Review this overview of pushing and pulling containers. Try to spin up a Vertex AI notebook from this container path:
us-east4-docker.pkg.dev/nih-cl-shared-resources/nigms-sandbox/nvidiaforVertex AI-rapids-22.12-cuda11.5-runtime-ubuntu20.04-py3.9@sha256:bb6703315633f21281e8caceed811f74822564a63ede01953664fe8d58b0c658
. Review these instructions for help. - STEP 7: Clone this GitHub repository into a Vertex AI Notebook Instance using the command line or the Vertex AI user interface.
- STEP 8: Review how to open a GitHub issue. If you have a question or a suggested enhancement, feel free to open an issue for this repository or for the module you are having trouble with. You can also email us at
CloudLab@nih.gov
.
After completing this prerequisite learning path, you should be able to:
- Navigate the GCP console
- Use Vertex AI Notebook instances
- Copy data to and from a Google Cloud Storage Bucket
- Enable APIs
- Understand Billing
- Pull container images and launch an instance from a container
- Use GitHub repositories.
π You are now ready to start analyzing data!
We have organized the rest of the learning pathways by scientific topic area and ordered them by technical complexity within each pathway. Our ordering is only based on the number and complexity of cloud services used and has no bearing on the complexity of the scientific content. We recommend you begin with Introductory modules within a given scientific interest area, and then progress to the more advanced modules as you gain key skills. Listed prices are only estimates. These prices assume you are only running the module on one cloud platform and that you delete all resources after running a module.
- Introductory modules π± use only Jupyter notebooks and sometimes Cloud Storage. Upon completion, users should be comfortable starting a Jupyter notebook instance and with copying data to and from Cloud Storage.
- Advanced modules π³ include additional functionality such as launching a notebook from a custom Docker container, making API calls from within the notebook to perform Batch Computing, and using GPU-enabled machine types. Upon completion of advanced modules, users should be comfortable interacting with additional cloud services from within the Jupyter interface.
Introduction to Biomedical Data Science
- π± Fundamentals of Bioinformatics - Dartmouth College
- π± Introduction to Data Science for Biology - San Francisco State University
- π± Introduction to Python for Biology - Northern Nazarene University
- π± Introduction to R and LLMs for Biology - Duke University
After completing this learning path, you should be able to conduct comprehensive data science analysis with a variety of bioinformatics data sources. Your skills now include version control and creating reproducible workflows, downloading, processing, and visualizing data, calculating statistically significant variables, use GenAI chatbots, and building and evaluating machine learning models with real biomedical data.
Specs | Details |
---|---|
Languages and Workflows | Python, R, BASH |
Jupyter Notebooks | 36 |
Approximate Cost | $17.00 |
Introduction to Biomedical Machine Learning and Artificial Intelligence
- π± Python and ML for Biomedical Data Science - University of Delaware
- π± Analysis of Biomedical Data for Biomarker Discovery - University of Rhode Island
- π³ Biomedical Imaging Analysis using AI/ML approaches - University of Arkansas
After completing this learning path, you should understand and apply the data science lifecycle and FAIR data principles, develop ethical AI/ML systems and interpret model decisions, prepare biomedical datasets by preprocessing, managing, and augmenting data, and implement deep learning methods including transfer learning and models. You will also learn to perform statistical analyses, utilize dimensionality reduction techniques on high-dimensional biomedical data, evaluate model performance, and improve models by addressing biomedical-specific data challenges.
Specs | Details |
---|---|
Languages and Workflows | Python, R |
Jupyter Notebooks | 60 |
Approximate Cost | $30.00 |
Introduction to Biomedical Genomics
- π± Consensus Pathway Analysis in the Cloud - University of Nevada Reno
- π³ DNA Methylation Sequencing Analysis with WGBS - University of Hawaii at Manoa
- π³ ATAC-Seq and Single Cell ATAC-Seq Analysis - University of Nebraska Medical Center
- π³ Chromatin Occupancy with Cut and Run - University of Nebraska Medical Center
- π³ Integrating Multi-Omics Datasets - University of North Dakota
After completing this learning path, you should be able to manage and analyze large genomic and omics datasets using cloud computing, perform quality control and become familiar with Nextflow pipelines, conduct and interpret differential analyses, and visualize multi-omics datasets to derive biological insights. You will also apply statistical methods and computational tools for genomic analysis, and retrieve and manage data from biological databases.
Specs | Details |
---|---|
Languages and Workflows | BASH, R, Nextflow |
Jupyter Notebooks | 22 |
Approximate Cost | $14.00 |
Introduction to Metagenomics and Phylogenetics
- π± Introduction to Amplicon-based Metagenomics - University of Nevada Reno
- π± Introduction to Phylogenetics - University of South Dakota
- π± Introduction to Population Genomics - University of Wyoming
- π³ Comparative Prokaryotic Genomics - University of New Hampshire
- π³ Introduction to Pangenomic Methods - National Center for Genome Resources
- π³ Metagenomics Analysis of Biofilm-Microbiome - University of South Dakota
After completing this learning path, you should know how to analyze microbial data to perform taxonomic classification using diversity and differential abundance analyses, utilize comparative genomics and phylogenetics to interpret genome sequences and evolutionary relationships, and implement cloud-based bioinformatics workflows including pipeline configuration resource scheduling and scalable analysis. Further, you will be able to access, retrieve, and manage genomic data from biological databases and sequencing sources, assemble, annotate, and analyze genome sequences utilizing bioinformatics tools and cloud resources, and apply quality control techniques and standardized bioinformatic data processing workflows for reproducibility. Finally, you will be able to visualize genomic, metagenomic, and microbial community data, relate genomic microbial and phylogenetic findings to metadata geographic context and biological interpretation, and leverage machine learning with comparative genomics and population genomics in genomic analyses and pangenome studies.
Specs | Details |
---|---|
Languages and Workflows | BASH, R, Nextflow |
Jupyter Notebooks | 29 |
Approximate Cost | $22.00 |
Introduction to Proteomics
- π±Proteome Quantification - University of Arkansas for Medical Sciences
- π±Proteome Structures and Docking - University of Arkansas for Medical Sciences
After completing this learning path, you should be comfortable analyzing bioinformatics data on the command line in a cloud environment and integrating -omics data to understand biological insight including normalize and perform differential analysis on proteomics data including QC management of missing values and understanding batch effects, prepare visualize and interpret molecular data and simulations using bioinformatics tools like PyMOL, manage and execute computational biology software and docking simulations in containerized cloud environments, apply cheminformatics and machine learning approaches to analyze molecular interactions ligand binding and protein structures, execute automated protein docking computational workflows via cloud platforms APIs and services, understand binding affinity equilibrium ligand residence and non-covalent interactions in protein-ligand complexes, perform macromolecular structure determination applying X-ray crystallography phasing and molecular replacement techniques, analyze protein structure-function relationships and relate protein properties to biochemical conditions, evaluate interpret visualize and communicate statistical and scientific results critically using interactive computational resources, and utilize cloud computing resources efficiently for biomedical bioinformatics and proteomics data science analyses. These skills can be applied to a wide variety of Omics datasets in the subsequent sections.
Specs | Details |
---|---|
Languages and Workflows | Python, R |
Jupyter Notebooks | 19 |
Approximate Cost | $11.00 |
Introduction to RNAseq and Transcriptome Assembly
- π³ Explore RNA methylation using MeRIP-seq - University of Hawai'i Manoa
- π³ RNAseq Differential Expression Analysis - University of Maine
- π³ Transcriptome Assembly Refinement and Applications - MDI Biological Laboratory
- π³ scRNASeq, miRNASeq, and Transcription Factors - The University of Maine
After completing this learning path, you are able to do a full analysis of RNA-seq data, including assembling a transcriptome and identifying differentially expressed genes as well as analyze MeRIP-seq and RNA-seq data including differential methylation and gene expression analyses, perform quality control, preprocessing, adapter trimming, alignment, and quantification of sequencing data, utilize cloud computing environments for managing genomic datasets and running reproducible bioinformatics pipelines (Nextflow Snakemake containerization), visualize genomic data with tools such as IGV PCA volcano MA and heatmap plots in interactive platforms like Jupyter Notebooks, apply statistical methods normalization and dimensionality reduction techniques to genomic datasets, identify annotate and interpret genomic peaks such as m6A sites and relate methylation to gene expression changes, retrieve manage and analyze genomic data from public repositories and cloud storage solutions, perform transcriptome assembly transcriptome analysis and explore gene regulatory networks and transcription factor activity, design and interpret experimental setups for differential expression and methylation analyses, and investigate biological pathways and functional enrichment related to genomic analyses leveraging databases and bioinformatics tools. These are computationally intensive tools which the cloud enables you to conduct in a scalable manner.
That concludes our section on Learning Pathways. The rest of the README will walk you through some of the technical details of each module, in particular focusing on compute environments and machine types, as well as additional resources to help you continue your learning journey!
Specs | Details |
---|---|
Languages and Workflows | Bash, Python, R, Nextflow, Snakemake |
Jupyter Notebooks | 33 |
Approximate Cost | $29.00 |