Skip to content

Collection of cloud-based biomedical data science learning modules funded by the National Institute of General Medical Sciences at the NIH

Notifications You must be signed in to change notification settings

NIGMS/NIGMS-Sandbox

Folders and files

NameName
Last commit message
Last commit date

Latest commit

Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 

Repository files navigation

National Institute of General Medical Sciences Cloud Learning Modules

Introduction

This repository aims to teach students, researchers, and clinicians, among others, how to utilize the power of cloud technology for the benefit of life sciences applications and research. Here we present 26 cloud learning modules that represent unique use cases or scientific workflows. Types of data used across the modules include, but are not limited to, genomics, methylomics, transcriptomics, proteomics, and medical imaging data across formats such as FASTA/FASTQ, SAM, BAM, CSV, PNG, and DICOM. Learning modules cover areas from introductory material to single-omics approaches, multi-omics techniques, single-cell analysis, metagenomics, and AI/ML imaging applications.

These modules run in Amazon Web Services (AWS), Google Cloud Platform (GCP), and Microsoft Azure (Azure). All modules will be available across the three cloud platforms in 2027, but you may notice that a given module is only available in one or two of these platforms at this time. You can run these modules in any cloud account, but we encourage users to request access to an NIH Cloud Lab account for an optimal experience.

To get started with any of the cloud platforms, visit the NIH Cloud Lab Jumpstart Pages for AWS, Azure, or Google Cloud, or visit the tutorial pages: AWS, Azure, GCP.

❗ If you require support at any time, please open an issue on GitHub for the module in question, or send us an informative email at CloudLab@nih.gov.

image1

Table of Contents

Available Modules

The 26 topics and their authors are listed here. If you would like guidance on what order to complete them in, jump to the recommended learning pathways in the next section.

Recommended Learning Pathways

✨ We put together these learning pathways to help orient you to using the Sandbox modules. Before starting on any of the individual modules, we recommend you complete all the steps in the Prerequisites section for your respective cloud provider and only continue once you are able to check off these key skills.

Prerequisites: Introduction to AWS

Here are some AWS prerequisites you should make sure are completed before starting the modules. These will give you the necessary cloud computing skills to run the training modules such as launching a VM, cloning modules from GitHub, and more. If anything looks unfamiliar, follow the link to view documentation on that subject. After reading the documentation in the links below, complete the simple tasks described in the steps below as a knowledge check on what you just learned. Complete each step in order to learn the key skills you need to complete the learning modules in the next section!

After completing this prerequisite learning path, you should be able to:

  • Navigate the AWS console
  • Use SageMaker AI Notebooks
  • Copy data to and from a AWS S3 Storage Bucket
  • Enable AWS Batch
  • Understand Billing
  • Pull container images and launch an instance from a container
  • Use GitHub repositories.

πŸ„ You are now ready to start analyzing data!

Prerequisites: Introduction to Azure

Here are some Azure prerequisites you should make sure are completed before starting the modules. These will give you the necessary cloud computing skills to run the training modules such as launching a VM, cloning modules from GitHub, and more. If anything looks unfamiliar, follow the link to view documentation on that subject. After reading the documentation in the links below, complete the simple tasks described in the steps below as a knowledge check on what you just learned. Complete each step in order to learn the key skills you need to complete the learning modules in the next section!

After completing this prerequisite learning path, you should be able to:

  • Navigate the Azure console
  • Use Azure ML Notebooks
  • Copy data to and from a Storage Account
  • Use Azure Batch
  • Understand Billing
  • Use GitHub repositories.

πŸ„ You are now ready to start analyzing data!

Prerequisites: Introduction to GCP

Here are some GCP prerequisites you should make sure you can complete before diving into the modules. These will give you the necessary cloud computing skills to run the training modules such as launching a VM, cloning modules from GitHub, and more. If anything looks unfamiliar, follow the link to view documentation on that subject. After reading the documentation in the links below, complete the simple tasks described in the steps below as a knowledge check on what you just learned. Complete each step in order to learn the key skills you need to complete the learning modules in the next section!

After completing this prerequisite learning path, you should be able to:

  • Navigate the GCP console
  • Use Vertex AI Notebook instances
  • Copy data to and from a Google Cloud Storage Bucket
  • Enable APIs
  • Understand Billing
  • Pull container images and launch an instance from a container
  • Use GitHub repositories.

πŸ„ You are now ready to start analyzing data!

We have organized the rest of the learning pathways by scientific topic area and ordered them by technical complexity within each pathway. Our ordering is only based on the number and complexity of cloud services used and has no bearing on the complexity of the scientific content. We recommend you begin with Introductory modules within a given scientific interest area, and then progress to the more advanced modules as you gain key skills. Listed prices are only estimates. These prices assume you are only running the module on one cloud platform and that you delete all resources after running a module.

  • Introductory modules 🌱 use only Jupyter notebooks and sometimes Cloud Storage. Upon completion, users should be comfortable starting a Jupyter notebook instance and with copying data to and from Cloud Storage.
  • Advanced modules 🌳 include additional functionality such as launching a notebook from a custom Docker container, making API calls from within the notebook to perform Batch Computing, and using GPU-enabled machine types. Upon completion of advanced modules, users should be comfortable interacting with additional cloud services from within the Jupyter interface.

Introduction to Biomedical Data Science

After completing this learning path, you should be able to conduct comprehensive data science analysis with a variety of bioinformatics data sources. Your skills now include version control and creating reproducible workflows, downloading, processing, and visualizing data, calculating statistically significant variables, use GenAI chatbots, and building and evaluating machine learning models with real biomedical data.

Specs Details
Languages and Workflows Python, R, BASH
Jupyter Notebooks 36
Approximate Cost $17.00

Introduction to Biomedical Machine Learning and Artificial Intelligence

After completing this learning path, you should understand and apply the data science lifecycle and FAIR data principles, develop ethical AI/ML systems and interpret model decisions, prepare biomedical datasets by preprocessing, managing, and augmenting data, and implement deep learning methods including transfer learning and models. You will also learn to perform statistical analyses, utilize dimensionality reduction techniques on high-dimensional biomedical data, evaluate model performance, and improve models by addressing biomedical-specific data challenges.

Specs Details
Languages and Workflows Python, R
Jupyter Notebooks 60
Approximate Cost $30.00

Introduction to Biomedical Genomics

After completing this learning path, you should be able to manage and analyze large genomic and omics datasets using cloud computing, perform quality control and become familiar with Nextflow pipelines, conduct and interpret differential analyses, and visualize multi-omics datasets to derive biological insights. You will also apply statistical methods and computational tools for genomic analysis, and retrieve and manage data from biological databases.

Specs Details
Languages and Workflows BASH, R, Nextflow
Jupyter Notebooks 22
Approximate Cost $14.00

Introduction to Metagenomics and Phylogenetics

After completing this learning path, you should know how to analyze microbial data to perform taxonomic classification using diversity and differential abundance analyses, utilize comparative genomics and phylogenetics to interpret genome sequences and evolutionary relationships, and implement cloud-based bioinformatics workflows including pipeline configuration resource scheduling and scalable analysis. Further, you will be able to access, retrieve, and manage genomic data from biological databases and sequencing sources, assemble, annotate, and analyze genome sequences utilizing bioinformatics tools and cloud resources, and apply quality control techniques and standardized bioinformatic data processing workflows for reproducibility. Finally, you will be able to visualize genomic, metagenomic, and microbial community data, relate genomic microbial and phylogenetic findings to metadata geographic context and biological interpretation, and leverage machine learning with comparative genomics and population genomics in genomic analyses and pangenome studies.

Specs Details
Languages and Workflows BASH, R, Nextflow
Jupyter Notebooks 29
Approximate Cost $22.00

Introduction to Proteomics

After completing this learning path, you should be comfortable analyzing bioinformatics data on the command line in a cloud environment and integrating -omics data to understand biological insight including normalize and perform differential analysis on proteomics data including QC management of missing values and understanding batch effects, prepare visualize and interpret molecular data and simulations using bioinformatics tools like PyMOL, manage and execute computational biology software and docking simulations in containerized cloud environments, apply cheminformatics and machine learning approaches to analyze molecular interactions ligand binding and protein structures, execute automated protein docking computational workflows via cloud platforms APIs and services, understand binding affinity equilibrium ligand residence and non-covalent interactions in protein-ligand complexes, perform macromolecular structure determination applying X-ray crystallography phasing and molecular replacement techniques, analyze protein structure-function relationships and relate protein properties to biochemical conditions, evaluate interpret visualize and communicate statistical and scientific results critically using interactive computational resources, and utilize cloud computing resources efficiently for biomedical bioinformatics and proteomics data science analyses. These skills can be applied to a wide variety of Omics datasets in the subsequent sections.

Specs Details
Languages and Workflows Python, R
Jupyter Notebooks 19
Approximate Cost $11.00

Introduction to RNAseq and Transcriptome Assembly

After completing this learning path, you are able to do a full analysis of RNA-seq data, including assembling a transcriptome and identifying differentially expressed genes as well as analyze MeRIP-seq and RNA-seq data including differential methylation and gene expression analyses, perform quality control, preprocessing, adapter trimming, alignment, and quantification of sequencing data, utilize cloud computing environments for managing genomic datasets and running reproducible bioinformatics pipelines (Nextflow Snakemake containerization), visualize genomic data with tools such as IGV PCA volcano MA and heatmap plots in interactive platforms like Jupyter Notebooks, apply statistical methods normalization and dimensionality reduction techniques to genomic datasets, identify annotate and interpret genomic peaks such as m6A sites and relate methylation to gene expression changes, retrieve manage and analyze genomic data from public repositories and cloud storage solutions, perform transcriptome assembly transcriptome analysis and explore gene regulatory networks and transcription factor activity, design and interpret experimental setups for differential expression and methylation analyses, and investigate biological pathways and functional enrichment related to genomic analyses leveraging databases and bioinformatics tools. These are computationally intensive tools which the cloud enables you to conduct in a scalable manner.

That concludes our section on Learning Pathways. The rest of the README will walk you through some of the technical details of each module, in particular focusing on compute environments and machine types, as well as additional resources to help you continue your learning journey!

Specs Details
Languages and Workflows Bash, Python, R, Nextflow, Snakemake
Jupyter Notebooks 33
Approximate Cost $29.00

About

Collection of cloud-based biomedical data science learning modules funded by the National Institute of General Medical Sciences at the NIH

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages