This repo contains analysis notebooks used in the The Clinical Genomic Variation Landscape manuscript.
Small output files can be found in this repo. Larger files can be found in our public s3 bucket: s3://nch-igm-wagner-lab-public/variation-normalizer-manuscript/2025
. There are notebooks that provide functions for programmatically downloading files from the s3 bucket.
After running the notebooks, users will be able to create figures such as these that demonstrate the results of the analysis, such as the below figure.
Variant normalization allows patient data from AACR Project GENIE to be matched to normalized variants in the CIViC, MOAlmanac, and ClinVar knowledgebases.
Before running the notebooks, you must set up your environment.
You can use Homebrew to install the prerequisites. See the
Homebrew documentation for how to install.
Make sure Homebrew is up-to-date by running brew update
.
brew install libpq
brew install postgresql@14
sudo apt install gcc libpq-dev python3-dev
From the root directory, run the following to create the venv and install exact packages:
uv python pin 3.13
uv venv
source .venv/bin/activate
uv sync --all-extras
git submodule update --init --recursive
python3 -m venv .venv
source .venv/bin/activate
python3 -m pip install -r requirements.txt
git submodule update --init --recursive
We use python-dotenv to load environment variables needed for analysis notebooks that run the Variation Normalizer.
If you are running any of the following notebooks, this section is required:
- analysis/civic/variation_analysis/civic_variation_analysis.ipynb
- analysis/cnvs/parse_prep_normalize_nch_cnvs.ipynb
- analysis/genie/pre_variant_analysis/genie_pre_variant_analysis.ipynb
- analysis/moa/feature_analysis/moa_feature_analysis.ipynb
In the analysis notebooks, you will see:
from dotenv import load_dotenv
load_dotenv(".env.shared")
This will load environment variables from the .env.shared
file in the root directory.
This analysis relies on backend services, which you must set up yourself.
Biocommons SeqRepo is used for fast access to sequence data. This analysis uses 2024-12-20 SeqRepo data.
Follow the Quick Start Documentation for setting up SeqRepo (2024-12-20).
To verify, run the following inside your virtual environment:
╰─$ python3
Python 3.13.1 (main, Dec 31 2024, 13:03:34) [Clang 16.0.0 (clang-1600.0.26.6)] on darwin
Type "help", "copyright", "credits" or "license" for more information.
>>> from biocommons.seqrepo import SeqRepo
>>> sr = SeqRepo(root_dir="/usr/local/share/seqrepo/2024-12-20")
>>> sr["NC_000001.11"][780000:780020]
'TGGTGGCACGCGCTTGTAGT'
If you have trouble using the default path, try creating a symlink, by running the following:
seqrepo update-latest
Verify that this works in SeqRepo Verification.
Important
This section assumes you have a local SeqRepo
installed at /usr/local/share/seqrepo/2024-12-20
. If you have it installed elsewhere,
please update add a SEQREPO_ROOT_DIR
environment variable in
compose.yaml.
If you're using Docker Desktop, you'll want to go to Settings -> Resources -> File sharing
and add /usr/local/share/seqrepo
under the Virtual file shares
section. Otherwise,
you will get the following error:
OSError: Unable to open SeqRepo directory /usr/local/share/seqrepo/2024-12-20
.
To build, (re)create, and start containers
docker volume create --name=uta_vol
docker compose \
-p variation-normalizer-manuscript \
-f submodules/compose.yaml \
-f compose.yaml \
up
Tip
If you want a clean slate, run docker compose down -v
to remove containers and
volumes, then docker compose -p variation-normalizer-manuscript -f submodules/compose.yaml -f compose.yaml up
to rebuild and start fresh containers.
This section provides information about the notebooks and the order that they should be run in.
- Run the following notebook:
- analysis/download_s3_files.ipynb
- Downloads files from public s3 bucket that are needed for the notebooks.
- Downloads ClinVar CNV, MANE Ensembl GFF, and NCH CNV data
- The following notebooks were used to create the files that are downloaded in this notebook (order does not matter):
- analysis/cnvs/prep_clinvar_cnvs.ipynb
- Creates
ClinVar-CNVs-normalized.csv
- Creates
- analysis/cnvs/parse_prep_normalize_nch_cnvs.ipynb
- Creates
NCH-microarray-CNVs-cleaned.csv
- Creates
- analysis/cnvs/prep_clinvar_cnvs.ipynb
- The following notebooks were used to create the files that are downloaded in this notebook (order does not matter):
- Downloads ClinVar CNV, MANE Ensembl GFF, and NCH CNV data
- Downloads files from public s3 bucket that are needed for the notebooks.
- analysis/download_s3_files.ipynb
- Run the following notebooks (order does not matter):
- analysis/civic/variation_analysis/civic_variation_analysis.ipynb
- Runs CIViC variant data through the Variation Normalizer
- analysis/clinvar/clinvar_variation_analysis.ipynb
- Analysis on ClinVar variant data
- analysis/genie/pre_variant_analysis/genie_pre_variant_analysis.ipynb
- Runs GENIE variant data through the Variation Normalizer
- analysis/moa/feature_analysis/moa_feature_analysis.ipynb
- Runs MOA feature data through the Variation Normalizer
- analysis/civic/variation_analysis/civic_variation_analysis.ipynb
Important
You must have the Docker containers running for these notebooks.
- Run the following notebooks (order does not matter):
- analysis/civic/variation_analysis/transcript_variation_analysis.ipynb
- Analysis on CIViC variants in the Transcript category
- analysis/civic/evidence_analysis/civic_evidence_analysis.ipynb
- Analysis on CIViC evidence items
- analysis/cnvs/query_match_nch_clinvar_cnvs.ipynb
- Analysis on feature overlap in NCH and ClinVar CNVs
- analysis/genie/variant_analysis/genie_search_analysis.ipynb
- Analysis on matched normalized GENIE variants and normalized variants from CIViC, MOA, and ClinVar
- analysis/moa/assertion_analysis/moa_assertion_analysis.ipynb
- Analysis on MOA assertions
- analysis/civic/variation_analysis/transcript_variation_analysis.ipynb
- Run the following notebook:
- analysis/merged_moa_civic/merged_moa_civic_evidence_analysis.ipynb
- Combined analysis on CIViC evidence items and MOA assertions
- analysis/merged_moa_civic/merged_moa_civic_evidence_analysis.ipynb
- Run the following notebook:
- analysis/performance_analysis/merged_performance_analysis.ipynb
- Analysis on Variation Normalizer performance on CIViC, MOA, and ClinVar
- analysis/performance_analysis/merged_performance_analysis.ipynb
VS Code is a lightweight source code editor for Windows, Linux, and macOS.
- Download VS Code here
- Open a notebook and click
Select Kernel
at the top right. Select the option where the path isvenv/3.13/bin/python
. See here for more information on managing Jupyter Kernels in VS Code. - Run the notebooks
These notebooks were run using these macOS specs:
Model Year | CPU Architecture | Total RAM | Hard drive capacity |
---|---|---|---|
2023 | M2 Pro | 32 GB | 1 TB |
2023 | M3 Pro | 36 GB | 1 TB |
2024 | M4 Pro | 48 GB | 1 TB |
If you have any questions or problems, please make an issue in the repo and our team will be happy to assist.