Variation Normalizer Manuscript

This repo contains analysis notebooks used in the The Clinical Genomic Variation Landscape manuscript.

Small output files can be found in this repo. Larger files can be found in our public s3 bucket: s3://nch-igm-wagner-lab-public/variation-normalizer-manuscript/2025. There are notebooks that provide functions for programmatically downloading files from the s3 bucket.

After running the notebooks, users will be able to create figures such as these that demonstrate the results of the analysis, such as the below figure.

Variant normalization allows patient data from AACR Project GENIE to be matched to normalized variants in the CIViC, MOAlmanac, and ClinVar knowledgebases.

Set Up

Before running the notebooks, you must set up your environment.

Prerequisites

Docker
Python 3.13
- We recommend using uv to install.
libpq
postgresql

MacOS

You can use Homebrew to install the prerequisites. See the Homebrew documentation for how to install. Make sure Homebrew is up-to-date by running brew update.

brew install libpq
brew install postgresql@14

Ubuntu

sudo apt install gcc libpq-dev python3-dev

Creating the virtual environment

uv

From the root directory, run the following to create the venv and install exact packages:

uv python pin 3.13
uv venv
source .venv/bin/activate
uv sync --all-extras
git submodule update --init --recursive

pip

python3 -m venv .venv
source .venv/bin/activate
python3 -m pip install -r requirements.txt
git submodule update --init --recursive

Environment Variables

We use python-dotenv to load environment variables needed for analysis notebooks that run the Variation Normalizer.

If you are running any of the following notebooks, this section is required:

In the analysis notebooks, you will see:

from dotenv import load_dotenv

load_dotenv(".env.shared")

This will load environment variables from the .env.shared file in the root directory.

Set Up Backend Services

This analysis relies on backend services, which you must set up yourself.

1. Biocommons SeqRepo

Biocommons SeqRepo is used for fast access to sequence data. This analysis uses 2024-12-20 SeqRepo data.

Follow the Quick Start Documentation for setting up SeqRepo (2024-12-20).

SeqRepo Verification

To verify, run the following inside your virtual environment:

╰─$ python3
Python 3.13.1 (main, Dec 31 2024, 13:03:34) [Clang 16.0.0 (clang-1600.0.26.6)] on darwin
Type "help", "copyright", "credits" or "license" for more information.
>>> from biocommons.seqrepo import SeqRepo
>>> sr = SeqRepo(root_dir="/usr/local/share/seqrepo/2024-12-20")
>>> sr["NC_000001.11"][780000:780020]
'TGGTGGCACGCGCTTGTAGT'

SeqRepo Issues

If you have trouble using the default path, try creating a symlink, by running the following:

seqrepo update-latest

Verify that this works in SeqRepo Verification.

2. Variation Normalizer: Docker Container

Important

This section assumes you have a local SeqRepo installed at /usr/local/share/seqrepo/2024-12-20. If you have it installed elsewhere, please update add a SEQREPO_ROOT_DIR environment variable in compose.yaml.
If you're using Docker Desktop, you'll want to go to Settings -> Resources -> File sharing and add /usr/local/share/seqrepo under the Virtual file shares section. Otherwise, you will get the following error: OSError: Unable to open SeqRepo directory /usr/local/share/seqrepo/2024-12-20.

To build, (re)create, and start containers

docker volume create --name=uta_vol
docker compose \
  -p variation-normalizer-manuscript \
  -f submodules/compose.yaml \
  -f compose.yaml \
  up

Tip

If you want a clean slate, run docker compose down -v to remove containers and volumes, then docker compose -p variation-normalizer-manuscript -f submodules/compose.yaml -f compose.yaml up to rebuild and start fresh containers.

Running Notebooks

This section provides information about the notebooks and the order that they should be run in.

Run the following notebook:
- analysis/download_s3_files.ipynb
  - Downloads files from public s3 bucket that are needed for the notebooks.
    - Downloads ClinVar CNV, MANE Ensembl GFF, and NCH CNV data
      - The following notebooks were used to create the files that are downloaded in this notebook (order does not matter):
        
        analysis/cnvs/prep_clinvar_cnvs.ipynb
        
        Creates ClinVar-CNVs-normalized.csv
        
        analysis/cnvs/parse_prep_normalize_nch_cnvs.ipynb
        
        Creates NCH-microarray-CNVs-cleaned.csv
Run the following notebooks (order does not matter):
- analysis/civic/variation_analysis/civic_variation_analysis.ipynb
  - Runs CIViC variant data through the Variation Normalizer
- analysis/clinvar/clinvar_variation_analysis.ipynb
  - Analysis on ClinVar variant data
- analysis/genie/pre_variant_analysis/genie_pre_variant_analysis.ipynb
  - Runs GENIE variant data through the Variation Normalizer
- analysis/moa/feature_analysis/moa_feature_analysis.ipynb
  - Runs MOA feature data through the Variation Normalizer

Important

You must have the Docker containers running for these notebooks.

Run the following notebooks (order does not matter):
- analysis/civic/variation_analysis/transcript_variation_analysis.ipynb
  - Analysis on CIViC variants in the Transcript category
- analysis/civic/evidence_analysis/civic_evidence_analysis.ipynb
  - Analysis on CIViC evidence items
- analysis/cnvs/query_match_nch_clinvar_cnvs.ipynb
  - Analysis on feature overlap in NCH and ClinVar CNVs
- analysis/genie/variant_analysis/genie_search_analysis.ipynb
  - Analysis on matched normalized GENIE variants and normalized variants from CIViC, MOA, and ClinVar
- analysis/moa/assertion_analysis/moa_assertion_analysis.ipynb
  - Analysis on MOA assertions
Run the following notebook:
- analysis/merged_moa_civic/merged_moa_civic_evidence_analysis.ipynb
  - Combined analysis on CIViC evidence items and MOA assertions
Run the following notebook:
- analysis/performance_analysis/merged_performance_analysis.ipynb
  - Analysis on Variation Normalizer performance on CIViC, MOA, and ClinVar

Running Notebooks in Visual Studio Code (VS Code)

VS Code is a lightweight source code editor for Windows, Linux, and macOS.

Download VS Code here
Open a notebook and click Select Kernel at the top right. Select the option where the path is venv/3.13/bin/python. See here for more information on managing Jupyter Kernels in VS Code.
Run the notebooks

Analysis with macOS Environments

These notebooks were run using these macOS specs:

Model Year	CPU Architecture	Total RAM	Hard drive capacity
2023	M2 Pro	32 GB	1 TB
2023	M3 Pro	36 GB	1 TB
2024	M4 Pro	48 GB	1 TB

Help

If you have any questions or problems, please make an issue in the repo and our team will be happy to assist.

Name		Name	Last commit message	Last commit date
Latest commit History 69 Commits
analysis		analysis
submodules @ 068337d		submodules @ 068337d
.env.shared		.env.shared
.git-blame-ignore-revs		.git-blame-ignore-revs
.gitignore		.gitignore
.gitmodules		.gitmodules
.pre-commit-config.yaml		.pre-commit-config.yaml
Download.Dockerfile		Download.Dockerfile
LICENSE		LICENSE
README.md		README.md
compose.yaml		compose.yaml
pyproject.toml		pyproject.toml
requirements.txt		requirements.txt
uv.lock		uv.lock

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Variation Normalizer Manuscript

Set Up

Prerequisites

MacOS

Ubuntu

Creating the virtual environment

uv

pip

Environment Variables

Set Up Backend Services

1. Biocommons SeqRepo

SeqRepo Verification

SeqRepo Issues

2. Variation Normalizer: Docker Container

Running Notebooks

Running Notebooks in Visual Studio Code (VS Code)

Analysis with macOS Environments

Help

About

Uh oh!

Releases 2

Packages

Uh oh!

Contributors 5

Uh oh!

Languages

License

GenomicMedLab/variation-normalizer-manuscript

Folders and files

Latest commit

History

Repository files navigation

Variation Normalizer Manuscript

Set Up

Prerequisites

MacOS

Ubuntu

Creating the virtual environment

uv

pip

Environment Variables

Set Up Backend Services

1. Biocommons SeqRepo

SeqRepo Verification

SeqRepo Issues

2. Variation Normalizer: Docker Container

Running Notebooks

Running Notebooks in Visual Studio Code (VS Code)

Analysis with macOS Environments

Help

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases 2

Packages 0

Uh oh!

Contributors 5

Uh oh!

Languages

Packages