Skip to content

GenomicMedLab/variation-normalizer-manuscript

Repository files navigation

Variation Normalizer Manuscript

This repo contains analysis notebooks used in the The Clinical Genomic Variation Landscape manuscript.

Small output files can be found in this repo. Larger files can be found in our public s3 bucket: s3://nch-igm-wagner-lab-public/variation-normalizer-manuscript/2025. There are notebooks that provide functions for programmatically downloading files from the s3 bucket.

After running the notebooks, users will be able to create figures such as these that demonstrate the results of the analysis, such as the below figure.

Variant normalization allows patient data from AACR Project GENIE to be matched to normalized variants in the CIViC, MOAlmanac, and ClinVar knowledgebases.

Patient Matching with GENIE

Set Up

Before running the notebooks, you must set up your environment.

Prerequisites

  • Docker
  • Python 3.13
    • We recommend using uv to install.
  • libpq
  • postgresql

MacOS

You can use Homebrew to install the prerequisites. See the Homebrew documentation for how to install. Make sure Homebrew is up-to-date by running brew update.

brew install libpq
brew install postgresql@14

Ubuntu

sudo apt install gcc libpq-dev python3-dev

Creating the virtual environment

uv

From the root directory, run the following to create the venv and install exact packages:

uv python pin 3.13
uv venv
source .venv/bin/activate
uv sync --all-extras
git submodule update --init --recursive

pip

python3 -m venv .venv
source .venv/bin/activate
python3 -m pip install -r requirements.txt
git submodule update --init --recursive

Environment Variables

We use python-dotenv to load environment variables needed for analysis notebooks that run the Variation Normalizer.

If you are running any of the following notebooks, this section is required:

In the analysis notebooks, you will see:

from dotenv import load_dotenv

load_dotenv(".env.shared")

This will load environment variables from the .env.shared file in the root directory.

Set Up Backend Services

This analysis relies on backend services, which you must set up yourself.

1. Biocommons SeqRepo

Biocommons SeqRepo is used for fast access to sequence data. This analysis uses 2024-12-20 SeqRepo data.

Follow the Quick Start Documentation for setting up SeqRepo (2024-12-20).

SeqRepo Verification

To verify, run the following inside your virtual environment:

╰─$ python3
Python 3.13.1 (main, Dec 31 2024, 13:03:34) [Clang 16.0.0 (clang-1600.0.26.6)] on darwin
Type "help", "copyright", "credits" or "license" for more information.
>>> from biocommons.seqrepo import SeqRepo
>>> sr = SeqRepo(root_dir="/usr/local/share/seqrepo/2024-12-20")
>>> sr["NC_000001.11"][780000:780020]
'TGGTGGCACGCGCTTGTAGT'
SeqRepo Issues

If you have trouble using the default path, try creating a symlink, by running the following:

seqrepo update-latest

Verify that this works in SeqRepo Verification.

2. Variation Normalizer: Docker Container

Important

This section assumes you have a local SeqRepo installed at /usr/local/share/seqrepo/2024-12-20. If you have it installed elsewhere, please update add a SEQREPO_ROOT_DIR environment variable in compose.yaml.
If you're using Docker Desktop, you'll want to go to Settings -> Resources -> File sharing and add /usr/local/share/seqrepo under the Virtual file shares section. Otherwise, you will get the following error: OSError: Unable to open SeqRepo directory /usr/local/share/seqrepo/2024-12-20.

To build, (re)create, and start containers

docker volume create --name=uta_vol
docker compose \
  -p variation-normalizer-manuscript \
  -f submodules/compose.yaml \
  -f compose.yaml \
  up

Tip

If you want a clean slate, run docker compose down -v to remove containers and volumes, then docker compose -p variation-normalizer-manuscript -f submodules/compose.yaml -f compose.yaml up to rebuild and start fresh containers.

Running Notebooks

This section provides information about the notebooks and the order that they should be run in.

  1. Run the following notebook:
  2. Run the following notebooks (order does not matter):

Important

You must have the Docker containers running for these notebooks.

  1. Run the following notebooks (order does not matter):
  2. Run the following notebook:
  3. Run the following notebook:

Running Notebooks in Visual Studio Code (VS Code)

VS Code is a lightweight source code editor for Windows, Linux, and macOS.

  1. Download VS Code here
  2. Open a notebook and click Select Kernel at the top right. Select the option where the path is venv/3.13/bin/python. See here for more information on managing Jupyter Kernels in VS Code.
  3. Run the notebooks

Analysis with macOS Environments

These notebooks were run using these macOS specs:

Model Year CPU Architecture Total RAM Hard drive capacity
2023 M2 Pro 32 GB 1 TB
2023 M3 Pro 36 GB 1 TB
2024 M4 Pro 48 GB 1 TB

Help

If you have any questions or problems, please make an issue in the repo and our team will be happy to assist.

About

Issue tracker for Variation Normalizer manuscript work

Resources

License

Stars

Watchers

Forks

Packages

No packages published

Contributors 5

Languages