AlignAIR

Deep‑learning sequence aligner for immunoglobulin & T‑cell receptor repertoires

✨ Quick Start

# Pull the latest image
docker pull thomask90/alignair:latest

# Run AlignAIR v2.0 pipeline
docker run -it --rm \
  -v /path/to/local/data:/data \
  -v /path/to/local/downloads:/downloads \
  thomask90/alignair:latest \
  python app.py run \
    --model-checkpoint=/app/pretrained_models/IGH_S5F_576 \
    --genairr-dataconfig=HUMAN_IGH_OGRDB \
    --sequences=/data/sample_HeavyChain_dataset.csv \
    --save-path=/downloads/

Table of contents

Key features
Installation
Usage
Available Models
Docker in depth
Examples
Parameter Reference
Data availability
Citation
Contributing
License
Contact

What's New in v2.0

AlignAIR v2.0 introduces a revolutionary unified architecture:

🔄 Unified Models

SingleChainAlignAIR: Optimized for single receptor type analysis
MultiChainAlignAIR: Native multi-chain support with chain type classification
Universal compatibility: Works with any GenAIRR dataconfig combination

🧬 Multi-Chain Analysis

Mixed receptor processing: Analyze IGK + IGL light chains simultaneously
Chain type classification: Automatic receptor type identification
Optimized batch processing: Equal partitioning across chain types

⚡ Dynamic GenAIRR Integration

Built-in dataconfigs: HUMAN_IGH_OGRDB, HUMAN_IGK_OGRDB, HUMAN_IGL_OGRDB, HUMAN_TCRB_IMGT
Custom config support: Use your own GenAIRR dataconfigs
Automatic detection: Single vs. multi-chain mode based on input

📈 Enhanced Performance

Streamlined architecture: Single codebase for all receptor types
Memory optimization: Efficient processing for large datasets
GPU acceleration: Optimized tensor operations

Key features

State‑of‑the‑art accuracy for V, D, J allele calling and junction segmentation
Unified multi‑chain architecture supporting any chain combinations with dynamic GenAIRR integration
Multi‑task deep network jointly optimises alignment, productivity, indel detection, and chain type classification
Scales to millions of AIRR‑seq reads with GPU support
Universal model architecture that adapts to single-chain or multi-chain scenarios
Dynamic data configuration with built-in GenAIRR dataconfigs for major species and receptors
Drop‑in integration with AIRR schema & downstream tools

Installation

Docker (recommended)

# Pull the latest image
docker pull thomask90/alignair:latest

# Start interactive container (mount local data to /data)
docker run -it --rm -v /path/to/local/data:/data thomask90/alignair:latest

Prerequisites: Nvidia GPU + CUDA 11 recommended (CPU works, slower).

Local (advanced)

git clone https://github.com/MuteJester/AlignAIR.git
cd AlignAIR && pip install -e ./

Note that the local version comes without pretrained model weights and is mainly used for custom model and pipeline development, testing, and debugging. It is mainly recommended for developers, contributors and advanced users.

Usage

Basic Usage

python app.py run \
    --model-checkpoint=/app/pretrained_models/IGH_S5F_576 \
    --genairr-dataconfig=HUMAN_IGH_OGRDB \
    --sequences=/data/input/sequences.csv \
    --save-path=/data/output

Example Commands

Heavy Chain Analysis:

python app.py run \
  --model-checkpoint=/app/pretrained_models/IGH_S5F_576 \
  --genairr-dataconfig=HUMAN_IGH_OGRDB \
  --sequences=/data/input/heavy_sequences.csv \
  --save-path=/data/output/heavy_results \
  --v-allele-threshold=0.75 \
  --d-allele-threshold=0.3 \
  --j-allele-threshold=0.8

Light Chain Analysis (Single Chain):

python app.py run \
  --model-checkpoint=/app/pretrained_models/IGL_S5F_576 \
  --genairr-dataconfig=HUMAN_IGL_OGRDB,HUMAN_IGL_OGRDB \
  --sequences=/data/input/light_sequences.csv \
  --save-path=/data/output/light_results \
  --airr-format \
  --fix-orientation

Multi-Chain Light Chain Analysis (IGK + IGL):

python app.py run \
  --model-checkpoint=/app/pretrained_models/MultiChain_Light_S5F_576 \
  --genairr-dataconfig=HUMAN_IGK_OGRDB,HUMAN_IGL_OGRDB \
  --sequences=/data/input/mixed_light_sequences.csv \
  --save-path=/data/output/multichain_results \
  --airr-format

T-Cell Receptor Beta Chain:

python app.py run \
  --model-checkpoint=/app/pretrained_models/TCRB_Uniform_576 \
  --genairr-dataconfig=HUMAN_TCRB_IMGT \
  --sequences=/data/input/tcr_sequences.csv \
  --save-path=/data/output/tcr_results

Available Models and Configurations

AlignAIR v2.0 introduces a unified architecture that dynamically adapts to different chain types and configurations using GenAIRR dataconfigs:

Model Architecture Types

Architecture	Use Case	DataConfig Support	Multi-Chain
SingleChainAlignAIR	Single receptor type analysis	Single GenAIRR dataconfig	No
MultiChainAlignAIR	Mixed receptor analysis	Multiple GenAIRR dataconfigs	Yes

Built-in GenAIRR DataConfigs

DataConfig	Chain Type	Species	Reference	D Gene
`HUMAN_IGH_OGRDB`	Heavy Chain	Human	OGRDB v1.5	✓
`HUMAN_IGK_OGRDB`	Kappa Light	Human	OGRDB v1.5	✗
`HUMAN_IGL_OGRDB`	Lambda Light	Human	OGRDB v1.5	✗
`HUMAN_TCRB_IMGT`	TCR Beta	Human	IMGT v3.1.25	✓
`HUMAN_IGH_EXTENDED`	Heavy Chain Extended	Human	OGRDB + Custom	✓

Pre-trained Model Checkpoints

The Docker container ships with optimized models for common use cases:

Model	Architecture	Supported Configs	Checkpoint Path
Heavy Chain	SingleChainAlignAIR	`HUMAN_IGH_OGRDB`	`/app/pretrained_models/IGH_S5F_576`
Lambda Light	SingleChainAlignAIR	`HUMAN_IGL_OGRDB`	`/app/pretrained_models/IGL_S5F_576`
Kappa Light	SingleChainAlignAIR	`HUMAN_IGK_OGRDB`	`/app/pretrained_models/IGK_S5F_576`
Multi-Light	MultiChainAlignAIR	`HUMAN_IGK_OGRDB,HUMAN_IGL_OGRDB`	`/app/pretrained_models/MultiLight_S5F_576`
TCR Beta	SingleChainAlignAIR	`HUMAN_TCRB_IMGT`	`/app/pretrained_models/TCRB_Uniform_576`

Custom DataConfigs

You can use custom GenAIRR dataconfigs by providing a path to a pickled DataConfig object:

python app.py run \
  --model-checkpoint=path/to/custom/model \
  --genairr-dataconfig=/path/to/custom_dataconfig.pkl \
  --sequences=input.csv \
  --save-path=output/

For multi-chain custom configs:

python app.py run \
  --model-checkpoint=path/to/multichain/model \
  --genairr-dataconfig=/path/to/config1.pkl,/path/to/config2.pkl \
  --sequences=input.csv \
  --save-path=output/

Docker in depth

Step‑by‑step guide

Pull image
```
docker pull thomask90/alignair:latest
```

Run container

docker run -it --rm \
    -v "/path/to/local/data:/data" \
    thomask90/alignair:latest

Inside the container, run AlignAIR:

python app.py run \
  --model-checkpoint="/app/pretrained_models/IGH_S5F_576" \
  --genairr-dataconfig=HUMAN_IGH_OGRDB \
  --sequences="/data/test01.csv" \
  --save-path="/data"

Results are written back to your mounted /data folder.

For help and all parameters:
```
python app.py run --help
```

Parameter Reference

Core Parameters

Parameter	Description	Default
`--model-checkpoint`	Path to model weights	Required
`--chain-type`	Specify heavy, light, or tcrb	Required
`--sequences`	Input file path (CSV/TSV/FASTA)	Required
`--save-path`	Output directory	Required

Model Settings

Parameter	Description	Default
`--max-input-size`	Maximum input window size	`576`
`--batch-size`	Sequences per batch	`2048`

Thresholds

Parameter	Description	Default
`--v-allele-threshold`	V allele calling threshold	`0.75`
`--d-allele-threshold`	D allele calling threshold	`0.30`
`--j-allele-threshold`	J allele calling threshold	`0.80`
`--v-cap` / `--d-cap` / `--j-cap`	Maximum calls per segment	`3`

Output Options

Parameter	Description	Default
`--airr-format`	Output full AIRR Schema	`false`
`--fix-orientation`	Auto-correct orientations	`true`
`--translate-to-asc`	Output ASC allele names	`false`

For complete parameter list: python app.py run --help

Examples

See the examples/ folder for Jupyter notebooks:

End‑to‑end heavy‑chain pipeline
Benchmark vs. IgBLAST on 10 K reads
Batch processing workflows

Data availability

Training & benchmark datasets are archived on Zenodo: doi:10.5281/zenodo.XXXXXXXX

Documentation

For comprehensive documentation, examples, and technical details, visit: https://alignair.ai/docs

Contributing

Pull requests are welcome! Please:

Run pre-commit run --all-files
Ensure pytest passes
Update CHANGELOG.md

See CONTRIBUTING.md for full guidelines.

License

This project is licensed under the terms of the GNU General Public License v3.0 or later (GPLv3+).

Contact

Open an issue or email thomaskon90@gmail.com.
For announcements, visit https://alignair.ai or join our Slack.

Name		Name	Last commit message	Last commit date
Latest commit History 174 Commits
.github/ISSUE_TEMPLATE		.github/ISSUE_TEMPLATE
build/lib/AlignAIR		build/lib/AlignAIR
docs		docs
src		src
stress_tests		stress_tests
tests		tests
.gitignore		.gitignore
Dockerfile		Dockerfile
LICENSE		LICENSE
README.md		README.md
app.py		app.py
local_server.py		local_server.py
main.py		main.py
mkdocs.yml		mkdocs.yml
requirements.txt		requirements.txt
setup.py		setup.py

License

MuteJester/AlignAIR

Folders and files

Latest commit

History

Repository files navigation

AlignAIR

✨ Quick Start

What's New in v2.0

🔄 Unified Models

🧬 Multi-Chain Analysis

⚡ Dynamic GenAIRR Integration

📈 Enhanced Performance

Key features

Installation

Docker (recommended)

Local (advanced)

Usage

Basic Usage

Example Commands

Available Models and Configurations

Model Architecture Types

Built-in GenAIRR DataConfigs

Pre-trained Model Checkpoints

Custom DataConfigs

Docker in depth

Parameter Reference

Core Parameters

Model Settings

Thresholds

Output Options

Examples

Data availability

Documentation

Contributing

License

Contact

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases 4

Packages 0

Uh oh!

Contributors 2

Uh oh!

Languages

Packages