Deep‑learning sequence aligner for immunoglobulin & T‑cell receptor repertoires
# Pull the latest image
docker pull thomask90/alignair:latest
# Run AlignAIR v2.0 pipeline
docker run -it --rm \
-v /path/to/local/data:/data \
-v /path/to/local/downloads:/downloads \
thomask90/alignair:latest \
python app.py run \
--model-checkpoint=/app/pretrained_models/IGH_S5F_576 \
--genairr-dataconfig=HUMAN_IGH_OGRDB \
--sequences=/data/sample_HeavyChain_dataset.csv \
--save-path=/downloads/
Table of contents
AlignAIR v2.0 introduces a revolutionary unified architecture:
- SingleChainAlignAIR: Optimized for single receptor type analysis
- MultiChainAlignAIR: Native multi-chain support with chain type classification
- Universal compatibility: Works with any GenAIRR dataconfig combination
- Mixed receptor processing: Analyze IGK + IGL light chains simultaneously
- Chain type classification: Automatic receptor type identification
- Optimized batch processing: Equal partitioning across chain types
- Built-in dataconfigs:
HUMAN_IGH_OGRDB
,HUMAN_IGK_OGRDB
,HUMAN_IGL_OGRDB
,HUMAN_TCRB_IMGT
- Custom config support: Use your own GenAIRR dataconfigs
- Automatic detection: Single vs. multi-chain mode based on input
- Streamlined architecture: Single codebase for all receptor types
- Memory optimization: Efficient processing for large datasets
- GPU acceleration: Optimized tensor operations
- State‑of‑the‑art accuracy for V, D, J allele calling and junction segmentation
- Unified multi‑chain architecture supporting any chain combinations with dynamic GenAIRR integration
- Multi‑task deep network jointly optimises alignment, productivity, indel detection, and chain type classification
- Scales to millions of AIRR‑seq reads with GPU support
- Universal model architecture that adapts to single-chain or multi-chain scenarios
- Dynamic data configuration with built-in GenAIRR dataconfigs for major species and receptors
- Drop‑in integration with AIRR schema & downstream tools
# Pull the latest image
docker pull thomask90/alignair:latest
# Start interactive container (mount local data to /data)
docker run -it --rm -v /path/to/local/data:/data thomask90/alignair:latest
Prerequisites: Nvidia GPU + CUDA 11 recommended (CPU works, slower).
git clone https://github.com/MuteJester/AlignAIR.git
cd AlignAIR && pip install -e ./
- Note that the local version comes without pretrained model weights and is mainly used for custom model and pipeline development, testing, and debugging. It is mainly recommended for developers, contributors and advanced users.
python app.py run \
--model-checkpoint=/app/pretrained_models/IGH_S5F_576 \
--genairr-dataconfig=HUMAN_IGH_OGRDB \
--sequences=/data/input/sequences.csv \
--save-path=/data/output
Heavy Chain Analysis:
python app.py run \
--model-checkpoint=/app/pretrained_models/IGH_S5F_576 \
--genairr-dataconfig=HUMAN_IGH_OGRDB \
--sequences=/data/input/heavy_sequences.csv \
--save-path=/data/output/heavy_results \
--v-allele-threshold=0.75 \
--d-allele-threshold=0.3 \
--j-allele-threshold=0.8
Light Chain Analysis (Single Chain):
python app.py run \
--model-checkpoint=/app/pretrained_models/IGL_S5F_576 \
--genairr-dataconfig=HUMAN_IGL_OGRDB,HUMAN_IGL_OGRDB \
--sequences=/data/input/light_sequences.csv \
--save-path=/data/output/light_results \
--airr-format \
--fix-orientation
Multi-Chain Light Chain Analysis (IGK + IGL):
python app.py run \
--model-checkpoint=/app/pretrained_models/MultiChain_Light_S5F_576 \
--genairr-dataconfig=HUMAN_IGK_OGRDB,HUMAN_IGL_OGRDB \
--sequences=/data/input/mixed_light_sequences.csv \
--save-path=/data/output/multichain_results \
--airr-format
T-Cell Receptor Beta Chain:
python app.py run \
--model-checkpoint=/app/pretrained_models/TCRB_Uniform_576 \
--genairr-dataconfig=HUMAN_TCRB_IMGT \
--sequences=/data/input/tcr_sequences.csv \
--save-path=/data/output/tcr_results
AlignAIR v2.0 introduces a unified architecture that dynamically adapts to different chain types and configurations using GenAIRR dataconfigs:
Architecture | Use Case | DataConfig Support | Multi-Chain |
---|---|---|---|
SingleChainAlignAIR | Single receptor type analysis | Single GenAIRR dataconfig | No |
MultiChainAlignAIR | Mixed receptor analysis | Multiple GenAIRR dataconfigs | Yes |
DataConfig | Chain Type | Species | Reference | D Gene |
---|---|---|---|---|
HUMAN_IGH_OGRDB |
Heavy Chain | Human | OGRDB v1.5 | ✓ |
HUMAN_IGK_OGRDB |
Kappa Light | Human | OGRDB v1.5 | ✗ |
HUMAN_IGL_OGRDB |
Lambda Light | Human | OGRDB v1.5 | ✗ |
HUMAN_TCRB_IMGT |
TCR Beta | Human | IMGT v3.1.25 | ✓ |
HUMAN_IGH_EXTENDED |
Heavy Chain Extended | Human | OGRDB + Custom | ✓ |
The Docker container ships with optimized models for common use cases:
Model | Architecture | Supported Configs | Checkpoint Path |
---|---|---|---|
Heavy Chain | SingleChainAlignAIR | HUMAN_IGH_OGRDB |
/app/pretrained_models/IGH_S5F_576 |
Lambda Light | SingleChainAlignAIR | HUMAN_IGL_OGRDB |
/app/pretrained_models/IGL_S5F_576 |
Kappa Light | SingleChainAlignAIR | HUMAN_IGK_OGRDB |
/app/pretrained_models/IGK_S5F_576 |
Multi-Light | MultiChainAlignAIR | HUMAN_IGK_OGRDB,HUMAN_IGL_OGRDB |
/app/pretrained_models/MultiLight_S5F_576 |
TCR Beta | SingleChainAlignAIR | HUMAN_TCRB_IMGT |
/app/pretrained_models/TCRB_Uniform_576 |
You can use custom GenAIRR dataconfigs by providing a path to a pickled DataConfig object:
python app.py run \
--model-checkpoint=path/to/custom/model \
--genairr-dataconfig=/path/to/custom_dataconfig.pkl \
--sequences=input.csv \
--save-path=output/
For multi-chain custom configs:
python app.py run \
--model-checkpoint=path/to/multichain/model \
--genairr-dataconfig=/path/to/config1.pkl,/path/to/config2.pkl \
--sequences=input.csv \
--save-path=output/
Step‑by‑step guide
-
Pull image
docker pull thomask90/alignair:latest
-
Run container
docker run -it --rm \ -v "/path/to/local/data:/data" \ thomask90/alignair:latest
-
Inside the container, run AlignAIR:
python app.py run \ --model-checkpoint="/app/pretrained_models/IGH_S5F_576" \ --genairr-dataconfig=HUMAN_IGH_OGRDB \ --sequences="/data/test01.csv" \ --save-path="/data"
Results are written back to your mounted
/data
folder. -
For help and all parameters:
python app.py run --help
Parameter | Description | Default |
---|---|---|
--model-checkpoint |
Path to model weights | Required |
--chain-type |
Specify heavy, light, or tcrb | Required |
--sequences |
Input file path (CSV/TSV/FASTA) | Required |
--save-path |
Output directory | Required |
Parameter | Description | Default |
---|---|---|
--max-input-size |
Maximum input window size | 576 |
--batch-size |
Sequences per batch | 2048 |
Parameter | Description | Default |
---|---|---|
--v-allele-threshold |
V allele calling threshold | 0.75 |
--d-allele-threshold |
D allele calling threshold | 0.30 |
--j-allele-threshold |
J allele calling threshold | 0.80 |
--v-cap / --d-cap / --j-cap |
Maximum calls per segment | 3 |
Parameter | Description | Default |
---|---|---|
--airr-format |
Output full AIRR Schema | false |
--fix-orientation |
Auto-correct orientations | true |
--translate-to-asc |
Output ASC allele names | false |
For complete parameter list: python app.py run --help
See the examples/
folder for Jupyter notebooks:
- End‑to‑end heavy‑chain pipeline
- Benchmark vs. IgBLAST on 10 K reads
- Batch processing workflows
Training & benchmark datasets are archived on Zenodo: doi:10.5281/zenodo.XXXXXXXX
For comprehensive documentation, examples, and technical details, visit: https://alignair.ai/docs
Pull requests are welcome! Please:
- Run
pre-commit run --all-files
- Ensure
pytest
passes - Update
CHANGELOG.md
See CONTRIBUTING.md for full guidelines.
This project is licensed under the terms of the GNU General Public License v3.0 or later (GPLv3+).
Open an issue or email thomaskon90@gmail.com.
For announcements, visit https://alignair.ai or join our Slack.