Adaptive Immune Receptor Repertoire sequence simulator
Generate realistic BCR & TCR repertoires in a single line of Python.
- Why GenAIRR?
- Key Features
- Installation
- Quick Start
- Examples
- Mutation Models
- Roadmap
- Contributing
- Citing GenAIRR
- License
- Acknowledgements
Click to expand
Benchmarking modern aligners, exploring somatic-hypermutation, or stress-testing novel ML pipelines requires large, perfectly-annotated repertoires—not snippets of real data peppered with sequencing error.
GenAIRR fills that gap with a plug-and-play, fully-extensible simulation engine that produces sequences while giving you full ground-truth labels.
Category | Highlights |
---|---|
Realistic Simulation | Context-aware S5F mutations, indels, allele-specific trimming, NP-region modelling |
Composable Pipelines | Chain together built-in & custom AugmentationStep s into simulation pipelines |
Multi-Chain Support | Heavy & light BCRs plus TCR-β out of the box |
Research-ready Output | JSON / pandas export, built-in plotting stubs, deterministic seeds |
Docs & Tutorials | Rich API docs, Jupyter notebooks, step-by-step guides |
# Python ≥ 3.9
pip install GenAIRR
# or the bleeding edge
pip install git+https://github.com/MuteJester/GenAIRR.git
Below is a 60-second tour. See /examples
for notebooks and CLI usages.
from GenAIRR.pipeline import AugmentationPipeline
from GenAIRR.steps import SimulateSequence, FixVPositionAfterTrimmingIndexAmbiguity
from GenAIRR.mutation import S5F
from GenAIRR.data import HUMAN_IGH_OGRDB
from GenAIRR.steps.StepBase import AugmentationStep
# 1️⃣ Configure built-in germline data
AugmentationStep.set_dataconfig(HUMAN_IGH_OGRDB)
# 2️⃣ Build a minimal pipeline
pipeline = AugmentationPipeline([
SimulateSequence(S5F(min_mutation_rate=0.003, max_mutation_rate=0.25), True),
FixVPositionAfterTrimmingIndexAmbiguity()
])
# 3️⃣ Simulate!
sim = pipeline.execute()
print(sim.get_dict())
from GenAIRR.steps import (
FixDPositionAfterTrimmingIndexAmbiguity, FixJPositionAfterTrimmingIndexAmbiguity,
CorrectForVEndCut, CorrectForDTrims, CorruptSequenceBeginning,
InsertNs, InsertIndels, ShortDValidation, DistillMutationRate
)
pipeline = AugmentationPipeline([
SimulateSequence(S5F(min_mutation_rate=0.003, max_mutation_rate=0.25), True),
FixVPositionAfterTrimmingIndexAmbiguity(),
FixDPositionAfterTrimmingIndexAmbiguity(),
FixJPositionAfterTrimmingIndexAmbiguity(),
CorrectForVEndCut(),
CorrectForDTrims(),
CorruptSequenceBeginning(0.7, [0.4, 0.4, 0.2], 576, 210, 310, 50),
InsertNs(0.02, 0.5),
ShortDValidation(),
InsertIndels(0.5, 5, 0.5, 0.5),
DistillMutationRate()
])
result = pipeline.execute()
from GenAIRR.mutation import Uniform
naive_step = SimulateSequence(Uniform(0, 0), True)
pipeline = AugmentationPipeline([naive_step])
naive_seq = pipeline.execute()
print(naive_seq.sequence)
custom_step = SimulateSequence(
S5F(0.003, 0.25),
True,
specific_v=HUMAN_IGH_OGRDB.v_alleles['IGHV1-2*02'][0], # specific V allele
specific_d=HUMAN_IGH_OGRDB.d_alleles['IGHD3-10*01'][0], # specific D allele
specific_j=HUMAN_IGH_OGRDB.j_alleles['IGHJ4*02'][0] # specific J allele
)
pipeline = AugmentationPipeline([custom_step])
print(pipeline.execute().get_dict())
Model | Description | When to use |
---|---|---|
S5F |
Context-specific somatic hyper-mutation | Antibody maturation studies |
Uniform |
Evenly random mutations | Baselines / ablation |
Your Model ➕ | Implement BaseMutationModel |
Custom evolutionary scenarios |
from GenAIRR.mutation import S5F
s5f = S5F(min_mutation_rate=0.01, max_mutation_rate=0.05)
mut_seq, muts, rate = s5f.apply_mutation(naive_seq)
- 🚧 More Complex Mutation Model (With Selection)
- 🚧 More Built-in Data Configs (e.g., TCR, custom germlines)
- 🚧 More Built-in Steps (e.g., more mutation models, more data augmentation)
- 🚧 Deeper Docs (e.g., more examples, more tutorials)
See the open issues. Feel something’s missing? Open a feature request.
Contributions are welcome! 💙 Please read our contributing guide and check the good first issue label.
If GenAIRR helps your research, please cite:
Konstantinovsky T, Peres A, Polak P, Yaari G.
An unbiased comparison of immunoglobulin sequence aligners.
Briefings in Bioinformatics. 2024 Sep 23; 25(6): bbae556.
https://doi.org/10.1093/bib/bbae556
PMID: 39489605 | PMCID: PMC11531861
Distributed under the GPL3 License. See LICENSE for details.
GenAIRR is inspired by and builds upon amazing work from the immunoinformatics community—especially AIRRship.