Adaptive Immune Receptor Repertoire sequence simulator
Generate realistic BCR & TCR repertoires in a single line of Python.
- Why GenAIRR?
- Key Features
- Installation
- Quick Start
- Examples
- Mutation Models
- Roadmap
- Contributing
- Citing GenAIRR
- License
- Acknowledgements
Click to expand
Benchmarking modern aligners, exploring somatic-hypermutation, or stress-testing novel ML pipelines requires large, perfectly-annotated repertoires—not snippets of real data peppered with sequencing error.
GenAIRR fills that gap with a plug-and-play, fully-extensible simulation engine that produces sequences while giving you full ground-truth labels.
Category | Highlights |
---|---|
Realistic Simulation | Context-aware S5F mutations, indels, allele-specific trimming, NP-region modelling |
Composable Pipelines | Chain together built-in & custom AugmentationStep s into simulation pipelines |
Multi-Chain Support | Heavy & light BCRs plus TCR-β out of the box |
Research-ready Output | JSON / pandas export, built-in plotting stubs, deterministic seeds |
Docs & Tutorials | Rich API docs, Jupyter notebooks, step-by-step guides |
# Python ≥ 3.9
pip install GenAIRR
# or the bleeding edge
pip install git+https://github.com/your-org/GenAIRR.git
Below is a 60-second tour. See /examples
for notebooks and CLI usages.
from GenAIRR.pipeline import AugmentationPipeline
from GenAIRR.parameters import ChainType,CHAIN_TYPE_INFO
from GenAIRR.steps import SimulateSequence, FixVPositionAfterTrimmingIndexAmbiguity
from GenAIRR.mutation import S5F
from GenAIRR.data import builtin_heavy_chain_data_config
from GenAIRR.steps.StepBase import AugmentationStep
# 1️⃣ Configure built-in germline & chain type
data_cfg = builtin_heavy_chain_data_config()
AugmentationStep.set_dataconfig(config = data_cfg,
chain_type=ChainType.BCR_HEAVY)
# 2️⃣ Build a minimal pipeline
pipeline = AugmentationPipeline([
SimulateSequence(mutation_model=S5F(min_mutation_rate=0.003, max_mutation_rate=0.25), productive=True),
FixVPositionAfterTrimmingIndexAmbiguity()
])
# 3️⃣ Simulate!
sim = pipeline.execute()
print(sim.get_dict())
from GenAIRR.steps import (
FixDPositionAfterTrimmingIndexAmbiguity, FixJPositionAfterTrimmingIndexAmbiguity,
CorrectForVEndCut, CorrectForDTrims, CorruptSequenceBeginning,
InsertNs, InsertIndels, ShortDValidation, DistillMutationRate
)
pipeline = AugmentationPipeline([
SimulateSequence(mutation_model=S5F(min_mutation_rate=0.003, max_mutation_rate=0.25), productive=True),
FixVPositionAfterTrimmingIndexAmbiguity(),
FixDPositionAfterTrimmingIndexAmbiguity(),
FixJPositionAfterTrimmingIndexAmbiguity(),
CorrectForVEndCut(),
CorrectForDTrims(),
CorruptSequenceBeginning(
corruption_probability=0.7,
corrupt_events_proba=[0.4, 0.4, 0.2],
max_sequence_length=576,
nucleotide_add_coefficient=210,
nucleotide_remove_coefficient=310,
nucleotide_add_after_remove_coefficient=50,
random_sequence_add_proba=1
),
InsertNs(n_ratio=0.02, proba=0.5),
ShortDValidation(short_d_length=5),
InsertIndels(indel_probability=0.5, max_indels=5, insertion_proba=0.5, deletion_proba=0.5),
DistillMutationRate()
])
result = pipeline.execute()
from GenAIRR.mutation import Uniform
naive_step = SimulateSequence(mutation_model=Uniform(0, 0), productive=True)
pipeline = AugmentationPipeline([naive_step])
naive_seq = pipeline.execute()
print(naive_seq.sequence)
custom_step = SimulateSequence(
mutation_model=S5F(0.003, 0.25),
productive=True,
specific_v=data_cfg.allele_list('v')[0],# specific V allele (as Allele object)
specific_d=data_cfg.allele_list('d')[0],# specific D allele (as Allele object)
specific_j=data_cfg.allele_list('j')[0] # specific J allele (as Allele object)
)
pipeline = AugmentationPipeline([custom_step])
print(pipeline.execute().get_dict())
Model | Description | When to use |
---|---|---|
S5F |
Context-specific somatic hyper-mutation | Antibody maturation studies |
Uniform |
Evenly random mutations | Baselines / ablation |
Your Model ➕ | Implement BaseMutationModel |
Custom evolutionary scenarios |
from GenAIRR.mutation import S5F
s5f = S5F(min_mutation_rate=0.01, max_mutation_rate=0.05)
mut_seq, muts, rate = s5f.apply_mutation(naive_seq)
- 🚧 More Complex Mutation Model (With Selection)
- 🚧 More Built-in Data Configs (e.g., TCR, custom germlines)
- 🚧 More Built-in Steps (e.g., more mutation models, more data augmentation)
- 🚧 Deeper Docs (e.g., more examples, more tutorials)
See the open issues. Feel something’s missing? Open a feature request.
Contributions are welcome! 💙 Please read our contributing guide and check the good first issue label.
If GenAIRR helps your research, please cite:
Konstantinovsky T, Peres A, Polak P, Yaari G.
An unbiased comparison of immunoglobulin sequence aligners.
Briefings in Bioinformatics. 2024 Sep 23; 25(6): bbae556.
https://doi.org/10.1093/bib/bbae556
PMID: 39489605 | PMCID: PMC11531861
Distributed under the MIT License. See LICENSE for details.
GenAIRR is inspired by and builds upon amazing work from the immunoinformatics community—especially AIRRship.