Language models for Biological Sequence Transformation and Evolutionary Representation
lobster
is a "batteries included" language model library for proteins and other biological sequences. Led by Nathan Frey, Karina Zadorozhny, Taylor Joren, Sidney Lisanza, Aya Abdlesalam Ismail, Joseph Kleinhenz and Allen Goodman, with many valuable contributions from Contributors across Prescient Design, Genentech.
This repository contains training code and access to pre-trained language models for biological sequence data.
Table of contents
- LBSTER is built for pre-training models quickly from scratch. It is "batteries included." This is most useful if you need to control the pre-training data mixture and embedding space, or want to experiment with novel pre-training objectives and fine-tuning strategies.
- LBSTER is a living, open-source library that will be periodically updated with new code and pre-trained models from the Frey Lab at Prescient Design, Genentech. The Frey Lab works on real therapeutic molecule design problems and LBSTER models and capabilities reflect the demands of real-world drug discovery campaigns.
- LBSTER is built with beignet, a standard library for biological research, and integrated with cortex, a modular framework for multitask modeling, guided generation, and multi-modal models.
- LBSTER supports concepts; we have a concept-bottleneck protein language model, CB-LBSTER, which supports 718 concepts.
If you use the code and/or models, please cite the relevant papers.
For the lbster
code base cite: Cramming Protein Language Model Training in 24 GPU Hours
@article{Frey2024.05.14.594108,
author = {Frey, Nathan C. and Joren, Taylor and Ismail, Aya Abdelsalam and Goodman, Allen and Bonneau, Richard and Cho, Kyunghyun and Gligorijevi{\'c}, Vladimir},
title = {Cramming Protein Language Model Training in 24 GPU Hours},
elocation-id = {2024.05.14.594108},
year = {2024},
doi = {10.1101/2024.05.14.594108},
publisher = {Cold Spring Harbor Laboratory},
URL = {https://www.biorxiv.org/content/early/2024/05/15/2024.05.14.594108},
eprint = {https://www.biorxiv.org/content/early/2024/05/15/2024.05.14.594108.full.pdf},
journal = {bioRxiv}
}
For the cb-lbster
code base cite: Concept Bottleneck Language Models for Protein Design
@article{ismail2024conceptbottlenecklanguagemodels,
title={Concept Bottleneck Language Models For protein design},
author={Aya Abdelsalam Ismail and Tuomas Oikarinen and Amy Wang and Julius Adebayo and Samuel Stanton and Taylor Joren and Joseph Kleinhenz and Allen Goodman and Héctor Corrada Bravo and Kyunghyun Cho and Nathan C. Frey},
year={2024},
eprint={2411.06090},
archivePrefix={arXiv},
primaryClass={cs.LG},
url={https://arxiv.org/abs/2411.06090},
}
Install uv and run
uv sync
For different optional dependencies, run
uv sync --extra <group name 1> --extra <group name 2>
where can be one of
lg-gpu
,lg-cpu
for Latent Generator dependencies for GPU or CPU respectivelymgm
for UME dependenciesflash
for flash-attention on GPUmcp
for MCP serverstrl
for transformer reinforcement learning
Recommended installation of all optional dependencies on a CPU:
uv sync --extra mgm --extra mcp --extra lg-cpu --extra trl
Recommended installation of all optional dependencies on a GPU:
uv sync --extra mgm --extra mcp --extra lg-gpu --extra flash --extra trl
To use the environement, you can run either activate the environment...
source .venv/bin/activate
python -c "import lobster"
lobster_train data.path_to_fasta="test_data/query.fasta"
... or run with uv run
:
uv run python -c "import lobster"
uv run lobster_train data.path_to_fasta="test_data/query.fasta"
Shorthand | #params | Dataset | Description | Model checkpoint |
---|---|---|---|---|
Lobster_24M | 24 M | uniref50 | 24M parameter protein Masked LLM trained on uniref50 | lobster_24M |
Lobster_150M | 150 M | uniref50 | 150M parameter protein Masked LLM trained on uniref50 | lobster_150M |
Shorthand | #params | Dataset | Description | Model checkpoint |
---|---|---|---|---|
cb_Lobster_24M | 24 M | uniref50+SwissProt | 24M parameter a protein concept bottleneck model for proteins with 718 concepts | cb_lobster_24M |
cb_Lobster_150M | 150 M | uniref50+SwissProt | 150M parameter a protein concept bottleneck model for proteins with 718 concepts | cb_lobster_150M |
cb_Lobster_650M | 650 M | uniref50+SwissProt | 650M parameter a protein concept bottleneck model for proteins with 718 concepts | cb_lobster_650M |
cb_Lobster_3B | 3 B | uniref50+SwissProt | 3B parameter a protein concept bottleneck model for proteins with 718 concepts | cb_lobster_3B |
from lobster.model import LobsterPMLM, LobsterPCLM, LobsterCBMPMLM
masked_language_model = LobsterPMLM("asalam91/lobster_24M")
concept_bottleneck_masked_language_model = LobsterCBMPMLM("asalam91/cb_lobster_24M")
causal_language_model = LobsterPCLM.load_from_checkpoint(<path to ckpt>)
3D, cDNA, and dynamic models use the same classes.
Models
- LobsterPMLM: masked language model (BERT-style encoder-only architecture)
- LobsterCBMPMLM: concept bottleneck masked language model (BERT-style encoder-only architecture with a concept bottleneck and a linear decoder)
- LobsterPCLM: causal language model (Llama-style decoder-only architecture)
- LobsterPLMFold: structure prediction language models (pre-trained encoder + structure head)
Check out this jupyter notebook tutorial for an example on how to extract embedding reprsentations from different models.
Check out this jupyter notebook tutorial for an example on how to intervene on different concepts for our concept-bottleneck models class.
Lobster supports Model Context Protocol (MCP) for seamless integration with Claude Desktop, Cursor, and other AI tools:
# Install with MCP support
uv sync --extra mcp
# Setup Claude Desktop integration
uv run lobster_mcp_setup
Click the button above to automatically add the Lobster MCP server to Cursor.
Requirements:
- Cursor installed
- uv package manager available in PATH
- Lobster repository cloned locally with all dependencies installed (
uv sync --all-extras
)
After setup, you can use Lobster models directly in Claude Desktop or Cursor with natural language commands like:
- "Get embeddings for this protein sequence using lobster_24M"
- "What concepts are supported by the cb_lobster_24M model?"
- "Intervene on this sequence to reduce hydrophobicity"
Key Features:
- Modular architecture - Clean separation of models, tools, and schemas
- Multiple model types - Access to both MLM and concept bottleneck models
- 5 core tools - Embeddings, concepts, interventions, naturalness, and model listing
- Type-safe validation - Pydantic schemas for reliable interactions
See the MCP Integration Guide for complete documentation or MCP README for quick start instructions.
Lobster is available as a DXT (Desktop Extension Toolkit) extension for Claude Desktop, providing a one-click installation experience:
- Download: Get the latest
.dxt
file from GitHub Releases - Install: Double-click the
.dxt
file or drag it into Claude Desktop - Use: Start using Lobster models with natural language commands
- One-click installation - No command line setup required
- Self-contained - Includes all dependencies (~500MB)
- Automatic updates - New versions available through GitHub Releases
- Full functionality - All MCP server capabilities included
Once installed, you can use natural language commands in Claude Desktop:
What Lobster models are available for protein analysis?
Get embeddings for the sequence MKTVRQERLKSIVRIL using lobster_24M
What concepts are supported by the cb_lobster_24M model?
Intervene on MKTVRQERLKSIVRIL to reduce hydrophobicity using cb_lobster_24M
For developers who want to build and test DXT extensions locally:
# Build DXT extension locally
python scripts/build_dxt.py
# Create a release (updates version, builds, and creates GitHub release)
python scripts/release_dxt.py 0.1.0
See DXT Distribution Guide for detailed build and distribution instructions.
Check out examples for scripts showing how to perform inference and interventions.
The entrypoint lobster_embed
is the main driver for embedding sequences and accepts parameters using Hydra syntax. The available parameters for configuration can be found by running lobster_embed --help
or by looking in the src/lobster/hydra_config directory
To embed a fasta file of sequences using a pre-trained model on an interactive GPU node, cd into the root dir of this repo and do
lobster_embed data.path_to_fasta="test_data/query.fasta" checkpoint="path_to_checkpoint.ckpt"
This will generate a dataframe of embeddings and also log them to wandb.
For robust multitask modeling, we recommend using lobster
with cortex. For simple baselines using lobster
embeddings, use lobster.model.LinearProbe
and lobster.model.LobsterMLP
.
Likelihoods from an autoregressive LobsterCLM
or pseudo-log likelihoods ("naturalness") from a LobsterPMLM
can be computed for a list of sequences
using
model.naturalness(sequences)
model.likelihood(sequences)
The entrypoint lobster_train
is the main driver for training and accepts parameters using Hydra syntax. The available parameters for configuration can be found by running lobster_train --help
or by looking in the src/lobster/hydra_config directory
To train an MLM on a fasta file of sequences on an interactive GPU node, cd into the root dir of this repo and do
lobster_train data.path_to_fasta="test_data/query.fasta" logger=csv paths.root_dir="."
Lobster supports reinforcement learning training using UME-based reward functions for post-training language models. This approach uses UME pseudo-likelihood scores as rewards to guide model behavior toward generating more biologically plausible sequences.
Quick Start:
# Step 1: Generate synthetic dataset
cd examples
python generate_synthetic_dataset.py
# Step 2: Run UME-based GRPO training
python train_ume_grpo.py
Key Features:
- Automatic modality detection for SMILES, amino acid, and DNA sequences
- UME-based reward functions using pseudo-likelihood scores
- GRPO training with TRL integration
- Modular design with reusable components
For detailed instructions and advanced usage, see the RL Training Guide.
Contributions are welcome! We ask that all users and contributors remember that the LBSTER team are all full-time drug hunters, and our open-source efforts are a labor of love because we care deeply about open science and scientific progress.
Expanding unit test coverage, docstrings, and type hints are always welcome and a good place to start to orient yourself to the code base. Likewise for identifying and fixing 🐛bugs🐛. For more involved project ideas, check Good First Issues. All new or modified code must be unit tested before maintainers will review.
pre-commit install
uv pip compile requirements.in -o requirements.txt
python -m pytest -v --cov-report term-missing --cov=./lobster ./tests