This is a repository of scripts for reproducing benchmarking GLM on Nullsette.
git clone https://github.com/cellethology/GLM-Nullsette-Benchmark.git
cd GLM-Nullsette-Benchmark
conda env create -f environment.yml
conda activate glm_eval
Example data is in data/
directory. For inference data used in the paper,
please unzip the data/processed_data.zip
file.
You will find expression cassette data stored in database
directory.
It can be easily imported using the following script.
from database import deboer_database, zahm_database, kosuri_database, lagator_database
Inference script for several representation models are in model
directory.
We acknowledge the valuable contributions to genomic language modeling made by the authors of the following repositories: Evo1, Evo2, Nucleotide Transformer, DNABERT-2, GENERator, METAGENE-1, Caduceus, GPN, GENA-LM, gLM2, PDLLM.