Generating Understanding and Interpretation of Multi-Omics Data with an Automated and Generalizable Pipeline
We aim to develop a prototype pipeline that rapidly predicts mechanisms driving disease pathogenesis using multi-omics data, demonstrating feasibility with generative AI for uncovering biological mechanisms based on key features from data harmonization.
- Samantha Erwin - AI Research
- Lisa Bramer - Domain Expert
- Daniel Claborne - AI Engineer
- Javier Flores - AI Research
- Matt Jensen - AI Engineer
- Karl Mueller - Basic Energy Sciences: Chem, Geo, and Biosciences
- David Wunschel - National Security: Chem/Bio
- Lauren Charles
With advances in instrumentation, we now collect vast multi-omics datasets that provide insights into biological systems and disease mechanisms. Current methods are often manual and time-consuming. Our team has developed a scalable deep learning model for multi-omics data harmonization. We aim to automate interpretation of predictive features using generative AI, thereby speeding up the discovery of biological mechanisms from existing datasets.
We will use our harmonization model to identify key features from multi-omics datasets. Subsequently, we will apply Llama 3, fine-tuned with Retrieval Augmented Fine-Tuning (RAFT) on 'omics-related literature, to interpret these features and elucidate mechanisms driving disease pathogenesis.
Use your favorite python manager to initial a virtual environment, venv
for example:
# install a virtual env in .venv
python3 -m venv .venv
# scope python packages to this path
source .venv/bin/activate
Then install the dependencies and the repo as a python package:
# install project as standalone python package
pip install -e .
The code is presented as a python package as well as a CLI using the click
package. The CLI is invoked by running:
python -m genraitor <cli-entrypoint>
Help documentation for each entrypoint can be accessed by running:
python -m genraitor <cli-entrypoint> --help
Some of the main steps in carrying out our fine-tuning procedure and their associated endpoints are described below.
Our synthetic data processing pipeline starts with a set of uniprot identifiers you are interested in. You can collect these beforehand using variable selections methods such as LASSO, or Shapley values.
Start with a file containing uniprot Accession numbers, one per line, as below:
# data/examples/uniprots.txt
Q9BRJ2
P09758
P84085
P08708
P46013
P02768
P05026
P14618
Then provide this file to the data:context
cli endpoint. This file will also default to some example uniprot ids when no file is provided.
python -m genraitor data:context \
--uniprot_ids=./data/examples/uniprots.txt \
--output_dir=./data
This will produce two files in ./data
, one (uniprot_context_results...
) with the raw results of querying uniprot for pathway information and abstracts, and the other (uniprot_context_postprocessed...
) with context derived from those results and usable by the RAFTDatasetPack
class from llama-index
.
To generate the top uniprot ids (the directory data/deepimv
should exist and contain a .csv file starting with 'shap', and containing 'AH1' and 'pro'):
# to save as a parquet file:
python3 -m genraitor data:uniprot --save_path data/training/uniprot.parquet
# to just print the values to stdout:
python3 -m genraitor data:uniprot
To parse the uniprot data for their associated pubmed ids:
# to save as a parquet file:
python3 -m genraitor data:uniprot-to-pubmed --uniprot_path data/training/uniprot.parquet --save_path data/training/uniprot_pubmed_ids.parquet
# to just print the values to stdout:
python3 -m genraitor data:uniprot-to-pubmed --uniprot_path data/training/uniprot.parquet
To generate documents for usage in a RAG model:
# to save as json files:
python3 -m genraitor data:rag --uniprot_path data/training/uniprot.parquet --save_path data/training/rag/documents
Once you have used the above to create a text file of context, you can use our modified RAFTDatasetPack
class to create synthetic question-answer pairs about chunks of that context.
You will need an OpenAI API key as well as a huggingface API key. The entrypoint for the cli is raft:data
, or there is an example script at examples/raft-dataset.py
.
To run from the cli do:
# set keys
export HF_TOKEN=<your-hf-token>
export OPENAI_API_KEY=<your-oai-key>
python -m genraitor raft:data \
--embed local \
--context_path /path/to/context.txt \
--output_path /path/to/raft_data
See python -m raft:data --help
for more options. The resulting huggingface dataset is a folder of files and can be loaded as below:
from datasets import load_from_disk
dataset = load_from_disk('/path/to/raft_data')
Once we have created the dataset suitable for performing RAFT, we simply point the cli target for training it to the dataset on disk. The cli target also takes a model name to be passed to the huggingface AutoModelForCausalLM.from_pretrained
method as well as an output path. For certain models, such as the Llama series, you will again need a huggingface api key and have accepted the terms of service on their model page.
python -m genraitor train:raft \
-t /path/to/raft_data \
-m meta-llama/Meta-Llama-3.1-8B \
-n data/finetuned
The fine-tuned model will be saved in data/finetuned and loadable via the huggingface interface:
from transformers import (
AutoModelForCausalLM,
AutoTokenizer
)
tokenizer = AutoTokenizer.from_pretrained('./data/finetuned', padding_side="left")
model = AutoModelForCausalLM.from_pretrained("./data/finetuned")
To run the rag model:
python3 -m genraitor rag:run
To inspect the documents nearest to a prompt:
python3 -m genraitor rag:index "How is tsp4_human related to pch2_human?"
- Multi-Omics Data: From PNNL's study on host responses to lethal human viruses.
- Literature: ‘Omics-related publications from PubMed, Medline, and Wikipathways.
- 30-day Goal: Fine-tune Llama 3 with RAFT using relevant literature.
- 60-day Goal: Demonstrate the model's ability to identify known biological mechanisms.
- Samantha Erwin: samantha.erwin@pnnl.gov
- Erwin, S et al. 2024; doi:10.1101/2023.09.06.556597
- Lee, C & van der Schaar, M. 2021; doi:10.48550/2102.03014
- Slenter DN et al. 2018; doi:10.1093/nar/gkx1064
- Zhang, T et al. 2024; doi:10.48550/2403.10131
Step 1 (existing work):
- In: Omics from infected and control
- Model: DeepIMV
- Out: Infection detection, macromolecules used as features in order of importance to the prediction
Step 2 (genraitor):
- In: Macromolecules
- Model: Raft Llama3
- Out: Relevant metabolic pathways, citations to wikipathways or PubMed and chain of reasoning
- Original RAFT paper
- RAFT press release (Microsoft)
- RAFT press release (Meta)
- RAFT press release (Berkeley)
- Blog post w/ example MVP RAFT implementation
- Repo with MVP RAFT training
- Webinar on RAFT (LlamaIndex)
- dataset generation (LlamaIndex)
- RAFT LlamaIndex Pack:
- HuggingFace: TRL - Transformer Reinforcement Learning
- HuggingFace: Odds Ratio Preference Optimization (ORPO) Trainer
- ORPO: Monolithic Preference Optimization without Reference Model
- Data Harmonization Dataset
- PNNL Publication on Disease Prediction
- PNNL DataHub for Multi-onmics Publication
- Uniprot Protein Database REST API
- Wikipathways Python Package
- RefMat
- LipidMaps
Data generation task is going to be the more important, labor intensive and hardest to define in this process. To train the model, we need a set questions-answer pairs, and a rotating corpus of documents the model can use to answer the question. The documents used for training can be helpful (called 'oracle' documents) or unhelpful (called 'destractor' documents). During training the model learns which information is relevant/irrelevant and memorizes domain knowledge. There is a package called LlamaIndex (credit to Matt Gaddis) that automates this process, but we'll have to heavily modify it to fit with our use case.
LlamaIndex can provide a blueprint for the data structure RAFT requires - which is helpful because there's some nuance (like providing 'distractor' documents, good system prompts, etc). The drawback: LlamaIndex generates question-answer pairs using an LLM - it takes a set of documents and uses ChatGPT or some other LLM to generate random question-answer pairs. For our purposes, this dataset should focus on a specific set of questions. For example, we might want the LLM to respond to the question 'given {macromolecules xyz}, what metobolic pathway do they share in common?' with the answer 'the {macromolecules xyz} are related through {abc pathway} according to {doc_id} and {doc_id}` LlamaIndex might generate question pairs like 'Who was the first author of the xyz paper?' instead. So we'll need to manually create a finetuning dataset.
There are a couple of different fine tuning methods. General techniques for tuning with smaller resource requirements than the original training regime (often called Parameter Efficient Fine-Tuning (PEFT)) include LoRA and QLoRA. Another class of training techniques use Reinforcement Learning (often called Reinforcement Learning Through Human Feedback (RLHF)), including the TRL package from HuggingFace.
According to a recent blog post, the RLHF technique finetuning with odds ratio preference optimization algorithm (ORPO) works with as little as 50 unique prompts. There is a TRL HuggingFace package that implements this technique.
orpo_dataset_dict = {
"prompt": [
"hello",
"how are you",
"What is your name?",
"What is your name?",
"Which is the best programming language?",
"Which is the best programming language?",
"Which is the best programming language?",
],
"chosen": [
"hi nice to meet you",
"I am fine",
"My name is Mary",
"My name is Mary",
"Python",
"Python",
"Java",
],
"rejected": [
"leave me alone",
"I am not fine",
"Whats it to you?",
"I dont have a name",
"Javascript",
"C++",
"C++",
],
}
{
"prompt": "How is protein A and protein B related {uniprot.A.function} {uniprot.A.interactions} {uniprot.A.names} ...",
"answer" : "They are related through XYZ according to DOI ZYX"
}
uniprot_id |
---|
TSP4_HUMAN |
TM131_HUMAN |
COMP_HUMAN |
BGH3_HUMAN |
S10A6_HUMAN |
AN32C_HUMAN |
CALU_HUMAN |
SH3L3_HUMAN |
PCH2_HUMAN |
FA83H_HUMAN |
LACRT_HUMAN |
MGT4B_HUMAN |
BOLA2_HUMAN |
CNBP_HUMAN |
LAMB3_HUMAN |
ATOX1_HUMAN |
NGAL_HUMAN |
AR6P1_HUMAN |
PEPD_HUMAN |
KI16B_HUMAN |
- Question: Is glucose related to protein a?
- Context:
- doc 1: sugar is related to protein a
- doc 2: sugar is a synonym to glucose
- Answer:
- yes, glucose is related to protien a.
- Context:
- Question: Is glucose related to protein a?
- Context:
- synonyms:
- doc 1:
- doc 2:
- Context: