Team CHOFormer @ Hackathon.Bio: DNA to RNA Expression #22
Replies: 3 comments
-
Hi @dsmandera, |
Beta Was this translation helpful? Give feedback.
-
Thank you for your response! Here is a detailed documentation and
description of the dataset.
CHOExp begins by accessing a dataset of 26,795 genes with corresponding RNA
expression values. Genes with zero expression are removed, and the top 66%
of genes that fall within three standard deviations are retained, resulting
in a refined set of 13,253 genes. Expression values are then projected onto
a log scale and normalized between 0 and 1 to allow sigmoid-based
predictions. This dataset is split into training, validation, and test sets
with an 80-10-10 split.
The core of CHOExp is an encoder-only transformer model with a
dimensionality of 384, 8 layers, and 4 attention heads. The model is
trained to predict protein expression levels based on the RNA expression
data from the training set. CHOExp does not use any DNA foundation models
as it's base, taking in the raw one-hot encoded vocab indices as input.
Each DNA sequence is truncated/padded to a length of 1024 3-mer tokens
(3072 total base pairs), and a classifier token is a added at the start of
the sequence. This input is processed through the transformer's attention
and MLP processes. The output embedding of the token is selected and
processed through to a classification head, which consists of a linear
layer and sigmoid activation function. After training the model on the
training dataset for 10 epochs (including validation after every epoch),
the expression model was evaluated on the test set and used to filter for
high-expression CHO Genes when training CHOFormer.
Sincerely,
Darsh Mandera
…On Fri, Oct 18, 2024 at 1:22 PM Lu Zhu ***@***.***> wrote:
Hi @dsmandera <https://github.com/dsmandera>,
Same comment as in #22
<#22>.
To better understand and apply the data, could you provide more detailed
documentation on the curation process and data quality?
—
Reply to this email directly, view it on GitHub
<#22 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AM2WXSSOXE3VX3YTXXXAYHDZ4E7XBAVCNFSM6AAAAABQGFEH6OVHI2DSMVQWIX3LMV43URDJONRXK43TNFXW4Q3PNVWWK3TUHMYTAOJYGYZDONQ>
.
You are receiving this because you were mentioned.Message ID:
***@***.***
com>
|
Beta Was this translation helpful? Give feedback.
-
Thanks for your submissions! I've been tagged as a domain expert to provide feedback. The aim of the project to "facilitate the engineering of CHO cell lines with enhanced productivity, stability, and product quality" First, there remain gaps in the documentation and clarity. It would be desirable to have precision in the description of data origin; and Second, for re-use of the data by others, it would be desirable to provide it in a minimally transformed state. Specifically for this or similar cases, I think it would be better to provide the input as nucleotide or amino acid sequences, and not bake-in a modelling choice, such as ESM embeddings into the dataset. Perhaps this is the case already, but see lack of clarity above. Third, and very importantly, there is a fundamental discrepancy between predicting the abundance of native mRNA/protein sequences, and predicting how a recombinant protein may be expressed, or more broadly any perturbed expression state relevant in engineering CHO cells. The native coding & protein sequences have been shaped by evolutionary forces, with, in particular, the relative nitrogen usage in nucleotides and amino acids incurring extra high selective pressure and optimization in highly expressed genes. These evolutionary forces induce correlation between sequence and expression level that are neither causal nor mechanistic and are not relevant for predicting expression in an engineered setting. To be impactful for the stated aim, a dataset would have to include exogenous sequences and include homology-aware splits that would test generalization. Further, it would be desirable to include in the features all reasonably available sequences thought to mechanistically control expression. Yes this may include the coding sequence as this is relevant for codon optimization, but, unless the dataset is setup such that regulatory sequences are held constant, then it should also include promotors, as well as introns and untranslated regions. Given the comments above, I do not recommend certification of these datasets in their current form. |
Beta Was this translation helpful? Give feedback.
Uh oh!
There was an error while loading. Please reload this page.
-
Polaris Link
https://polarishub.io/datasets/vishrut64/rna-expression-prediction-dataset
README
The Chinese Hamster Ovary (CHO) genome is critical in the field of drug discovery. Today, nearly 70% of recombinant pharmaceuticals are manufactured using the CHO genome in their research and development. Here, we present three distinct datasets based on the CHO genome that are critically relevant to drug discovery: RNA expression levels based on DNA sequence , and protein expression levels based on DNA, and RNA expression levels based on DNA sequence of the CHO genome.
First, we curate and present the RNA expression level based on the DNA sequence of a general set of genes. Understanding RNA expression levels can help researchers identify which genes are actively transcribed during cell culture, enabling the selection of optimal expression systems and conditions. This is also well correlated with protein expression. This information can facilitate the engineering of CHO cell lines with enhanced productivity, stability, and product quality, ultimately accelerating the development of therapeutic proteins and antibodies. Additionally, analyzing RNA expression can provide insights into cellular responses to drug treatments, aiding in the identification of potential drug targets and improving the efficacy of therapeutic interventions.
RNASeq data for CHO was collected (26795 genes), after which only non-zero expression genes were preserved, yielding 19918 genes. In order to ensure ideal data for model training, only the top 66% of genes and those between 3 standard deviations. This resulted in 13253 genes and corresponding RNA expression values.
These have been used to train our CHO Expression Predictor. However, since there isn’t a perfect correlation between RNA expression and protein expression, this was later fine-tuned on a dataset of 200 genes and their protein expression (empirically studied).
Dataset Source
NCBI: CHO genomes
Dataset Curation
https://github.com/RJain12/choformer/tree/main
Dataset Completeness
readme
,source
andcuration_reference
fields for my Polaris dataset.Anything else we should know?
We have also uploaded a relevant benchmark at https://polarishub.io/benchmarks/vishrut64/rna-expression-prediction-dataset-task.
Beta Was this translation helpful? Give feedback.
All reactions