Team CHOFormer @ Hackathon.Bio: DNA to RNA Expression #22

dhruv-ramu · 2024-10-18T15:11:14Z

dhruv-ramu
Oct 18, 2024

Polaris Link

https://polarishub.io/datasets/vishrut64/rna-expression-prediction-dataset

README

The Chinese Hamster Ovary (CHO) genome is critical in the field of drug discovery. Today, nearly 70% of recombinant pharmaceuticals are manufactured using the CHO genome in their research and development. Here, we present three distinct datasets based on the CHO genome that are critically relevant to drug discovery: RNA expression levels based on DNA sequence , and protein expression levels based on DNA, and RNA expression levels based on DNA sequence of the CHO genome.

First, we curate and present the RNA expression level based on the DNA sequence of a general set of genes. Understanding RNA expression levels can help researchers identify which genes are actively transcribed during cell culture, enabling the selection of optimal expression systems and conditions. This is also well correlated with protein expression. This information can facilitate the engineering of CHO cell lines with enhanced productivity, stability, and product quality, ultimately accelerating the development of therapeutic proteins and antibodies. Additionally, analyzing RNA expression can provide insights into cellular responses to drug treatments, aiding in the identification of potential drug targets and improving the efficacy of therapeutic interventions.

RNASeq data for CHO was collected (26795 genes), after which only non-zero expression genes were preserved, yielding 19918 genes. In order to ensure ideal data for model training, only the top 66% of genes and those between 3 standard deviations. This resulted in 13253 genes and corresponding RNA expression values.

These have been used to train our CHO Expression Predictor. However, since there isn’t a perfect correlation between RNA expression and protein expression, this was later fine-tuned on a dataset of 200 genes and their protein expression (empirically studied).

Dataset Source

NCBI: CHO genomes

Dataset Curation

https://github.com/RJain12/choformer/tree/main

Dataset Completeness

I confirm that I filled out at least the readme, source and curation_reference fields for my Polaris dataset.

Anything else we should know?

We have also uploaded a relevant benchmark at https://polarishub.io/benchmarks/vishrut64/rna-expression-prediction-dataset-task.

zhu0619 · 2024-10-18T17:22:35Z

zhu0619
Oct 18, 2024
Maintainer

Hi @dsmandera,
Same comment as in #22.
To better understand and apply the data, could you provide more detailed documentation on the curation process and data quality?

0 replies

dsmandera · 2024-10-20T16:33:41Z

dsmandera
Oct 20, 2024

Thank you for your response! Here is a detailed documentation and description of the dataset. CHOExp begins by accessing a dataset of 26,795 genes with corresponding RNA expression values. Genes with zero expression are removed, and the top 66% of genes that fall within three standard deviations are retained, resulting in a refined set of 13,253 genes. Expression values are then projected onto a log scale and normalized between 0 and 1 to allow sigmoid-based predictions. This dataset is split into training, validation, and test sets with an 80-10-10 split. The core of CHOExp is an encoder-only transformer model with a dimensionality of 384, 8 layers, and 4 attention heads. The model is trained to predict protein expression levels based on the RNA expression data from the training set. CHOExp does not use any DNA foundation models as it's base, taking in the raw one-hot encoded vocab indices as input. Each DNA sequence is truncated/padded to a length of 1024 3-mer tokens (3072 total base pairs), and a classifier token is a added at the start of the sequence. This input is processed through the transformer's attention and MLP processes. The output embedding of the token is selected and processed through to a classification head, which consists of a linear layer and sigmoid activation function. After training the model on the training dataset for 10 epochs (including validation after every epoch), the expression model was evaluated on the test set and used to filter for high-expression CHO Genes when training CHOFormer. Sincerely, Darsh Mandera

…

On Fri, Oct 18, 2024 at 1:22 PM Lu Zhu ***@***.***> wrote: Hi @dsmandera <https://github.com/dsmandera>, Same comment as in #22 <#22>. To better understand and apply the data, could you provide more detailed documentation on the curation process and data quality? — Reply to this email directly, view it on GitHub <#22 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AM2WXSSOXE3VX3YTXXXAYHDZ4E7XBAVCNFSM6AAAAABQGFEH6OVHI2DSMVQWIX3LMV43URDJONRXK43TNFXW4Q3PNVWWK3TUHMYTAOJYGYZDONQ> . You are receiving this because you were mentioned.Message ID: ***@***.*** com>

0 replies

alisandra · 2024-10-24T19:23:22Z

alisandra
Oct 24, 2024

Thanks for your submissions! I've been tagged as a domain expert to provide feedback.

The aim of the project to "facilitate the engineering of CHO cell lines with enhanced productivity, stability, and product quality"
is important and impactful in drug development; however it is unclear to me that the submitted datasets and modelling tasks are appropriate to drive progress towards these aims.

First, there remain gaps in the documentation and clarity. It would be desirable to have precision in the description of data origin; and
unambiguous description of data transformations and processing in the Polaris dataset readme without need to reference a discussion nor source code. I would like to have easy reference to details such as: accession numbers, the unit of 'expression' (e.g. counts, transcripts per million, reads per kilobase per million?), whether there is any aggregation across samples or replicates, etc...). Similarly, the predictive task itself (for reuse of the dataset by others) should be stated in the simplest terms. Describing input X, and output Y. This is especially unclear for me regarding the cho-dna-expression-prediction-dataset, where your first post talks of predicting RNA and Protein expression levels, but your follow up speaks of optimizing codons, i.e. generating the underlying DNA sequence.

Second, for re-use of the data by others, it would be desirable to provide it in a minimally transformed state. Specifically for this or similar cases, I think it would be better to provide the input as nucleotide or amino acid sequences, and not bake-in a modelling choice, such as ESM embeddings into the dataset. Perhaps this is the case already, but see lack of clarity above.

Third, and very importantly, there is a fundamental discrepancy between predicting the abundance of native mRNA/protein sequences, and predicting how a recombinant protein may be expressed, or more broadly any perturbed expression state relevant in engineering CHO cells. The native coding & protein sequences have been shaped by evolutionary forces, with, in particular, the relative nitrogen usage in nucleotides and amino acids incurring extra high selective pressure and optimization in highly expressed genes. These evolutionary forces induce correlation between sequence and expression level that are neither causal nor mechanistic and are not relevant for predicting expression in an engineered setting. To be impactful for the stated aim, a dataset would have to include exogenous sequences and include homology-aware splits that would test generalization. Further, it would be desirable to include in the features all reasonably available sequences thought to mechanistically control expression. Yes this may include the coding sequence as this is relevant for codon optimization, but, unless the dataset is setup such that regulatory sequences are held constant, then it should also include promotors, as well as introns and untranslated regions.

Given the comments above, I do not recommend certification of these datasets in their current form.

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Team CHOFormer @ Hackathon.Bio: DNA to RNA Expression #22

Uh oh!

{{title}}

Uh oh!

Replies: 3 comments

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

Team CHOFormer @ Hackathon.Bio: DNA to RNA Expression #22

Uh oh!

dhruv-ramu Oct 18, 2024

Polaris Link

README

Dataset Source

Dataset Curation

Dataset Completeness

Anything else we should know?

Replies: 3 comments

Uh oh!

zhu0619 Oct 18, 2024 Maintainer

Uh oh!

dsmandera Oct 20, 2024

Uh oh!

alisandra Oct 24, 2024

dhruv-ramu
Oct 18, 2024

zhu0619
Oct 18, 2024
Maintainer

dsmandera
Oct 20, 2024

alisandra
Oct 24, 2024