Team choformer @ hackathon.bio: CHO DNA #21

dsmandera · 2024-10-18T15:10:03Z

dsmandera
Oct 18, 2024

Polaris Link

https://polarishub.io/datasets/vishrut64/cho-dna-expression-prediction-dataset

README

The Chinese Hamster Ovary (CHO) genome is critical in the field of drug discovery. Today, nearly 70% of recombinant pharmaceuticals are manufactured using the CHO genome in their research and development. Here, we present three distinct datasets based on the CHO genome that are critically relevant to drug discovery: RNA expression levels based on DNA sequence , and protein expression levels based on DNA, and RNA expression levels based on DNA sequence of the CHO genome.

Here, we curate and present the RNA expression levels (normalized from 0 to 1) based on the DNA sequence of the CHO genome, which can be found in the log_prec_y column of our submitted dataset. This provides us with an even more specific understanding of gene regulation and expression patterns unique to CHO cells, which can be critical for optimizing recombinant protein production. By analyzing these specific RNA expression profiles, researchers can identify key regulatory elements and pathways that influence protein yield and quality. This knowledge not only enhances our ability to engineer CHO cell lines for drug development but also improves our understanding of the cellular mechanisms underlying therapeutic protein production.

To create this file, we specifically take a clustering approach. We design a script that runs CD-HIT-EST to perform clustering on the nucleotide sequences. CD-HIT-EST is a tool used to group sequences that are similar above a specified threshold, with the default similarity threshold set at 95%. The script constructs a shell command to execute CD-HIT-EST, specifying the input file, output file, and several parameters, such as the similarity threshold and word length. Once the command is executed, the clustered sequences are saved to an output file.
Then, the nucleotide sequences were translated into amino acid sequences, starting with Methionine and ending with a stop codon. This prevents non-functional proteins from existing in the dataset. These were then converted to ESM embeddings using ESM-3.

Dataset Source

NCBI: FTP CHO genomes

Dataset Curation

https://github.com/RJain12/choformer/blob/main/data_preprocessing/

Dataset Completeness

I confirm that I filled out at least the readme, source and curation_reference fields for my Polaris dataset.

Anything else we should know?

We have also uploaded a relevant benchmark at https://polarishub.io/benchmarks/vishrut64/cho-dna-expression-prediction-dataset-task

zhu0619 · 2024-10-18T17:20:35Z

zhu0619
Oct 18, 2024
Maintainer

Hi @dsmandera,
Thank you for uploading the dataset on Polaris.
We greatly appreciate the potential value these datasets bring to the community.
In Polaris, we strongly recommend following these principles for any dataset.
To better understand and apply the data, could you provide more detailed documentation on the curation process and data quality?

0 replies

dsmandera · 2024-10-20T16:30:14Z

dsmandera
Oct 20, 2024
Author

Thank you so much for your response! Here is a more detailed documentation and description of our dataset: We accessed a dataset of 97,000 CHO gene sequences from the NCBI database, focusing exclusively on protein-coding genes. These sequences are then filtered to retain those between 300 and 8000 base pairs, resulting in a refined dataset of 86,632 sequences. To reduce redundancy, cd-hit-est is employed to cluster the sequences based on an 8-word window and 90% nucleotide similarity, producing 47,713 sequences. The nucleotide sequences are then translated into their corresponding amino acid sequences, and any unnatural amino acids are removed to ensure biological relevance. The dataset is then split into training, validation, and test sets in an 80-10-10 ratio. CHOFormer is built on the Transformer architecture, utilizing multiple decoder layers to map ESM-2-150M protein sequence embeddings to optimized codon sequences. The ESM embeddings from EvolutionaryScale are crucial because they capture complex biological features from the protein sequences, including structural and evolutionary relationships. To bridge the gap between amino acids and codon usage, we engineered a custom 3-mer tokenizer specifically for DNA sequences to accurately represent all codons. To generate optimized codons, we project the ESM-2 embeddings into a higher-dimensional space before passing them through two decoder layers with four attention heads. Then, decoder logits are mapped to a probability distribution over our custom tokenizer's vocabulary to select optimized codons. With this approach, we generate DNA sequences with significantly improved protein yield and translational efficiency. Sincerely, Darsh Mandera

…

On Fri, Oct 18, 2024 at 1:20 PM Lu Zhu ***@***.***> wrote: Hi @dsmandera <https://github.com/dsmandera>, Thank you for uploading the dataset on Polaris. We greatly appreciate the potential value these datasets bring to the community. In Polaris, we strongly recommend following these principles <https://polarishub.io/datasets-101> for any dataset. To better understand and apply the data, could you provide more detailed documentation on the curation process and data quality? — Reply to this email directly, view it on GitHub <#21 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AM2WXSVZDYNNECNI34UMPZLZ4E7PRAVCNFSM6AAAAABQGFB3DGVHI2DSMVQWIX3LMV43URDJONRXK43TNFXW4Q3PNVWWK3TUHMYTAOJYGYZDKOI> . You are receiving this because you were mentioned.Message ID: ***@***.*** com>

0 replies

alisandra · 2024-10-24T19:24:53Z

alisandra
Oct 24, 2024

Thank you for your submission! Please see joint feedback on both CHO submissions here.

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Team choformer @ hackathon.bio: CHO DNA #21

Uh oh!

{{title}}

Uh oh!

Replies: 3 comments

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

Team choformer @ hackathon.bio: CHO DNA #21

Uh oh!

dsmandera Oct 18, 2024

Polaris Link

README

Dataset Source

Dataset Curation

Dataset Completeness

Anything else we should know?

Replies: 3 comments

Uh oh!

zhu0619 Oct 18, 2024 Maintainer

Uh oh!

dsmandera Oct 20, 2024 Author

Uh oh!

alisandra Oct 24, 2024

dsmandera
Oct 18, 2024

zhu0619
Oct 18, 2024
Maintainer

dsmandera
Oct 20, 2024
Author

alisandra
Oct 24, 2024