Replies: 3 comments
-
Hi @dsmandera, |
Beta Was this translation helpful? Give feedback.
0 replies
-
Thank you so much for your response! Here is a more detailed documentation
and description of our dataset:
We accessed a dataset of 97,000 CHO gene sequences from the NCBI database,
focusing exclusively on protein-coding genes. These sequences are then
filtered to retain those between 300 and 8000 base pairs, resulting in a
refined dataset of 86,632 sequences. To reduce redundancy, cd-hit-est is
employed to cluster the sequences based on an 8-word window and 90%
nucleotide similarity, producing 47,713 sequences. The nucleotide sequences
are then translated into their corresponding amino acid sequences, and any
unnatural amino acids are removed to ensure biological relevance. The
dataset is then split into training, validation, and test sets in an
80-10-10 ratio.
CHOFormer is built on the Transformer architecture, utilizing multiple
decoder layers to map ESM-2-150M protein sequence embeddings to optimized
codon sequences. The ESM embeddings from EvolutionaryScale are crucial
because they capture complex biological features from the protein
sequences, including structural and evolutionary relationships. To bridge
the gap between amino acids and codon usage, we engineered a custom 3-mer
tokenizer specifically for DNA sequences to accurately represent all codons.
To generate optimized codons, we project the ESM-2 embeddings into a
higher-dimensional space before passing them through two decoder layers
with four attention heads. Then, decoder logits are mapped to a probability
distribution over our custom tokenizer's vocabulary to select optimized
codons. With this approach, we generate DNA sequences with significantly
improved protein yield and translational efficiency.
Sincerely,
Darsh Mandera
…On Fri, Oct 18, 2024 at 1:20 PM Lu Zhu ***@***.***> wrote:
Hi @dsmandera <https://github.com/dsmandera>,
Thank you for uploading the dataset on Polaris.
We greatly appreciate the potential value these datasets bring to the
community.
In Polaris, we strongly recommend following these principles
<https://polarishub.io/datasets-101> for any dataset.
To better understand and apply the data, could you provide more detailed
documentation on the curation process and data quality?
—
Reply to this email directly, view it on GitHub
<#21 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AM2WXSVZDYNNECNI34UMPZLZ4E7PRAVCNFSM6AAAAABQGFB3DGVHI2DSMVQWIX3LMV43URDJONRXK43TNFXW4Q3PNVWWK3TUHMYTAOJYGYZDKOI>
.
You are receiving this because you were mentioned.Message ID:
***@***.***
com>
|
Beta Was this translation helpful? Give feedback.
0 replies
-
Thank you for your submission! Please see joint feedback on both CHO submissions here. |
Beta Was this translation helpful? Give feedback.
0 replies
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Uh oh!
There was an error while loading. Please reload this page.
-
Polaris Link
https://polarishub.io/datasets/vishrut64/cho-dna-expression-prediction-dataset
README
The Chinese Hamster Ovary (CHO) genome is critical in the field of drug discovery. Today, nearly 70% of recombinant pharmaceuticals are manufactured using the CHO genome in their research and development. Here, we present three distinct datasets based on the CHO genome that are critically relevant to drug discovery: RNA expression levels based on DNA sequence , and protein expression levels based on DNA, and RNA expression levels based on DNA sequence of the CHO genome.
Here, we curate and present the RNA expression levels (normalized from 0 to 1) based on the DNA sequence of the CHO genome, which can be found in the log_prec_y column of our submitted dataset. This provides us with an even more specific understanding of gene regulation and expression patterns unique to CHO cells, which can be critical for optimizing recombinant protein production. By analyzing these specific RNA expression profiles, researchers can identify key regulatory elements and pathways that influence protein yield and quality. This knowledge not only enhances our ability to engineer CHO cell lines for drug development but also improves our understanding of the cellular mechanisms underlying therapeutic protein production.
To create this file, we specifically take a clustering approach. We design a script that runs CD-HIT-EST to perform clustering on the nucleotide sequences. CD-HIT-EST is a tool used to group sequences that are similar above a specified threshold, with the default similarity threshold set at 95%. The script constructs a shell command to execute CD-HIT-EST, specifying the input file, output file, and several parameters, such as the similarity threshold and word length. Once the command is executed, the clustered sequences are saved to an output file.
Then, the nucleotide sequences were translated into amino acid sequences, starting with Methionine and ending with a stop codon. This prevents non-functional proteins from existing in the dataset. These were then converted to ESM embeddings using ESM-3.
Dataset Source
NCBI: FTP CHO genomes
Dataset Curation
https://github.com/RJain12/choformer/blob/main/data_preprocessing/
Dataset Completeness
readme
,source
andcuration_reference
fields for my Polaris dataset.Anything else we should know?
We have also uploaded a relevant benchmark at https://polarishub.io/benchmarks/vishrut64/cho-dna-expression-prediction-dataset-task
Beta Was this translation helpful? Give feedback.
All reactions