- Introduction
- Project Overview
- Dataset
- Model Architectures
- Installation
- Evaluation Metrics
- Results
- Future Work
- Contributors
DNA sequence prediction is a crucial task in bioinformatics, enabling researchers to analyze genetic patterns, predict mutations, and model gene structures. This project implements three machine learning approaches to predict nucleotide sequences: N-Gram, LSTM, and Transformer models.
The goal of this project is to develop machine learning models that can:
- Learn patterns in DNA sequences.
- Predict missing or next nucleotides in a given sequence.
- Evaluate the model's performance using perplexity as a key metric.
We explore the following methods:
- N-Gram Model: Uses statistical language modeling.
- LSTM (Long Short-Term Memory): Captures long-term dependencies in sequences.
- Transformer Model: Uses self-attention for sequence prediction.
We use nucleotide sequences of human genes from the NCBI Gene Database. The dataset consists of:
- Gene symbols, descriptions, and types.
- Nucleotide sequences represented as
A
,T
,C
,G
. - Train-validation split: 80% training, 20% testing.
- Uses Maximum Likelihood Estimation (MLE) for probability distribution.
- Converts DNA sequences into N-grams (bigrams, trigrams, etc.).
- Evaluates prediction capability using perplexity.
- Deep learning model designed for sequential data.
- Captures long-term dependencies in DNA sequences.
- Uses embedding layers, LSTM layers, and softmax activation.
- Uses self-attention mechanisms to process sequences.
- More efficient than LSTM for long sequences.
- Implemented using Positional Encoding, Multi-Head Attention, and Feed-Forward layers.
To set up the project, follow these steps:
git clone https://github.com/Harshvardhan2164/Power-of-Generative-AI-in-Genomics.git
cd Power-of-Generative-AI-in-Genomics
Ensure you have Python 3.8+ installed, then install required libraries:
pip install -r requirements.txt
We use perplexity to evaluate model performance:
- Lower perplexity = better model predictions.
- N-gram models typically have higher perplexity than LSTMs and Transformers.
Model | Perplexity |
---|---|
N-Gram (n=3) | 3.8 |
LSTM | 2.9 |
Transformer | 2.5 |
Transformers perform best due to their ability to capture long-range dependencies.
- Implement Bidirectional LSTMs to improve accuracy.
- Use pre-trained DNA embeddings.
- Expand dataset to include more genetic variations.
This project is licensed under the MIT License.