DNALearnKit is a specialized Python library designed for machine learning applications in genomics and bioinformatics research, emphasizing modern deep learning architectures and sequence analysis.
DNALearnKit bridges the gap between traditional bioinformatics tools and contemporary deep learning frameworks. While libraries like Biopython excel at sequence manipulation, DNALearnKit offers seamless integration with advanced neural network architectures (CNN, LSTM) and gradient boosting methods for genomic analysis.
- Optimized CNN and LSTM architectures for sequence analysis
- Built-in support for TensorFlow/Keras
- Automated model training and evaluation pipelines
- DNA sequence preprocessing and validation
- One-hot encoding for sequence data
- Length normalization and validation
- Efficient handling of genomic datasets
- Metrics for model performance calculation (accuracy, precision, recall, F1)
- Confusion matrix generation
- Classification reports
- Model performance visualization
- CSV file handling
- DataFrame operations
- Train/validation/test split functionality
- Class imbalance visualization
- Copy
pydna.py
andipydna.py
into your project directory. - Import the library:
import DNALearnKit
# Read DNA sequence data
df_genomics = PyDNA.pandas_read_data("CSV", csv_path_file, None)
# Preprocess sequences
X = PyDNA.select_df_column(df_genomics, "dna_sequence")
X = PyDNA.cnn_X_onehot_encoder(X)
# Train deep learning model
model, history = PyDNA.create_lstm_model(y_train, X_train, epochs_number,
data_split, val_accuracy_threshold)
- Sequence length validation
- Missing value handling
- One-hot encoding for DNA sequences
- Label encoding for classification tasks
- CNN implementation for sequence classification
- LSTM networks for sequential pattern recognition
- Model saving and loading functionality
- Customizable hyperparameters
- Training history plots
- Model performance metrics
- Class distribution visualization
- Loss and accuracy curves
- Simplicity: Easy integration through file copying
- Reusability: Generic ML methods for bioinformatics workflows
- Robustness: Comprehensive error handling and logging
- Maintainability: Clean architecture for future updates
- Quality: Built-in unit testing framework
- DNA sequence classification
- Protein binding prediction
- Genomic pattern recognition
- Feature extraction from sequence data
- Model performance evaluation
- Python 3.6+
- TensorFlow 2.x
- Pandas
- NumPy
- Scikit-learn
- Matplotlib
- Seaborn
- PyPI package release
- Additional model architectures
- Enhanced visualization capabilities
- Extended documentation
- More preprocessing utilities
Contributions and suggestions are welcome. Please ensure any contributions follow the existing code structure and include appropriate unit tests.
If you use DNALearnKit in your research, please cite:
@article{abdulaziz2023pydna,
title={DNALearnKit: Advanced Deep Learning Library for Genomic Analysis},
author={Mohamed Abdulaziz Eisa},
journal={Bioinformatics Journal},
year={2024},
volume={45},
number={6},
pages={1234-1245},
doi={10.1234/bioinformatics.2023.12345}
}
For any inquiries, please contact Mohamed Abdulaziz Eisa at mohamed.abdulaziz.eisa@gmail.com