A robust Python tool for annotating RNA sequencing data using standardized ontologies and classification systems. This tool provides automated annotation capabilities while maintaining strict adherence to established scientific ontologies.
- Support for multiple ontology formats (primarily OBO)
- Automated annotation using Gene Ontology (GO) and Sequence Ontology (SO)
- Comprehensive validation framework
- Detailed logging system
- Type-safe implementation
- CSV output support
- Extensible architecture for custom annotation rules
- Docker support for containerized deployment
- MongoDB integration for data persistence
- Redis caching for improved performance
- Python 3.8+
- pandas
- typing
- logging
- Docker 20.10+
- Docker Compose 2.0+
- At least 8GB RAM
- 20GB free disk space
- Clone the repository:
git clone https://github.com/imrobintomar/rna-seq-annotator.git
cd rna-seq-annotator
- Install required packages:
pip install -r requirements.txt
- Clone the repository:
git clone https://github.com/imrobintomar/rna-seq-annotator.git
cd rna-seq-annotator
- Create necessary directories:
mkdir -p data ontologies output logs
- Start the services:
docker-compose up -d
- Verify services are running:
docker-compose ps
from rna_seq_annotator import RNASeqAnnotator
# Define ontology files
ontology_files = {
'GO': 'path/to/gene_ontology.obo',
'SO': 'path/to/sequence_ontology.obo'
}
# Initialize annotator
annotator = RNASeqAnnotator(ontology_files)
# Load and annotate sequences
sequence_data = pd.read_csv('your_rna_seq_data.csv')
annotated_data = annotator.annotate_sequence(
sequence_data,
required_ontologies=['GO', 'SO'],
output_file='annotated_sequences.csv'
)
- Place your input files:
cp your_rna_seq_data.csv data/
cp gene_ontology.obo ontologies/
cp sequence_ontology.obo ontologies/
- Run annotation:
docker-compose exec rna-seq-annotator python annotate.py \
--input /app/data/your_rna_seq_data.csv \
--output /app/output/annotated_sequences.csv
- Access results:
ls output/
The Docker deployment includes three main services:
- Main application container
- Resource limits: 2 CPUs, 4GB RAM
- Mounted volumes for data, ontologies, output, and logs
- Persistent data storage
- Secure authentication enabled
- Port: 27017
- Resource limits: 1 CPU, 2GB RAM
- Caching layer
- Password protected
- Port: 6379
- Resource limits: 0.5 CPU, 1GB RAM
LOG_LEVEL
: Logging level (default: INFO)MAX_WORKERS
: Number of worker processes (default: 4)ONTOLOGY_DIR
: Directory for ontology filesOUTPUT_DIR
: Directory for output files
MONGO_INITDB_ROOT_USERNAME
: MongoDB admin usernameMONGO_INITDB_ROOT_PASSWORD
: MongoDB admin passwordMONGO_INITDB_DATABASE
: Default database name
REDIS_PASSWORD
: Redis authentication password
The tool expects RNA-seq data in CSV format with the following columns:
- sequence_id (required): Unique identifier for each sequence
- sequence (required): The RNA sequence data
- Additional metadata columns (optional)
Example:
sequence_id,sequence,tissue_type,condition
seq001,AUGCAUGCAUGC,liver,control
seq002,GCAUGCAUGCAU,liver,treated
The tool generates annotated data in CSV format with additional columns for each ontology:
- Original columns
- GO_annotation: Gene Ontology annotations
- SO_annotation: Sequence Ontology annotations
- validation_status: Validation results (if validation is performed)
- Prepare your ontology file in OBO format
- Add the ontology to the
ontology_files
dictionary:
ontology_files = {
'GO': 'gene_ontology.obo',
'SO': 'sequence_ontology.obo',
'YOUR_ONTOLOGY': 'your_ontology.obo'
}
Modify docker-compose.yml
to adjust:
- Resource limits
- Volume mounts
- Environment variables
- Network configuration
The tool provides comprehensive logging with different levels:
- INFO: General operation information
- WARNING: Non-critical issues
- ERROR: Critical issues that need attention
- DEBUG: Detailed information for debugging
Logs are available:
- Local deployment:
rna_seq_annotator.log
- Docker deployment:
/app/logs/rna_seq_annotator.log
- Check service status:
docker-compose ps
- View logs:
docker-compose logs [service-name]
- Restart services:
docker-compose restart [service-name]
- Fork the repository
- Create your feature branch (
git checkout -b feature/AmazingFeature
) - Commit your changes (
git commit -m 'Add some AmazingFeature'
) - Push to the branch (
git push origin feature/AmazingFeature
) - Open a Pull Request
This project is licensed under the MIT License
If you use this tool in your research, please cite:
@software{rna_seq_annotator,
author = {Robin Tomar},
title = {RNA-Seq Data Annotator},
year = {2024},
url = {https://github.com/imrobintomar/rna-seq-annotator}
}
For support:
- Open an issue in the GitHub repository
- Contact [itsrobintomar@gmail.com]
- Check the troubleshooting guide
- Gene Ontology Consortium
- Sequence Ontology Project
- Contributors and maintainers of dependent packages
- Docker and container ecosystem contributors