<<<<<<< HEAD
A comprehensive tool for analyzing biomedical literature to extract and understand relationships between genomic coordinates, variants, genes, and diseases using advanced NLP and Large Language Models.
This project provides sophisticated tools for analyzing PubMed articles to extract and understand relationships between genomic variants and other biomedical entities such as genes, diseases, and tissues. It leverages Natural Language Processing (NLP) and Large Language Models (LLMs) to identify, analyze, and score the strength of these relationships.
The system is designed for researchers, bioinformaticians, and clinicians who need to systematically analyze large volumes of biomedical literature to understand variant-disease associations and genomic relationships.
- 🔍 Genomic Variant Extraction: Extract genomic coordinates and variants from biomedical literature
- 🧬 Relationship Analysis: Analyze relationships between variants and biomedical entities (genes, diseases, tissues)
- 📊 Scoring System: Score relationship strength from 0-10 using advanced LLM analysis
- 📁 Multiple Export Formats: Export results to CSV and JSON formats
- ⚡ Intelligent Caching: Cache API responses for faster processing and reduced costs
- 🤖 Multi-LLM Support: Support for multiple LLM providers (OpenAI, TogetherAI)
- 🔧 Modular Architecture: Professional, scalable codebase with clear separation of concerns
- 🧪 Comprehensive Testing: Extensive test suite with >80% code coverage
- 📚 Rich Documentation: Detailed documentation and examples
- Python 3.9+
- API keys for:
- OpenAI (for GPT models) or TogetherAI (for open-source models)
- PubTator3 (optional, for enhanced entity recognition)
- ClinVar (optional, for variant validation)
-
Clone the repository:
git clone https://github.com/yourusername/coordinates-lit.git cd coordinates-lit
-
Create and activate virtual environment:
python -m venv venv source venv/bin/activate # On Windows: venv\Scripts\activate
-
Install dependencies:
pip install -r requirements.txt
-
Set up configuration:
cp config/development.example.yaml config/development.yaml
Then edit
config/development.yaml
to add your API keys and settings.
Analyze PubMed articles:
python -m src.cli.analyze --pmids 32735606 32719766 --output results.csv
Analyze from file:
python -m src.cli.analyze --file pmids.txt --output results.csv --json results.json
The project follows a professional modular architecture:
coordinates-lit/
├── 📁 src/ # Source code
│ ├── 🔌 api/ # API layer & external communication
│ │ ├── clients/ # API clients (PubTator, ClinVar, LitVar)
│ │ └── cache/ # Caching system
│ ├── 🧬 analysis/ # Biomedical analysis modules
│ │ ├── bio_ner/ # Named Entity Recognition
│ │ ├── context/ # Context analysis
│ │ ├── llm/ # LLM-based analysis
│ │ └── base/ # Base analyzer classes
│ ├── 💻 cli/ # Command-line interface
│ ├── 📊 models/ # Data models & structures
│ ├── ⚙️ services/ # Business logic services
│ │ ├── flow/ # Data flow orchestration
│ │ ├── processing/ # Data processing
│ │ ├── search/ # Literature search
│ │ └── validation/ # Data validation
│ └── 🛠️ utils/ # Utilities & helpers
│ ├── config/ # Configuration management
│ ├── llm/ # LLM management
│ └── logging/ # Logging system
├── 🧪 tests/ # Test suite (mirrors src structure)
├── 📁 config/ # Configuration files
├── 📁 data/ # Data storage
├── 📁 scripts/ # Utility scripts
└── 📁 docs/ # Documentation
PubTator Client - Extract biomedical entities:
from src.api.clients.pubtator_client import PubTatorClient
client = PubTatorClient()
publication = client.get_publication_by_pmid("32735606")
entities = client.extract_entities(publication)
ClinVar Client - Validate variants:
from src.api.clients.clinvar_client import ClinVarClient
client = ClinVarClient()
variant_info = client.get_variant_info("NM_000492.3:c.1521_1523delCTT")
LLM Context Analyzer - Analyze relationships using LLM:
from src.analysis.llm.llm_context_analyzer import LlmContextAnalyzer
analyzer = LlmContextAnalyzer()
results = analyzer.analyze_publications_by_pmids(["32735606", "32719766"])
Bio NER - Extract genomic variants:
from src.analysis.bio_ner.variant_recognizer import VariantRecognizer
recognizer = VariantRecognizer()
variants = recognizer.extract_variants("Found mutation c.123A>G in BRCA1")
Flow Orchestration - Run complete analysis pipelines:
from src.services.flow.pubmed_flow import PubMedAnalysisFlow
flow = PubMedAnalysisFlow()
results = flow.analyze_pmids(["32735606"], output_format="csv")
The project includes comprehensive testing with >80% code coverage.
pytest
# API tests
pytest tests/api/
# Analysis tests
pytest tests/analysis/
# LLM manager tests
pytest tests/utils/llm/
# Integration tests
pytest tests/integration/
pytest --cov=src --cov-report=html
# Tests without real API calls (fast)
pytest -m "not realapi"
# Integration tests only
pytest -m integration
# Slow tests
pytest -m slow
python -m src.cli.analyze [OPTIONS]
Options:
--pmids TEXT
: List of PubMed IDs to analyze--file PATH
: File containing PubMed IDs (one per line)--output PATH
: Output CSV file path--json PATH
: Output JSON file path (optional)--model TEXT
: LLM model to use (default from config)--email TEXT
: Email for PubTator API requests--debug
: Enable debug mode--no-retry
: Disable automatic retries--cache-type [memory|disk]
: Cache type--log-level [DEBUG|INFO|WARNING|ERROR]
: Logging level
Basic analysis:
python -m src.cli.analyze --pmids 32735606 32719766 --output results.csv
Batch analysis with JSON output:
python -m src.cli.analyze --file pmids.txt --output results.csv --json results.json
Debug mode with specific model:
python -m src.cli.analyze --pmids 32735606 --model gpt-4 --debug --log-level DEBUG
- Identify variant-disease associations in literature
- Validate genomic findings against published research
- Systematic literature reviews for specific variants
- Extract genomic coordinates from publications
- Build knowledge graphs of variant relationships
- Automated literature curation
- Find variants associated with drug responses
- Identify therapeutic targets
- Literature-based drug repurposing
- Validate variant pathogenicity from literature
- Find similar cases in published research
- Evidence-based variant interpretation
# config/development.yaml
llm:
provider: "openai" # or "together"
model: "gpt-3.5-turbo"
temperature: 0.7
api:
openai_key: "your-openai-key"
together_key: "your-together-key"
pubtator_email: "your-email@example.com"
cache:
type: "disk" # or "memory"
ttl: 3600
max_size: 1000
logging:
level: "INFO"
format: "%(asctime)s - %(name)s - %(levelname)s - %(message)s"
export OPENAI_API_KEY="your-openai-key"
export TOGETHER_API_KEY="your-together-key"
export PUBTATOR_EMAIL="your-email@example.com"
If you're upgrading from an older version, see:
- Migration Guide - For src reorganization
- Tests Migration Guide - For tests reorganization
- Fork the repository
- Create a feature branch (
git checkout -b feature/amazing-feature
) - Make your changes following the project structure
- Add tests for new functionality
- Ensure all tests pass (
pytest
) - Commit your changes (
git commit -m 'Add amazing feature'
) - Push to the branch (
git push origin feature/amazing-feature
) - Open a Pull Request
This project is licensed under the Creative Commons Attribution-NonCommercial 4.0 International License (CC BY-NC 4.0) - see the LICENSE file for details.
Documentation is licensed under Creative Commons Attribution 4.0 International (CC BY 4.0) - see the LICENSE-DOCS file for details.
- PubTator3 for biomedical entity annotations
- ClinVar for variant databases
- LangChain for LLM integration
- OpenAI and Together AI for LLM services
- Issues: GitHub Issues
- Discussions: GitHub Discussions
- Email: wojciech.sitek@pw.edu.pl
This experiment compares LLM-based variant extraction against reference sources for FOX family genes.
- Load FOX genes from
external_data/enhancer_tables_from_uw/fox_unique_genes.txt
- Get PMID counts for each gene using NCBI E-utilities
- Extract reference variants from LitVar based on gene names
- Extract predicted variants from publication texts using LLM (max 100 pubs/gene)
- Extract reference variants from PubTator annotations
- Calculate metrics comparing predicted vs reference variants
results/2025-07-01/
├── data/
│ ├── fox_genes.txt # FOX family genes (50 genes)
│ ├── gene_pmids_counts.csv # PMID counts per gene
│ ├── reference_variants.json # LitVar variants by gene
│ ├── predicted_variants.json # LLM-predicted variants
│ └── pubtator_variants.json # PubTator reference variants
├── reports/
│ ├── llm_vs_pubtator_metrics.json # Detailed metrics vs PubTator
│ ├── llm_vs_litvar_metrics.json # Detailed metrics vs LitVar
│ ├── metrics_summary.csv # Summary metrics
│ └── experiment_summary.md # Human-readable report
├── logs/
│ └── experiment.log # Experiment execution log
├── main_fox_experiment.py # Main experiment orchestrator
├── variant_metrics_evaluator.py # Metrics calculation
├── experiment_utils.py # Helper functions
└── README.md # This file
cd results/2025-07-01
python main_fox_experiment.py
python variant_metrics_evaluator.py
python -c "from experiment_utils import test_small_subset_experiment; test_small_subset_experiment()"
python -c "from experiment_utils import validate_experiment_data; validate_experiment_data()"
python -c "from experiment_utils import check_api_connections; check_api_connections()"
- Precision: TP / (TP + FP) - How many predicted variants are correct
- Recall: TP / (TP + FN) - How many actual variants were found
- F1-Score: 2 * (Precision * Recall) / (Precision + Recall) - Harmonic mean
- FOX Genes: 50 genes from enhancer tables
- Publications: Retrieved via NCBI E-utilities based on gene names
- LLM: Meta-Llama-3.1-8B-Instruct for variant extraction
- PubTator: Manual annotations for reference variants
- LitVar: Literature-derived variant database
- Full experiment: ~2-4 hours (depends on API rate limits)
- Small subset test: ~5-10 minutes
- Metrics calculation: ~1-2 minutes
- Rate limiting is implemented for all API calls
- LLM calls are limited to 100 publications per gene
- Variant normalization handles HGVS and protein notation
- All intermediate results are saved for debugging
dev