🧬 Coordinates Literature Analysis

<<<<<<< HEAD

🧬 Coordinates Literature Analysis

A comprehensive tool for analyzing biomedical literature to extract and understand relationships between genomic coordinates, variants, genes, and diseases using advanced NLP and Large Language Models.

📋 Overview

This project provides sophisticated tools for analyzing PubMed articles to extract and understand relationships between genomic variants and other biomedical entities such as genes, diseases, and tissues. It leverages Natural Language Processing (NLP) and Large Language Models (LLMs) to identify, analyze, and score the strength of these relationships.

The system is designed for researchers, bioinformaticians, and clinicians who need to systematically analyze large volumes of biomedical literature to understand variant-disease associations and genomic relationships.

✨ Key Features

🔍 Genomic Variant Extraction: Extract genomic coordinates and variants from biomedical literature
🧬 Relationship Analysis: Analyze relationships between variants and biomedical entities (genes, diseases, tissues)
📊 Scoring System: Score relationship strength from 0-10 using advanced LLM analysis
📁 Multiple Export Formats: Export results to CSV and JSON formats
⚡ Intelligent Caching: Cache API responses for faster processing and reduced costs
🤖 Multi-LLM Support: Support for multiple LLM providers (OpenAI, TogetherAI)
🔧 Modular Architecture: Professional, scalable codebase with clear separation of concerns
🧪 Comprehensive Testing: Extensive test suite with >80% code coverage
📚 Rich Documentation: Detailed documentation and examples

🚀 Quick Start

Prerequisites

Python 3.9+
API keys for:
- OpenAI (for GPT models) or TogetherAI (for open-source models)
- PubTator3 (optional, for enhanced entity recognition)
- ClinVar (optional, for variant validation)

Installation

Clone the repository:

git clone https://github.com/yourusername/coordinates-lit.git
cd coordinates-lit

Create and activate virtual environment:

python -m venv venv
source venv/bin/activate  # On Windows: venv\Scripts\activate

Install dependencies:
```
pip install -r requirements.txt
```
Set up configuration:
```
cp config/development.example.yaml config/development.yaml
```
Then edit config/development.yaml to add your API keys and settings.

Basic Usage

Analyze PubMed articles:

python -m src.cli.analyze --pmids 32735606 32719766 --output results.csv

Analyze from file:

python -m src.cli.analyze --file pmids.txt --output results.csv --json results.json

🏗️ Project Architecture

The project follows a professional modular architecture:

coordinates-lit/
├── 📁 src/                     # Source code
│   ├── 🔌 api/                 # API layer & external communication
│   │   ├── clients/            # API clients (PubTator, ClinVar, LitVar)
│   │   └── cache/              # Caching system
│   ├── 🧬 analysis/            # Biomedical analysis modules
│   │   ├── bio_ner/            # Named Entity Recognition
│   │   ├── context/            # Context analysis
│   │   ├── llm/                # LLM-based analysis
│   │   └── base/               # Base analyzer classes
│   ├── 💻 cli/                 # Command-line interface
│   ├── 📊 models/              # Data models & structures
│   ├── ⚙️ services/            # Business logic services
│   │   ├── flow/               # Data flow orchestration
│   │   ├── processing/         # Data processing
│   │   ├── search/             # Literature search
│   │   └── validation/         # Data validation
│   └── 🛠️ utils/               # Utilities & helpers
│       ├── config/             # Configuration management
│       ├── llm/                # LLM management
│       └── logging/            # Logging system
├── 🧪 tests/                   # Test suite (mirrors src structure)
├── 📁 config/                  # Configuration files
├── 📁 data/                    # Data storage
├── 📁 scripts/                 # Utility scripts
└── 📁 docs/                    # Documentation

🔧 Module Usage

API Clients

PubTator Client - Extract biomedical entities:

from src.api.clients.pubtator_client import PubTatorClient

client = PubTatorClient()
publication = client.get_publication_by_pmid("32735606")
entities = client.extract_entities(publication)

ClinVar Client - Validate variants:

from src.api.clients.clinvar_client import ClinVarClient

client = ClinVarClient()
variant_info = client.get_variant_info("NM_000492.3:c.1521_1523delCTT")

Analysis Modules

LLM Context Analyzer - Analyze relationships using LLM:

from src.analysis.llm.llm_context_analyzer import LlmContextAnalyzer

analyzer = LlmContextAnalyzer()
results = analyzer.analyze_publications_by_pmids(["32735606", "32719766"])

Bio NER - Extract genomic variants:

from src.analysis.bio_ner.variant_recognizer import VariantRecognizer

recognizer = VariantRecognizer()
variants = recognizer.extract_variants("Found mutation c.123A>G in BRCA1")

Services

Flow Orchestration - Run complete analysis pipelines:

from src.services.flow.pubmed_flow import PubMedAnalysisFlow

flow = PubMedAnalysisFlow()
results = flow.analyze_pmids(["32735606"], output_format="csv")

🧪 Testing

The project includes comprehensive testing with >80% code coverage.

Run All Tests

pytest

Run Specific Test Categories

# API tests
pytest tests/api/

# Analysis tests
pytest tests/analysis/

# LLM manager tests
pytest tests/utils/llm/

# Integration tests
pytest tests/integration/

Test with Coverage

pytest --cov=src --cov-report=html

Test Markers

# Tests without real API calls (fast)
pytest -m "not realapi"

# Integration tests only
pytest -m integration

# Slow tests
pytest -m slow

📊 CLI Reference

Main Analysis Command

python -m src.cli.analyze [OPTIONS]

Options:

--pmids TEXT: List of PubMed IDs to analyze
--file PATH: File containing PubMed IDs (one per line)
--output PATH: Output CSV file path
--json PATH: Output JSON file path (optional)
--model TEXT: LLM model to use (default from config)
--email TEXT: Email for PubTator API requests
--debug: Enable debug mode
--no-retry: Disable automatic retries
--cache-type [memory|disk]: Cache type
--log-level [DEBUG|INFO|WARNING|ERROR]: Logging level

Examples

Basic analysis:

python -m src.cli.analyze --pmids 32735606 32719766 --output results.csv

Batch analysis with JSON output:

python -m src.cli.analyze --file pmids.txt --output results.csv --json results.json

Debug mode with specific model:

python -m src.cli.analyze --pmids 32735606 --model gpt-4 --debug --log-level DEBUG

🎯 Use Cases

1. Clinical Research

Identify variant-disease associations in literature
Validate genomic findings against published research
Systematic literature reviews for specific variants

2. Bioinformatics Analysis

Extract genomic coordinates from publications
Build knowledge graphs of variant relationships
Automated literature curation

3. Drug Discovery

Find variants associated with drug responses
Identify therapeutic targets
Literature-based drug repurposing

4. Diagnostic Support

Validate variant pathogenicity from literature
Find similar cases in published research
Evidence-based variant interpretation

⚙️ Configuration

Config File Structure

# config/development.yaml
llm:
  provider: "openai"  # or "together"
  model: "gpt-3.5-turbo"
  temperature: 0.7

api:
  openai_key: "your-openai-key"
  together_key: "your-together-key"
  pubtator_email: "your-email@example.com"

cache:
  type: "disk"  # or "memory"
  ttl: 3600
  max_size: 1000

logging:
  level: "INFO"
  format: "%(asctime)s - %(name)s - %(levelname)s - %(message)s"

Environment Variables

export OPENAI_API_KEY="your-openai-key"
export TOGETHER_API_KEY="your-together-key"
export PUBTATOR_EMAIL="your-email@example.com"

🔄 Migration from Old Structure

If you're upgrading from an older version, see:

Migration Guide - For src reorganization
Tests Migration Guide - For tests reorganization

🤝 Contributing

Fork the repository
Create a feature branch (git checkout -b feature/amazing-feature)
Make your changes following the project structure
Add tests for new functionality
Ensure all tests pass (pytest)
Commit your changes (git commit -m 'Add amazing feature')
Push to the branch (git push origin feature/amazing-feature)
Open a Pull Request

📚 Documentation

📄 License

Source Code

This project is licensed under the Creative Commons Attribution-NonCommercial 4.0 International License (CC BY-NC 4.0) - see the LICENSE file for details.

Documentation

Documentation is licensed under Creative Commons Attribution 4.0 International (CC BY 4.0) - see the LICENSE-DOCS file for details.

🙏 Acknowledgments

PubTator3 for biomedical entity annotations
ClinVar for variant databases
LangChain for LLM integration
OpenAI and Together AI for LLM services

📞 Support

Issues: GitHub Issues
Discussions: GitHub Discussions
Email: wojciech.sitek@pw.edu.pl

🌟 If this project helps your research, please consider giving it a star!

FOX Genes Variant Extraction Experiment - 01.07.2025

Overview

This experiment compares LLM-based variant extraction against reference sources for FOX family genes.

Experiment Steps

Load FOX genes from external_data/enhancer_tables_from_uw/fox_unique_genes.txt
Get PMID counts for each gene using NCBI E-utilities
Extract reference variants from LitVar based on gene names
Extract predicted variants from publication texts using LLM (max 100 pubs/gene)
Extract reference variants from PubTator annotations
Calculate metrics comparing predicted vs reference variants

File Structure

results/2025-07-01/
├── data/
│   ├── fox_genes.txt                    # FOX family genes (50 genes)
│   ├── gene_pmids_counts.csv           # PMID counts per gene
│   ├── reference_variants.json         # LitVar variants by gene
│   ├── predicted_variants.json         # LLM-predicted variants
│   └── pubtator_variants.json          # PubTator reference variants
├── reports/
│   ├── llm_vs_pubtator_metrics.json    # Detailed metrics vs PubTator
│   ├── llm_vs_litvar_metrics.json      # Detailed metrics vs LitVar
│   ├── metrics_summary.csv             # Summary metrics
│   └── experiment_summary.md           # Human-readable report
├── logs/
│   └── experiment.log                  # Experiment execution log
├── main_fox_experiment.py              # Main experiment orchestrator
├── variant_metrics_evaluator.py       # Metrics calculation
├── experiment_utils.py                 # Helper functions
└── README.md                           # This file

Usage

Run Full Experiment

cd results/2025-07-01
python main_fox_experiment.py

Calculate Metrics (after data collection)

python variant_metrics_evaluator.py

Test with Small Subset

python -c "from experiment_utils import test_small_subset_experiment; test_small_subset_experiment()"

Validate Data

python -c "from experiment_utils import validate_experiment_data; validate_experiment_data()"

Check API Connections

python -c "from experiment_utils import check_api_connections; check_api_connections()"

Metrics Calculated

Precision: TP / (TP + FP) - How many predicted variants are correct
Recall: TP / (TP + FN) - How many actual variants were found
F1-Score: 2 * (Precision * Recall) / (Precision + Recall) - Harmonic mean

Data Sources

FOX Genes: 50 genes from enhancer tables
Publications: Retrieved via NCBI E-utilities based on gene names
LLM: Meta-Llama-3.1-8B-Instruct for variant extraction
PubTator: Manual annotations for reference variants
LitVar: Literature-derived variant database

Expected Runtime

Full experiment: ~2-4 hours (depends on API rate limits)
Small subset test: ~5-10 minutes
Metrics calculation: ~1-2 minutes

Notes

Rate limiting is implemented for all API calls
LLM calls are limited to 100 publications per gene
Variant normalization handles HGVS and protein notation
All intermediate results are saved for debugging

dev

Name		Name	Last commit message	Last commit date
Latest commit History 72 Commits
.github		.github
config		config
data		data
docs		docs
drafts		drafts
examples		examples
experiments		experiments
external_data		external_data
logs		logs
results		results
scripts		scripts
src		src
tests		tests
.DS_Store		.DS_Store
.env.example		.env.example
.gitignore		.gitignore
.markdownlint.json		.markdownlint.json
.pre-commit-config.yaml		.pre-commit-config.yaml
CI_CD_SETUP.md		CI_CD_SETUP.md
COORDINATES_RECOGNIZER_SUMMARY.md		COORDINATES_RECOGNIZER_SUMMARY.md
CURSOR_DEBUG_QUICK_START.md		CURSOR_DEBUG_QUICK_START.md
LICENSE		LICENSE
LICENSE-DOCS		LICENSE-DOCS
LICENSE.documentation		LICENSE.documentation
MIGRATION_GUIDE.md		MIGRATION_GUIDE.md
QUICK_START_CI_CD.md		QUICK_START_CI_CD.md
README-pl.md		README-pl.md
README.md		README.md
README_INFO.md		README_INFO.md
README_PIPELINE.md		README_PIPELINE.md
comprehensive_test.py		comprehensive_test.py
final_test_summary.py		final_test_summary.py
pyproject.toml		pyproject.toml
pytest.ini		pytest.ini
requirements-dev.txt		requirements-dev.txt
requirements.txt		requirements.txt
setup.py		setup.py
test_fixed_patterns.py		test_fixed_patterns.py
test_runner.py		test_runner.py
tmvar_experiment.log		tmvar_experiment.log

License

Licenses found

biodatageeks/genomic-publications-agent

Folders and files

Latest commit

History

Repository files navigation

🧬 Coordinates Literature Analysis

📋 Overview

✨ Key Features

🚀 Quick Start

Prerequisites

Installation

Basic Usage

🏗️ Project Architecture

🔧 Module Usage

API Clients

Analysis Modules

Services

🧪 Testing

Run All Tests

Run Specific Test Categories

Test with Coverage

Test Markers

📊 CLI Reference

Main Analysis Command

Examples

🎯 Use Cases

1. Clinical Research

2. Bioinformatics Analysis

3. Drug Discovery

4. Diagnostic Support

⚙️ Configuration

Config File Structure

Environment Variables

🔄 Migration from Old Structure

🤝 Contributing

📚 Documentation

📄 License

Source Code

Documentation

🙏 Acknowledgments

📞 Support

🌟 If this project helps your research, please consider giving it a star!

FOX Genes Variant Extraction Experiment - 01.07.2025

Overview

Experiment Steps

File Structure

Usage

Run Full Experiment

Calculate Metrics (after data collection)

Test with Small Subset

Validate Data

Check API Connections

Metrics Calculated

Data Sources

Expected Runtime

Notes

About

Resources

License

Licenses found

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Contributors 3

Uh oh!

Languages

Packages