A sophisticated Python-based system leveraging large language models (LLMs) to transform unstructured clinical notes into structured, actionable healthcare data. Built with modern AI techniques and software engineering best practices.
Author: Nicole LeGuern (@CodeQueenie)
This project demonstrates advanced NLP capabilities by implementing an end-to-end pipeline for extracting structured clinical information from de-identified medical notes. It showcases expertise in:
- AI/ML Integration: Leveraging state-of-the-art LLMs (OpenAI GPT-4, Anthropic Claude) for complex text analysis
- Software Architecture: Implementing clean, modular design with separation of concerns
- API Development: Creating a production-ready RESTful API with FastAPI
- Testing Methodologies: Comprehensive testing strategy including mock implementations
- Healthcare Domain Knowledge: Applying NLP to solve real-world clinical data challenges
The pipeline includes components for data ingestion, preprocessing, LLM-based extraction, validation, and API deployment.
- Backend: Python 3.9+, FastAPI, Uvicorn
- AI/ML: OpenAI API, Anthropic API, custom prompt engineering
- Data Processing: Custom NLP preprocessing, JSON schema validation
- Testing: Pytest, mock frameworks, API testing
- DevOps: Environment management, containerization-ready
- Documentation: OpenAPI/Swagger, comprehensive README
- Flexible Data Ingestion: Load clinical notes from various file formats (TXT, CSV, JSON)
- Advanced Preprocessing: Clean, normalize, and de-identify clinical text
- LLM Integration: Extract structured data using OpenAI or Anthropic models
- Dynamic Prompt Engineering: Create optimized prompts for different extraction needs
- Comprehensive Validation: Validate extracted data against schemas and clinical rules
- Consistency Checking: Ensure internal consistency across different sections of the data
- RESTful API: Deploy the pipeline as a service with FastAPI
- Mock LLM Client: Test the pipeline without making actual API calls or incurring costs
- Extensive Testing: Comprehensive unit tests for all components
The project includes sample clinical notes and API request examples to help you get started:
data/sample/sample_clinical_note.txt
: A sample clinical encounter notedata/sample/api_examples/
: JSON-formatted API request examples
The project includes a Python client that demonstrates how to call the API:
# Extract data from a clinical note
python example_api_client.py extract data/sample/api_examples/comprehensive_extraction.json
# Compare two clinical notes
python example_api_client.py compare data/sample/api_examples/comparison_request.json
# Validate extracted data
python example_api_client.py validate data/sample/api_examples/validation_request.json
LLM-Powered Clinical Text Extraction Pipeline/
├── src/ # Source code
│ ├── data_ingestion/ # Data loading and preprocessing
│ │ ├── data_loader.py # Load clinical notes from files
│ │ └── preprocessor.py # Preprocess clinical notes
│ ├── llm_integration/ # LLM interaction
│ │ ├── llm_client.py # Client for LLM APIs
│ │ ├── mock_llm_client.py # Mock client for testing
│ │ ├── prompt_engineering.py # Create extraction prompts
│ │ └── clinical_extractor.py # Extract clinical data
│ ├── validation/ # Validation components
│ │ ├── data_validator.py # Validate extracted data
│ │ └── consistency_checker.py # Check data consistency
│ └── api/ # API deployment
│ ├── app.py # FastAPI application
│ └── models.py # API data models
├── tests/ # Unit tests
│ ├── data_ingestion/ # Tests for data ingestion
│ ├── llm_integration/ # Tests for LLM integration
│ ├── validation/ # Tests for validation
│ └── api/ # Tests for API
├── data/ # Data directory
│ └── sample/ # Sample clinical notes
├── requirements.txt # Project dependencies
└── README.md # Project documentation
-
Create a new conda environment:
conda create -n clinical-extraction python=3.9
-
Activate the environment:
conda activate clinical-extraction
-
Install packages using conda:
conda install -c conda-forge fastapi uvicorn python-dotenv requests jsonschema pytest
-
Install OpenAI and Anthropic packages (using pip as they may not be in conda channels):
conda install -c conda-forge pip pip install openai anthropic
-
Clone the repository:
git clone https://github.com/CodeQueenie/LLM-Powered_Clinical_Test_Extraction_Pipeline.git cd LLM-Powered_Clinical_Test_Extraction_Pipeline
-
Clone the repository:
git clone https://github.com/CodeQueenie/LLM-Powered_Clinical_Test_Extraction_Pipeline.git cd LLM-Powered_Clinical_Test_Extraction_Pipeline
-
Create and activate a virtual environment:
python -m venv venv # On Windows venv\Scripts\activate # On macOS/Linux source venv/bin/activate
-
Install dependencies:
pip install -r requirements.txt
-
Set up environment variables: Create a
.env
file in the project root with your API keys:# Create .env file for API keys echo OPENAI_API_KEY=your_openai_key_here > .env echo ANTHROPIC_API_KEY=your_anthropic_key_here >> .env
Start the FastAPI server:
uvicorn src.api.app:app --reload --host 0.0.0.0 --port 8000
The API will be available at http://localhost:8000
. You can access the interactive API documentation at:
- Swagger UI:
http://localhost:8000/docs
- ReDoc:
http://localhost:8000/redoc
- Open
http://localhost:8000/docs
in your browser - Click on the green POST button under extraction
- Click "Try it out"
- Enter a request body like:
{ "note_text": "Your clinical note text here", "extraction_type": "comprehensive", "target_category": "", "fields": [] }
- Replace "Your clinical note text here" with content from one of the JSON example files in
data/sample/api_examples/
- Click the blue "Execute" button
- The system will process your clinical note and return structured data in the "Response body" section
Important Note: When using the Swagger UI, we recommend using the pre-formatted JSON examples from the data/sample/api_examples/
directory rather than directly copying from the sample_clinical_note.txt
file, which may cause JSON parsing errors. The JSON examples are properly formatted for the API and will work correctly with the Swagger UI.
GET /
: Health check endpointPOST /extract
: Extract clinical data from a notePOST /compare
: Compare two clinical notesPOST /validate
: Validate extracted clinical data
import requests
import json
# API endpoint
url = "http://localhost:8000/extract"
# Clinical note
note = """
PATIENT ENCOUNTER NOTE
Date: 2025-01-15
Patient ID: PT12345 (De-identified)
CHIEF COMPLAINT:
Patient presents with persistent cough for 2 weeks, fatigue, and mild fever.
MEDICATIONS:
1. Lisinopril 10mg daily for hypertension
2. Metformin 500mg twice daily for diabetes
ASSESSMENT:
1. Community-acquired pneumonia, right lower lobe
2. Hypertension, controlled
3. Type 2 diabetes mellitus, controlled
"""
# Request payload
payload = {
"note_text": note,
"extraction_type": "comprehensive"
}
# Send request
response = requests.post(url, json=payload)
result = response.json()
# Print extracted data
print(json.dumps(result["extracted_data"], indent=2))
from src.data_ingestion.data_loader import ClinicalNoteLoader
from src.data_ingestion.preprocessor import ClinicalNotePreprocessor
from src.llm_integration.clinical_extractor import ClinicalDataExtractor
from src.validation.data_validator import DataValidator
from src.validation.consistency_checker import ConsistencyChecker
# Load a clinical note
loader = ClinicalNoteLoader()
note = loader.load_from_txt("data/sample/sample_clinical_note.txt")
# Preprocess the note
preprocessor = ClinicalNotePreprocessor()
preprocessed_note = preprocessor.preprocess(note)
# Extract clinical data
extractor = ClinicalDataExtractor(llm_provider="openai")
extracted_data = extractor.extract_comprehensive(preprocessed_note)
# Validate the extracted data
validator = DataValidator()
is_valid, validation_results = validator.validate_extraction(extracted_data)
# Check consistency
checker = ConsistencyChecker()
is_consistent, inconsistencies = checker.check_consistency(extracted_data)
# Print results
print(f"Extracted data valid: {is_valid}")
print(f"Data is consistent: {is_consistent}")
Run the test suite:
pytest
Run tests with coverage:
pytest --cov=src
Run the unit tests to ensure all components are working correctly:
pytest tests/
For testing without making actual API calls (useful for development and CI/CD), you can use the included mock LLM client:
# Test extraction with mock client
python test_with_mock.py extract data/sample/api_examples/comprehensive_extraction.json
# Test comparison with mock client
python test_with_mock.py compare data/sample/api_examples/comparison_request.json
The mock client returns predefined responses based on the request type, allowing you to test the pipeline's functionality without incurring API costs or requiring internet connectivity. This is especially useful when:
- You don't have an active internet connection
- You want to avoid API costs during development
- You're running automated tests in a CI/CD pipeline
- You're experiencing timeout issues with the real APIs
If you encounter issues with the API:
- Swagger UI not loading: Try refreshing the page or ensure the server is still running
- API Timeouts: For large clinical notes, you might experience timeouts with the LLM APIs. Try:
- Using the mock client for testing
- Breaking down the note into smaller sections
- Increasing the timeout settings in your requests
- API Key Issues: Verify your API keys are correctly set in the
.env
file
This project is licensed under the MIT License - see the LICENSE file for details.
- OpenAI and Anthropic for their powerful language models
- FastAPI for the web framework
- The medical NLP research community for inspiration and best practices
This project was created by Nicole LeGuern (@CodeQueenie). If you use or modify this code, please provide attribution to the original author.