Skip to content

A cutting-edge AI pipeline that transforms unstructured clinical notes into structured healthcare data using OpenAI and Anthropic LLMs. Features FastAPI endpoints, mock testing capabilities, and comprehensive validation. Demonstrates advanced NLP, software architecture, and healthcare domain expertise.

License

Notifications You must be signed in to change notification settings

CodeQueenie/LLM-Powered_Clinical_Test_Extraction_Pipeline

Repository files navigation

LLM-Powered Clinical Text Extraction Pipeline

A sophisticated Python-based system leveraging large language models (LLMs) to transform unstructured clinical notes into structured, actionable healthcare data. Built with modern AI techniques and software engineering best practices.

Author: Nicole LeGuern (@CodeQueenie)

Project Overview

This project demonstrates advanced NLP capabilities by implementing an end-to-end pipeline for extracting structured clinical information from de-identified medical notes. It showcases expertise in:

  • AI/ML Integration: Leveraging state-of-the-art LLMs (OpenAI GPT-4, Anthropic Claude) for complex text analysis
  • Software Architecture: Implementing clean, modular design with separation of concerns
  • API Development: Creating a production-ready RESTful API with FastAPI
  • Testing Methodologies: Comprehensive testing strategy including mock implementations
  • Healthcare Domain Knowledge: Applying NLP to solve real-world clinical data challenges

The pipeline includes components for data ingestion, preprocessing, LLM-based extraction, validation, and API deployment.

Technical Stack

  • Backend: Python 3.9+, FastAPI, Uvicorn
  • AI/ML: OpenAI API, Anthropic API, custom prompt engineering
  • Data Processing: Custom NLP preprocessing, JSON schema validation
  • Testing: Pytest, mock frameworks, API testing
  • DevOps: Environment management, containerization-ready
  • Documentation: OpenAPI/Swagger, comprehensive README

Features

  • Flexible Data Ingestion: Load clinical notes from various file formats (TXT, CSV, JSON)
  • Advanced Preprocessing: Clean, normalize, and de-identify clinical text
  • LLM Integration: Extract structured data using OpenAI or Anthropic models
  • Dynamic Prompt Engineering: Create optimized prompts for different extraction needs
  • Comprehensive Validation: Validate extracted data against schemas and clinical rules
  • Consistency Checking: Ensure internal consistency across different sections of the data
  • RESTful API: Deploy the pipeline as a service with FastAPI
  • Mock LLM Client: Test the pipeline without making actual API calls or incurring costs
  • Extensive Testing: Comprehensive unit tests for all components

Sample Data and API Examples

The project includes sample clinical notes and API request examples to help you get started:

Sample Files

  • data/sample/sample_clinical_note.txt: A sample clinical encounter note
  • data/sample/api_examples/: JSON-formatted API request examples

Using the Sample API Examples

With the Example API Client

The project includes a Python client that demonstrates how to call the API:

# Extract data from a clinical note
python example_api_client.py extract data/sample/api_examples/comprehensive_extraction.json

# Compare two clinical notes
python example_api_client.py compare data/sample/api_examples/comparison_request.json

# Validate extracted data
python example_api_client.py validate data/sample/api_examples/validation_request.json

Project Structure

LLM-Powered Clinical Text Extraction Pipeline/
├── src/                           # Source code
│   ├── data_ingestion/            # Data loading and preprocessing
│   │   ├── data_loader.py         # Load clinical notes from files
│   │   └── preprocessor.py        # Preprocess clinical notes
│   ├── llm_integration/           # LLM interaction
│   │   ├── llm_client.py          # Client for LLM APIs
│   │   ├── mock_llm_client.py     # Mock client for testing
│   │   ├── prompt_engineering.py  # Create extraction prompts
│   │   └── clinical_extractor.py  # Extract clinical data
│   ├── validation/                # Validation components
│   │   ├── data_validator.py      # Validate extracted data
│   │   └── consistency_checker.py # Check data consistency
│   └── api/                       # API deployment
│       ├── app.py                 # FastAPI application
│       └── models.py              # API data models
├── tests/                         # Unit tests
│   ├── data_ingestion/            # Tests for data ingestion
│   ├── llm_integration/           # Tests for LLM integration
│   ├── validation/                # Tests for validation
│   └── api/                       # Tests for API
├── data/                          # Data directory
│   └── sample/                    # Sample clinical notes
├── requirements.txt               # Project dependencies
└── README.md                      # Project documentation

Installation

Using Conda (Recommended)

  1. Create a new conda environment:

    conda create -n clinical-extraction python=3.9
  2. Activate the environment:

    conda activate clinical-extraction
  3. Install packages using conda:

    conda install -c conda-forge fastapi uvicorn python-dotenv requests jsonschema pytest
  4. Install OpenAI and Anthropic packages (using pip as they may not be in conda channels):

    conda install -c conda-forge pip
    pip install openai anthropic
  5. Clone the repository:

    git clone https://github.com/CodeQueenie/LLM-Powered_Clinical_Test_Extraction_Pipeline.git
    cd LLM-Powered_Clinical_Test_Extraction_Pipeline

Using Pip

  1. Clone the repository:

    git clone https://github.com/CodeQueenie/LLM-Powered_Clinical_Test_Extraction_Pipeline.git
    cd LLM-Powered_Clinical_Test_Extraction_Pipeline
  2. Create and activate a virtual environment:

    python -m venv venv
    # On Windows
    venv\Scripts\activate
    # On macOS/Linux
    source venv/bin/activate
  3. Install dependencies:

    pip install -r requirements.txt
  4. Set up environment variables: Create a .env file in the project root with your API keys:

    # Create .env file for API keys
    echo OPENAI_API_KEY=your_openai_key_here > .env
    echo ANTHROPIC_API_KEY=your_anthropic_key_here >> .env

Usage

Running the API

Start the FastAPI server:

uvicorn src.api.app:app --reload --host 0.0.0.0 --port 8000

The API will be available at http://localhost:8000. You can access the interactive API documentation at:

  • Swagger UI: http://localhost:8000/docs
  • ReDoc: http://localhost:8000/redoc

Testing the API with Swagger UI

  1. Open http://localhost:8000/docs in your browser
  2. Click on the green POST button under extraction
  3. Click "Try it out"
  4. Enter a request body like:
    {
      "note_text": "Your clinical note text here",
      "extraction_type": "comprehensive",
      "target_category": "",
      "fields": []
    }
  5. Replace "Your clinical note text here" with content from one of the JSON example files in data/sample/api_examples/
  6. Click the blue "Execute" button
  7. The system will process your clinical note and return structured data in the "Response body" section

Important Note: When using the Swagger UI, we recommend using the pre-formatted JSON examples from the data/sample/api_examples/ directory rather than directly copying from the sample_clinical_note.txt file, which may cause JSON parsing errors. The JSON examples are properly formatted for the API and will work correctly with the Swagger UI.

API Endpoints

  • GET /: Health check endpoint
  • POST /extract: Extract clinical data from a note
  • POST /compare: Compare two clinical notes
  • POST /validate: Validate extracted clinical data

Example: Extracting Data

import requests
import json

# API endpoint
url = "http://localhost:8000/extract"

# Clinical note
note = """
PATIENT ENCOUNTER NOTE
Date: 2025-01-15
Patient ID: PT12345 (De-identified)

CHIEF COMPLAINT:
Patient presents with persistent cough for 2 weeks, fatigue, and mild fever.

MEDICATIONS:
1. Lisinopril 10mg daily for hypertension
2. Metformin 500mg twice daily for diabetes

ASSESSMENT:
1. Community-acquired pneumonia, right lower lobe
2. Hypertension, controlled
3. Type 2 diabetes mellitus, controlled
"""

# Request payload
payload = {
    "note_text": note,
    "extraction_type": "comprehensive"
}

# Send request
response = requests.post(url, json=payload)
result = response.json()

# Print extracted data
print(json.dumps(result["extracted_data"], indent=2))

Using the Library Directly

from src.data_ingestion.data_loader import ClinicalNoteLoader
from src.data_ingestion.preprocessor import ClinicalNotePreprocessor
from src.llm_integration.clinical_extractor import ClinicalDataExtractor
from src.validation.data_validator import DataValidator
from src.validation.consistency_checker import ConsistencyChecker

# Load a clinical note
loader = ClinicalNoteLoader()
note = loader.load_from_txt("data/sample/sample_clinical_note.txt")

# Preprocess the note
preprocessor = ClinicalNotePreprocessor()
preprocessed_note = preprocessor.preprocess(note)

# Extract clinical data
extractor = ClinicalDataExtractor(llm_provider="openai")
extracted_data = extractor.extract_comprehensive(preprocessed_note)

# Validate the extracted data
validator = DataValidator()
is_valid, validation_results = validator.validate_extraction(extracted_data)

# Check consistency
checker = ConsistencyChecker()
is_consistent, inconsistencies = checker.check_consistency(extracted_data)

# Print results
print(f"Extracted data valid: {is_valid}")
print(f"Data is consistent: {is_consistent}")

Testing

Run the test suite:

pytest

Run tests with coverage:

pytest --cov=src

Unit Tests

Run the unit tests to ensure all components are working correctly:

pytest tests/

Testing with Mock LLM Client

For testing without making actual API calls (useful for development and CI/CD), you can use the included mock LLM client:

# Test extraction with mock client
python test_with_mock.py extract data/sample/api_examples/comprehensive_extraction.json

# Test comparison with mock client
python test_with_mock.py compare data/sample/api_examples/comparison_request.json

The mock client returns predefined responses based on the request type, allowing you to test the pipeline's functionality without incurring API costs or requiring internet connectivity. This is especially useful when:

  • You don't have an active internet connection
  • You want to avoid API costs during development
  • You're running automated tests in a CI/CD pipeline
  • You're experiencing timeout issues with the real APIs

Troubleshooting

If you encounter issues with the API:

  1. Swagger UI not loading: Try refreshing the page or ensure the server is still running
  2. API Timeouts: For large clinical notes, you might experience timeouts with the LLM APIs. Try:
    • Using the mock client for testing
    • Breaking down the note into smaller sections
    • Increasing the timeout settings in your requests
  3. API Key Issues: Verify your API keys are correctly set in the .env file

License

This project is licensed under the MIT License - see the LICENSE file for details.

Acknowledgments

  • OpenAI and Anthropic for their powerful language models
  • FastAPI for the web framework
  • The medical NLP research community for inspiration and best practices

Attribution

This project was created by Nicole LeGuern (@CodeQueenie). If you use or modify this code, please provide attribution to the original author.

About

A cutting-edge AI pipeline that transforms unstructured clinical notes into structured healthcare data using OpenAI and Anthropic LLMs. Features FastAPI endpoints, mock testing capabilities, and comprehensive validation. Demonstrates advanced NLP, software architecture, and healthcare domain expertise.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages