LLM-Powered Clinical Text Extraction Pipeline

A sophisticated Python-based system leveraging large language models (LLMs) to transform unstructured clinical notes into structured, actionable healthcare data. Built with modern AI techniques and software engineering best practices.

Author: Nicole LeGuern (@CodeQueenie)

Project Overview

This project demonstrates advanced NLP capabilities by implementing an end-to-end pipeline for extracting structured clinical information from de-identified medical notes. It showcases expertise in:

AI/ML Integration: Leveraging state-of-the-art LLMs (OpenAI GPT-4, Anthropic Claude) for complex text analysis
Software Architecture: Implementing clean, modular design with separation of concerns
API Development: Creating a production-ready RESTful API with FastAPI
Testing Methodologies: Comprehensive testing strategy including mock implementations
Healthcare Domain Knowledge: Applying NLP to solve real-world clinical data challenges

The pipeline includes components for data ingestion, preprocessing, LLM-based extraction, validation, and API deployment.

Technical Stack

Backend: Python 3.9+, FastAPI, Uvicorn
AI/ML: OpenAI API, Anthropic API, custom prompt engineering
Data Processing: Custom NLP preprocessing, JSON schema validation
Testing: Pytest, mock frameworks, API testing
DevOps: Environment management, containerization-ready
Documentation: OpenAPI/Swagger, comprehensive README

Features

Flexible Data Ingestion: Load clinical notes from various file formats (TXT, CSV, JSON)
Advanced Preprocessing: Clean, normalize, and de-identify clinical text
LLM Integration: Extract structured data using OpenAI or Anthropic models
Dynamic Prompt Engineering: Create optimized prompts for different extraction needs
Comprehensive Validation: Validate extracted data against schemas and clinical rules
Consistency Checking: Ensure internal consistency across different sections of the data
RESTful API: Deploy the pipeline as a service with FastAPI
Mock LLM Client: Test the pipeline without making actual API calls or incurring costs
Extensive Testing: Comprehensive unit tests for all components

Sample Data and API Examples

The project includes sample clinical notes and API request examples to help you get started:

Sample Files

data/sample/sample_clinical_note.txt: A sample clinical encounter note
data/sample/api_examples/: JSON-formatted API request examples

Using the Sample API Examples

With the Example API Client

The project includes a Python client that demonstrates how to call the API:

# Extract data from a clinical note
python example_api_client.py extract data/sample/api_examples/comprehensive_extraction.json

# Compare two clinical notes
python example_api_client.py compare data/sample/api_examples/comparison_request.json

# Validate extracted data
python example_api_client.py validate data/sample/api_examples/validation_request.json

Project Structure

LLM-Powered Clinical Text Extraction Pipeline/
├── src/                           # Source code
│   ├── data_ingestion/            # Data loading and preprocessing
│   │   ├── data_loader.py         # Load clinical notes from files
│   │   └── preprocessor.py        # Preprocess clinical notes
│   ├── llm_integration/           # LLM interaction
│   │   ├── llm_client.py          # Client for LLM APIs
│   │   ├── mock_llm_client.py     # Mock client for testing
│   │   ├── prompt_engineering.py  # Create extraction prompts
│   │   └── clinical_extractor.py  # Extract clinical data
│   ├── validation/                # Validation components
│   │   ├── data_validator.py      # Validate extracted data
│   │   └── consistency_checker.py # Check data consistency
│   └── api/                       # API deployment
│       ├── app.py                 # FastAPI application
│       └── models.py              # API data models
├── tests/                         # Unit tests
│   ├── data_ingestion/            # Tests for data ingestion
│   ├── llm_integration/           # Tests for LLM integration
│   ├── validation/                # Tests for validation
│   └── api/                       # Tests for API
├── data/                          # Data directory
│   └── sample/                    # Sample clinical notes
├── requirements.txt               # Project dependencies
└── README.md                      # Project documentation

Installation

Using Conda (Recommended)

Create a new conda environment:

conda create -n clinical-extraction python=3.9

Activate the environment:
```
conda activate clinical-extraction
```

Install packages using conda:

conda install -c conda-forge fastapi uvicorn python-dotenv requests jsonschema pytest

Install OpenAI and Anthropic packages (using pip as they may not be in conda channels):
```
conda install -c conda-forge pip
pip install openai anthropic
```

Clone the repository:

git clone https://github.com/CodeQueenie/LLM-Powered_Clinical_Test_Extraction_Pipeline.git
cd LLM-Powered_Clinical_Test_Extraction_Pipeline

Using Pip

Clone the repository:

git clone https://github.com/CodeQueenie/LLM-Powered_Clinical_Test_Extraction_Pipeline.git
cd LLM-Powered_Clinical_Test_Extraction_Pipeline

Create and activate a virtual environment:

python -m venv venv
# On Windows
venv\Scripts\activate
# On macOS/Linux
source venv/bin/activate

Install dependencies:
```
pip install -r requirements.txt
```

Set up environment variables: Create a .env file in the project root with your API keys:

# Create .env file for API keys
echo OPENAI_API_KEY=your_openai_key_here > .env
echo ANTHROPIC_API_KEY=your_anthropic_key_here >> .env

Usage

Running the API

Start the FastAPI server:

uvicorn src.api.app:app --reload --host 0.0.0.0 --port 8000

The API will be available at http://localhost:8000. You can access the interactive API documentation at:

Swagger UI: http://localhost:8000/docs
ReDoc: http://localhost:8000/redoc

Testing the API with Swagger UI

Open http://localhost:8000/docs in your browser
Click on the green POST button under extraction
Click "Try it out"

Enter a request body like:

{
  "note_text": "Your clinical note text here",
  "extraction_type": "comprehensive",
  "target_category": "",
  "fields": []
}

Replace "Your clinical note text here" with content from one of the JSON example files in data/sample/api_examples/
Click the blue "Execute" button
The system will process your clinical note and return structured data in the "Response body" section

Important Note: When using the Swagger UI, we recommend using the pre-formatted JSON examples from the data/sample/api_examples/ directory rather than directly copying from the sample_clinical_note.txt file, which may cause JSON parsing errors. The JSON examples are properly formatted for the API and will work correctly with the Swagger UI.

API Endpoints

GET /: Health check endpoint
POST /extract: Extract clinical data from a note
POST /compare: Compare two clinical notes
POST /validate: Validate extracted clinical data

Example: Extracting Data

import requests
import json

# API endpoint
url = "http://localhost:8000/extract"

# Clinical note
note = """
PATIENT ENCOUNTER NOTE
Date: 2025-01-15
Patient ID: PT12345 (De-identified)

CHIEF COMPLAINT:
Patient presents with persistent cough for 2 weeks, fatigue, and mild fever.

MEDICATIONS:
1. Lisinopril 10mg daily for hypertension
2. Metformin 500mg twice daily for diabetes

ASSESSMENT:
1. Community-acquired pneumonia, right lower lobe
2. Hypertension, controlled
3. Type 2 diabetes mellitus, controlled
"""

# Request payload
payload = {
    "note_text": note,
    "extraction_type": "comprehensive"
}

# Send request
response = requests.post(url, json=payload)
result = response.json()

# Print extracted data
print(json.dumps(result["extracted_data"], indent=2))

Using the Library Directly

from src.data_ingestion.data_loader import ClinicalNoteLoader
from src.data_ingestion.preprocessor import ClinicalNotePreprocessor
from src.llm_integration.clinical_extractor import ClinicalDataExtractor
from src.validation.data_validator import DataValidator
from src.validation.consistency_checker import ConsistencyChecker

# Load a clinical note
loader = ClinicalNoteLoader()
note = loader.load_from_txt("data/sample/sample_clinical_note.txt")

# Preprocess the note
preprocessor = ClinicalNotePreprocessor()
preprocessed_note = preprocessor.preprocess(note)

# Extract clinical data
extractor = ClinicalDataExtractor(llm_provider="openai")
extracted_data = extractor.extract_comprehensive(preprocessed_note)

# Validate the extracted data
validator = DataValidator()
is_valid, validation_results = validator.validate_extraction(extracted_data)

# Check consistency
checker = ConsistencyChecker()
is_consistent, inconsistencies = checker.check_consistency(extracted_data)

# Print results
print(f"Extracted data valid: {is_valid}")
print(f"Data is consistent: {is_consistent}")

Testing

Run the test suite:

pytest

Run tests with coverage:

pytest --cov=src

Unit Tests

Run the unit tests to ensure all components are working correctly:

pytest tests/

Testing with Mock LLM Client

For testing without making actual API calls (useful for development and CI/CD), you can use the included mock LLM client:

# Test extraction with mock client
python test_with_mock.py extract data/sample/api_examples/comprehensive_extraction.json

# Test comparison with mock client
python test_with_mock.py compare data/sample/api_examples/comparison_request.json

The mock client returns predefined responses based on the request type, allowing you to test the pipeline's functionality without incurring API costs or requiring internet connectivity. This is especially useful when:

You don't have an active internet connection
You want to avoid API costs during development
You're running automated tests in a CI/CD pipeline
You're experiencing timeout issues with the real APIs

Troubleshooting

If you encounter issues with the API:

Swagger UI not loading: Try refreshing the page or ensure the server is still running
API Timeouts: For large clinical notes, you might experience timeouts with the LLM APIs. Try:
- Using the mock client for testing
- Breaking down the note into smaller sections
- Increasing the timeout settings in your requests
API Key Issues: Verify your API keys are correctly set in the .env file

License

This project is licensed under the MIT License - see the LICENSE file for details.

Acknowledgments

OpenAI and Anthropic for their powerful language models
FastAPI for the web framework
The medical NLP research community for inspiration and best practices

Attribution

This project was created by Nicole LeGuern (@CodeQueenie). If you use or modify this code, please provide attribution to the original author.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

LLM-Powered Clinical Text Extraction Pipeline

Project Overview

Technical Stack

Features

Sample Data and API Examples

Sample Files

Using the Sample API Examples

With the Example API Client

Project Structure

Installation

Using Conda (Recommended)

Using Pip

Usage

Running the API

Testing the API with Swagger UI

API Endpoints

Example: Extracting Data

Using the Library Directly

Testing

Unit Tests

Testing with Mock LLM Client

Troubleshooting

License

Acknowledgments

Attribution

About

Uh oh!

Releases

Packages

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 1 Commit
data/sample		data/sample
src		src
tests		tests
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
example_api_client.py		example_api_client.py
requirements.txt		requirements.txt
test_with_mock.py		test_with_mock.py

License

CodeQueenie/LLM-Powered_Clinical_Test_Extraction_Pipeline

Folders and files

Latest commit

History

Repository files navigation

LLM-Powered Clinical Text Extraction Pipeline

Project Overview

Technical Stack

Features

Sample Data and API Examples

Sample Files

Using the Sample API Examples

With the Example API Client

Project Structure

Installation

Using Conda (Recommended)

Using Pip

Usage

Running the API

Testing the API with Swagger UI

API Endpoints

Example: Extracting Data

Using the Library Directly

Testing

Unit Tests

Testing with Mock LLM Client

Troubleshooting

License

Acknowledgments

Attribution

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Languages

Packages