MedVidDeID is a comprehensive medical data de-identification pipeline that removes Protected Health Information (PHI) from multi-modal medical data including video, audio, and text.
- Multi-modal De-identification: Comprehensive PHI removal from video, audio, and text
- Security-Hardened: Path validation, input sanitization, and thread-safe operations
- HIPAA Compliance: Complete audit trails with artifact lineage tracking
- Pipeline Orchestration: Automated workflow with error handling and recovery
- Artifact Management: SHA256 checksums, metadata tracking, and immutable storage
- AI-Powered: YOLO pose detection, WhisperX transcription, and NLP-based PHI detection
- pipeline: Central artifact management with audit trails
- video_deid: Face blurring and pose-based de-identification using YOLO
- audio_deid: Speech transcription (WhisperX) and PHI audio scrubbing
- philter-ucsf: Text de-identification with 2,260+ PHI detection patterns
# Required dependencies
conda create -n deid python=3.9
conda activate deid
# For video processing
pip install opencv-python torch torchvision
# For audio transcription (optional)
pip install whisperx
# For text processing
pip install spacy transformers
# Clone with all submodules
git clone --recursive https://github.com/kbjohnson-penn/MedVidDeID.git
cd MedVidDeID
# Install pipeline requirements
pip install -r pipeline/requirements.txt
# Setup video_deid module
cd video_deid
pip install -r requirements.txt
pip install -e .
cd ..
# Setup audio_deid module
cd audio_deid
pip install -r requirements.txt
cd ..
# Setup philter-ucsf module
cd philter-ucsf
pip install -r requirements.txt
cd ..
# Process a medical video through complete pipeline
python simple_example.py patient_consultation.mp4
# Output structure:
# output/patient_consultation/
# ├── deid_12345678.mp4 # De-identified video
# ├── keypoints_12345678.csv # Pose detection data
# ├── transcript.json # Speech transcription
# ├── scrubbed_audio.mp3 # PHI-removed audio
# └── artifacts/ # Complete audit trail
from pipeline.simple_integration import process_video_pipeline
from pathlib import Path
# Programmatic access
result = process_video_pipeline(
video_path=Path("input_video.mp4"),
output_dir=Path("./secure_output")
)
if result["success"]:
print(f"De-identified video: {result['output_video']}")
print(f"Artifacts: {result['artifacts']}")
print(f"Processing run: {result['run_id']}")
- Technology: YOLO v8 pose detection, Kalman filter tracking
- Features: Face blurring, skeleton overlay, temporal consistency
- Input: MP4, AVI, MOV video files
- Output: De-identified video with preserved audio
# Video-only processing
cd video_deid
python -m video_deid.cli --operation_type extract --video input.mp4 --keypoints_csv output.csv
python -m video_deid.cli --operation_type deid --video input.mp4 --keypoints_csv output.csv --output deid.mp4
- Technology: WhisperX speech-to-text, regex-based PHI detection
- Features: Word-level timestamp alignment, beep replacement
- Input: MP3, WAV, MP4 audio/video files
- Output: Transcript JSON, PHI intervals, scrubbed audio
# Audio transcription and de-identification
cd audio_deid
python transcribe.py --audio input.mp3 --output_format json
python scrub.py --source input.mp3 --json phi_intervals.json --output scrubbed.mp3
- Technology: 2,260+ regex patterns, NLP context analysis
- Features: Medical terminology preservation, configurable redaction
- Input: JSON, TSV, TXT medical documents
- Output: De-identified text with PHI markers
# Text de-identification
cd philter-ucsf
python main_format.py -i transcript.json -o deidentified.json -f json
python main.py -i notes/ -o output/ -f ./configs/philter_delta.json --prod=True
- Path Validation: Prevents directory traversal attacks
- Input Sanitization: Regex-based filename and path cleaning
- Symlink Protection: Blocks symlink-based security bypasses
- Process Timeouts: Prevents resource exhaustion attacks
- Thread Safety: Concurrent artifact operations with proper locking
- Audit Trails: Complete operation logging in structured JSON format
- Artifact Lineage: Full provenance tracking from input to output
- Data Integrity: SHA256 checksums for all stored artifacts
- Access Logging: Tracks all artifact access attempts
- Log Rotation: Automatic audit log management with size-based rotation
output/video_name/
├── deid_12345678.mp4 # Final de-identified video
├── keypoints_12345678.csv # YOLO pose detection data
├── audio_12345678.mp3 # Extracted audio track
├── transcript.json # WhisperX speech transcription
├── phi_intervals.json # PHI time segments for audio
├── scrubbed_audio.mp3 # De-identified audio
├── deidentified_transcript.json # PHI-removed transcript
└── artifacts/
├── storage/ # Immutable artifact files
│ ├── VIDEO_RAW/
│ ├── VIDEO_KEYPOINTS/
│ ├── VIDEO_DEID/
│ ├── AUDIO_RAW/
│ ├── AUDIO_TRANSCRIPT/
│ ├── AUDIO_PHI_INTERVALS/
│ ├── AUDIO_DEID/
│ ├── TEXT_RAW/
│ └── TEXT_DEID/
├── metadata/ # JSON metadata for each artifact
└── audit/
├── audit.jsonl # Complete operation audit trail
└── runs/ # Processing run records
- CPU: Multi-core recommended for video processing
- GPU: CUDA-capable GPU for optimal YOLO/WhisperX performance
- Memory: 8GB+ RAM for processing large medical videos
- Storage: 2-3x input file size for artifacts and processing
- Video De-identification: 1-2x real-time (varies by resolution)
- Audio Transcription: 0.1-0.3x real-time with GPU
- Text De-identification: Near real-time for typical documents
# Install development dependencies
conda activate deid
pip install -r requirements-dev.txt
# Code formatting (video_deid module)
cd video_deid
black video_deid/
isort video_deid/
flake8 video_deid/
mypy video_deid/