Dedicated to the memory of Carlito Cross Madhouse Live
Advanced audio/video processing pipeline for multi-speaker conversation analysis and content extraction.
Built with extension-based architecture, featuring direct transcription processing, LLM-powered analysis, and production-grade reliability.
# Process audio files
python pipeline_orchestrator.py input_audio/
# Process video with frame analysis
python pipeline_orchestrator.py video.mp4
# Process from URL (YouTube, etc.)
python pipeline_orchestrator.py --url https://youtube.com/watch?v=...
# Resume interrupted run
python pipeline_orchestrator.py --resume --output-folder outputs/run-20250702-120000
13-stage processing pipeline with extension-based architecture (40+ extensions)
- ingestion - File anonymization, PII removal
- video_analysis - Frame analysis (Moondream2 VLM)
- separation - Vocal/instrumental separation
- music_muting - Music detection and removal (CLAP)
- remix - Audio mixing, channel processing
- call_tones - Organ tone appending
- diarization - Speaker identification (PyAnnote)
- speaker_segmentation - Audio segmentation by speaker
- resampling - 16kHz conversion for ASR
- transcription - Direct speech-to-text (Whisper)
- soundbite_finalization - Segment processing
- llm - LLM-powered content analysis
- finalization - MP3 conversion with metadata
- Direct Transcription: Eliminated complex consolidation system for faster, more reliable processing
- Predictable File Naming: Consistent XXXX-SSSS-EEEE format across all workflows
- Enhanced Reliability: No more consolidation/splitting cycles that disrupted transcript alignment
- Better Performance: Reduced complexity and overhead for streamlined processing
- Improved Debugging: Linear processing with clear error messages
- CarlitoCrosstalKEnhanced: Advanced humor detection with conversation analysis
- Content Analysis: Intelligent categorization and sentiment analysis
- Audio Compilation: Automated highlight reel generation from LLM insights
- Robust Error Handling: Comprehensive failure recovery
- Resume Capability: Bulletproof continuation from any stage
- Format Standardization: 44.1kHz CD quality output throughout
- Zero Technical Debt: Clean codebase with simplified architecture
outputs/run-YYYYMMDD-HHMMSS/
├── manifest.json # Processing metadata
├── pipeline_state.json # Resume state
├── soundbites/
│ ├── 0000/ # Call segments
│ │ ├── left-vocals/ # Left channel speakers
│ │ │ └── speaker_00/
│ │ └── right-vocals/ # Right channel speakers
│ │ └── speaker_01/
│ └── 0000_master_transcript.txt
├── finalized/
│ ├── show/ # Complete show output
│ │ ├── show.mp3
│ │ └── show_notes.txt
│ └── calls/ # Individual segments
│ └── 0001-title/
│ ├── call.mp3
│ ├── transcript.txt
│ └── soundbites/
└── llm/ # Analysis results
└── content_analysis.json
input_path
- Input file/directory/URL--output-folder
- Specify output directory--debug
- Enable debug logging
--resume
- Resume from existing run--resume-from STAGE
- Resume from specific stage--force-rerun STAGE
- Force re-run stage--clear-from STAGE
- Clear completion status
--asr_engine [whisper|parakeet]
- ASR engine selection--separation-model MODEL
- Separation model choice--show
- Enable CLAP segmentation for shows--call-tones
- Add organ tones to calls--max-speakers N
- Maximum speakers per channel--full-processing
- Disable call tuple optimization
--llm-config PATH
- LLM configuration file--minimal
- Skip expensive stages for quick testing--enable-gpu-optimizations
- GPU acceleration
- RAM: 16GB+ (32GB recommended)
- Storage: 50GB+ free space
- GPU: CUDA-capable (RTX 3060+ recommended)
- CPU: Multi-core (8+ threads recommended)
# Core ML frameworks
pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu121
# Audio/video processing
pip install transformers accelerate pyannote.audio
pip install openai-whisper yt-dlp soundfile
# LLM integration
pip install requests anthropic openai
# System requirements
# FFmpeg (system package)
- 44.1kHz CD Quality: Professional audio output standard
- Multi-Speaker Diarization: Advanced speaker separation and identification
- Format-Matched Processing: Intelligent audio format detection and conversion
- Smart Optimization: Call tuple detection with intelligent processing
- Three-Tier System: Left, right, and master transcripts
- [XTALK] Detection: Automatic crosstalk identification
- Channel-Aware Processing: Single-file vs. call tuple workflows
- Chronological Organization: Timestamped conversation flow
- Extension Architecture: Modular, maintainable design
- State Management: Directory-based processing state
- Resume Capability: Bulletproof continuation from interruptions
- Comprehensive Testing: 142 tests with 0 import errors
- GPU Acceleration: 10-400x speedup on compatible hardware
- Smart Processing: Skip redundant operations for call tuples
- Memory Management: Intelligent resource utilization
- Direct Processing: Simplified workflow without consolidation overhead
# Stage already completed
python pipeline_orchestrator.py --force-rerun transcription --resume
# Memory issues
python pipeline_orchestrator.py --max-speakers 2 input.mp4
# GPU errors
python pipeline_orchestrator.py --asr_engine whisper input.wav
# Empty transcripts
python pipeline_orchestrator.py --force-rerun transcription --full-processing
# Check pipeline status
python pipeline_orchestrator.py --show-resume-status --output-folder outputs/run-*
# Detailed logging
python pipeline_orchestrator.py --debug --resume --output-folder outputs/run-*
Extensions provide stage-specific processing with auto-registration:
Key Extensions:
transcription_extension.py
- Direct ASR processingllm_processing_extension.py
- Content analysiscarlito_crosstalk_enhanced.py
- Humor detectioncall_tones_extension.py
- Audio enhancement
Development Pattern:
class MyExtension(ExtensionBase):
name = "my_extension"
stage = "my_stage"
def run(self) -> None:
# Extension logic
pass
Technical Documentation:
memory-bank/projectbrief.md
- System architecturememory-bank/techContext.md
- Technology stackmemory-bank/activeContext.md
- Recent changesmemory-bank/progress.md
- Feature status
For detailed technical information, troubleshooting, and development guides, see the memory-bank directory.