Skip to content

Multimedia context generation tool using off-the-shelf components. Leverages several local ML/AI tools to accomplish transcription, context clues, and llm-driven tasks. Designed with extensibility in mind. Dataset preparation tool. Adds context to video and audio inputs.

License

Notifications You must be signed in to change notification settings

akspa0/The-Machine

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

The-Machine

Dedicated to the memory of Carlito Cross Madhouse Live


Advanced audio/video processing pipeline for multi-speaker conversation analysis and content extraction.

Built with extension-based architecture, featuring direct transcription processing, LLM-powered analysis, and production-grade reliability.

Quick Start

# Process audio files
python pipeline_orchestrator.py input_audio/

# Process video with frame analysis
python pipeline_orchestrator.py video.mp4

# Process from URL (YouTube, etc.)
python pipeline_orchestrator.py --url https://youtube.com/watch?v=...

# Resume interrupted run
python pipeline_orchestrator.py --resume --output-folder outputs/run-20250702-120000

Pipeline Architecture

13-stage processing pipeline with extension-based architecture (40+ extensions)

  1. ingestion - File anonymization, PII removal
  2. video_analysis - Frame analysis (Moondream2 VLM)
  3. separation - Vocal/instrumental separation
  4. music_muting - Music detection and removal (CLAP)
  5. remix - Audio mixing, channel processing
  6. call_tones - Organ tone appending
  7. diarization - Speaker identification (PyAnnote)
  8. speaker_segmentation - Audio segmentation by speaker
  9. resampling - 16kHz conversion for ASR
  10. transcription - Direct speech-to-text (Whisper)
  11. soundbite_finalization - Segment processing
  12. llm - LLM-powered content analysis
  13. finalization - MP3 conversion with metadata

Recent Major Improvements

Simplified Processing Architecture (July 2025)

  • Direct Transcription: Eliminated complex consolidation system for faster, more reliable processing
  • Predictable File Naming: Consistent XXXX-SSSS-EEEE format across all workflows
  • Enhanced Reliability: No more consolidation/splitting cycles that disrupted transcript alignment
  • Better Performance: Reduced complexity and overhead for streamlined processing
  • Improved Debugging: Linear processing with clear error messages

LLM-Powered Extensions

  • CarlitoCrosstalKEnhanced: Advanced humor detection with conversation analysis
  • Content Analysis: Intelligent categorization and sentiment analysis
  • Audio Compilation: Automated highlight reel generation from LLM insights

Production Stability

  • Robust Error Handling: Comprehensive failure recovery
  • Resume Capability: Bulletproof continuation from any stage
  • Format Standardization: 44.1kHz CD quality output throughout
  • Zero Technical Debt: Clean codebase with simplified architecture

Output Structure

outputs/run-YYYYMMDD-HHMMSS/
├── manifest.json              # Processing metadata
├── pipeline_state.json        # Resume state
├── soundbites/
│   ├── 0000/                  # Call segments
│   │   ├── left-vocals/       # Left channel speakers
│   │   │   └── speaker_00/
│   │   └── right-vocals/      # Right channel speakers
│   │       └── speaker_01/
│   └── 0000_master_transcript.txt
├── finalized/
│   ├── show/                  # Complete show output
│   │   ├── show.mp3
│   │   └── show_notes.txt
│   └── calls/                 # Individual segments
│       └── 0001-title/
│           ├── call.mp3
│           ├── transcript.txt
│           └── soundbites/
└── llm/                       # Analysis results
    └── content_analysis.json

Command Options

Basic Usage

  • input_path - Input file/directory/URL
  • --output-folder - Specify output directory
  • --debug - Enable debug logging

Resume & Control

  • --resume - Resume from existing run
  • --resume-from STAGE - Resume from specific stage
  • --force-rerun STAGE - Force re-run stage
  • --clear-from STAGE - Clear completion status

Processing Options

  • --asr_engine [whisper|parakeet] - ASR engine selection
  • --separation-model MODEL - Separation model choice
  • --show - Enable CLAP segmentation for shows
  • --call-tones - Add organ tones to calls
  • --max-speakers N - Maximum speakers per channel
  • --full-processing - Disable call tuple optimization

Advanced

  • --llm-config PATH - LLM configuration file
  • --minimal - Skip expensive stages for quick testing
  • --enable-gpu-optimizations - GPU acceleration

Technical Requirements

Hardware

  • RAM: 16GB+ (32GB recommended)
  • Storage: 50GB+ free space
  • GPU: CUDA-capable (RTX 3060+ recommended)
  • CPU: Multi-core (8+ threads recommended)

Software Dependencies

# Core ML frameworks
pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu121

# Audio/video processing
pip install transformers accelerate pyannote.audio
pip install openai-whisper yt-dlp soundfile

# LLM integration
pip install requests anthropic openai

# System requirements
# FFmpeg (system package)

Key Features

Audio Processing Excellence

  • 44.1kHz CD Quality: Professional audio output standard
  • Multi-Speaker Diarization: Advanced speaker separation and identification
  • Format-Matched Processing: Intelligent audio format detection and conversion
  • Smart Optimization: Call tuple detection with intelligent processing

Transcript Intelligence

  • Three-Tier System: Left, right, and master transcripts
  • [XTALK] Detection: Automatic crosstalk identification
  • Channel-Aware Processing: Single-file vs. call tuple workflows
  • Chronological Organization: Timestamped conversation flow

Infrastructure & Reliability

  • Extension Architecture: Modular, maintainable design
  • State Management: Directory-based processing state
  • Resume Capability: Bulletproof continuation from interruptions
  • Comprehensive Testing: 142 tests with 0 import errors

Performance Optimizations

  • GPU Acceleration: 10-400x speedup on compatible hardware
  • Smart Processing: Skip redundant operations for call tuples
  • Memory Management: Intelligent resource utilization
  • Direct Processing: Simplified workflow without consolidation overhead

Troubleshooting

Common Issues

# Stage already completed
python pipeline_orchestrator.py --force-rerun transcription --resume

# Memory issues
python pipeline_orchestrator.py --max-speakers 2 input.mp4

# GPU errors
python pipeline_orchestrator.py --asr_engine whisper input.wav

# Empty transcripts
python pipeline_orchestrator.py --force-rerun transcription --full-processing

Debug Commands

# Check pipeline status
python pipeline_orchestrator.py --show-resume-status --output-folder outputs/run-*

# Detailed logging
python pipeline_orchestrator.py --debug --resume --output-folder outputs/run-*

Extension Development

Extensions provide stage-specific processing with auto-registration:

Key Extensions:

  • transcription_extension.py - Direct ASR processing
  • llm_processing_extension.py - Content analysis
  • carlito_crosstalk_enhanced.py - Humor detection
  • call_tones_extension.py - Audio enhancement

Development Pattern:

class MyExtension(ExtensionBase):
    name = "my_extension"
    stage = "my_stage"
    
    def run(self) -> None:
        # Extension logic
        pass

Documentation

Technical Documentation:

  • memory-bank/projectbrief.md - System architecture
  • memory-bank/techContext.md - Technology stack
  • memory-bank/activeContext.md - Recent changes
  • memory-bank/progress.md - Feature status

For detailed technical information, troubleshooting, and development guides, see the memory-bank directory.

About

Multimedia context generation tool using off-the-shelf components. Leverages several local ML/AI tools to accomplish transcription, context clues, and llm-driven tasks. Designed with extensibility in mind. Dataset preparation tool. Adds context to video and audio inputs.

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages