Production-grade automated video processing system with ML transcription, semantic search, and AI-powered content extraction.
- 6,095+ channels monitored across tech, business, educational content
- 32,579+ videos processed and catalogued
- 41.8M words of searchable transcript content
- 99.9% storage compression (MP3 → transcript conversion)
- 15,162 transcripts generated with local ML
- Zero ongoing API costs (local AI/ML processing)
Problem: Consuming and retaining knowledge from thousands of hours of video content across hundreds of channels is impossible manually.
Solution: Automated pipeline that downloads, transcribes, and semantically indexes video content for instant searchability and AI-powered insights.
- 🎥 Automated Download: Monitors channels every 2 hours, downloads new content
- 🎙️ Local ML Transcription: MacWhisper Pro + Parakeet-MLX for offline processing
- 💾 Semantic Storage: PostgreSQL + pgvector for embedding-based search
- 🤖 AI Processing: OpenRouter integration for summaries, chapters, insights
- 📊 Real-time Dashboard: Next.js frontend for browsing and searching
- 🔄 Incremental Processing: Only processes new content, skips duplicates
- 🚂 Production Ready: Railway deployment, error recovery, monitoring
┌─────────────────┐ ┌──────────────────┐ ┌─────────────────┐
│ yt-dlp │─────▶│ MacWhisper Pro │─────▶│ Parakeet-MLX │
│ Downloader │ │ File Watcher │ │ Transcription │
└─────────────────┘ └──────────────────┘ └─────────────────┘
│ │
▼ ▼
┌─────────────────────────────────────────────────────────────────────┐
│ Supabase PostgreSQL │
│ ┌──────────┐ ┌──────────┐ ┌─────────────┐ ┌─────────────────┐ │
│ │ Channels │ │ Videos │ │ Transcripts │ │ Vector Embeddings│ │
│ └──────────┘ └──────────┘ └─────────────┘ └─────────────────┘ │
└─────────────────────────────────────────────────────────────────────┘
│ │
▼ ▼
┌─────────────────┐ ┌──────────────────┐ ┌─────────────────┐
│ Next.js │ │ OpenRouter AI │ │ pgvector │
│ Dashboard │ │ Processing │ │ Search │
└─────────────────┘ └──────────────────┘ └─────────────────┘
- Python 3.9+
- MacWhisper Pro (for transcription)
- Supabase account
- Railway account (for deployment)
# Clone repository
git clone https://github.com/mordechaipotash/youtube-transcription-pipeline.git
cd youtube-transcription-pipeline
# Install dependencies
pip install -r requirements.txt
# Configure environment
cp .env.example .env
# Edit .env with your credentials
# Run pipeline
python main.py# YouTube Configuration
CHANNEL_LIST=UCxxxxx,UCyyyyy # Comma-separated channel IDs
# Transcription
WATCHED_FOLDER=/path/to/macwhisper/folder
# Database
SUPABASE_URL=https://your-project.supabase.co
SUPABASE_KEY=your-service-role-key
# AI Processing (Optional)
OPENROUTER_API_KEY=your-key
OPENROUTER_MODEL=anthropic/claude-3-sonnet
# Deployment
LOG_LEVEL=INFO
RAILWAY_ENVIRONMENT=production # Auto-set by Railwayyoutube-transcription-pipeline/
├── main.py # Application entry point, scheduler
├── youtube_downloader.py # yt-dlp integration, channel monitoring
├── transcript_processor.py # MacWhisper integration, AI processing
├── requirements.txt # Python dependencies
├── railway.toml # Railway deployment config
├── .env.example # Environment template
└── README.md
channels
channel_id(text, primary key)channel_name,videos_count,status,last_checkvideos_found,videos_processed
youtube (videos)
youtube_id(text, unique)title,description,urlchannel_name,channel_id,durationview_count,like_count,upload_datehas_transcript,watch_count
Additional Tables
transcripts: Raw transcription dataprocessed_content: AI summaries and insights
- Discovery:
youtube_downloader.pychecks configured channels for new videos - Download: Uses
yt-dlpto download audio-only MP3 (storage efficient) - Transcription: MacWhisper Pro watches folder, auto-transcribes with Parakeet-MLX
- Processing:
transcript_processor.pydetects new transcripts, generates embeddings - AI Enhancement: OpenRouter creates summaries, chapters, key points
- Storage: All data stored in Supabase with vector embeddings for semantic search
# Railway automatically deploys on push
railway upConfiguration:
- Add all env vars from
.env.example - Cron runs every 2 hours automatically
- Logs available in Railway dashboard
# Add to crontab for 2-hour intervals
0 */2 * * * cd /path/to/pipeline && python main.py- Processing Speed: ~1 video/minute (download + transcribe)
- Storage Efficiency: 99.9% compression (video → audio → text)
- Scalability: Handles 6,000+ channels, 32,000+ videos
- Cost: $0/month API costs (local ML), minimal storage (~500MB for 41.8M words)
- Reliability: Error recovery, duplicate detection, incremental processing
Core:
- Python 3.9+
- yt-dlp (YouTube downloading)
- MacWhisper Pro (transcription)
- Parakeet-MLX (ML model)
Data:
- PostgreSQL 15+
- Supabase (database hosting)
- pgvector (semantic search)
AI:
- OpenRouter (LLM gateway)
- Anthropic Claude / OpenAI GPT
- Vector embeddings
Infrastructure:
- Railway (cloud hosting)
- LaunchD/Cron (scheduling)
- Environment-based configuration
- Personal Knowledge Base: Build searchable library of educational content
- Research Assistant: Find specific topics across thousands of videos
- Content Curation: Track channels, analyze trends, discover patterns
- Learning Analytics: Track consumption, measure topic coverage
- AI Training Data: Clean transcripts for fine-tuning or RAG systems
This is a personal project, but suggestions and improvements welcome via issues.
MIT License - see LICENSE file for details
- Sparkii Command Center - Full-stack dashboard for the pipeline
- Google Apps Script Portfolio - 41 automation solutions
Status: Production (300+ hours development, actively processing 32K+ videos)
Maintained: Yes (regular updates and monitoring)