🎥 Video RAG

Local Retrieval-Augmented Generation for Video Q&A

A simple Video RAG system that lets users upload a video, builds a local searchable index from the video's transcript (and optional visual captions), and answers user questions using Google Gemini (via langchain_google_genai). The pipeline is designed to be free/local where possible (Whisper for local transcription, HuggingFace sentence-transformers for embeddings, Chroma for vector store). Gemini is used as the LLM backend (requires Google credentials if you want real Gemini responses).

✨ Features

🎬 Video Upload & Processing - Support for MP4/MOV files with automatic audio extraction
🎙️ Local Transcription - Uses OpenAI Whisper for accurate, free speech-to-text
🔍 Semantic Search - HuggingFace embeddings with Chroma vector database
🤖 AI-Powered Q&A - Google Gemini integration for intelligent responses
🚀 Real-time Interface - Clean Streamlit frontend with FastAPI backend
💰 Cost-Effective - Local processing minimizes API costs

🏗️ Architecture

The system follows a clean pipeline architecture:

Upload → Video file received via Streamlit UI
Extract → FFmpeg converts video to audio (WAV)
Transcribe → Whisper generates text transcript locally
Index → Text chunked and embedded using sentence-transformers
Store → Chroma vector database for fast retrieval
Query → User questions matched against relevant chunks
Generate → Gemini produces contextual answers

📁 Project Structure

video-rag/
├── 📄 app.py              # Streamlit frontend interface
├── 🚀 backend.py          # FastAPI backend server
├── ⚙️ functions.py        # Core processing pipeline
├── 📦 requirements.txt    # Python dependencies
├── 🔐 .env               # Environment configuration
├── 📊 diagram-export-*    # Architecture diagram
└── 📖 README.md          # This file

🚀 Quick Start

Prerequisites

Python 3.10+ (recommended: use virtual environment)
FFmpeg installed and available in PATH
Google Cloud credentials (optional, for Gemini responses)

Installation

Clone the repository
```
git clone <repository-url>
cd video-rag
```

Create virtual environment

python -m venv venv

# Windows
.\venv\Scripts\activate

# macOS/Linux  
source venv/bin/activate

Install dependencies
```
pip install -r requirements.txt
```

Install FFmpeg

Windows:

# Using Chocolatey
choco install ffmpeg -y

# Using winget
winget install --id=Gyan.FFmpeg -e

macOS:

brew install ffmpeg

Linux:

sudo apt update && sudo apt install ffmpeg

Verify installation
```
ffmpeg --version
```

Configuration

Create a .env file in the project root:

# Optional: Whisper model size (tiny|base|small|medium|large)
WHISPER_MODEL=tiny

# Optional: Google API credentials
GOOGLE_API_KEY=your_api_key_here

For Google Gemini integration, set up Application Default Credentials:

export GOOGLE_APPLICATION_CREDENTIALS="/path/to/service-account.json"

Running the Application

Start the backend server
```
uvicorn backend:app --reload
```

Launch the frontend (in a new terminal)

streamlit run app.py --server.maxUploadSize=200

Access the application
- Frontend: http://localhost:8501
- Backend API: http://localhost:8000

💡 Usage

Upload Video 📤
- Navigate to the Streamlit interface
- Upload your MP4 or MOV file
- Wait for "processed" confirmation
Ask Questions ❓
- Enter your question in the text input
- Click Submit to get AI-powered answers
- Responses are generated from video content
Explore Results 🔍
- View relevant context from the video
- Get detailed answers based on transcript analysis

🔧 Advanced Configuration

Whisper Models

Choose based on your needs:

Model	Size	Speed	Accuracy
`tiny`	39 MB	⚡ Fastest	Good
`base`	74 MB	🚀 Fast	Better
`small`	244 MB	🐌 Slower	Best

First Startup with small model and when things dont work change model to tiny.

   # change size if you want faster/slower
   def _get_whisper():
    global _whisper_model
    if _whisper_model is None:
        _whisper_model = whisper.load_model("tiny")  
    return _whisper_model

Embedding Models

Default: all-MiniLM-L6-v2 (fast, good quality)

For better quality: all-mpnet-base-v2
For speed: all-MiniLM-L12-v2

Production Optimizations

Use GPU-enabled containers for faster Whisper processing
Implement Chroma persistence to avoid re-indexing
Add background task queues for large video processing
Enable CORS and security headers for public deployment

🎯 Handling Silent Videos

Current pipeline focuses on audio transcription. For silent videos:

Extract frames using FFmpeg
Generate captions with image captioning models (BLIP, etc.)
Combine visual and audio information for richer context

Example integration point in functions.py:

# Add visual captioning step here
captions = generate_visual_captions(video_path)
combined_text = transcript + " " + captions

🐛 Troubleshooting

Common Issues

ModuleNotFoundError

# Ensure virtual environment is activated
source venv/bin/activate  # or .\venv\Scripts\activate on Windows
pip install -r requirements.txt

FFmpeg not found

# Verify FFmpeg installation
ffmpeg --version

# Add to PATH if needed (Windows)
set PATH=%PATH%;C:\path\to\ffmpeg\bin

Streamlit upload errors (403)

Increase upload size: --server.maxUploadSize=500
Try incognito mode or different browser
Move project outside OneDrive/cloud sync folders

Google ADC errors

# Set credentials environment variable
export GOOGLE_APPLICATION_CREDENTIALS="/path/to/service-account.json"

Debugging Tips

Check backend logs for processing pipeline status
Verify transcript files are generated with content
Monitor docs=0 or context_len=0 logs for indexing issues

🔒 Security & Production

✅ Sanitize uploaded filenames
✅ Validate file types and sizes
✅ Implement rate limiting
✅ Use containerization for isolation
✅ Monitor API usage and quotas
✅ Enable HTTPS in production

🤝 Contributing

Contributions are welcome! Please feel free to submit a Pull Request. For major changes, please open an issue first to discuss what you would like to change.

Fork the repository
Create your feature branch (git checkout -b feature/AmazingFeature)
Commit your changes (git commit -m 'Add some AmazingFeature')
Push to the branch (git push origin feature/AmazingFeature)
Open a Pull Request

TL;DR

You upload a video and ask question, on the basis of the prompt/question you ask, you get a high level answer with 3 frames demostrating the occurrence to justify the answer.

📄 License

This project is licensed under the MIT License - see the LICENSE file for details.

🙏 Acknowledgments

Built with these amazing open-source libraries:

OpenAI Whisper - Speech recognition
Hugging Face Transformers - Text embeddings
LangChain - LLM framework
ChromaDB - Vector database
Streamlit - Web interface
FastAPI - Backend framework

⭐ Star this repo if it helped you!

Name		Name	Last commit message	Last commit date
Latest commit History 12 Commits
__pycache__		__pycache__
assests		assests
.env		.env
Readme.md		Readme.md
app.py		app.py
backend.py		backend.py
functions.py		functions.py
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

🎥 Video RAG

Local Retrieval-Augmented Generation for Video Q&A

✨ Features

🏗️ Architecture

📁 Project Structure

🚀 Quick Start

Prerequisites

Installation

Configuration

Running the Application

💡 Usage

🔧 Advanced Configuration

Whisper Models

Embedding Models

Production Optimizations

🎯 Handling Silent Videos

🐛 Troubleshooting

Common Issues

Debugging Tips

🔒 Security & Production

🤝 Contributing

TL;DR

📄 License

🙏 Acknowledgments

About

Uh oh!

Releases

Packages

Languages

P47Parzival/Video-RAG

Folders and files

Latest commit

History

Repository files navigation

🎥 Video RAG

Local Retrieval-Augmented Generation for Video Q&A

✨ Features

🏗️ Architecture

📁 Project Structure

🚀 Quick Start

Prerequisites

Installation

Configuration

Running the Application

💡 Usage

🔧 Advanced Configuration

Whisper Models

Embedding Models

Production Optimizations

🎯 Handling Silent Videos

🐛 Troubleshooting

Common Issues

Debugging Tips

🔒 Security & Production

🤝 Contributing

TL;DR

📄 License

🙏 Acknowledgments

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages