Skip to content

A powerful Video RAG system that enables users to upload videos, automatically builds searchable indexes from transcripts, and answers questions using Google Gemini. Built with a focus on local processing and only free tools.

Notifications You must be signed in to change notification settings

P47Parzival/Video-RAG

Folders and files

NameName
Last commit message
Last commit date

Latest commit

Β 

History

12 Commits
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 

Repository files navigation

πŸŽ₯ Video RAG

Local Retrieval-Augmented Generation for Video Q&A

A simple Video RAG system that lets users upload a video, builds a local searchable index from the video's transcript (and optional visual captions), and answers user questions using Google Gemini (via langchain_google_genai). The pipeline is designed to be free/local where possible (Whisper for local transcription, HuggingFace sentence-transformers for embeddings, Chroma for vector store). Gemini is used as the LLM backend (requires Google credentials if you want real Gemini responses).

Logo

Python Streamlit FastAPI License

Rag flow

Architecture & Flow


✨ Features

  • 🎬 Video Upload & Processing - Support for MP4/MOV files with automatic audio extraction
  • πŸŽ™οΈ Local Transcription - Uses OpenAI Whisper for accurate, free speech-to-text
  • πŸ” Semantic Search - HuggingFace embeddings with Chroma vector database
  • πŸ€– AI-Powered Q&A - Google Gemini integration for intelligent responses
  • πŸš€ Real-time Interface - Clean Streamlit frontend with FastAPI backend
  • πŸ’° Cost-Effective - Local processing minimizes API costs

πŸ—οΈ Architecture

The system follows a clean pipeline architecture:

  1. Upload β†’ Video file received via Streamlit UI
  2. Extract β†’ FFmpeg converts video to audio (WAV)
  3. Transcribe β†’ Whisper generates text transcript locally
  4. Index β†’ Text chunked and embedded using sentence-transformers
  5. Store β†’ Chroma vector database for fast retrieval
  6. Query β†’ User questions matched against relevant chunks
  7. Generate β†’ Gemini produces contextual answers

πŸ“ Project Structure

video-rag/
β”œβ”€β”€ πŸ“„ app.py              # Streamlit frontend interface
β”œβ”€β”€ πŸš€ backend.py          # FastAPI backend server
β”œβ”€β”€ βš™οΈ functions.py        # Core processing pipeline
β”œβ”€β”€ πŸ“¦ requirements.txt    # Python dependencies
β”œβ”€β”€ πŸ” .env               # Environment configuration
β”œβ”€β”€ πŸ“Š diagram-export-*    # Architecture diagram
└── πŸ“– README.md          # This file

πŸš€ Quick Start

Prerequisites

  • Python 3.10+ (recommended: use virtual environment)
  • FFmpeg installed and available in PATH
  • Google Cloud credentials (optional, for Gemini responses)

Installation

  1. Clone the repository

    git clone <repository-url>
    cd video-rag
  2. Create virtual environment

    python -m venv venv
    
    # Windows
    .\venv\Scripts\activate
    
    # macOS/Linux  
    source venv/bin/activate
  3. Install dependencies

    pip install -r requirements.txt
  4. Install FFmpeg

    Windows:

    # Using Chocolatey
    choco install ffmpeg -y
    
    # Using winget
    winget install --id=Gyan.FFmpeg -e

    macOS:

    brew install ffmpeg

    Linux:

    sudo apt update && sudo apt install ffmpeg
  5. Verify installation

    ffmpeg --version

Configuration

Create a .env file in the project root:

# Optional: Whisper model size (tiny|base|small|medium|large)
WHISPER_MODEL=tiny

# Optional: Google API credentials
GOOGLE_API_KEY=your_api_key_here

For Google Gemini integration, set up Application Default Credentials:

export GOOGLE_APPLICATION_CREDENTIALS="/path/to/service-account.json"

Running the Application

  1. Start the backend server

    uvicorn backend:app --reload
  2. Launch the frontend (in a new terminal)

    streamlit run app.py --server.maxUploadSize=200
  3. Access the application


πŸ’‘ Usage

  1. Upload Video πŸ“€

    • Navigate to the Streamlit interface
    • Upload your MP4 or MOV file
    • Wait for "processed" confirmation
  2. Ask Questions ❓

    • Enter your question in the text input
    • Click Submit to get AI-powered answers
    • Responses are generated from video content
  3. Explore Results πŸ”

    • View relevant context from the video
    • Get detailed answers based on transcript analysis

πŸ”§ Advanced Configuration

Whisper Models

Choose based on your needs:

Model Size Speed Accuracy
tiny 39 MB ⚑ Fastest Good
base 74 MB πŸš€ Fast Better
small 244 MB 🐌 Slower Best

First Startup with small model and when things dont work change model to tiny.

   # change size if you want faster/slower
   def _get_whisper():
    global _whisper_model
    if _whisper_model is None:
        _whisper_model = whisper.load_model("tiny")  
    return _whisper_model

Embedding Models

Default: all-MiniLM-L6-v2 (fast, good quality)

  • For better quality: all-mpnet-base-v2
  • For speed: all-MiniLM-L12-v2

Production Optimizations

  • Use GPU-enabled containers for faster Whisper processing
  • Implement Chroma persistence to avoid re-indexing
  • Add background task queues for large video processing
  • Enable CORS and security headers for public deployment

🎯 Handling Silent Videos

Current pipeline focuses on audio transcription. For silent videos:

  1. Extract frames using FFmpeg
  2. Generate captions with image captioning models (BLIP, etc.)
  3. Combine visual and audio information for richer context

Example integration point in functions.py:

# Add visual captioning step here
captions = generate_visual_captions(video_path)
combined_text = transcript + " " + captions

πŸ› Troubleshooting

Common Issues

ModuleNotFoundError

# Ensure virtual environment is activated
source venv/bin/activate  # or .\venv\Scripts\activate on Windows
pip install -r requirements.txt

FFmpeg not found

# Verify FFmpeg installation
ffmpeg --version

# Add to PATH if needed (Windows)
set PATH=%PATH%;C:\path\to\ffmpeg\bin

Streamlit upload errors (403)

  • Increase upload size: --server.maxUploadSize=500
  • Try incognito mode or different browser
  • Move project outside OneDrive/cloud sync folders

Google ADC errors

# Set credentials environment variable
export GOOGLE_APPLICATION_CREDENTIALS="/path/to/service-account.json"

Debugging Tips

  • Check backend logs for processing pipeline status
  • Verify transcript files are generated with content
  • Monitor docs=0 or context_len=0 logs for indexing issues

πŸ”’ Security & Production

  • βœ… Sanitize uploaded filenames
  • βœ… Validate file types and sizes
  • βœ… Implement rate limiting
  • βœ… Use containerization for isolation
  • βœ… Monitor API usage and quotas
  • βœ… Enable HTTPS in production

🀝 Contributing

Contributions are welcome! Please feel free to submit a Pull Request. For major changes, please open an issue first to discuss what you would like to change.

  1. Fork the repository
  2. Create your feature branch (git checkout -b feature/AmazingFeature)
  3. Commit your changes (git commit -m 'Add some AmazingFeature')
  4. Push to the branch (git push origin feature/AmazingFeature)
  5. Open a Pull Request

TL;DR

You upload a video and ask question, on the basis of the prompt/question you ask, you get a high level answer with 3 frames demostrating the occurrence to justify the answer.

πŸ“„ License

This project is licensed under the MIT License - see the LICENSE file for details.


πŸ™ Acknowledgments

Built with these amazing open-source libraries:


⭐ Star this repo if it helped you!

About

A powerful Video RAG system that enables users to upload videos, automatically builds searchable indexes from transcripts, and answers questions using Google Gemini. Built with a focus on local processing and only free tools.

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages