A simple Video RAG system that lets users upload a video, builds a local searchable index from the video's transcript (and optional visual captions), and answers user questions using Google Gemini (via langchain_google_genai). The pipeline is designed to be free/local where possible (Whisper for local transcription, HuggingFace sentence-transformers for embeddings, Chroma for vector store). Gemini is used as the LLM backend (requires Google credentials if you want real Gemini responses).
- π¬ Video Upload & Processing - Support for MP4/MOV files with automatic audio extraction
- ποΈ Local Transcription - Uses OpenAI Whisper for accurate, free speech-to-text
- π Semantic Search - HuggingFace embeddings with Chroma vector database
- π€ AI-Powered Q&A - Google Gemini integration for intelligent responses
- π Real-time Interface - Clean Streamlit frontend with FastAPI backend
- π° Cost-Effective - Local processing minimizes API costs
The system follows a clean pipeline architecture:
- Upload β Video file received via Streamlit UI
- Extract β FFmpeg converts video to audio (WAV)
- Transcribe β Whisper generates text transcript locally
- Index β Text chunked and embedded using sentence-transformers
- Store β Chroma vector database for fast retrieval
- Query β User questions matched against relevant chunks
- Generate β Gemini produces contextual answers
video-rag/
βββ π app.py # Streamlit frontend interface
βββ π backend.py # FastAPI backend server
βββ βοΈ functions.py # Core processing pipeline
βββ π¦ requirements.txt # Python dependencies
βββ π .env # Environment configuration
βββ π diagram-export-* # Architecture diagram
βββ π README.md # This file
- Python 3.10+ (recommended: use virtual environment)
- FFmpeg installed and available in PATH
- Google Cloud credentials (optional, for Gemini responses)
-
Clone the repository
git clone <repository-url> cd video-rag
-
Create virtual environment
python -m venv venv # Windows .\venv\Scripts\activate # macOS/Linux source venv/bin/activate
-
Install dependencies
pip install -r requirements.txt
-
Install FFmpeg
Windows:
# Using Chocolatey choco install ffmpeg -y # Using winget winget install --id=Gyan.FFmpeg -e
macOS:
brew install ffmpeg
Linux:
sudo apt update && sudo apt install ffmpeg -
Verify installation
ffmpeg --version
Create a .env file in the project root:
# Optional: Whisper model size (tiny|base|small|medium|large)
WHISPER_MODEL=tiny
# Optional: Google API credentials
GOOGLE_API_KEY=your_api_key_hereFor Google Gemini integration, set up Application Default Credentials:
export GOOGLE_APPLICATION_CREDENTIALS="/path/to/service-account.json"-
Start the backend server
uvicorn backend:app --reload
-
Launch the frontend (in a new terminal)
streamlit run app.py --server.maxUploadSize=200
-
Access the application
- Frontend: http://localhost:8501
- Backend API: http://localhost:8000
-
Upload Video π€
- Navigate to the Streamlit interface
- Upload your MP4 or MOV file
- Wait for "processed" confirmation
-
Ask Questions β
- Enter your question in the text input
- Click Submit to get AI-powered answers
- Responses are generated from video content
-
Explore Results π
- View relevant context from the video
- Get detailed answers based on transcript analysis
Choose based on your needs:
| Model | Size | Speed | Accuracy |
|---|---|---|---|
tiny |
39 MB | β‘ Fastest | Good |
base |
74 MB | π Fast | Better |
small |
244 MB | π Slower | Best |
First Startup with small model and when things dont work change model to tiny.
# change size if you want faster/slower
def _get_whisper():
global _whisper_model
if _whisper_model is None:
_whisper_model = whisper.load_model("tiny")
return _whisper_model
Default: all-MiniLM-L6-v2 (fast, good quality)
- For better quality:
all-mpnet-base-v2 - For speed:
all-MiniLM-L12-v2
- Use GPU-enabled containers for faster Whisper processing
- Implement Chroma persistence to avoid re-indexing
- Add background task queues for large video processing
- Enable CORS and security headers for public deployment
Current pipeline focuses on audio transcription. For silent videos:
- Extract frames using FFmpeg
- Generate captions with image captioning models (BLIP, etc.)
- Combine visual and audio information for richer context
Example integration point in functions.py:
# Add visual captioning step here
captions = generate_visual_captions(video_path)
combined_text = transcript + " " + captionsModuleNotFoundError
# Ensure virtual environment is activated
source venv/bin/activate # or .\venv\Scripts\activate on Windows
pip install -r requirements.txtFFmpeg not found
# Verify FFmpeg installation
ffmpeg --version
# Add to PATH if needed (Windows)
set PATH=%PATH%;C:\path\to\ffmpeg\binStreamlit upload errors (403)
- Increase upload size:
--server.maxUploadSize=500 - Try incognito mode or different browser
- Move project outside OneDrive/cloud sync folders
Google ADC errors
# Set credentials environment variable
export GOOGLE_APPLICATION_CREDENTIALS="/path/to/service-account.json"- Check backend logs for processing pipeline status
- Verify transcript files are generated with content
- Monitor
docs=0orcontext_len=0logs for indexing issues
- β Sanitize uploaded filenames
- β Validate file types and sizes
- β Implement rate limiting
- β Use containerization for isolation
- β Monitor API usage and quotas
- β Enable HTTPS in production
Contributions are welcome! Please feel free to submit a Pull Request. For major changes, please open an issue first to discuss what you would like to change.
- Fork the repository
- Create your feature branch (
git checkout -b feature/AmazingFeature) - Commit your changes (
git commit -m 'Add some AmazingFeature') - Push to the branch (
git push origin feature/AmazingFeature) - Open a Pull Request
You upload a video and ask question, on the basis of the prompt/question you ask, you get a high level answer with 3 frames demostrating the occurrence to justify the answer.
This project is licensed under the MIT License - see the LICENSE file for details.
Built with these amazing open-source libraries:
- OpenAI Whisper - Speech recognition
- Hugging Face Transformers - Text embeddings
- LangChain - LLM framework
- ChromaDB - Vector database
- Streamlit - Web interface
- FastAPI - Backend framework
β Star this repo if it helped you!

.jpeg)
