Cross-Frame Multimodal Retrieval-Augmented Generation for Video Understanding
CFM-RAG brings cross-frame audio, text and visual evidence together to enable accurate, context-rich video Q&A and content search.
This repository implements a video Retrieval-Augmented Generation (RAG) pipeline that:
- Extracts representative frames from a video.
- Transcribes audio to text (ASR).
- Performs OCR, object detection, and segmentation on frames.
- Generates frame captions.
- Builds CLIP embeddings for both text and images.
- Retrieves the most relevant text and frames for a user query.
- Constructs a context-rich prompt and queries an LLM to produce a final answer.
The entrypoint is src/main.py, which instantiates VideoRAGPipeline and runs a sample query (e.g., “How many tin cans are there in the video?”).
- Cross-frame fusion of multimodal signals (ASR, OCR, captions, detections) for robust retrieval.
- Timestamped evidence (captions + detection labels) attached to retrieval results for traceability.
- Per-item caching to speed up repeated runs and reduce redundant model calls.
- Streaming LLM integration for low-latency, traceable answers.
-
Scene extraction (
SceneExtractor.extract)- Uses PySceneDetect to detect cuts and Decord to load key frames.
- Caches scenes keyed by video content hash.
- Returns arrays of
framesandtimes.
-
Transcription (
Transcriber.transcribe)- Uses FFmpeg to extract audio (16 kHz mono float32) and Whisper Base for ASR.
- Caches transcripts using SHA256 of the file path.
-
Frame processing (
FrameProcessor.process)- Captions: GPU-batched BLIP-2 (FLAN-T5-XL) for all frames.
- Per-frame (parallel): Tesseract OCR, YOLOv8x detection, GroundingDINO tiny + SAM huge for segmentation.
- Caches results per frame content hash.
- Returns
(ocr_texts, captions, detections, masks).
-
Indexing (
Indexer.build)- Builds open_clip ViT-H-14 embeddings for text (ASR + OCR + captions) and for images (frames).
- Uses per-item caching (text hashed by content, images by frame hash).
- Returns
(texts, text_embs, img_embs).
-
Retrieval + LLM (
Retriever.query)- Encodes the query with CLIP and computes cosine similarity with text and image embeddings.
- Selects top-3 text and top-3 image hits (configurable).
- Builds a prompt including: top text snippets, timestamped image captions, and detection labels with confidences.
- Calls Groq Chat Completions (model:
meta-llama/llama-4-scout-17b-16e-instruct) using streaming. - Returns
(prompt, answer).
src/main.py— Minimal entrypoint; creates pipeline withConfig.VIDEO_PATHand prints prompt/answer.src/pipeline.py—VideoRAGPipelineorchestrator; validates config and ensuresConfig.CACHE_DIRexists.src/scene_extractor.py— PySceneDetect + Decord scene/key-frame extraction (cache by file-hash).src/transcriber.py— FFmpeg audio extraction + Whisper transcription (cache by path-hash).src/frame_processor.py— BLIP-2 captioning + per-frame OCR/detection/segmentation in a process pool (cache by frame-hash).src/indexer.py— open_clip ViT-H-14 embeddings for texts and images with per-item caching.src/retriever.py— CLIP-based retrieval across text & image embeddings and Groq LLM call.src/cache_utils.py— Simple pickle-based cache utilities (store inConfig.CACHE_DIR).src/hash_utils.py— SHA256 helpers for files and frames.src/config.py— Central configuration (paths, device selection, thresholds).
- Scene list & key frames: keyed by video file content hash.
- Transcription: keyed by SHA256 of the file path (note: may go stale if path remains the same but content changes).
- Per-frame results (OCR/detections/masks): keyed by frame content hash.
- Text embeddings: keyed by text SHA256.
- Image embeddings: keyed by frame content hash.
- Cache files are stored as
.pklfiles insideConfig.CACHE_DIR.
-
Video processing: PySceneDetect, Decord
-
ASR: Whisper Base (HF Transformers)
-
Vision:
- Captions: BLIP-2 (FLAN-T5-XL) — GPU recommended
- OCR: Tesseract (
pytesseract) — system Tesseract required - Detection: YOLOv8x (
ultralytics) - Grounded detection: GroundingDINO tiny
- Segmentation: SAM huge
-
Embeddings: open_clip ViT-H-14 (
laion2b_s32b_b79k) -
LLM: Groq Chat Completions (
meta-llama/llama-4-scout-17b-16e-instruct) -
Utilities: Torch, PIL/Pillow, NumPy, FFmpeg
Note: Large models (BLIP-2 XL, SAM huge, CLIP ViT-H-14) require significant GPU memory. Consider lighter alternatives for constrained hardware.
Edit src/config.py or set environment variables as needed:
Config.VIDEO_PATH— Absolute path to input video.Config.CACHE_DIR— Absolute path for cache storage.Config.DEVICE—"cuda"if CUDA is available, otherwise"cpu".GROQ_API_KEY— Must be set in the environment for the Retriever to call Groq. Replace any hardcoded default in code with a secure env variable.
The pipeline prints progress across stages:
Extract scenes → Transcribe → Process frames → Build index → Initialize retriever → Query retriever
On completion it prints:
- The constructed retrieval-augmented prompt used for the LLM.
- The LLM-generated answer (streaming supported).
-
Heavy GPU/VRAM demand for certain models. CPU fallback is available but slow.
-
Windows-specific notes:
- Ensure
ffmpegandtesseractare installed and added toPATH. - Large HF model downloads happen on first run.
- Using a
ProcessPoolExecutorwith GPU models requires care: workers load models on demand and may increase memory footprint.
- Ensure
-
Replace any hardcoded Groq API key with
GROQ_API_KEY.
-
Scene extraction uses only scene-boundary key frames — events inside long scenes can be missed.
-
Retrieval uses a simple top-k fusion across text & image scores (no learned reranker or joint reranking).
-
Caching design has weaknesses:
- Transcription cache keyed by path can go stale if video content changes but path doesn’t.
- Pickle-based cache has no file-locking; concurrent runs may race or corrupt cache.
-
Model and runtime size: memory pressure and long cold-starts are expected on first run.
-
Hardcoded absolute paths reduce portability — prefer environment-configured paths.
- Install Python packages (example):
pip install decord torch numpy Pillow transformers>=4.30.1 segment_anything ultralytics pytesseract soundfile ffmpeg-python groq sentence-transformers open-clip-torch faiss-cpu scenedetect[opencv] opencv-python shapely pycocotools- System binaries (if you haven't yet):
sudo apt-get update && sudo apt-get install -y \
ffmpeg \
tesseract-ocr \
libsm6 \
libxext6 \
libxrender1 \
git- Configure
src/config.pyor export env vars:
export GROQ_API_KEY="<your_groq_api_key>"
# or edit Config.VIDEO_PATH and Config.CACHE_DIR in src/config.py- Run the demo:
python src/main.py- To change the query, edit
src/main.pyand re-run.
- Use lighter models for constrained environments: BLIP-2 smaller, SAM-base, YOLOv8n, etc.
- Improve caching by hashing file contents rather than file paths for ASR.
- Add file-locking or a more robust key-value store (Redis/LMDB) to avoid pickle races.
- Add a reranker that jointly considers text + visual features or finetune a lightweight cross-encoder for improved precision.
- Extract and store short video clips for top visual hits to provide direct visual evidence for answers.
Contributions welcome — open issues or pull requests for bug fixes, performance improvements, and new features.