A distributed video feature extraction pipeline that processes videos using state-of-the-art vision-language models and stores the extracted features in a vector database.
This pipeline is designed to:
- Process multiple videos in parallel using GPU workers
- Extract features using various vision-language models (CLIP, VL3-SigLIP-NaViT)
- Store the extracted features in a Milvus vector database
- Handle distributed processing with proper error handling and logging
- Python 3.x
- CUDA-compatible GPU (for optimal performance)
- MacOS/Linux (MPS/CUDA support)
- Clone the repository:
git clone <repository-url>
cd feature_pipeline
- Install dependencies:
pip install -r requirements.txt
- Download models:
cd models
python download.py
The pipeline is configured using YAML files located in the config/
directory. The default configuration file is config/navit-config.yaml
.
Change the following configs when running pipeline:
- video_input.path
- gpus (according to server gpu type and number, 'cuda:0', 'cuda:1' for example)
- pipeline phases to add (each phase with model and database accordingly)
Example configuration:
video_input:
path: "./video_samples/"
gpus:
- mps
phases:
- model:
name: "VL3-SigLIP-NaViT"
path: "models/VL3-SigLIP-NaViT"
source: "folder"
features: ["video_embedding"]
db:
type: "milvus"
name: "navit_video_feature.db"
batch_size: 1000
collections:
- name: "video_embedding_collection"
fields:
- name: id
dtype: INT64
is_primary: true
auto_id: true
- name: video_path
dtype: VARCHAR
max_length: 512
- name: frame_id
dtype: INT16
- name: row_idx
dtype: INT16
- name: col_idx
dtype: INT16
- name: embeddings
dtype: FLOAT_VECTOR
dim: 1152
Run the pipeline:
python main.py --config_path config/navit-config.yaml
feature_pipeline/
├── main.py # Main entry point
├── worker.py # GPU and DB worker implementations
├── database.py # Database interface
├── utils.py # Utility functions
├── logger.py # Logging configuration
├── config/ # Configuration files
├── models/ # Model files and checkpoints
├── video_samples/ # Input videos
├── logs/ # Log files
└── db/ # Database files
- CLIP: OpenAI's CLIP model for video feature extraction
- VL3-SigLIP-NaViT: Vision-language model encoder for VideoLLaMA3
- GPU Worker: Processes videos and extracts features
- DB Worker: Handles database operations
- Uses Milvus for efficient vector storage and retrieval
- Supports different collection schemas for various feature types
The pipeline uses a comprehensive logging system that tracks:
- Video processing progress
- Model loading and inference
- Database operations
- Error handling
Logs are stored in the logs/
directory with timestamps.
The pipeline includes robust error handling for:
- Model loading failures
- Video processing errors
- Database connection issues
- Worker thread management