This project provides a robust system for image search and metadata extraction, leveraging advanced machine learning models and indexing techniques. It integrates face detection, CLIP-based image search, and geolocation metadata extraction to enable efficient querying of images based on textual descriptions, face labels, and geographic information. The system is designed for scalability and modularity, making it suitable for applications requiring sophisticated image retrieval and analysis.
This project
- Face Detection and Clustering: Utilizes InsightFace for face detection and feature extraction, with Agglomerative Clustering to assign labels to detected faces.
- CLIP-based Image Search: Employs the CLIP model (Contrastive Language–Image Pre-training) to generate image embeddings and perform text-based image retrieval using FAISS for efficient similarity search.
- Geolocation Metadata Extraction: Extracts GPS coordinates and addresses from image EXIF data using Pillow and Nominatim, supporting both standard image formats and HEIC/HEIF.
- BM25 Search: Implements BM25Okapi for text-based search on face labels and geolocation metadata.
- Combined Search Strategy: Integrates face detection, geolocation, and CLIP-based search to refine results based on query context.
- Data Ingestion Pipeline: Automates the generation and storage of embeddings and metadata for a given image directory.
The system is modular, with distinct components for ingestion, search, and metadata extraction:
- Ingestion (
ingestion.py
): Orchestrates the generation of image paths, CLIP embeddings, face data, and geolocation metadata. Stores results in a designated embedding directory. - Face Detection (
face_detection.py
): Detects faces using InsightFace, extracts embeddings, and clusters them to assign labels. Supports BM25-based search on face labels. - CLIP Search (
clip_search.py
): Generates image embeddings using CLIP and supports FAISS-based similarity search for text queries, with optional subset filtering. - Geolocation Metadata (
geo_metadata.py
): Extracts GPS coordinates and addresses from images, with BM25-based search on address data. - Search Strategy (
retrieval.py
): Combines face detection, geolocation, and CLIP search results using a conditional strategy (intersection or union of indices) to optimize query results.
- Python 3.8+
- A CUDA-capable GPU (optional, for faster CLIP inference)
- Virtual environment (recommended)
-
Clone the Repository:
git clone https://github.com/aditypatil/image-search.git cd image-search-system
-
Set Up a Virtual Environment:
python -m venv venv source venv/bin/activate # On Windows: venv\Scripts\activate
-
Install Dependencies:
pip install -r requirements.txt
-
Prepare the Embedding Directory: Create a directory (e.g.,
embed_store
) to store embeddings and metadata:mkdir embed_store
-
Prepare the Image Directory: Place images in a directory (e.g.,
ImageSamples
) with supported formats (.jpg
,.jpeg
,.png
,.heic
,.heif
).
Run the ingestion pipeline to process images and generate embeddings/metadata:
python ingestion.py
This will:
- Generate image paths and store them in
embed_store/img_path_index.pkl
. - Create CLIP embeddings (
img_embeddings.bin
,img_filenames.npy
). - Generate face data (
face_data.pkl
,img_path_index_for_face.pkl
). - Extract geolocation metadata (
geo_metadata.npy
).
Perform a combined search using face labels, geolocation, and CLIP embeddings:
python retrieval.py
Example query:
srch = Search()
img_indices = srch.strategy1(query="aditya on beaches of goa")
print(img_indices)
This query searches for images of a person labeled "Aditya" in locations associated with "Goa beaches," refined by CLIP-based text similarity.
To assign meaningful names to face clusters, modify face_detection.py
's add_dummy_faces
function and run:
python face_detection.py
Example mapping:
name_labels = {0: "Aditya", 1: "Pratik", 2: "Dogemaster", 3: "NeoBoomer", 4: "KittyCat"}
image-search-system/
├── embed_store/ # Directory for storing embeddings and metadata
├── ImageSamples/ # Directory for input images
├── models/
│ ├── __init__.py
│ ├── clip_search.py # CLIP embedding generation and search
│ ├── face_detection.py # Face detection and clustering
│ └── geo_metadata.py # Geolocation metadata extraction
├── retrieval.py # Combined search strategy
├── ingestion.py # Data ingestion pipeline
├── README.md # Project documentation
└── requirements.txt # Python dependencies
- FAISS: For efficient similarity search of CLIP embeddings.
- InsightFace: For face detection and feature extraction.
- CLIP (Transformers): For text-image embedding generation.
- Pillow: For image processing and EXIF data extraction.
- Geopy: For reverse geocoding of GPS coordinates.
- scikit-learn: For clustering face embeddings.
- rank-bm25: For text-based search on face labels and addresses.
- tqdm: For progress bars during processing.
- NumPy, Pandas, Matplotlib, OpenCV: For data manipulation and visualization.
- Torch: For CLIP model inference.
- Pillow-HEIF: For HEIC/HEIF image support.
- Szegedy, C., et al. (2015). "Going Deeper with Convolutions." Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR). [InsightFace model inspiration]
- Radford, A., et al. (2021). "Learning Transferable Visual Models From Natural Language Supervision." International Conference on Machine Learning (ICML). [CLIP model]
- Jones, K. S., et al. (2000). "A Probabilistic Model of Information Retrieval: Development of the BM25 Model." Journal of Documentation. [BM25 algorithm]
- FAISS Documentation: https://github.com/facebookresearch/faiss
- InsightFace Documentation: https://github.com/deepinsight/insightface
- Geopy Documentation: https://geopy.readthedocs.io/
- Transformers (Hugging Face) Documentation: https://huggingface.co/docs/transformers/
This project is licensed under the MIT License. See the LICENSE file for details.