GitHub - tylersupersad/laion-distributed-pipeline: A distributed system for large-scale image data processing with CLIP embeddings and FAISS indexing. Built on a five-node AlmaLinux cluster with SLURM, Ansible, and NFS. Supports modular embedding, FAISS shard merging, and capacity benchmarking with Prometheus.

Distributed CLIP Embedding and Approximate Nearest Neighbor Search on LAION Dataset

This project implements a scalable, fully distributed pipeline for large-scale image-text data processing, utilizing the LAION2B-en-aesthetic dataset. The system performs high-throughput CLIP embedding generation and FAISS-based approximate nearest neighbor (ANN) search across a 5-node CPU cluster with NFS-shared storage and SLURM orchestration.

System Overview

Distributed Embedding: Parallel CLIP embedding of image batches across multiple worker nodes, with output persisted to NFS.
Parallel Indexing: FAISS indices are constructed independently per partitioned embedding output, then merged into a unified search index.
Approximate Search: The final FAISS index supports fast, scalable ANN retrieval on embedded representations.

Architecture

Cluster Configuration:

1 Host node, 4 Worker nodes
CPU-only inference with multi-node orchestration
Centralized NFS storage for inputs, embeddings, and indices

Orchestration:

SLURM array jobs for dynamic task distribution
Terraform and Ansible automation for cluster provisioning and software setup

Usage

This project requires the cluster to be configured before the pipeline can be executed. Please follow the steps outlined below:

Cluster Configuration (cluster-config.md): This document provides a comprehensive guide to provisioning and configuring your 5-node CPU cluster, including network setup, storage access, and orchestration tools. Start here.
Pipeline Execution (usage.md): After successfully configuring the cluster, this document details how to analyze and run the distributed CLIP embedding and approximate nearest neighbor search pipeline on the LAION dataset.

Key Features

High concurrency support for embedding and indexing
Monitoring via Prometheus and Node Exporter

Name		Name	Last commit message	Last commit date
Latest commit History 49 Commits
config		config
docs		docs
infra		infra
scripts		scripts
test_inputs		test_inputs
README.md		README.md
cluster-config.md		cluster-config.md
usage.md		usage.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Distributed CLIP Embedding and Approximate Nearest Neighbor Search on LAION Dataset

System Overview

Architecture

Usage

Key Features

About

Uh oh!

Releases

Packages

Uh oh!

Languages

tylersupersad/laion-distributed-pipeline

Folders and files

Latest commit

History

Repository files navigation

Distributed CLIP Embedding and Approximate Nearest Neighbor Search on LAION Dataset

System Overview

Architecture

Usage

Key Features

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Languages

Packages