vector-db-benchmark

Vector DB Comparison

This repository is forked from vector-db-benchmark. The current repository contains benchmark results for a comparison between Qdrant and Milvus single node engines. The useful files added include:

results/analysis.ipynb: Notebook that contains analysis of upload, search results and comparison of engine configurations achieving similar precisions.
results/plots-html: directory that contains diagrams of upload or search metrics as functions of the achieved mean precisions rate.
Other directories inside results/ containing the json files with results of each experiment (dataset - engine configuration - upload) or (dataset - engine configuration - search config)
csv files inside results/ - each line indicates a different experiment and contains metrics for Qdrant and Milvus (to compare engines with the same configuration)
summary.py: a command line interface used to show the results for specific/all engine configurations and datasets

For explanation about engine and search configurations used for the benchmarking you can see results/README.md.

Installation and Run instructions are kept as the base repository below with the addition of Prerequisites and Setup using venv sections in order to define the experiments environment explicitly.

View results

There are various vector search engines available, and each of them may offer a different set of features and efficiency. But how do we measure the performance? There is no clear definition and in a specific case you may worry about a specific thing, while not paying much attention to other aspects. This project is a general framework for benchmarking different engines under the same hardware constraints, so you can choose what works best for you.

Running any benchmark requires choosing an engine, a dataset and defining the scenario against which it should be tested. A specific scenario may assume running the server in a single or distributed mode, a different client implementation and the number of client instances.

How to run a benchmark?

Benchmarks are implemented in server-client mode, meaning that the server is running in a single machine, and the client is running on another.

Prerequisites

Docker Desktop
Python 3: version 3.10.12 was used for the experiments

Run the server

All engines are served using docker compose. The configuration is in the servers.

To launch the server instance, run the following command:

cd ./engine/servers/<engine-configuration-name>
docker compose up

Containers are expected to expose all necessary ports, so the client can connect to them.

Setup using venv

We suggest you using a virtual environment in order to use a specific python version (version 3.10.12 was used for the experiments) and manage separate package installations.
To create the virtual environment you can use the following command. Then you can activate the virtual environment and install packages, as defined in python documentation Install packages in a virtual environment using pip and venv.

python3.10 -m venv .venv

Run the client

Install dependencies:

pip install poetry
poetry install

Run the benchmark:

Usage: run.py [OPTIONS]

  Example: python3 -m run --engines *-m-16-* --datasets glove-*

Options:
  --engines TEXT                  [default: *]
  --datasets TEXT                 [default: *]
  --host TEXT                     [default: localhost]
  --skip-upload / --no-skip-upload
                                  [default: no-skip-upload]
  --install-completion            Install completion for the current shell.
  --show-completion               Show completion for the current shell, to
                                  copy it or customize the installation.
  --help                          Show this message and exit.

Command allows you to specify wildcards for engines and datasets. Results of the benchmarks are stored in the ./results/ directory.

How to update benchmark parameters?

Each engine has a configuration file, which is used to define the parameters for the benchmark. Configuration files are located in the configuration directory.

Each step in the benchmark process is using a dedicated configuration's path:

connection_params - passed to the client during the connection phase.
collection_params - parameters, used to create the collection, indexing parameters are usually defined here.
upload_params - parameters, used to upload the data to the server.
search_params - passed to the client during the search phase. Framework allows multiple search configurations for the same experiment run.

Exact values of the parameters are individual for each engine.

How to register a dataset?

Datasets are configured in the datasets/datasets.json file. Framework will automatically download the dataset and store it in the datasets directory.

How to implement a new engine?

There are a few base classes that you can use to implement a new engine.

BaseConfigurator - defines methods to create collections, setup indexing parameters.
BaseUploader - defines methods to upload the data to the server.
BaseSearcher - defines methods to search the data.

See the examples in the clients directory.

Once all the necessary classes are implemented, you can register the engine in the ClientFactory.

Name		Name	Last commit message	Last commit date
Latest commit History 450 Commits
.github/workflows		.github/workflows
benchmark		benchmark
dataset_reader		dataset_reader
datasets		datasets
engine		engine
experiments/configurations		experiments/configurations
monitoring		monitoring
results		results
scripts		scripts
tests/engine/clients		tests/engine/clients
tools		tools
.DS_Store		.DS_Store
.dockerignore		.dockerignore
.gitignore		.gitignore
.pre-commit-config.yaml		.pre-commit-config.yaml
Dockerfile		Dockerfile
LICENSE		LICENSE
README.md		README.md
poetry.lock		poetry.lock
pyproject.toml		pyproject.toml
run.py		run.py
run_all_engines.sh		run_all_engines.sh
summary.py		summary.py
sync_results.sh		sync_results.sh
win-test.txt		win-test.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

vector-db-benchmark

Vector DB Comparison

How to run a benchmark?

Prerequisites

Run the server

Setup using venv

Run the client

How to update benchmark parameters?

How to register a dataset?

How to implement a new engine?

About

Uh oh!

Releases

Packages

Languages

License

ntua-el20069/vector-db-benchmark

Folders and files

Latest commit

History

Repository files navigation

vector-db-benchmark

Vector DB Comparison

How to run a benchmark?

Prerequisites

Run the server

Setup using venv

Run the client

How to update benchmark parameters?

How to register a dataset?

How to implement a new engine?

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages