Vector Search Benchmarks

This repo contains datasets for benchmarking vector search performance, to help Superlinked prioritize integration partners.

Overview

We reviewed a number of publicly available datasets and noted 3 core problems + here is how this dataset fixes them:

Problems of other vector search benchmarks	How this dataset solves it
Not enough metadata of various types makes it hard to test filter performance	3 number, 1 categorical, 3 text, 1 image column
Vectors too small, while SOTA models usually output 2k+ even 4k+ dims	4154 dims
Dataset too small, especially if larger vectors are used	100k, 1M and 10M item variants, all sampled from the large dataset

Available Datasets

Product data

The folders contain parquet files with the metadata and vectors.

Dataset	Records	# Files	Size
benchmark_10k	10,000	100	~230 MB
benchmark_100k	100,000	100	~2.3 GB
benchmark_1M	1,000,000	100	~23 GB
benchmark_10M	10,534,536	1000	~240 GB

The structure of the files is the same throughout:

Schema([('parent_asin', String), # the id
        ('main_category', String),
        ('title', String),
        ('average_rating', Float64),
        ('rating_number', Float64),
        ('description', String),
        ('price', Float64),
        ('categories', String),
        ('image_url', String)])
        ('value', List(Float64)), # the vectors

Queries

Some smaller dataset versions have a query set guaranteed to only contain parent_asins from the corresponding dataset version. The smaller versions are created for testing purposes when only a smaller dataset was ingested. In the query file the actual query structure can be seen. The file structure is

{
    query_id: {
        product_id: str | None,       # parent_asin - get that value from the database and search with it
        rating_max: int | None,       # filter for product.average_rating <= rating_max
        rating_num_min: int | None,   # filter product.rating_number > rating_num_min
        main_category: str | None,    # filter for product.main_category == main_category
    },
    ...
}

Dataset	Queries
query-params-100k	15
query-params-1M	117
query-params-10M	1,000

Result set

Query results are stored in ranked-results.json. The structure is

{
    query_id: [ordered list of result parent_asins],
    ...
}

NOTE: The results expect all products ingested in the database!

Data Access

Datasets are available via multiple ways:

You can use gsutil to download the dataset (as HTTPS download works best for individual files):

# Download benchmark datasets
gsutil cp -r "gs://superlinked-benchmarks-external/amazon-products-images/benchmark-10k/**" ./your/local/data/folder/
gsutil cp -r "gs://superlinked-benchmarks-external/amazon-products-images/benchmark-100k/**" ./your/local/data/folder/
gsutil cp -r "gs://superlinked-benchmarks-external/amazon-products-images/benchmark-1M/**" ./your/local/data/folder/
gsutil cp -r "gs://superlinked-benchmarks-external/amazon-products-images/benchmark-10M/**" ./your/local/data/folder/

As queries are individual files, even a simple https download works fine:

# Download queries
wget https://storage.googleapis.com/superlinked-benchmarks-external/amazon-products-images/query-params-100k.json
wget https://storage.googleapis.com/superlinked-benchmarks-external/amazon-products-images/query-params-1M.json
wget https://storage.googleapis.com/superlinked-benchmarks-external/amazon-products-images/query-params-10M.json

Same is true for results:

# Download the ground truth query results
wget https://storage.googleapis.com/superlinked-benchmarks-external/amazon-products-images/ranked-results.json

but gsutil works fine for these as well (you can infer the path from the URLs). For ranked-results.json:

gsutil cp "gs://superlinked-benchmarks-external/amazon-products-images/ranked-results.json" ./your/local/data/folder/

Using huggingface datasets

The product data is available using HF Datasets.

from datasets import load_dataset

benchmark_10k = load_dataset("superlinked/external-benchmarking", data_dir="benchmark-10k")
benchmark_100k = load_dataset("superlinked/external-benchmarking", data_dir="benchmark-100k")
benchmark_1M = load_dataset("superlinked/external-benchmarking", data_dir="benchmark-1M")
benchmark_10M = load_dataset("superlinked/external-benchmarking", data_dir="benchmark-10M")

For query and result data, please use one of the above methods (gsutil or direct download).

Dataset Production

Source Data

Origin: Amazon Reviews 2023 dataset
Categories: ["Books", "Automotive", "Tools and Home Improvement", "All Beauty", "Electronics", "Software", "Health and Household"]

Embeddings

The embeddings are created via a superlinked config. The resulting 4154 dim vector contains:

1 categorical,
3 number,
3 text (Qwen/Qwen3-Embedding-0.6B),
and 1 image (laion/CLIP-ViT-H-14-laion2B-s32B-b79K)

embeddings concatenated.

Running Benchmarks

For the benchmark_10M setup produce the following set of measurements - basically fill in the 'TBD' cells:

#	Write	Target	Observed	Read	Target	Observed
1	Create Index from scratch	< 2hrs	TBD	-	-	-
2	-	-	-	20 QPS of 0.001% filter selectivity	100ms @ p95	TBD
3	-	-	-	20 QPS of 0.1% filter selectivity	100ms @ p95	TBD
4	-	-	-	20 QPS of 1% filter selectivity	100ms @ p95	TBD
5	-	-	-	20 QPS of 10% filter selectivity	100ms @ p95	TBD
6	20 QPS for single-object updates (incl. embedding)	2s @ p95	TBD	20 QPS of 1% filter selectivity	100ms @ p95	TBD
7	200 QPS for single-object updates (incl. embedding)	2s @ p95	TBD	20 QPS of 1% filter selectivity	100ms @ p95	TBD

Formulate the queries like this:

Vector Similarity: Each query should contain dot product similarity scoring against a vector that you grab from the DB. The vector is specified in query_params under the product_id key.
Filters: To get the target filter selectivity, please use the filters specified in the query_params files.
Results details: Add LIMIT 100 to all queries and only retrieve parent_asin for each record to minimize networking overhead.
Vector Search Recall: We expect that you can tune your system to produce >90% average hit rate for the ANN index and we expect that you run the above tests with such tuning.

Selectivity	Predicate
0.001%	`average_rating <= 3.0 and rating_number > 130 and main_category == 'Computers'`
0.1%	`average_rating <= 3.5 and rating_number > 30 and main_category == 'Computers'`
1%	`rating_number > 45 and main_category == 'Computers'`
10%	`average_rating <= 3.5 and rating_number > 1`

Query result quality evaluation

You are welcome to use the calculate_hit_rates function in eval.py. It expects the prediction results in a similar format as the ground truth result set is provided.

Pricing

To enable us to compare different vendors, we consider the above dataset size + performance to be a "unit" of vector search, for which we would like to know:

What are the vector search vendor parameters of the cloud instance that can support this "unit".
What is the price-per-GB-month for this instance, assuming a sustained average workload as described by the targets above.
How does the price scale with (a) 2x the size (b) 2x the read QPS (c) 2x the write QPS.

License

This dataset is derived from the Amazon Reviews 2023 dataset. Please refer to the original dataset's license for usage terms.

Name		Name	Last commit message	Last commit date
Latest commit History 34 Commits
superlinked_app		superlinked_app
.gitignore		.gitignore
README.md		README.md
eval.py		eval.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Vector Search Benchmarks

Overview

Available Datasets

Product data

Queries

Result set

Data Access

Dataset Production

Source Data

Embeddings

Running Benchmarks

Query result quality evaluation

Pricing

License

About

Uh oh!

Releases

Packages

Contributors 2

Languages

superlinked/external-benchmarks

Folders and files

Latest commit

History

Repository files navigation

Vector Search Benchmarks

Overview

Available Datasets

Product data

Queries

Result set

Data Access

Dataset Production

Source Data

Embeddings

Running Benchmarks

Query result quality evaluation

Pricing

License

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Languages

Packages