A framework for evaluating semantic search across custom datasets, metrics, and embedding backends.
Contents
- Flexible model integration: HuggingFace, OpenAI, BM25, and more.
- Simple YAML-based configuration.
- Custom evaluation metrics: e.g., Accuracy@k, Latency.
- Integrated visualizations via
seaborn
/matplotlib
.
To install the project and its dependencies, follow these steps:
# Clone the repository
git clone https://github.com/machinelearningZH/semantic-search-eval.git
cd semantic-search-eval
# Install venv and dependencies
pip3 install uv
uv venv
source .venv/bin/activate
uv sync
You need two input files:
docs
: A CSV or parquet file with a column namedtext
containing the document content. Each row represents a single document.queries
: A CSV or parquet file with two columns:search_query
: The query text.idx
: The document ID that corresponds to the correct answer for the query.
Note
If you do not already have test queries for your documents, the tool can create them automatically via OpenAI or naive keyword extraction (for private data). The prompt used for query generation via OpenAI is in German and can be edited in semsearcheval/prompts.py. If you use the naive keyword extraction mechanism, be aware that this will boost the performance of BM25 in comparison to the embedding models.
Note
Embedding models have a maximum input length - more on this in the next section. If your documents exceed this length, they should be split into smaller chunks before evaluation to ensure compatibility with the models. All preprocessing (e.g., cleaning, tokenization) should be completed before evaluation, as it is not (yet) supported in this toolkit.
Create a YAML config to define datasets, models, and metrics. Use configs/example.yaml
as a template.
Key fields:
folder
: where to save results and plotsdocs
andqueries
: paths to your documents and queries in CSV or parquet formatis-public-data
: set to true to use OpenAI query creator if data is publicmax-len
: set to the shortest model limit to ensure fair evaluation with same input text length for all modelsmodels
: define model backends and options
To use OpenAI-based features:
- Set the key as an environment variable:
export OPENAI_API_KEY="your-api-key"
- Or add it to a
.env
file in the project root (the version above takes precedence):
OPENAI_API_KEY=your-key
Warning
Never commit API keys to version control.
Once your config is ready:
python semsearcheval/evaluate.py -c configs/your_config.yaml
Results will be saved in the specified folder
, including a CSV file with stats and plots for each metric.
Semantic search is becoming increasingly important in public administration and many other areas. Intelligent search solutions like the ones we built for the Staatsarchiv and for our OGD metadata catalog help to make large collections of documents easier to search and use. However, with new embedding models being released all the time — both open-source and commercial and in various sizes — it can be hard to know which one works best for a specific task or dataset.
This tool was built to make it easier to compare these models in a transparent and reproducible way, using your own data and metrics.
- The tool lets you evaluate and compare models like OpenAI, HuggingFace, and BM25 on your own datasets.
- Setup is easy using a simple YAML configuration file.
- You can use automatically generated test queries or define your own for more targeted tests (e.g. to check for bias).
- Results are saved so you do not need to re-run old evaluations when testing new models.
- The tool provides a CSV with all results as well as visual plots to help with decision-making.
- It works with the OpenAI API but can also run completely offline or on-premise if your data is private.
- It is easy to extend with new datasets, models, or metrics as your needs change.
This project contributes to the broader goal of enabling fair and transparent deployment of AI systems in administrative and public service environments.
Chantal Amrhein, Patrick Arnecke – Statistisches Amt Zürich: Team Data
We welcome feedback and contributions! Email us or open an issue or pull request.
We use ruff
for linting and formatting.
Install pre-commit hooks and run tests before opening a pull request:
pre-commit install
pytest
- Implement the
Metric
interface inmetrics.py
- Register it in
constants.py
- Implement the
Model
interface inmodels.py
- Register it in
constants.py
This project is licensed under the MIT License. See the LICENSE file for details.
This evaluation tool (the Software) incorporates the open-source model de_core_news_sm
from spacy.io and evaluates user-defined open- and closed-sourced embedding models (the Models). The Software has been developed according to and with the intent to be used under Swiss law. Please be aware that the EU Artificial Intelligence Act (EU AI Act) may, under certain circumstances, be applicable to your use of the Software. You are solely responsible for ensuring that your use of the Software as well as of the underlying Models complies with all applicable local, national and international laws and regulations. By using this Software, you acknowledge and agree (a) that it is your responsibility to assess which laws and regulations, in particular regarding the use of AI technologies, are applicable to your intended use and to comply therewith, and (b) that you will hold us harmless from any action, claims, liability or loss in respect of your use of the Software.